3. Design Guide for Efficient Operators(ONNX)

When the range of operator design matches the range of hardware support, the hardware performance can be exploited more fully to improve the model inference speed. This section explains how to implement efficient design algorithms on the AX620 hardware platform.

3.1. Convolution

Note

The convolution operator consumes an input tensor and a filter, and computes the output.

Conv Supported OpSet Version: 1, 11-13

Support Properties	Performance Description	Remarks
kernel_shape	If kernel_shape is 3, 100% performance can be achieved	other values are less than 50% efficiency
kernel_shape	with kernel_shape of 1, performance drops to 89%, which requires input_channel % 32 == 0
pads	the most efficient when kernel_shape is 3 and pads is 1	/
strides	stride_h = 1, stride_w <= 2, performance is 100%	stride = [2, 2] with efficiency about: output_channel / (output_channel + 8)
	In other cases, the larger the output_channel, the more efficient it is	When kernel_shape is 3, avoid strdies of 3
auto_pad	/	only support configuration as NOTSET
dilations	the efficiency is calculated as: kernel_shape / ((kernel_shape - 1) * dilation + 1)	which wastes the amount of computation needed to fill the dilation
group	channel/group is most efficient when it is a multiple of 16, but input_width must be a multiple of 32	for example: depthwise conv efficiency 1/16

Hint

input/output_channel

Most efficient when input_channel is a multiple of 16 and output_channel is a multiple of 8
When the multiplier limit is not met, the calculation is wasted to the corresponding multiplier.

3.2. ConvTranspose

ConvTranspose has the most efficient support for the following three cases.

kernel_size is 2 x 2, stride is 2, pad takes 0
kernel_size is 4 x 4, stride is 2, pad takes 1
kernel_size is 4 x 4, stride is 4, pad takes 0

Attention

The efficiency of ConvTranspose is slightly lower than that of the resize operator, which performs the same upsampling function.

3.3. Linear

It is recommended that channels be a multiple of 16.

3.4. Activation

ReLU has the most efficient support
LeakyReLU, HardSwish, Swish, Mish are also efficiently supported (but weaker than ReLU)
PReLU support is less efficient

3.5. Transpose/Reshape

Attention

The implementation is inefficient and should be avoided.

3.6. Pool

Operator	Efficient suggestions
MaxPool	Efficient support for the case `kernel_size <= 2` and `kernel_size == stride`, it is recommended to try to `kernel_size <= 3`
AvgPool	`kernel_size` to the power of `2` is the most efficient, and it is recommended that the maximum exceed `32`.

3.7. Resize

scale only supports powers of two, suggested in the range [1/16, 1/8, 1/4, 1/2, 2, 4, 8, 16].
mode only supports nearest, bilinear and area.