3. Design Guide for Efficient Operators(ONNX)

When the range of operator design matches the range of hardware support, the hardware performance can be exploited more fully to improve the model inference speed. This section explains how to implement efficient design algorithms on the AX620 hardware platform.

3.1. Convolution

Note

The convolution operator consumes an input tensor and a filter, and computes the output.

Conv Supported OpSet Version: 1, 11-13

Support Properties

Performance Description

Remarks

kernel_shape

If kernel_shape is 3, 100% performance can be achieved

other values are less than 50% efficiency

kernel_shape

with kernel_shape of 1, performance drops to 89%, which requires input_channel % 32 == 0

pads

the most efficient when kernel_shape is 3 and pads is 1

/

strides

stride_h = 1, stride_w <= 2, performance is 100%

stride = [2, 2] with efficiency about: output_channel / (output_channel + 8)

In other cases, the larger the output_channel, the more efficient it is

When kernel_shape is 3, avoid strdies of 3

auto_pad

/

only support configuration as NOTSET

dilations

the efficiency is calculated as: kernel_shape / ((kernel_shape - 1) * dilation + 1)

which wastes the amount of computation needed to fill the dilation

group

channel/group is most efficient when it is a multiple of 16, but input_width must be a multiple of 32

for example: depthwise conv efficiency 1/16

Hint

input/output_channel
  • Most efficient when input_channel is a multiple of 16 and output_channel is a multiple of 8

  • When the multiplier limit is not met, the calculation is wasted to the corresponding multiplier.

3.2. ConvTranspose

ConvTranspose has the most efficient support for the following three cases.

  • kernel_size is 2 x 2, stride is 2, pad takes 0

  • kernel_size is 4 x 4, stride is 2, pad takes 1

  • kernel_size is 4 x 4, stride is 4, pad takes 0

Attention

The efficiency of ConvTranspose is slightly lower than that of the resize operator, which performs the same upsampling function.

3.3. Linear

It is recommended that channels be a multiple of 16.

3.4. Activation

  • ReLU has the most efficient support

  • LeakyReLU, HardSwish, Swish, Mish are also efficiently supported (but weaker than ReLU)

  • PReLU support is less efficient

3.5. Transpose/Reshape

Attention

The implementation is inefficient and should be avoided.

3.6. Pool

Operator

Efficient suggestions

MaxPool

Efficient support for the case kernel_size <= 2 and kernel_size == stride, it is recommended to try to kernel_size <= 3

AvgPool

kernel_size to the power of 2 is the most efficient, and it is recommended that the maximum exceed 32.

3.7. Resize

  • scale only supports powers of two, suggested in the range [1/16, 1/8, 1/4, 1/2, 2, 4, 8, 16].

  • mode only supports nearest, bilinear and area.