3. Design Guide for Efficient Operators(ONNX)
When the range of operator design matches the range of hardware support, the hardware performance can be exploited more fully to improve the model inference speed.
This section explains how to implement efficient design algorithms on the AX620 hardware platform.
3.1. Convolution
Note
The convolution operator consumes an input tensor and a filter, and computes the output.
Conv Supported OpSet Version: 1, 11-13
Support Properties |
Performance Description |
Remarks |
|---|---|---|
kernel_shape |
If kernel_shape is 3, 100% performance can be achieved |
other values are less than 50% efficiency |
kernel_shape |
with kernel_shape of 1, performance drops to 89%, which requires input_channel % 32 == 0 |
|
pads |
the most efficient when kernel_shape is 3 and pads is 1 |
/ |
strides |
stride_h = 1, stride_w <= 2, performance is 100% |
stride = [2, 2] with efficiency about: output_channel / (output_channel + 8) |
In other cases, the larger the output_channel, the more efficient it is |
When kernel_shape is 3, avoid strdies of 3 |
|
auto_pad |
/ |
only support configuration as NOTSET |
dilations |
the efficiency is calculated as: kernel_shape / ((kernel_shape - 1) * dilation + 1) |
which wastes the amount of computation needed to fill the dilation |
group |
channel/group is most efficient when it is a multiple of 16, but input_width must be a multiple of 32 |
for example: depthwise conv efficiency 1/16 |
Hint
input/output_channelMost efficient when
input_channelis a multiple of16andoutput_channelis a multiple of8When the multiplier limit is not met, the calculation is wasted to the corresponding multiplier.
3.2. ConvTranspose
ConvTranspose has the most efficient support for the following three cases.
kernel_size is
2 x 2, stride is2, pad takes0kernel_size is
4 x 4, stride is2, pad takes1kernel_size is
4 x 4, stride is4, pad takes0
Attention
The efficiency of ConvTranspose is slightly lower than that of the resize operator, which performs the same upsampling function.
3.3. Linear
It is recommended that channels be a multiple of 16.
3.4. Activation
ReLUhas the most efficient supportLeakyReLU,HardSwish,Swish,Mishare also efficiently supported (but weaker thanReLU)PReLUsupport is less efficient
3.5. Transpose/Reshape
Attention
The implementation is inefficient and should be avoided.
3.6. Pool
Operator |
Efficient suggestions |
|---|---|
MaxPool |
Efficient support for the case |
AvgPool |
|
3.7. Resize
scaleonly supports powers of two, suggested in the range [1/16, 1/8, 1/4, 1/2, 2, 4, 8, 16].modeonly supportsnearest,bilinearandarea.