3. Design Guide for Efficient Operators(ONNX)
When the range of operator design matches the range of hardware support, the hardware performance can be exploited more fully to improve the model inference speed.
This section explains how to implement efficient design algorithms on the AX620
hardware platform.
3.1. Convolution
Note
The convolution operator consumes an input tensor and a filter, and computes the output.
Conv Supported OpSet Version
: 1
, 11-13
Support Properties |
Performance Description |
Remarks |
---|---|---|
kernel_shape |
If kernel_shape is 3, 100% performance can be achieved |
other values are less than 50% efficiency |
kernel_shape |
with kernel_shape of 1, performance drops to 89%, which requires input_channel % 32 == 0 |
|
pads |
the most efficient when kernel_shape is 3 and pads is 1 |
/ |
strides |
stride_h = 1, stride_w <= 2, performance is 100% |
stride = [2, 2] with efficiency about: output_channel / (output_channel + 8) |
In other cases, the larger the output_channel, the more efficient it is |
When kernel_shape is 3, avoid strdies of 3 |
|
auto_pad |
/ |
only support configuration as NOTSET |
dilations |
the efficiency is calculated as: kernel_shape / ((kernel_shape - 1) * dilation + 1) |
which wastes the amount of computation needed to fill the dilation |
group |
channel/group is most efficient when it is a multiple of 16, but input_width must be a multiple of 32 |
for example: depthwise conv efficiency 1/16 |
Hint
input/output_channel
Most efficient when
input_channel
is a multiple of16
andoutput_channel
is a multiple of8
When the multiplier limit is not met, the calculation is wasted to the corresponding multiplier.
3.2. ConvTranspose
ConvTranspose
has the most efficient support for the following three cases.
kernel_size is
2 x 2
, stride is2
, pad takes0
kernel_size is
4 x 4
, stride is2
, pad takes1
kernel_size is
4 x 4
, stride is4
, pad takes0
Attention
The efficiency of ConvTranspose
is slightly lower than that of the resize
operator, which performs the same upsampling function.
3.3. Linear
It is recommended that channels
be a multiple of 16
.
3.4. Activation
ReLU
has the most efficient supportLeakyReLU
,HardSwish
,Swish
,Mish
are also efficiently supported (but weaker thanReLU
)PReLU
support is less efficient
3.5. Transpose/Reshape
Attention
The implementation is inefficient and should be avoided.
3.6. Pool
Operator |
Efficient suggestions |
---|---|
MaxPool |
Efficient support for the case |
AvgPool |
|
3.7. Resize
scale
only supports powers of two, suggested in the range [1/16, 1/8, 1/4, 1/2, 2, 4, 8, 16].mode
only supportsnearest
,bilinear
andarea
.