4. Accuracy loss troubleshooting and accuracy tuning suggestions
4.1. Accuracy Loss Identification
Attention
If there is a loss of accuracy in the converted model, please follow the recommended way to troubleshoot the stage
or layer
that is causing the problem
4.1.1. CheckLists
- The first step is to specify the hardware platform on which the accuracy loss occurs
only on the
AX
platformPlease continue down the list.
All other platforms have accuracy loss occurs
Common problem, users need to decide whether to train a better model and then re-quantize it; Determine if other platforms use INT8 or INT16 quantization, or a mix of quantization.
- In the second step, determine the stage where the accuracy loss occurs
pulsar run
has a low accuracy (cos-sim < 98%
)Please follow the [Step 3] recommendations to continue the investigation
The upper board is connected to the user’s
post-processing
program, and the accuracy is very low after parsingPlease follow [step 4] and continue to the next step
- Step 3, cos-sim is below 98%, troubleshooting suggestions
the
output_config.prototxt
file required forpulsar run
must be generated automatically bypulsar build
check that the
color space
andmean/std
configurations in theconfig.prototxt
configuration file are correctuse
pulsar run
to compare thecos-sim
values betweenmodel.lava_joint
andmodel.onnx
to see if accuracy loss occursUse layer-by-layer splitting to see the
layer
where the loss of precision occurs
- Step 4, low accuracy on the board, troubleshooting suggestions
When executing the
run_joint
command, it will print some information about thejoint
model, so you need to check if thepost-processor
is parsing the output data correctly.If other platforms don’t drop points, but
BadCase
is reported onAX
platform, see Upboard accuracy loss troubleshooting method.
- Step 5, get help from AXera
When the user still can’t solve the problem after the first four steps, please send the relevant
log
andconclusion
toFAE
colleagues, so thatAX
engineers can locate the problem
4.1.2. Accuracy loss occurs after model compilation
This section elaborates on CheckLists in third_step.
Hint
pulsar run
is an integrated tool in the Pulsar
toolchain for simulation and pairing, see Simulation and pairing on x86 platforms for details.
If the original onnx
model is compiled into a joint
model, the cos-sim
of the pulsar run
is very low, which means that the converted model is losing accuracy and the problem needs to be investigated.
- config Configuration
The
config
required forpulsar run
is automatically generated from thepulsar build
.1# Note that the following command is not complete 2pulsar build --input model.onnx --config config.prototxt --output_config output_config.prototxt ... 3pulsar run model.onnx model.joint --config output_config.prototxt ...
- csc & mean/std
color space convert, csc
After configuration, you need to configuremean/std
in channel order.1# Configure the input data color space of the compiled model as BGR 2dst_input_tensors { 3 color_space: TENSOR_COLOR_SPACE_BGR 4} 5 6# mean/std needs to be filled in the order of BGR 7input_normalization { 8 mean: [0.485, 0.456, 0.406] # mean 9 std: [0.229, 0.224, 0.255] # std 10}
The
color_space
indst_input_tensors
isBGR
, which means that the calibration image data is read inBGR
format at compile time, so thatmean/std
is also set inBGR
order.
- check if the model has lost accuracy during the quantization phase
During the compilation of
pulsar build
, an intermediate filemodel.lava_joint
is generated for debugging, which is passed through1# Note that the following commands are incomplete 2pulsar run model.onnx model.lava_joint --input ...
You can verify that there is no loss of precision in the quantization phase.
- Model quantization stage lost accuracy solution
add quantitative data sets
1dataset_conf_calibration { 2 path: "imagenet-1k-images.tar" 3 type: DATASET_TYPE_TAR 4 size: 256 # The actual number of data needed for calibration during compilation 5 batch_size: 32 # default is 32, can be changed to other values 6}
Adjustment of quantitative strategies and quantitative methods
Quantification strategy,
CALIB_STRATEGY_PER_CHANNEL
andCALIB_STRATEGY_PER_TENSOR
quantization methods,
OBSVR_METHOD_MIN_MAX
andOBSVR_METHOD_MSE_CLIPPING
Quantitative strategies and quantitative methods can be two combinations, where
CALIB_STRATEGY_PER_CHANNEL
may have dropped pointsRecommend
PER_TENSOR/MIN_MAX
orPER_TENSOR/MSE_CLIPPING
combinations
1dataset_conf_calibration { 2 path: "magenet-1k-images.tar" # quantified dataset 3 type: DATASET_TYPE_TAR 4 size: 256 # The actual number of data needed for calibration during compilation 5 batch_size: 32 # default is 32, can be changed to other values 6 7 calibration_strategy: CALIB_STRATEGY_PER_TENSOR # Quantification strategy 8 observer_method: OBSVR_METHOD_MSE_CLIPPING # Quantification method 9}
use
INT16
quantizationSee 16bit quantization for details.
turn on
dataset_conf_error_measurement
, for error testing during compilation1dataset_conf_error_measurement { 2 path: "imagenet-1k-images.tar" 3 type: DATASET_TYPE_TAR 4 size: 32 5 batch_size: 8 6}
- Layer-by-layer comparison
See layer wise compare for details.
- pulsar debug
The
pulsar debug
function will be added later
4.1.3. Accuracy loss occurs on board
本节对 CheckLists 中 第四步 进行详细说明.
This section details the CheckLists in fourth_step.
- Determining if the post-processor is wrong
Using the
run_joint
command on theAX
development board, you can implement board-side reasoning and then parse the results using the user’s own postprocessor.To verify that the user’s post-processor is error-free, you can compare the output of
pulsar run
with the output ofrun_joint
for the same input condition,Refer to the gt folder comparison instructions, if the comparison is successful, the user’s postprocessor
may
have an error.
- The post-processor is correct, but the accuracy is still low.
- Possible reasons
npu simulator
generated instructions andcmode
ran inconsistent results.run_joint.so
andnpu drive
errors
This kind of problem needs to be logged so that it can be fixed quickly.
- BadCase handling
For this type of
BadCase
, first checkcos-sim
withpulsar run
, if there is no serious point loss (below 98%),Then send the
BadCase
to the board and run it withrun_joint
,See if the results are consistent with
pulsar run
, if not, it means there is a problem with the board and needs to be fixed by theAX
engineer.
4.1.4. Other notes
If you need an AX
engineer to troubleshoot the problem, please provide detailed log information and relevant experimental findings.
>>> Note: If you can provide a minimum recurrence set, you can improve the efficiency of the problem.
Note
In some cases the SILU
function causes the mAP
of the detection model to be very low, replacing it with the ReLU
function will solve the problem.
Note
If the quantized dataset
is very different from the training dataset
, the accuracy will be significantly reduced.
To determine whether the calibration
choice is reasonable, you can select a pulsar run
from the calibration
dataset and perform a pulsar run
to score it.
4.2. Precision tuning suggestions
For the quantized accuracy error, it is recommended that the user use the following 2
methods for optimization, both of which require reconversion of the model after configuration in the config.prototxt
file.
4.2.1. calibration settings
Two combinations of quantitative strategies and quantitative solutions
Try to use other quantitative data sets
Increase or decrease the amount of data as appropriate
4.2.2. QAT Training
When the accuracy of the model cannot be improved by using various tuning techniques, the model is probably the corner case
of the PTQ
scheme, and you can try to train it using QAT
.
Attention
More tuning suggestions will be updated gradually.