4. Accuracy loss troubleshooting and accuracy tuning suggestions
4.1. Accuracy Loss Identification
Attention
If there is a loss of accuracy in the converted model, please follow the recommended way to troubleshoot the stage or layer that is causing the problem
4.1.1. CheckLists
- The first step is to specify the hardware platform on which the accuracy loss occurs
only on the
AXplatformPlease continue down the list.
All other platforms have accuracy loss occurs
Common problem, users need to decide whether to train a better model and then re-quantize it; Determine if other platforms use INT8 or INT16 quantization, or a mix of quantization.
- In the second step, determine the stage where the accuracy loss occurs
pulsar runhas a low accuracy (cos-sim < 98%)Please follow the [Step 3] recommendations to continue the investigation
The upper board is connected to the user’s
post-processingprogram, and the accuracy is very low after parsingPlease follow [step 4] and continue to the next step
- Step 3, cos-sim is below 98%, troubleshooting suggestions
the
output_config.prototxtfile required forpulsar runmust be generated automatically bypulsar buildcheck that the
color spaceandmean/stdconfigurations in theconfig.prototxtconfiguration file are correctuse
pulsar runto compare thecos-simvalues betweenmodel.lava_jointandmodel.onnxto see if accuracy loss occursUse layer-by-layer splitting to see the
layerwhere the loss of precision occurs
- Step 4, low accuracy on the board, troubleshooting suggestions
When executing the
run_jointcommand, it will print some information about thejointmodel, so you need to check if thepost-processoris parsing the output data correctly.If other platforms don’t drop points, but
BadCaseis reported onAXplatform, see Upboard accuracy loss troubleshooting method.
- Step 5, get help from AXera
When the user still can’t solve the problem after the first four steps, please send the relevant
logandconclusiontoFAEcolleagues, so thatAXengineers can locate the problem
4.1.2. Accuracy loss occurs after model compilation
This section elaborates on CheckLists in third_step.
Hint
pulsar run is an integrated tool in the Pulsar toolchain for simulation and pairing, see Simulation and pairing on x86 platforms for details.
If the original onnx model is compiled into a joint model, the cos-sim of the pulsar run is very low, which means that the converted model is losing accuracy and the problem needs to be investigated.
- config Configuration
The
configrequired forpulsar runis automatically generated from thepulsar build.1# Note that the following command is not complete 2pulsar build --input model.onnx --config config.prototxt --output_config output_config.prototxt ... 3pulsar run model.onnx model.joint --config output_config.prototxt ...
- csc & mean/std
color space convert, cscAfter configuration, you need to configuremean/stdin channel order.1# Configure the input data color space of the compiled model as BGR 2dst_input_tensors { 3 color_space: TENSOR_COLOR_SPACE_BGR 4} 5 6# mean/std needs to be filled in the order of BGR 7input_normalization { 8 mean: [0.485, 0.456, 0.406] # mean 9 std: [0.229, 0.224, 0.255] # std 10}
The
color_spaceindst_input_tensorsisBGR, which means that the calibration image data is read inBGRformat at compile time, so thatmean/stdis also set inBGRorder.
- check if the model has lost accuracy during the quantization phase
During the compilation of
pulsar build, an intermediate filemodel.lava_jointis generated for debugging, which is passed through1# Note that the following commands are incomplete 2pulsar run model.onnx model.lava_joint --input ...
You can verify that there is no loss of precision in the quantization phase.
- Model quantization stage lost accuracy solution
add quantitative data sets
1dataset_conf_calibration { 2 path: "imagenet-1k-images.tar" 3 type: DATASET_TYPE_TAR 4 size: 256 # The actual number of data needed for calibration during compilation 5 batch_size: 32 # default is 32, can be changed to other values 6}
Adjustment of quantitative strategies and quantitative methods
Quantification strategy,
CALIB_STRATEGY_PER_CHANNELandCALIB_STRATEGY_PER_TENSORquantization methods,
OBSVR_METHOD_MIN_MAXandOBSVR_METHOD_MSE_CLIPPINGQuantitative strategies and quantitative methods can be two combinations, where
CALIB_STRATEGY_PER_CHANNELmay have dropped pointsRecommend
PER_TENSOR/MIN_MAXorPER_TENSOR/MSE_CLIPPINGcombinations
1dataset_conf_calibration { 2 path: "magenet-1k-images.tar" # quantified dataset 3 type: DATASET_TYPE_TAR 4 size: 256 # The actual number of data needed for calibration during compilation 5 batch_size: 32 # default is 32, can be changed to other values 6 7 calibration_strategy: CALIB_STRATEGY_PER_TENSOR # Quantification strategy 8 observer_method: OBSVR_METHOD_MSE_CLIPPING # Quantification method 9}
use
INT16quantizationSee 16bit quantization for details.
turn on
dataset_conf_error_measurement, for error testing during compilation1dataset_conf_error_measurement { 2 path: "imagenet-1k-images.tar" 3 type: DATASET_TYPE_TAR 4 size: 32 5 batch_size: 8 6}
- Layer-by-layer comparison
See layer wise compare for details.
- pulsar debug
The
pulsar debugfunction will be added later
4.1.3. Accuracy loss occurs on board
本节对 CheckLists 中 第四步 进行详细说明.
This section details the CheckLists in fourth_step.
- Determining if the post-processor is wrong
Using the
run_jointcommand on theAXdevelopment board, you can implement board-side reasoning and then parse the results using the user’s own postprocessor.To verify that the user’s post-processor is error-free, you can compare the output of
pulsar runwith the output ofrun_jointfor the same input condition,Refer to the gt folder comparison instructions, if the comparison is successful, the user’s postprocessor
mayhave an error.
- The post-processor is correct, but the accuracy is still low.
- Possible reasons
npu simulatorgenerated instructions andcmoderan inconsistent results.run_joint.soandnpu driveerrors
This kind of problem needs to be logged so that it can be fixed quickly.
- BadCase handling
For this type of
BadCase, first checkcos-simwithpulsar run, if there is no serious point loss (below 98%),Then send the
BadCaseto the board and run it withrun_joint,See if the results are consistent with
pulsar run, if not, it means there is a problem with the board and needs to be fixed by theAXengineer.
4.1.4. Other notes
If you need an AX engineer to troubleshoot the problem, please provide detailed log information and relevant experimental findings.
>>> Note: If you can provide a minimum recurrence set, you can improve the efficiency of the problem.
Note
In some cases the SILU function causes the mAP of the detection model to be very low, replacing it with the ReLU function will solve the problem.
Note
If the quantized dataset is very different from the training dataset, the accuracy will be significantly reduced.
To determine whether the calibration choice is reasonable, you can select a pulsar run from the calibration dataset and perform a pulsar run to score it.
4.2. Precision tuning suggestions
For the quantized accuracy error, it is recommended that the user use the following 2 methods for optimization, both of which require reconversion of the model after configuration in the config.prototxt file.
4.2.1. calibration settings
Two combinations of quantitative strategies and quantitative solutions
Try to use other quantitative data sets
Increase or decrease the amount of data as appropriate
4.2.2. QAT Training
When the accuracy of the model cannot be improved by using various tuning techniques, the model is probably the corner case of the PTQ scheme, and you can try to train it using QAT.
Attention
More tuning suggestions will be updated gradually.