4. Accuracy loss troubleshooting and accuracy tuning suggestions

4.1. Accuracy Loss Identification

Attention

If there is a loss of accuracy in the converted model, please follow the recommended way to troubleshoot the stage or layer that is causing the problem

4.1.1. CheckLists

The first step is to specify the hardware platform on which the accuracy loss occurs
  1. only on the AX platform

    Please continue down the list.
    
  2. All other platforms have accuracy loss occurs

    Common problem, users need to decide whether to train a better model and then re-quantize it;
    Determine if other platforms use INT8 or INT16 quantization, or a mix of quantization.
    
In the second step, determine the stage where the accuracy loss occurs
  1. pulsar run has a low accuracy (cos-sim < 98%)

    Please follow the [Step 3] recommendations to continue the investigation
    
  2. The upper board is connected to the user’s post-processing program, and the accuracy is very low after parsing

    Please follow [step 4] and continue to the next step
    
Step 3, cos-sim is below 98%, troubleshooting suggestions
  1. the output_config.prototxt file required for pulsar run must be generated automatically by pulsar build

  2. check that the color space and mean/std configurations in the config.prototxt configuration file are correct

  3. use pulsar run to compare the cos-sim values between model.lava_joint and model.onnx to see if accuracy loss occurs

  4. Use layer-by-layer splitting to see the layer where the loss of precision occurs

Step 4, low accuracy on the board, troubleshooting suggestions
  1. When executing the run_joint command, it will print some information about the joint model, so you need to check if the post-processor is parsing the output data correctly.

  2. If other platforms don’t drop points, but BadCase is reported on AX platform, see Upboard accuracy loss troubleshooting method.

Step 5, get help from AXera

When the user still can’t solve the problem after the first four steps, please send the relevant log and conclusion to FAE colleagues, so that AX engineers can locate the problem

4.1.2. Accuracy loss occurs after model compilation

This section elaborates on CheckLists in third_step.

Hint

pulsar run is an integrated tool in the Pulsar toolchain for simulation and pairing, see Simulation and pairing on x86 platforms for details.

If the original onnx model is compiled into a joint model, the cos-sim of the pulsar run is very low, which means that the converted model is losing accuracy and the problem needs to be investigated.

config Configuration

The config required for pulsar run is automatically generated from the pulsar build.

1# Note that the following command is not complete
2pulsar build --input model.onnx --config config.prototxt --output_config output_config.prototxt  ...
3pulsar run model.onnx model.joint --config output_config.prototxt  ...
csc & mean/std

color space convert, csc After configuration, you need to configure mean/std in channel order.

 1# Configure the input data color space of the compiled model as BGR
 2dst_input_tensors {
 3    color_space: TENSOR_COLOR_SPACE_BGR
 4}
 5
 6# mean/std needs to be filled in the order of BGR
 7input_normalization {
 8    mean: [0.485, 0.456, 0.406]  # mean
 9    std: [0.229, 0.224, 0.255]   # std
10}

The color_space in dst_input_tensors is BGR, which means that the calibration image data is read in BGR format at compile time, so that mean/std is also set in BGR order.

check if the model has lost accuracy during the quantization phase

During the compilation of pulsar build, an intermediate file model.lava_joint is generated for debugging, which is passed through

1# Note that the following commands are incomplete
2pulsar run model.onnx model.lava_joint --input ...

You can verify that there is no loss of precision in the quantization phase.

Model quantization stage lost accuracy solution
  1. add quantitative data sets

    1dataset_conf_calibration {
    2    path: "imagenet-1k-images.tar"
    3    type: DATASET_TYPE_TAR
    4    size: 256 # The actual number of data needed for calibration during compilation
    5    batch_size: 32 # default is 32, can be changed to other values
    6}
    
  2. Adjustment of quantitative strategies and quantitative methods

    • Quantification strategy, CALIB_STRATEGY_PER_CHANNEL and CALIB_STRATEGY_PER_TENSOR

    • quantization methods, OBSVR_METHOD_MIN_MAX and OBSVR_METHOD_MSE_CLIPPING

    • Quantitative strategies and quantitative methods can be two combinations, where CALIB_STRATEGY_PER_CHANNEL may have dropped points

    • Recommend PER_TENSOR/MIN_MAX or PER_TENSOR/MSE_CLIPPING combinations

    1dataset_conf_calibration {
    2    path: "magenet-1k-images.tar" # quantified dataset
    3    type: DATASET_TYPE_TAR
    4    size: 256 # The actual number of data needed for calibration during compilation
    5    batch_size: 32 # default is 32, can be changed to other values
    6
    7    calibration_strategy: CALIB_STRATEGY_PER_TENSOR # Quantification strategy
    8    observer_method: OBSVR_METHOD_MSE_CLIPPING # Quantification method
    9}
    
  3. use INT16 quantization

  4. turn on dataset_conf_error_measurement, for error testing during compilation

    1dataset_conf_error_measurement {
    2    path: "imagenet-1k-images.tar"
    3    type: DATASET_TYPE_TAR
    4    size: 32
    5    batch_size: 8
    6}
    
Layer-by-layer comparison

See layer wise compare for details.

pulsar debug

The pulsar debug function will be added later

4.1.3. Accuracy loss occurs on board

本节对 CheckLists第四步 进行详细说明.

This section details the CheckLists in fourth_step.

Determining if the post-processor is wrong

Using the run_joint command on the AX development board, you can implement board-side reasoning and then parse the results using the user’s own postprocessor.

To verify that the user’s post-processor is error-free, you can compare the output of pulsar run with the output of run_joint for the same input condition,

Refer to the gt folder comparison instructions, if the comparison is successful, the user’s postprocessor may have an error.

The post-processor is correct, but the accuracy is still low.
Possible reasons
  • npu simulator generated instructions and cmode ran inconsistent results.

  • run_joint.so and npu drive errors

This kind of problem needs to be logged so that it can be fixed quickly.

BadCase handling

For this type of BadCase, first check cos-sim with pulsar run, if there is no serious point loss (below 98%),

Then send the BadCase to the board and run it with run_joint,

See if the results are consistent with pulsar run, if not, it means there is a problem with the board and needs to be fixed by the AX engineer.

4.1.4. Other notes

If you need an AX engineer to troubleshoot the problem, please provide detailed log information and relevant experimental findings.

>>> Note: If you can provide a minimum recurrence set, you can improve the efficiency of the problem.

Note

In some cases the SILU function causes the mAP of the detection model to be very low, replacing it with the ReLU function will solve the problem.

Note

If the quantized dataset is very different from the training dataset, the accuracy will be significantly reduced.

To determine whether the calibration choice is reasonable, you can select a pulsar run from the calibration dataset and perform a pulsar run to score it.

4.2. Precision tuning suggestions

For the quantized accuracy error, it is recommended that the user use the following 2 methods for optimization, both of which require reconversion of the model after configuration in the config.prototxt file.

4.2.1. calibration settings

  • Two combinations of quantitative strategies and quantitative solutions

  • Try to use other quantitative data sets

  • Increase or decrease the amount of data as appropriate

4.2.2. QAT Training

When the accuracy of the model cannot be improved by using various tuning techniques, the model is probably the corner case of the PTQ scheme, and you can try to train it using QAT.

Attention

More tuning suggestions will be updated gradually.