8. FAQ
8.1. Basic Questions
- If you encounter a problem that cannot be solved for a long time
If you encounter problems that take a long time to resolve, please contact the relevant
FAE
for support. Also note that you should try to use the latest version of the toolchain for model conversion.
- joint Basic model information
joint
model basic information:Model input
DType
oftensor
defaults toUINT8
andLayout
toNHWC
.Model output
DType
oftensor
defaults toFLOAT32
,Layout
isNCHW
.ColorSpace
: defaults toTENSOR_COLOR_SPACE_AUTO
, which can be automatically recognized based on the model inputchannel
number3-channel:
BGR
1-channel:
GRAY
Other options:
RGB
,NV12
,NV21
,BGR0
- If you need to change the VNPU settings, you need to recompile, but you can't change the VNPU settings directly through the existing joint
It is not technically possible to change the
vnpu
settings directly from thejoint
model, because:the toolchain needs to compile the model based on the input configuration information during compilation, and compilation is to gradually turn the initial model into binary instruction code, which requires a lot of raw information of the model
compilation is a continuous
lowering
process, which will lose information, and the compiled binary instruction code cannot recover the original model information.besides, different
vnpu
configurations will use different hardware resources and compile different optimal binary models
In summary, if you need to modify the
vnpu
model, you need to do the conversion again.
- Detection model has OutOfMemory(OOM) error during conversion, how to solve it
If the output
layout
of the detection model isNCHW
, the errorOOM
will appear, which is This is becauseNPU
outputslayout
asNHWC
by default, and thefeaturemap
output by the detection model is larger, Thetranspose
output of the model is large, so it is easy to crowd the memory when directlytranspose
. Therefore, you need to specify the outputlayout
when detecting the model transformation:pulsar build --input xxx.onnx --config xxx.prototxt --output xxx.joint --output_tensor_layout nhwc
- How to check NPU utilization rate
ssh
After logging into the board, use the following code to view it:$ cd /proc/ax_npu $ echo 1 > enable # Echo 1 > /proc/ax_proc/ax_npu/enable $ cat top_period_ms
- What is the difference between --config and --output_config for pulsar build
--config
is a user-friendlyprototxt
that provides a variety of syntactic sugar to simplify configuration--output_config
expands all syntax sugars and saves them as toolchain-friendly configuration filesAt this stage, the
pulsar run
emulation function requires the use of--output_config
generated configuration files
- The simulated fps output in pulsar build differs significantly from the measured fps on the board.
There are two general reasons for the discrepancy, which are model specific
neu fps
is larger because theDDR
on the board is not limited, while thepulsar
simulation is strictly stuck on theDDR
bandwidthThe smaller
neu fps
is due to the fact that in theCycle Model
of thePulsar
simulation, the small amount previously omitted (e.g.,LUT
configuration time) becomes non-negligible in somecases
.
- How to configure multiple Batches in Prototxt
The following configuration allows you to set the target
batch_size
value independently.# Path to configuration file parameters: pulsar_conf pulsar_conf { batch_size: 2 # Set the batch size to 2 for compiling model inference }
- How to configure dynamic Batch in Prototxt
Dynamic
Batch
can be implemented by the following configuration.# Configuration file parameter path: pulsar_conf pulsar_conf { batch_size_option: BSO_DYNAMIC # The compiled model supports dynamic batch batch_size: 1 # Commonly used for practical inference batch_size batch_size: 2 # Commonly used for practical inference batch_size batch_size: 4 # The maximum batch_size is 4 }
For a more detailed description, see pulsar_conf.
- onnx model input is RGB, expect the joint model to be converted to RGB input as well, how should this work?
Configure it in the configuration file as follows:
dst_input_tensors { color_space: TENSOR_COLOR_SPACE_RGB }
- Can the transferred .joint model run on the board like the previous .neu model?
.joint
can run on the board just like.neu
. In fact, the.joint
model is the current mainstream on-board model format and the.neu
model is the old format,Pulsar
can convert the.neu
model to the.joint
model
- Can PTQ run GPUs?
The toolchain itself supports it, but
docker
itself doesn’t use thenvidia
base image for size reasons
- dataset_output_type defaults to BGR, does it mean that the input to the model is in BGR format when using the image correction from the dataset. If so, should the mean and std in config.prototxt also be set in BGR order?
Yes, they need to be configured in that order. The
dataset_output_type
value isBGR
which means that the calibration images are read inBGR
format at compile time, somean/std
has to be set in BGR order as well.
- How to configure the Q value in config.prototxt
This can be done with the following configuration.
dst_output_tensors { data_type: INT16 }
- Is the Q value int16?
Q
values are not exactlyint16
. TheQ
value types can be matched, see data_type type for details.
- How to calculate CPU subgraph time for Q values
The
Q
value does not have aCPU
subgraph, but the/Q
arithmetic operation is left to the client’s post-processing code
- The Q value still has to be connected to the CPU to do the division, which doesn't save time
Yes, you have to connect to the
CPU
, but the/Q
operation can be coupled with other operations, and in most cases it is almost freeFor example, if you need to divide after the detection post-processing step
NMS
, thenDivisor*Q
is sufficientThe detection network alone does a large
tensor
multiplication, which may takeNPU
several times as long, andNMS
after a small computation
- what is the Q-value interface for the transfer model
directly upboard execution
run_joint model.joint
, which will be printed in the logThe
C++
interface in thejoint sdk
also hasnQuantizationValue
.
- Does the toolchain and hardware support sparse acceleration?
Structured sparsity and low bit-width are supported, unstructured sparsity is not supported by hardware.
8.2. View the Inference Report report
According to inference_report
can:
Analyze inference bottlenecks:
CONV
,DDR
,DMA
,arithmetic
Count the sources of arithmetic power loss
Evaluate the space for continued optimization
Analyze the direction of speed optimization
After the pulsar build
command is run, an inference_report
folder is saved in the current working directory,
This folder contains one or more
part_x.lava
folders (wherex
represents the number, starting from0
),Each
part_x.lava
folder contains aninference_report.log
file,For small models there is usually only one
part_0.lava
folder and oneinference_report.log
,When the model is too large, it is split into multiple sub-models and executed sequentially, so that there are multiple
part_0.lava
folders.
In this case, the tot_cyc
of this model is the sum of the tot_cyc
of these individual submodels, and the DDR
The total_io_data_size
is the sum of the total_io_data_size
of these individual submodels.
# Smaller model, contains only the part_0.lava folder
➜ DEMO cd inference_report
➜ inference_report tree -L 2
.
└── part_0.lava
├── inference_report.log
├── subgraph_0
├── subgraph_1
├── subgraph_2
├── subgraph_3
├── subgraph_4
└── subgraph_5
7 directories, 1 file
View inference_report.log
, the example is as follows:
inference_report.log
contains some custom terms, some of which are explained below Nomenclature
ld
, i.e. read fromDDR
, write toOCM
st
, read fromOCM
, write toDDR
mv
, read fromOCM
, write toOCM
The role of inference_report.log
is illustrated by a typical example, as in the following case
:
In the non-virtual NPU
condition, as shown in the figure (blue box above), three types of EU
are involved in model inference, namely conv-2cores
, teng
and stream
,
and the table counts the cycle
percentage, physically meaning the cycle
actually run for each type of EU
divided by the total cycle
actually spent on model inference. This ratio
provides a visual representation of how busy EU
is,
For example, the ratio
of teng
in the figure reaches 98%
, which is almost at full capacity.
teng
and stream
have the ability to read and write data on DDR
. A detailed breakdown of the proportion of each type of task is shown in the profile stream EU
,
The values of ld_ratio/ld_param_ratio/st_ratio
(red box above) reflect the time and percentage of DDR
read/write tasks performed by the corresponding EU
, which can be used to analyze the DDR
bandwidth pressure.
Summary
In general, the following conditions reflect the speed bottleneck of the model for a given DDR_BW
case:
The
ratio
ofteng/stream
is higher and significantly higher than theratio
of otherEU
ld_ratio/ld_param_ratio/st_ratio
is higher inteng/stream
Conversely, the following condition reflects that the speed bottleneck of the model is the computational power:
The
ratio
ofconv
is higher than theratio
of the otherEU``s, and significantly higher than the ``ratio
of the otherEU
.
More specifically, the ratio
of the model teng
is 98%
, which is significantly higher than the conv
’s 39.0%
;
The DDR
read-write task in teng
is 87.1% + 0.4% = 87.5%
of the time, which accounts for the majority of the EU
work time, so the speed bottleneck of this model is considered to be the DDR
bandwidth.
Hint
For the virtual NPU111
, there are only two EU``s, ``conv-1core/teng
, which are counted in the same way as the non-virtual NPU
.