Previous Index Next

Choosing the right Quantization option based on the Hardware Platform

The Inference Engine can infer models in different formats with various input and output formats. Let's understand what are the preferred model formats for different Intel hardware devices in this video.

Quantization in deep neural networks refers to using lower precision representations, often in place of floating-point arithmetic used for computation and storage of tensor values.

At training time, we often use full precision (FP32) representations to train and store the model. At inference time, this provides an avenue for optimization where we can quantize the weights and biases of the network to lower precision to run inference, e.g. FP16, INT8.

The Model Optimizer can only create FP32 and FP16 IR models and for even lower precision i.e., INT8 quantized models the Post-Training Optimization Tool (POT) needs to be used. A Calibration data set is required to check the accuracy during INT8 quantization. This can be a small subset of the training/validation dataset.

From the above video, we learned that CPU and GPU plugins support both FP32 and FP16, but the preferred model precision for inferencing on CPU is FP32, while FP16 on GPU. Also, both CPU and GPU plugins support even lower precision INT8, but one important point to note here is model optimizer cannot quantize the model to INT8. You need to use the Post-Training Optimization Tool to quantize the model in INT8 format.

For the VPU plugin, FP16 is the only supported data precision, so if you decide to use an accelerator like Intel® Movidius™ Myriad™ X VPU, you need to quantize your model to FP16 format first.

Resources

INT8 vs FP32 Comparison

Optimization and Quantization of Models for better performance

Choosing the right Quantization option based on the Hardware Platform

Resources

XP