Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left

  Apply Now

Optimization and Quantization of Models for better performance

17 / 21

Introduction to low precision optimization with Post-Training Optimization Tool(POT)

Post-Training Optimization Tool (POT ) can lower the precision of a model from FP32 to INT8 with a process called Quantization. Quantization accelerates the performance of certain models on hardware that supports INT8. An INT8 model takes up less memory footprint and speeds up inference time at the cost of a small reduction in accuracy. POT tool also improves latency, with little to no degradation in model accuracy.

Post-training quantization is fast and usually takes few minutes depending on the model size and development hardware. To apply post-training algorithms from the POT, you will need:

  • A full precision model, FP32 or FP16, converted into the OpenVINO™ toolkit Intermediate Representation (IR) format

  • A representative calibration dataset of data samples representing your use case scenario. For example, 300 images.

enter image description here

The above figure shows the optimization flow for a model with the Model Optimizer and POT tools.

The process involves the use of the Model Optimizer tool to convert the model from the source framework to the OpenVINO™ toolkit Intermediate Representation (IR) and run it on the CPU with Inference Engine. In this step, you should ensure that the model trained on the target dataset can be successfully inferred with the Inference Engine in floating-point precision (FP32 or FP16).

The next step is to use the POT tool to quantize the model to INT8 format. There are two algorithms available:

  • DefaultQuantization DefaultQuantization algorithm is designed to perform a fast and in many cases accurate 8-bits quantization of NNs. The target with this algorithm is to perform the fastest possible quantization. It is relatively more performance-focused. Even though it is not accuracy aware, in most of the cases we see that the accuracy drop is very minimal.

  • AccuracyAwareQuantization As the name suggests AccuracyAware algorithm is designed to perform accurate 8-bit quantization and allows the model to stay in the pre-defined range of accuracy drop, for example like 1% (This will be configured using configuration files). Here algorithm will quantize many of the layers until accuracy drops to the level you allow. It starts with running the DefaultQuantization algorithm, and then uses an iterative process, where each layer of the model is ranked based on the performance gain it can provide when converted to int8 and the accuracy drop it can cause. At the end of the process, you will get high performing model with the optimal combination of floating-point layers and quantized int8 layers.

To understand more about the POT, you can refer to the official documentation for the Post-Training Optimization Tool. For details about the low-precision optimization flow in the OpenVINO™ toolkit, see the Low Precision Optimization Guide.

Resources


No hints are availble for this assesment

Answer is not availble for this assesment