Model Quantization

Quantization, Model compression

Deployment

Infrastructure

Soft glowing orange and yellow light with a gradient blending into black background.

TL;DR

The process of converting high-precision numerical values in an AI model to lower-precision formats, reducing memory footprint and accelerating inference speed.

In depth

This compression technique maps continuous, high-precision weights and activations, typically 32-bit floating-point numbers, into smaller discrete sets like 8-bit or 4-bit integers. It utilizes scaling factors and zero-points to ensure that the lower-precision representations retain the core mathematical relationships of the original model. While it can introduce a small drop in accuracy, it significantly decreases the memory bandwidth required for inference. There are two primary avenues for implementation, known as post-training quantization and quantization-aware training.

Why this matters for your business

It is a cornerstone requirement for running massive artificial intelligence models on edge devices like smartphones, laptops, and local servers. By heavily shrinking model sizes, businesses can sharply decrease cloud hosting costs and latency.