Definition: Technique that reduces the numerical precision of AI model weights to decrease size and accelerate inference, with minimal quality loss.
— Source: NERVICO, Product Development Consultancy
What is Quantization
Quantization is an optimization technique that reduces the numerical precision of an AI model’s weights and activations. Instead of using 32-bit floating point numbers (FP32) or 16-bit (FP16), quantization converts these values to lower-precision formats like INT8 (8-bit) or INT4 (4-bit). This reduces the model’s memory footprint, accelerates inference speed, and lowers hardware requirements, with quality loss typically below 1-2%.
How It Works
There are two main approaches. Post-training quantization (PTQ) converts the weights of an already trained model to lower precision without retraining. It is fast and easy to apply but may lose some accuracy. Quantization-aware training (QAT) simulates lower precision during the training process, allowing the model to adapt and compensate for information loss. Both methods map continuous ranges of floating-point values to a discrete set of quantization levels, using techniques like calibration and per-channel scaling to minimize error.
Why It Matters
Quantization is essential for making LLMs economically viable in production. A 70B parameter model in FP16 requires approximately 140 GB of VRAM. Quantized to INT4, the same model fits on a single GPU with 35 GB of VRAM. For companies, this means running more powerful models on more accessible hardware, reducing cloud infrastructure costs, and enabling inference on edge devices.
Practical Example
A startup wants to run a 70B parameter Llama model on their own infrastructure. With FP16, they would need two A100 80 GB GPUs (monthly cloud cost exceeding $5,000). Applying GPTQ quantization to 4 bits, they run the model on a single A100 with imperceptible quality degradation, cutting their infrastructure cost in half.
Related Terms
- Model Distillation - Complementary model compression technique
- LoRA - Efficient fine-tuning method compatible with quantized models
- Inference Optimization - Field that includes quantization as a key technique
Last updated: February 2026 Category: Artificial Intelligence Related to: Model Compression, Inference Optimization, LoRA, QLoRA Keywords: quantization, model compression, int8, int4, gptq, inference optimization, ptq, qat, vram reduction