Benchmarking Dynamic Quantization for Larger Language Models

Benchmarking Dynamic Quantization for Larger Language Models

Modern language models can answer complex queries and generate high‑quality text, but they are heavy: hundreds of megabytes of weights have to be loaded into memory and millions of multiply and accumulate operations run per query. To reduce latency and cost, practitioners use optimisation techniques such as model distillation, quantization, pruning, dynamic/continuous batching, and KV‑cache reuse. When these methods are combined, costs can be slashed by up to 80 % while maintaining acceptable qualitylatitude-blog.ghost.io. This post focuses on dynamic INT8 quantization for a medium‑sized GPT‑2 model, shows the performance difference between full‑precision and quantised models, and discusses trade‑offs.

Why Quantization?

Quantization reduces the precision of a model’s weights from 32‑bit floating‑point values to 8‑bit (or lower) integers. PyTorch’s quantization documentation notes that INT8 quantization can shrink the model size and memory bandwidth by roughly 4× and that INT8 operations are typically 2–4 × faster than FP32 (docs.pytorch.org). For example, a 500‑million‑parameter model occupying 2 GB in FP32 can be reduced to 0.5 GB when quantized to INT8 (latitude-blog.ghost.io). This smaller footprint allows large language models to run on memory‑constrained devices and reduces inference latency.

Weight‑only quantization (e.g. 4‑bit AWQ or GPTQ) can improve speed even more. In real deployments, quantized LLMs show dramatic throughput gains: DeepSeek‑7B’s throughput on an NVIDIA RTX 4090 increases from 52 tokens/s to 130 tokens/s using AWQ, while Mistral‑7B on an AWS G5.xlarge instance jumps from 28 tokens/s to 88 tokens/s (latitude-blog.ghost.io). A 4‑bit GPTQ version of Llama‑3.2 1B maintained its F1 score and gained 30 % speed (latitude-blog.ghost.io).

Despite these benefits, quantization can degrade output quality if not applied carefully. Lower‑precision representations introduce rounding errors and remove representational capacity. Advanced techniques like AWQ/GPTQ mitigate this by weighting channel scaling per layer and calibrating on sample data. Our example uses PyTorch’s simple dynamic quantization, which is less powerful than weight‑only methods but easy to apply.

The Example: Measuring FP32 vs. INT8 GPT‑2

We built a script (inference_example_large.py) that loads GPT‑2 (124M parameters), processes a handful of prompts, quantizes the model dynamically, and then processes the same prompts. The key steps are:

  1. Load model and tokenizer. We use Hugging Face’s AutoModelForCausalLM with the GPT‑2 checkpoint and its tokenizer. Because GPT‑2 lacks a padding token, we assign the end‑of‑sequence token (eos_token) as the pad token and set padding_side='left' to avoid runtime warnings.

  2. Run FP32 inference. The script encodes a list of prompts, generates up to 50 new tokens for each, and measures total time.

  3. Apply dynamic quantization. torch.quantization.quantize_dynamic quantizes only linear layers in the model (attention projections and feed‑forward layers). Embedding and layer‑norm parameters remain in FP32, so the model size reduction is limited.

  4. Run INT8 inference with the quantized model and measure time.

Here is a condensed version of the code’s core logic (full code here: JordiCorbilla/langgraph-cookbook: langgraph-cookbook):

from transformers import AutoModelForCausalLM, AutoTokenizer import torch, time, psutil model_name = "gpt2" dev = torch.device("cpu") # Load and patch tokenizer (GPT-2 has no pad token) tokenizer = AutoTokenizer.from_pretrained(model_name) if tokenizer.pad_token_id is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" # Load full-precision model model = AutoModelForCausalLM.from_pretrained(model_name).to(dev) model.eval() prompts = ["Write a poem about quantization.", "Why is dynamic batching useful?", # 4 prompts total... ] max_new_tokens = 50 # FP32 inference start = time.time() for p in prompts: inp = tokenizer(p, return_tensors="pt").to(dev) out = model.generate( input_ids=inp["input_ids"], attention_mask=inp["attention_mask"], max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id, ) end = time.time() fp32_time = end - start # Dynamically quantize linear layers model_int8 = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) # INT8 inference start = time.time() for p in prompts: inp = tokenizer(p, return_tensors="pt").to(dev) out = model_int8.generate( input_ids=inp["input_ids"], attention_mask=inp["attention_mask"], max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id, ) end = time.time() int8_time = end - start

Results

Our run on a Windows laptop (CPU) produced the following numbers:

ModelPrecisionWeight sizeTotal time for 4 promptsSpeedup
GPT‑2FP32~474 MB~6.57 s
GPT‑2INT8 dynamic~474 MB*~5.78 s1.14×

*Dynamic quantization only converts nn.Linear layers. Embedding and layer‑norm weights remain in FP32, so the overall model footprint does not shrink. Weight‑only quantization or AWQ/GPTQ can compress all weights and yield 4×–8× reductions (docs.pytorch.orglatitude-blog.ghost.io).

Interpretation

  • Modest gains on small models. On a 124M‑parameter GPT‑2, dynamic quantization offered only a 14 % speedup and no memory savings. This is because the model’s largest tensors are the embeddings and LM head, which were not quantized. Dynamic quantization shines on CPU‑bound workloads with many linear operations but yields diminishing returns on small models or GPU‑accelerated inference.

  • Larger models benefit more. Weight‑only quantization of big models (e.g., Llama‑3 70B) can reduce memory from 2 GB to 0.5 GB and double token throughput (latitude-blog.ghost.iolatitude-blog.ghost.io). AWQ and GPTQ compress all weights, producing 2×–4× speedups while preserving or even improving some evaluation metrics (latitude-blog.ghost.io).

  • Accuracy trade‑offs. Lower‑bit quantization introduces rounding error. While our 8‑bit dynamic model produced sensible text, 4‑bit quantization must be applied with care (e.g., per‑channel scaling, calibration) to avoid harming output quality. The Latitude article emphasises that quantization maintains accuracy “good enough for most applications” (latitude-blog.ghost.io), but domain‑specific tasks may require careful evaluation.

Continuous and Dynamic Batching

Quantization isn’t the only lever for faster LLMs. Dynamic or continuous batching groups multiple inference requests arriving close in time into a single batch. Instead of running one forward pass per request, the server merges them into micro‑batches and runs a single forward pass. Systems like vLLM achieve up to 23× throughput gains by continuously adding new prompts mid‑generation and sharing key/value caches across requests.

In our earlier example (with a tiny MLP), we saw that the first request took ~23 ms, while subsequent requests processed within 0.5–2 ms because they piggy‑backed on the existing batch. Continuous batching is especially powerful when serving interactive chat applications where many users send short messages concurrently.

Conclusion

Dynamic quantization is an easy way to reduce the computational cost of LLM inference, but its benefits scale with model size. Our GPT‑2 experiment shows only modest gains because embeddings dominate memory usage and remain in FP32. For significant improvements, practitioners should adopt weight‑only quantization and pair it with continuous batching, KV‑cache optimisation, and model distillation. These methods, when applied thoughtfully, can make deploying large language models feasible even in cost‑constrained environments.

Comments

Popular Posts