How Residual Quantization breaks giant embedding tables into compact, hierarchical codebooks — dramatically reducing parameters while preserving representation quality.
In recommendation and search systems, every item (video, product, document) is assigned a unique embedding vector. The model learns one dense vector per item:
Instead of storing a unique vector for every item, we approximate each embedding as a sum of shared codebook vectors, selected level by level:
The process works by quantizing the residual error at each level:
977× fewer parameters — shared codebooks replace per-item storage.
Watch how each codebook level refines the approximation in 2D. Choose a target vector and step through the quantization levels.
The reconstruction is built by summing the codebook vectors chosen at each level. Each successive vector is smaller, capturing finer residual details.
Each codebook level reduces the remaining quantization error, with the coarsest level capturing the most information:
| Traditional Embedding | Semantic ID (RQ) | |
|---|---|---|
| Representation | One unique D-dim vector per item | Tuple of L codebook indices per item |
| Parameters | N × D (scales with catalog) | L × K × D (fixed, shared) |
| Example | 1M × 256 = 256M params | 4 × 256 × 256 = 262K params |
| New items | Requires retraining | Encode with existing codebooks |
| Structure | Flat, no hierarchy | Hierarchical: coarse → fine |
| Storage per item | D floats (256 × 4 = 1024 bytes) | L integers (4 × 1 = 4 bytes) |