TL;DR
Google researchers built TurboQuant, a compression algorithm that shrinks LLM key-value cache memory by 6x and speeds up attention computation by up to 8x on H100 GPUs, all with zero measurable accuracy loss. It needs no retraining, no calibration data, and works on any existing model out of the box. The paper heads to ICLR 2026, community implementations already exist, and memory chip stocks dropped on the news. If you run local LLMs, you’ll want to know about this.
The Other Memory Problem
You’ve probably heard people complain about VRAM when running local models. “I need 24 GB just to load a 7B model.” That’s the weight memory problem, and quantization formats like GGUF and AWQ already handle it reasonably well.
But there’s a second memory problem that gets far less attention: the KV cache.
Every time an LLM generates a token, it computes attention over all previous tokens. To avoid redoing that math from scratch each time, the model stores intermediate results (the keys and values from each attention layer) in a cache. This cache grows linearly with context length. Run a 7B model with a 32K context window and the KV cache alone can eat 4-8 GB of VRAM, on top of the model weights.
As context windows keep expanding (Claude handles 1M tokens, Gemini does 2M), this cache becomes the binding constraint. Not the weights. Not compute. Memory for storing what the model has already read.
That’s the problem TurboQuant solves.
What TurboQuant Actually Does
TurboQuant is a quantization algorithm for KV caches. Weight quantization (GPTQ, AWQ, GGUF) compresses the model’s parameters before inference starts. KV cache quantization compresses the running memory that accumulates during inference. They solve different problems.
The headline numbers:
- 6x compression of KV cache memory at 2.5-3 bits per value (down from 16-bit)
- 8x speedup on attention computation on H100 GPUs at 4-bit precision
- Zero accuracy loss across standard benchmarks including Needle-in-a-Haystack, LongBench, and RULER
- No training required: works on any existing model without fine-tuning or calibration
The “zero accuracy loss” claim is the part that raised my eyebrows. Previous approaches like KIVI (ICML 2024) could compress maybe 2.6x before accuracy started degrading. Going from 2.6x to 6x with no accuracy penalty is a big jump.
The Two-Stage Trick: PolarQuant + QJL
TurboQuant works through two complementary stages. The math is dense, but the core ideas make sense once you see the analogies.
Stage 1: PolarQuant — Think Compass, Not Grid
Traditional quantization works like snapping GPS coordinates to a grid. You take a high-precision number, round it to the nearest grid point, and accept some rounding error. To keep that error manageable, you need to store extra metadata: the scale and offset for each block of numbers. That metadata adds up.
PolarQuant takes a different approach. Instead of working with Cartesian coordinates (x, y, z), it converts each vector to polar coordinates: a single radius (how far from the origin) and a set of angles (which direction).
Why does this help? Because after a random rotation (a standard mathematical trick), those angles follow a predictable, concentrated distribution. You don’t need per-block normalization constants. You already know roughly where the values will land, so you can build a single quantization codebook that works for all blocks.
This eliminates the metadata overhead that limited previous methods. The compression ratio depends only on how many bits you allocate per angle, not on how many extra bits you burn on bookkeeping.
PolarQuant supports optimal scalar quantizers (Lloyd-Max), meaning it finds the best possible grid points for a given bit budget. For the values (as opposed to keys), PolarQuant alone handles the compression.
Stage 2: QJL — A 1-Bit Safety Net for Keys
Keys are trickier than values because small quantization errors in keys get amplified when computing attention scores. PolarQuant alone can handle them, but the errors add a systematic bias to the attention weights.
QJL (Quantized Johnson-Lindenstrauss) fixes this with a clever correction layer.
Here’s the analogy: imagine you’re estimating the distance between two cities, but your map has small errors. Instead of trying to fix the map, you take a second, rough measurement from a different angle and use it to cancel out the systematic error from the first.
That’s what QJL does. It projects the residual quantization error into a lower-dimensional space using a random projection (the Johnson-Lindenstrauss transform), then collapses each value down to a single sign bit, just +1 or -1. When the model later computes attention, it combines the high-precision query with the PolarQuant-compressed keys and the 1-bit QJL correction to produce an unbiased estimate.
The cost: 1 extra bit per key dimension. The payoff: the systematic bias disappears entirely.
The Benchmarks Hold Up
Google tested TurboQuant on Gemma and Mistral models across five benchmark suites:
| Benchmark | What It Tests | TurboQuant Result |
|---|---|---|
| Needle-in-a-Haystack | Finding a fact buried in 104K tokens | 100% retrieval accuracy at every compression level |
| LongBench | QA, summarization, code gen across long contexts | Matched or beat KIVI baseline on all tasks |
| ZeroSCROLLS | Zero-shot long-document understanding | No degradation vs. full precision |
| RULER | Synthetic long-context reasoning tasks | Maintained accuracy at 8.5K–64K tokens |
| L-Eval | Long-form evaluation | Consistent with uncompressed baseline |
The 2.5-bit configuration (nearly 5x compression) maintained 100% exact match on needle-in-a-haystack retrieval. The 3-bit configuration (6x compression) showed zero degradation on every task they tested.
On hardware performance, 4-bit TurboQuant on H100 GPUs delivered up to 8x speedup on the attention logit computation specifically. That’s not end-to-end inference speedup. It’s the attention kernel itself. Real-world end-to-end gains will be smaller, but still meaningful for long-context workloads where attention dominates compute time.
How It Compares to Nvidia’s KVTC
TurboQuant isn’t the only KV cache compression paper at ICLR 2026. Nvidia published KVTC (Key-Value Token Compression) with its own impressive numbers. The two take fundamentally different approaches:
| TurboQuant (Google) | KVTC (Nvidia) | |
|---|---|---|
| Compression | 6x (lossless) | Up to 20x (~1 point accuracy loss) |
| Calibration | None (data-oblivious) | One-time calibration per model |
| Approach | Polar coordinates + 1-bit correction | PCA decorrelation + entropy coding |
| Models tested | Up to ~8B parameters | 1.5B to 70B |
| Target | Real-time inference | Offline cache storage and reuse |
| Production path | Community (llama.cpp, MLX, Triton) | Nvidia Dynamo + vLLM |
TurboQuant gives you 6x compression with zero setup and zero accuracy loss. KVTC gives you 20x compression if you’re willing to accept a small accuracy penalty and run a calibration step per model.
For local LLM users, TurboQuant is the more immediately useful one. Drop it in, get 6x more context for free. For cloud inference providers optimizing cost per token, KVTC’s higher compression ratio might justify the calibration overhead.
Neither team published head-to-head benchmarks on the same models and tasks, which is unfortunate but not surprising given they’re from competing companies.
Community Implementations Are Already Running
Google hasn’t released official code yet (expected Q2 2026), but the community didn’t wait.
A developer at Tonbi Studio built a PyTorch implementation with a custom Triton kernel and tested it on Gemma 3 4B running on an RTX 4090. The result: character-identical output to the uncompressed baseline at 2-bit precision. The implementation is structured in two layers:
# Layer 1: Core algorithm (turboquant_core.py)
# Random rotation, Lloyd-Max codebook, quantize/dequantize
# Layer 2: KV cache integration (turboquant_kv_cache.py)
# Patched DynamicCache that quantizes on every cache.update() call
# Works with any HuggingFace model, no model-specific code needed
Prince Canuma independently implemented it in MLX and tested on Qwen 3.5 35B with context lengths up to 64K tokens — 6/6 exact match on needle-in-haystack, 4.9x smaller KV cache at 2.5-bit.
In the llama.cpp community, at least three developers have C and CUDA implementations in progress, with one reporting all 18 tests passing and compression ratios matching the paper.
The Triton kernel on the RTX 4090 showed a ~1.2x speedup on the Q@K^T operation itself. Not the 8x number from the paper, but that was measured on H100s with 4-bit keys. Consumer GPU results will be more modest, but still worth having.
One caveat from early implementers: the QJL error-correction stage is tricky to get right. A naive implementation produces garbage. You need to follow the paper’s asymmetric estimator design carefully, where high-precision queries pair with the 1-bit compressed keys during attention scoring.
Practical Impact
If you run local LLMs: This is the update to watch. A 7B model that currently chokes at 16K context on your 8 GB GPU could potentially handle 96K+ with TurboQuant compressing the KV cache. Expect llama.cpp integration within weeks, not months.
If you build inference infrastructure: TurboQuant’s zero-calibration design means you can apply it uniformly across every model you serve without per-model setup. Combined with weight quantization, you’re looking at dramatically lower VRAM requirements per concurrent user.
If you care about the market: Memory chip stocks (SK Hynix, Samsung, Micron) dropped after the announcement. This comes at a time when companies like Musk’s Terafab are betting $25 billion on building more chips. The worry is that 6x memory compression means 6x fewer memory chips needed. That’s an oversimplification. Demand for longer contexts and more concurrent users will likely absorb the savings. But the market reaction tells you people are taking this seriously.
If you’re a researcher: The theoretical foundation here is strong. TurboQuant achieves near-optimal distortion rates backed by mathematical proofs alongside the empirical results. The polar coordinate trick for eliminating normalization overhead is elegant and likely applicable beyond KV caches.
FAQ
Does TurboQuant replace GGUF/AWQ/GPTQ weight quantization?
No. TurboQuant compresses the KV cache (runtime memory), not model weights. You’d use both: weight quantization to fit the model in VRAM, and TurboQuant to extend how much context it can handle.
Can I use TurboQuant right now?
Community implementations exist for PyTorch (with Triton kernels), MLX, and llama.cpp is in progress. Google’s official release is expected Q2 2026. If you’re comfortable with experimental code, the PyTorch version on GitHub works today.
Does it work on every model?
In theory, yes. TurboQuant is data-oblivious and requires no model-specific calibration. In practice, it’s only been tested on models up to 8B parameters (Gemma, Mistral). Larger models (70B+) should work but haven’t been publicly validated yet.
Is the 8x speedup real?
The 8x number applies to attention logit computation on H100 GPUs at 4-bit precision. End-to-end inference speedup will be smaller since attention is only part of the total compute. On consumer GPUs like the RTX 4090, early implementations show ~1.2x on the attention kernel.
Will this kill the memory chip market?
Unlikely. Longer contexts, more concurrent users, and bigger models will absorb the efficiency gains. But it does shift the constraint from “how much memory can I buy” toward “how smart is my compression.” Chip companies that bet purely on capacity growth should be paying attention.
Bottom Line
TurboQuant is the kind of paper that doesn’t make splashy demos but changes what’s practical. Shrinking KV caches by 6x without losing accuracy means longer contexts on cheaper hardware, more users per GPU, and local models that can actually use the context windows they advertise. The math is proven, the community has validated it, and production integration is coming fast. If you run LLMs — locally or in the cloud — TurboQuant will probably be in your stack by summer.
