context + claim

Google announced TurboQuant — a KV cache compression algorithm claiming up to 6x memory reduction and 8x speedup. The internet went wild. But the comparison baseline is misleading: they’re comparing 4-bit to 32-bit, not to the quantization methods people already use in production.

Key assertion: This is Google’s attempt at a DeepSeek V3 moment — but DeepSeek actually delivered. TurboQuant is mostly marketing.

constraint map

What Google claimed:

  • 6x KV cache memory reduction
  • 8x inference speedup
  • Frees ~83% of memory in AI chips
  • Triggered Micron stock drop (~30% on DDR5)

The problem with the claims:

  • 8x speedup: compared 4-bit to 32-bit unquantized baseline — nobody uses this in practice
  • Modern inference already uses 4-bit, 8-bit quantization
  • Real question: how does TurboQuant compare to existing quantization methods? Google didn’t answer this
  • Paper published ~1 year ago, quietly re-marketed as “new”

What KV cache actually does:

  • Transformers store “key vectors” for every token in memory
  • As context grows, KV cache grows linearly
  • Query vectors compared against all stored keys via dot product
  • Memory bottleneck is real — but compression isn’t new

The DeepSeek comparison:

  • DeepSeek V3: genuine architectural + training breakthroughs, people misunderstood the impact but it was real
  • TurboQuant: comparison to a strawman baseline, real improvement unclear

my take

Google’s blog post is designed for Twitter virality, not technical rigor. “8x faster” makes great headlines. “8x faster than our 32-bit baseline that we never actually ship” doesn’t.

The technique itself might be solid — the paper has real contributions. But we can’t evaluate them without proper comparison to baselines practitioners actually use. Saying “I run 100x faster than a toddler” is technically true, but useless for anyone choosing between sprinters.

Watch for independent benchmarks comparing TurboQuant against INT4/INT8 methods already in vLLM, TensorRT-LLM, etc.

linkage