model distillation factories appear across teams
Distillation moved from research side-project to production discipline in 2024, especially after teams saw how quickly inference bills can erase product margin (arXiv). What changed is operational cadence: teams are no longer distilling once, they are distilling continuously.
see also: inference cost compression changes product bets · llama three launch pressures api only stacks
where the leverage moved
The old optimization target was model quality at any cost. The current target is acceptable quality at predictable unit economics. Distillation factories formalize that tradeoff: baseline model, teacher updates, student refreshes, and benchmark gates.
risk surface in this pattern
Distillation can silently encode teacher failures. If evaluation is weak, teams ship smaller models that are cheap but brittle on edge cases. That is exactly why governance work from open source model audits become procurement baseline is now coupled to performance engineering.
decision boundary for teams
If your product tolerates slight quality loss for major latency savings, distillation is a clear win. If your domain is high-stakes and low-tolerance, the cost savings can be a false economy unless audits and rollback paths are mature.
my take
I treat distillation as infrastructure now, not experimentation. Teams that industrialize it will ship faster and survive price wars better.
linkage
- [[inference cost compression changes product bets]]
- [[llama three launch pressures api only stacks]]
- [[open source model audits become procurement baseline]]
ending questions
what evaluation gate should be mandatory before a distilled model can replace a teacher in production?