Architecture
Deep dives into how models actually work — from attention mechanisms to mixture of experts.
-
KV Cache Optimization — Why Inference Memory Explodes and How to Fix It — Understanding PagedAttention, prefix caching, and MLA — three approaches to taming the KV cache bottleneck
-
Multi-head Latent Attention (MLA) — Review — Review session on MLA — testing recall on KV cache compression, learned projections, and the memory vs compute tradeoff