Towards a subquadratic future
One of the reason's AI progress, specifically LLM progress, often feels like "brute" forcing it, is the quadratic nature of LLM attention, where every token gets compared to every other token. This has put significant stress on compute / memory capacity and performance.
Some of this has been mitigated, for example flash-attention has reduced how memory-latency-bound some AI processing has become. Nonetheless, researchers and developers continue to push the boundaries of linear attention and hybrid attention, in order to slay this dragon for good.
This page will chronicle the path to and from quadratic attention and where are as an industry. An interesting resource on this subject I came across is Sebastian Raschka's [Big LLM Architecture Comparison](https