Simple analogy for how scaling improves LLMs featured image

Introduction

Scaling Laws for Neural Language Models, 2020, Kaplan et al, popularized the idea of "loss scales as a power-law with model size, dataset size, and the amount of compute" , in other words, models get better with model size, more (quality) data, and compute. Belief in this view / actual observation in practice is one of the clear drivers behind massive infrastructure investments and the anticipation of continued improvement in models, and perhaps even one day, human-like intelligence.

While I and others have easy access to such papers, and can read them, sometimes it is helpful to express formal notation, experimental data, and a long paper into something a little simpler to carry around every day. For me, one of the simple ideas I carry around is that complex information spaces, with a large number of combinations, can be error prone. If there are more combinations than ways of expressing combinations, then ambiguity is introduced, one code gets mapped to many different combinations. Amibuity leads to suboptimal or undersired outcomes - for example hallucinations. Determinism can be out of reach.

While many want much better outcomes than LLMs provide today, whether in (LLM) agent form or not, those same dissatisfied people still acknowledge how much models have improved - improvements that are rooted in many things, including scaling laws. We get more, and it wets our appetite for even more.

The rest of this text expands on disambiguation with a simple analogy, that has flaws, but is for me, a starting point to explaining to those that do not read research papers, one of the reasons LLMs have improved.

Image

A simple network

Do not think of the networks discussed below as neural networks, just think of them as very simple networks.

Imagine four nodes. There is only one way to get from A to D, by going from A to B. However, in this initial limited system, the code for expressing link identity is limited to one color, blue. While clearly this would be a limited approach, the point is, selection becomes probabilistic. If you are told to select the "blue" link, you may choose A-C or A-B. if you choose A-B , great, you have a chance of getting to D, if you choose A-C, well you are way off track.

Image

Disambiguate with an expanded code space

Next, the system is upgraded, now supporting two colors, red and blue. If you are now told to select the blue link, you will more deterministically make it to D, via B. Errors have been reduced, and the probabilistic is now deterministic.

This next step is analogous to the benefits of having a larger embedding space in LLMs - a larger "address" space of embedding codes.

Image

Optimal selection

Let us now create more disambiguated links / paths. Not only can you deterministically get from A to D, you can do so via a better performing selection, the green link.

The point of this analogy is not to imagine this is how a neural network operates, the point is to imagine how decisions get better when a model has more information. More specifically, in the case of LLMs, when you have a large embedding space (codes for things - like semantics), the ability to unambiguously assign them to one option, increases. As the old example goes, you can now more easily differentiate between "Paris Hilton" and the "Hilton in Paris" - similar, but different.

In many information problems, information reduces uncertainty. However, when you have an information space, of 50,000 words for example, the number of potential combinations is a large number, even larger when you consider the number of potential phrases and sentences. Hence, disambiguating choices is important and challenging, requiring larger and larger models to reduce errors and make better choices - which may not be super intelligence, but less errors and better choices still lead to better outcomes.

The analogy above has many limitations and as you dig in to it, the subtleties of LLMs emerge:

  • Modeling all possible combinations is not necessarily needed.
  • Scaling might amplify bad quality data as a type of noise rather than disambiguating
  • Curse of dimensionality / computational complexity / sparse data / overfitting noise
  • And all the approaches to address the above bullet points
Image

Despite the limitations of the analogy, it is a simple starting point. It is also extensible. Consider semantic subspaces, or any cluster of nodes that represent an area of expertise. One part of the model can be active, while other experts are inactive, providing routing to one part of the model and energy reductions - a simple analogy for Mixture of Expert (MoE) models - a form of scaling up, with an emphasis on specialization and efficiency over uniform parameter growth.

While simple analogies have limitations, they can be a starting point for a journey. A journey that starts with one of the oldest insights about information, relevant and non-redundant information often decreases uncertainty, and that in turn increases determinism, if you can hold that thought in your head, you can begin to understand while LLMs are getting better over time, and then start to research why sudden jumps in capabilities occur when thresholds are passed.

Image

None of this should be taken as an assertion that this is a path to super intelligence. In fact, there is growing emphasis not only on price/performance, but bending the curve towards accessible and sustainable intelligence.