How Modern LLMs Work: From RNNs to Transformer Attention

Before today's AI, language models were dominated by recurrent neural networks. The shift to Transformer architecture and attention mechanisms changed what context costs — and why your prompt design suddenly matters.

Sean Robinson

Before the current generation of AI, language generation models were dominated by recurrent neural networks (RNNs). An RNN processes text token by token, maintaining a small (accent on small) amount of information between each generation step. In practice this meant models were fairly compact and could use recent context well, but struggled to retain information across long passages. They were essentially sophisticated next-token predictors with a limited short-term memory.

A huge shift came with the Transformer architecture (2017) and the attention mechanism that powered it. Instead of compressing past context into a single running state, Transformers can directly look back at every prior token in the input and compute a weighted score for how relevant each one is to generating the next token. This gave models a qualitatively richer ability to track long-range dependencies and follow complex instructions, but it came with two significant costs. First, the attention matrix grows roughly with the square of the context length, making models very large and memory/compute-hungry. Second, and more important for users, the model is forced to consider every token in the context every time it generates a new token.

So, context became a constant processing obligation.

Or as I like to put it: "Context is not just what the LLM knows — it is what the LLM is forced to consider at all times."

Every extra sentence in your prompt is a sentence the model must weigh against every bit of text it produces in reply. So even as modern models have really big theoretical input/context token limits, in practice the attention/transformer methods experience "cognitive strain" at much lower token limits, and in particular, when a lot of imperatives are thrown at it.

This means prompt design is a balancing act: include enough context for the model to do the task correctly, but no more. Padding the prompt with background the model does not need can actually degrade output quality, not just slow things down.

Frequently asked

Common questions on this topic.

Providing excessive context creates cognitive strain because the attention mechanism forces the model to weigh every input token against every token it generates. When prompts are padded with irrelevant background, the model may struggle to prioritize the actual imperatives, which degrades the quality of the output.
What this piece resolves
Stage 01 · CuriosityStage 02 · ProjectsSolo scaleGrowth scaleMid-market scaleClimb enabler