Fundamentals of AI: Inside the transformer

The engine behind modern AI explained

The transformer, introduced in a 2017 paper with the now-famous title “Attention Is All You Need,” is the engine behind every major language model you have heard of. GPT, Claude, Gemini, Llama, Mistral. All transformer-based models. Understanding how this architecture works is like understanding how an internal combustion engine works. You do not need to know it to drive a car but knowing it changes how you think about every car you will ever drive (and perhaps fix, if that is your thing).

This is the second blog in the ‘Fundamentals of AI’ series. We covered the foundational vocabulary in Fundamentals of AI. Now we go under the hood, walking through different architectural concepts that collectively explain how transformers process language, what works as the foundation of modern AI, and how different aspects contribute to what we now know as large language models (LLMs).

Fair warning: This piece gets technical, but I have done my best to make every concept accessible.

Attention decides what matters

Before transformers, the dominant neural models processed text sequentially. They read word by word, left to right, maintaining a running summary of what they had seen. This worked, but it had a fatal flaw. By the time the model reached the end of a long sentence, the beginning had faded from its effective memory.

Attention solves this by letting the model look at the entire input simultaneously and decide, for each position, which other positions are most relevant. When processing the word “it” in the sentence “The cat sat on the mat because it was tired,” attention helps the model figure out that “it” refers to “cat” rather than “mat.” It does this by computing a relevance score between “it” and every other word in the sentence.

The mechanism works through three learned transformations. Each token gets projected into three vectors: a query (what am I looking for?), a key (what kind of information do I represent?), and a value (what do I actually hand over when asked?). The model computes similarity between each query and all keys; high similarity means high relevance. These similarity scores get normalized and used as weights to produce an output which is a weighted sum of the value vectors, emphasizing the most relevant ones.

The beauty of this design is parallelism. Every token computes its attention simultaneously. There is no sequential bottleneck. This is what makes transformers trainable on modern GPUs, which excel at parallel operations across thousands of processors.

At its core, attention is a soft lookup table. The query is your search term. The keys are the index. The values are the data. Once you see it that way, the whole mechanism feels a little bit more natural.

Seeing the input from multiple angles

One attention pass isn’t enough. If you’re looking at the sentence, “The bank by the river had been closed since the flood,” you’ve got a mess of connections to track. In order to understand context a model would need to link the subject to the verb, figure out that “bank” relates to “river,” and track what “the flood” actually refers to. A single pass can’t handle that kind of complexity.

That’s where multi-head attention comes in. Instead of one monolithic calculation, we run many separate attention streams in parallel. One head might obsess over grammar, tracking how nouns link to verbs. Another ignores syntax entirely to focus on semantic meaning, keeping the “bank” and “river” relationship in view. A third might handle pronouns and so on.

We don’t just stack these heads on top of each other; we split the work. If the model’s embedding dimension is 768, we slice that into twelve chunks of 64 (like BERT-base or the GPT-2 small model). Each head gets its own slice to play with. At the end, we stitch those outputs back together. It sounds like a lot of overhead, but the math works out to roughly the same cost as one big attention pass. An important part here is that we don’t hard-code a parser or write a rulebook on English grammar. Instead, the architecture is built in such a way that it allows the model to look at language from multiple angles simultaneously. When you train any model on enough data, it naturally settles into these patterns because they’re the most efficient way to predict the next word. That’s the real beauty of the transformer. You provide structure, and the data dictates the logic of what happens next.

Teaching order to a system that has none

Here is a strange fact about the attention mechanism we just described. It has no concept of word order. If you scramble the words in a sentence, the attention scores would change (because the embeddings are different), but the mechanism itself treats position 1 and position 50 identically. There is nothing in the architecture that says “this word comes before that one.”

This presents a bit of a challenge, as word order is fundamental to language. Consider the difference between “dog bites man” and “man bites dog.” While the words are identical, the meaning changes entirely based on their sequence. To truly understand what a sentence is trying to convey, the model needs to recognize that order. So how do you teach order to something that doesn’t have a sense of order?

The model already represents each word as a numerical fingerprint, a long list of numbers that captures what the word means. What researchers figured out is that you can also build fingerprints for positions. One fingerprint for “first word in the sentence,” a different one for “second word,” another for “third,” and so on. Then you combine the two. The fingerprint for the word “dog” gets mixed with the fingerprint for whatever slot it’s sitting in. “Dog” in the first slot ends up having a slightly different numerical representation from “dog” in the fifth slot, even though it’s the same word just located in a different place in the sentence. Once you do this for every word in the sentence, meaning and position are baked into the same signal, and the rest of the transformer can learn to read both out of it.

Learning by fill-in-the-blank

How do you teach a model to understand language? One powerful approach is to give it sentences with missing words and train it to fill in the blanks. This is masked language modeling (MLM), the training objective behind BERT (Bidirectional Encoder Representations from Transformers) and its many variants.

During training, the model receives sentences where roughly 15% of tokens have been replaced with a special [MASK] token. For example, “The cat sat on the mat” might become “The [MASK] sat on the mat.” The model must predict the original token using context from both sides of the gap, simultaneously rather than sequentially.

That last point is where the power lies. Because the model can attend to words both before and after the masked position, it develops bidirectional understanding. In our example, “The” before the mask suggests a noun is coming, while “sat on the mat” after it tells the model that noun is likely an animal or person. Neither side alone is sufficient as the model combines both directions to arrive at a strong prediction. This two-way context produces richer representations than models that can only look backward.

MLM produces models that are exceptionally good at “understanding” tasks that involve sentiment analysis, question answering, and text classification. The bidirectional context gives them a thorough grasp of how words relate to each other within a sentence.

The downside is that MLM models are not natural text generators. Because they were trained to fill in blanks (which can be anywhere in a sentence), they do not learn the left-to-right generation pattern needed for tasks like writing essays or having conversations. That requires a different training objective, which brings us to autoregressive models.

Autoregressive and masked models

The AI field has produced two dominant training paradigms for language models, and each optimizes for different strengths.

Autoregressive models (the GPT family, Llama, etc.) are trained to predict the next token given all previous tokens. They process text strictly left to right. At each step, the model sees everything that came before and predicts what comes next. This makes them natural generators. These models literally learn to write by predicting one word at a time, building sentences from the beginning forward.

Masked models such as BERT are trained to reconstruct unseen or corrupted inputs. They see the whole sentence (with some tokens masked) and predict the missing pieces. They can look both forward and backward, which gives them deeper contextual understanding of how words relate within a sentence.

In terms of engineering, autoregressive models are better at generation tasks such as writing text, completing code, and answering open-ended questions. Masked models are better at understanding tasks such as classifying text, extracting information, and comparing sentence similarity. The training objective shapes what the model learns to do well. It’s also the reason why you might find LLMs specifically trained for specific tasks. There are thousands of models these days that are optimized for different aspects of specific tasks. You can find a lot of them on Hugging Face.

The practical landscape has shifted heavily toward autoregressive models in recent years, mainly because generation capability turned out to be more valuable and because scaled-up autoregressive models (like GPT-4) proved surprisingly good at understanding tasks, too. BERT-style models remain important in production systems where classification speed and accuracy matter more than generation ability.

Next sentence prediction

Masked language modeling teaches word-level relationships. But language also has structure at the sentence level. For example, does sentence B logically follow sentence A? Is this paragraph coherent? BERT addressed this with a second pretraining task called next sentence prediction (NSP).

During training, the model receives pairs of sentences. Half the time, sentence B actually follows sentence A in the source text (positive pair). Half the time, sentence B is a random sentence from elsewhere in the corpus (negative pair). The model learns to classify each pair as consecutive or random.

The idea was to give the model a sense of coherence to understand how sentences connect to form meaningful sequences. This would help with tasks like question answering (where the answer sentence needs to relate to the question) and natural language inference (determining if one statement supports, contradicts, or is unrelated to another).

How LLMs differ from traditional language models

Before transformers and LLMs, the dominant approach to language modeling was statistical. N-gram models counted how often sequences of n words appeared together in a corpus and used those frequencies to estimate probabilities. Consider a sentence such as “A cat sat on the mat.” A trigram model (n=3), for instance, would estimate the probability of “mat” following “on the” by counting how many times “on the mat” appeared relative to all other continuations of “on the.”

These models worked surprisingly well for their simplicity, and they are still used in some applications. Speech recognition systems, for example, often use n-gram language models as a fast, lightweight component. However, n-gram models cannot capture dependencies beyond “n” words; they require explicit feature engineering, and they scale poorly to large vocabularies. LLMs are different in three ways, and the first one is the most significant.

First, they don’t treat words as standalone symbols. The word “bank” in an n-gram model is just a string, indistinguishable from “xyzzy” except by how often it shows up. An LLM represents “bank” as a long list of numbers (an embedding) that places it near “loan” and “deposit” in abstract numerical space, and a different version of “bank” near “river” and “shore.” Meaning becomes geometry.

Second, they don’t rely on counting short, fixed windows. An n-gram model that looks at three words at a time can never connect a pronoun on page four to the noun it refers to on page two of your favorite novel. Attention changes that. LLMs can pull context from anywhere in the input window, not just the last few words.

Third, the training is different. Old language models were trained on specific tasks with labeled data. LLMs are trained by reading enormous amounts of unlabeled text and predicting the next word, over and over, until general language patterns fall out. The task-specific behavior comes later, often without retraining at all.

The scale difference is staggering. A well-trained trigram model might have millions of parameters. GPT-3 had 175 billion, GPT-4 is reportedly much larger. This difference in scale, combined with the architectural advantages of transformers, is what enables LLMs to produce fluent text, reason about complex questions, and transfer knowledge across domains in ways that statistical models never could.

What you now know

We’ve gone under the hood. You now understand the architecture that powers every major language model in production today, from the attention mechanism that lets a model decide what matters to the multi-head design that lets it look at language from multiple angles simultaneously.

Most importantly, you now have a mental model for why transformers work. Provide the right structure, train on enough data, and the model discovers its own logic without hard-coded rules or dictated grammar – just architecture, data, and scale.

In the next installment, we will go deeper still, looking at how raw pretrained models get shaped into the helpful assistants you interact with through fine-tuning, prompting, and engineering.

Source link