← Back to Wiki

Technology & AI

LLM Training

Start here if you're new to AI and want to understand the fundamentals.

This guide explains how AI models like ChatGPT and Claude actually work, starting from the basics and building up step by step.

The Problem Transformers Solved

Before 2017, AI models processed text one word at a time, in order. Imagine reading a sentence but you can only remember the last few words you read.

If someone says "The cat sat on the mat because it was tired" — what does "it" refer to? You need to remember "cat" from earlier. Old models struggled with this because information faded as sequences got longer.

Transformers solved this by letting the model look at all words at once.

Step 1: Turning Words into Numbers

Computers can't read words. They need numbers. So first, we break text into tokens (roughly words or word-pieces) and convert each token into a list of numbers called a vector.

"The cat sat" → [0.12, -0.34, 0.56, ...], [0.78, 0.23, -0.11, ...], [0.45, 0.67, 0.89, ...]
"The" "cat" "sat"

Each vector has thousands of dimensions (like 4096 numbers). These vectors are learned during training — words with similar meanings end up with similar vectors.

Step 2: The Core Idea — Self-Attention

Here's the magic. For each word, the model asks: "Which other words in this sentence should I pay attention to?"

Take: "The cat sat on the mat because it was tired"

When processing "it", the model needs to figure out that "it" refers to "cat". Self-attention lets every word look at every other word and decide how relevant they are.

How it works:

The result: when processing "it", the model pulls in information from "cat" because it learned they're related.

Step 3: Multi-Head Attention

One attention mechanism learns one type of relationship. But language has many types:

So transformers run multiple attention mechanisms in parallel — usually 32 or 96 "heads". Each head can specialize in different patterns. One might learn grammar, another might learn meaning.

Step 4: Feedforward Networks

After attention figures out which words relate to each other, each word passes through a feedforward network — a small neural network that transforms the vector.

Think of attention as "gathering information from context" and feedforward as "processing that information into something useful."

This is also where a lot of the factual knowledge gets stored. When you ask "What's the capital of France?" — the feedforward layers contain patterns that encode "Paris."

Step 5: Stack It Up (Layers)

One round of attention + feedforward is called a layer or block. Modern LLMs stack many of these:

Each layer refines the representation. Early layers might handle basic syntax. Middle layers might handle meaning. Later layers might handle complex reasoning.

Step 6: Residual Connections

Instead of just passing output from one layer to the next, transformers add the input back to the output:

output = layer(input) + input

This "skip connection" helps in two ways: gradients flow better during training (the model learns faster), and the model can be very deep without losing information.

Step 7: Positional Encoding

Attention has a problem: it treats "The cat sat" the same as "sat cat The" because it looks at all words equally. Word order matters in language.

So before processing, we add position information to each word's vector. "The" gets encoded as "The + position 1", "cat" as "cat + position 2", etc.

Step 8: Predicting the Next Word

After all the layers, the model outputs a vector for the last position. This vector gets converted into probabilities over all possible next words:

"The cat sat on the" →
  "mat": 15%
  "floor": 12%
  "couch": 8%
  "ground": 7%
  ...

During training, if the real next word was "mat", the model is rewarded for putting high probability on "mat". Over trillions of examples, it learns patterns.

The Full Picture

"The cat sat" [Convert to vectors + add positions] [Layer 1: Attention → Feedforward] [Layer 2: Attention → Feedforward] ... (96 more layers) [Final vector for last position] [Convert to probabilities] "on" (predicted next word)

The model then adds "on" to the input and repeats the whole process to predict the next word after that. This is autoregressive generation — one word at a time, each time looking at everything that came before.

How LLMs Execute Tasks (Tool Use)

Here's a key insight: LLMs don't actually do anything. They just generate text. The magic happens when you connect that text generation to real tools.

Think of the LLM as a brain that understands your request and decides what to do. But the actual execution—reading files, running code, browsing the web—happens through separate tools. The LLM is the interface; the tools are the hands.

You: "Find all bugs in my code and fix them" LLM thinks: "I need to read the code first" LLM outputs: { "tool": "read_file", "path": "app.js" } System executes: Reads file, returns content to LLM LLM thinks: "I see a bug on line 42, let me fix it" LLM outputs: { "tool": "edit_file", "changes": [...] } System executes: Makes the edit LLM responds: "Fixed! Here's what I changed..."

This is what makes tools like Cursor powerful: the LLM understands your intent, then orchestrates tools to actually accomplish the task.

What's Next?

Once you understand these fundamentals, switch to Advanced mode to learn about:

How do companies like OpenAI, Anthropic, and xAI actually train their models? This page breaks down the architecture, algorithms, and training processes that power modern AI. Understanding this is essential for anyone looking to work at the frontier of AI development.

At its core, an LLM is a giant neural network trained to predict the next word (token) in a sequence. But the magic is in the details: the transformer architecture, the training data, the optimization algorithms, and the post-training alignment that makes these models useful and safe.

The Training Pipeline

Training a frontier LLM happens in distinct phases. Each phase builds on the previous one.

PHASE 1: PRETRAINING Raw Text Data Tokenization Embeddings Transformer Next-Token Prediction (trillions of words) (break into tokens) (vectors) (self-attention) (learn patterns) PHASE 2: SUPERVISED FINE-TUNING (SFT) Human-written examples of ideal responses Fine-tune on instruction-following (demonstrations of helpful, accurate answers) PHASE 3: REINFORCEMENT LEARNING (RLHF/RLAIF) Human rankings of responses Train reward model Optimize policy with RL (which response is better?) (predict preferences) (maximize reward signal)

The result is a model that not only understands language patterns but actively tries to be helpful, harmless, and honest. The pretraining gives it knowledge and capability; the post-training gives it behavior and alignment.

The Transformer Architecture

The transformer is the foundational architecture behind all modern LLMs. Introduced in 2017 with the paper "Attention Is All You Need," it replaced older recurrent neural networks with a mechanism called self-attention.

Transformer Architecture Diagram - showing input tokens flowing through embeddings, transformer blocks with self-attention and feed-forward networks, to output probabilities

Decoder-only vs Encoder-Decoder: GPT, Claude, and Grok use decoder-only architectures optimized for text generation. They predict the next token autoregressively. Google's T5 and original BERT used encoder-decoder or encoder-only architectures for different tasks.

Tokenization & Embeddings

Before text enters the model, it must be converted into numbers. This happens in two steps: tokenization (breaking text into pieces) and embedding (converting pieces into vectors).

"Hello world" tokenize [15496, 995] embed [[0.12, -0.34, ...], [0.56, 0.78, ...]] (token IDs) (4096-dim vectors each)

Pretraining: Learning from the Internet

Pretraining is where the model learns language, facts, and reasoning from massive amounts of text. The objective is simple: predict the next token.

Compute Scale: GPT-4 reportedly trained on ~25,000 A100 GPUs for months. Training a frontier model costs $50M-$100M+ in compute alone. Grok 4 trained on 200,000 GPUs in the Colossus cluster.

RLHF: Reinforcement Learning from Human Feedback

Pretraining produces a model that can predict text, but it doesn't know how to be helpful or follow instructions. RLHF aligns the model with human preferences.

The RLHF Process

REWARD MODEL TRAINING: Prompt + Response A vs Response B Human ranks A > B Train model to predict this POLICY OPTIMIZATION: Generate response Get reward Update policy to increase expected reward (sample from model) (from reward model) (PPO algorithm)

Why it works: RLHF teaches the model that being helpful, honest, and harmless leads to higher reward. The model internalizes these preferences and generalizes them to new situations.

Tool Use: LLMs as the Brain, Not the Hands

Here's the key insight: LLMs don't actually do anything. They just generate text. The magic happens when you connect that text generation to real tools that execute actions.

Think of the LLM as a brain that understands your request and decides what to do. But the actual execution—reading files, running code, browsing the web, sending emails—happens through separate tools, functions, scripts, and APIs. The LLM is the interface; the tools are the hands.

YOU: "Find all TODO comments in my codebase and fix them" LLM THINKS: "I need to search the codebase, then edit files" LLM OUTPUTS: { "tool": "grep", "args": { "pattern": "TODO", "path": "." } } SYSTEM EXECUTES: grep runs, returns results to LLM LLM OUTPUTS: { "tool": "edit_file", "args": { "path": "app.js", "changes": [...] } } SYSTEM EXECUTES: file is edited LLM RESPONDS: "I found 3 TODOs and fixed them. Here's what I changed..."

How Tool Use Works

Examples of Tools

This is what makes Cursor powerful: The LLM understands your intent ("fix this bug"), then orchestrates tools to actually do it—reading your code, making edits, running tests, committing to git. The model is the decision-maker; the tools are the execution layer.

Agentic AI

When an LLM can autonomously use tools in a loop—planning, executing, observing results, and adapting—it becomes an agent. This is the frontier of AI right now.

MCP (Model Context Protocol): This is Anthropic's open standard for connecting LLMs to tools. Think of it like USB for AI—a universal way to plug capabilities into any model. Cursor uses MCP to connect to GitHub, Slack, browsers, and more.

How Different Companies Train

Each AI lab has developed unique training approaches that reflect their philosophy and research priorities.

OpenAI GPT Models

Anthropic Claude

xAI Grok

Google Gemini

Meta LLaMA

Key Concepts to Understand

These are the terms and ideas you need to know to discuss LLM training intelligently.

What Cursor Does

Cursor isn't training foundation models from scratch. Instead, they're doing applied AI research that makes these models dramatically more useful for coding.

Why this matters for jobs: Cursor is hiring for ML engineering, infrastructure, and product roles that focus on applying frontier models to real problems. You don't need to know how to pretrain GPT-4, but you do need to understand how these models work and how to make them useful.

Resources for Going Deeper

Papers

Videos & Courses

Company Research