LLM Training

This guide explains how AI models like ChatGPT and Claude actually work, starting from the basics and building up step by step.

The Problem Transformers Solved

Before 2017, AI models processed text one word at a time, in order. Imagine reading a sentence but you can only remember the last few words you read.

If someone says "The cat sat on the mat because it was tired" — what does "it" refer to? You need to remember "cat" from earlier. Old models struggled with this because information faded as sequences got longer.

Transformers solved this by letting the model look at all words at once.

Step 1: Turning Words into Numbers

Computers can't read words. They need numbers. So first, we break text into tokens (roughly words or word-pieces) and convert each token into a list of numbers called a vector.

"The cat sat" → [0.12, -0.34, 0.56, ...], [0.78, 0.23, -0.11, ...], [0.45, 0.67, 0.89, ...]

                      "The"                      "cat"                     "sat"

Each vector has thousands of dimensions (like 4096 numbers). These vectors are learned during training — words with similar meanings end up with similar vectors.

Step 2: The Core Idea — Self-Attention

Here's the magic. For each word, the model asks: "Which other words in this sentence should I pay attention to?"

Take: "The cat sat on the mat because it was tired"

When processing "it", the model needs to figure out that "it" refers to "cat". Self-attention lets every word look at every other word and decide how relevant they are.

How it works:

Each word creates three things: a Query ("What am I looking for?"), a Key ("What do I contain?"), and a Value ("What information do I give if you attend to me?")
To process "it", the model compares its Query against every word's Key. "Cat" has a Key that matches well.
Words with matching Keys get higher attention scores. The model takes a weighted mix of all the Values.

The result: when processing "it", the model pulls in information from "cat" because it learned they're related.

Step 3: Multi-Head Attention

One attention mechanism learns one type of relationship. But language has many types:

"it" → "cat" (what does the pronoun refer to?)
"sat" → "cat" (who is doing the action?)
"tired" → "sat" (why did it sit?)

So transformers run multiple attention mechanisms in parallel — usually 32 or 96 "heads". Each head can specialize in different patterns. One might learn grammar, another might learn meaning.

Step 4: Feedforward Networks

After attention figures out which words relate to each other, each word passes through a feedforward network — a small neural network that transforms the vector.

Think of attention as "gathering information from context" and feedforward as "processing that information into something useful."

This is also where a lot of the factual knowledge gets stored. When you ask "What's the capital of France?" — the feedforward layers contain patterns that encode "Paris."

Step 5: Stack It Up (Layers)

One round of attention + feedforward is called a layer or block. Modern LLMs stack many of these:

GPT-3: 96 layers
Claude: ~100+ layers
GPT-4: potentially 120+ layers

Each layer refines the representation. Early layers might handle basic syntax. Middle layers might handle meaning. Later layers might handle complex reasoning.

Step 6: Residual Connections

Instead of just passing output from one layer to the next, transformers add the input back to the output:

output = layer(input) + input

This "skip connection" helps in two ways: gradients flow better during training (the model learns faster), and the model can be very deep without losing information.

Step 7: Positional Encoding

Attention has a problem: it treats "The cat sat" the same as "sat cat The" because it looks at all words equally. Word order matters in language.

So before processing, we add position information to each word's vector. "The" gets encoded as "The + position 1", "cat" as "cat + position 2", etc.

Step 8: Predicting the Next Word

After all the layers, the model outputs a vector for the last position. This vector gets converted into probabilities over all possible next words:

"The cat sat on the" →

  "mat": 15%

  "floor": 12%

  "couch": 8%

  "ground": 7%

  ...

During training, if the real next word was "mat", the model is rewarded for putting high probability on "mat". Over trillions of examples, it learns patterns.

The Full Picture

"The cat sat" ↓ [Convert to vectors + add positions] ↓ [Layer 1: Attention → Feedforward] ↓ [Layer 2: Attention → Feedforward] ↓ ... (96 more layers) ↓ [Final vector for last position] ↓ [Convert to probabilities] ↓ "on" (predicted next word)

The model then adds "on" to the input and repeats the whole process to predict the next word after that. This is autoregressive generation — one word at a time, each time looking at everything that came before.

How LLMs Execute Tasks (Tool Use)

Here's a key insight: LLMs don't actually do anything. They just generate text. The magic happens when you connect that text generation to real tools.

Think of the LLM as a brain that understands your request and decides what to do. But the actual execution—reading files, running code, browsing the web—happens through separate tools. The LLM is the interface; the tools are the hands.

You: "Find all bugs in my code and fix them" ↓ LLM thinks: "I need to read the code first" ↓ LLM outputs: { "tool": "read_file", "path": "app.js" } ↓ System executes: Reads file, returns content to LLM ↓ LLM thinks: "I see a bug on line 42, let me fix it" ↓ LLM outputs: { "tool": "edit_file", "changes": [...] } ↓ System executes: Makes the edit ↓ LLM responds: "Fixed! Here's what I changed..."

This is what makes tools like Cursor powerful: the LLM understands your intent, then orchestrates tools to actually accomplish the task.

What's Next?

Once you understand these fundamentals, switch to Advanced mode to learn about:

The three-phase training pipeline (pretraining, SFT, RLHF)
How different companies train their models (OpenAI, Anthropic, xAI)
Constitutional AI and other alignment techniques
Key technical terms and concepts
What Cursor specifically does with these models

How do companies like OpenAI, Anthropic, and xAI actually train their models? This page breaks down the architecture, algorithms, and training processes that power modern AI. Understanding this is essential for anyone looking to work at the frontier of AI development.

At its core, an LLM is a giant neural network trained to predict the next word (token) in a sequence. But the magic is in the details: the transformer architecture, the training data, the optimization algorithms, and the post-training alignment that makes these models useful and safe.

The Training Pipeline

Training a frontier LLM happens in distinct phases. Each phase builds on the previous one.

PHASE 1: PRETRAINING Raw Text Data → Tokenization → Embeddings → Transformer → Next-Token Prediction (trillions of words) (break into tokens) (vectors) (self-attention) (learn patterns) ↓ PHASE 2: SUPERVISED FINE-TUNING (SFT) Human-written examples of ideal responses → Fine-tune on instruction-following (demonstrations of helpful, accurate answers) ↓ PHASE 3: REINFORCEMENT LEARNING (RLHF/RLAIF) Human rankings of responses → Train reward model → Optimize policy with RL (which response is better?) (predict preferences) (maximize reward signal)

The result is a model that not only understands language patterns but actively tries to be helpful, harmless, and honest. The pretraining gives it knowledge and capability; the post-training gives it behavior and alignment.

The Transformer Architecture

The transformer is the foundational architecture behind all modern LLMs. Introduced in 2017 with the paper "Attention Is All You Need," it replaced older recurrent neural networks with a mechanism called self-attention.

Transformer Architecture Diagram - showing input tokens flowing through embeddings, transformer blocks with self-attention and feed-forward networks, to output probabilities

Self-Attention — Each token can "attend" to every other token in the sequence. The model learns which tokens are relevant to each other, regardless of distance. This is how it understands context.
Multi-Head Attention — Instead of one attention mechanism, transformers use multiple "heads" in parallel. Each head can learn different types of relationships (syntax, semantics, coreference).
Feedforward Networks — After attention, each position passes through a feedforward neural network. This is where much of the "knowledge" is stored.
Residual Connections — Skip connections add the input to the output of each layer. This helps gradients flow during training and allows the network to be very deep.
Layer Normalization — Normalizes activations within each layer to stabilize training and improve convergence.
Positional Encoding — Since attention has no inherent notion of order, position information is added to tell the model where each token appears in the sequence.

Decoder-only vs Encoder-Decoder: GPT, Claude, and Grok use decoder-only architectures optimized for text generation. They predict the next token autoregressively. Google's T5 and original BERT used encoder-decoder or encoder-only architectures for different tasks.

Tokenization & Embeddings

Before text enters the model, it must be converted into numbers. This happens in two steps: tokenization (breaking text into pieces) and embedding (converting pieces into vectors).

Tokenization — Text is split into tokens, which are subword units. The word "understanding" might become ["under", "stand", "ing"]. Common algorithms: BPE (Byte Pair Encoding), SentencePiece, tiktoken.
Vocabulary Size — Modern LLMs use vocabularies of 32K-100K+ tokens. Larger vocabularies mean more efficient encoding but more parameters. GPT-4 uses ~100K tokens; Claude uses ~100K.
Token Embeddings — Each token ID maps to a learned vector (e.g., 4096 dimensions). These embeddings capture semantic meaning and are trained along with the model.
Position Embeddings — Added to token embeddings to encode position. Modern models use RoPE (Rotary Position Embedding) which extends better to long sequences.
Context Window — The maximum number of tokens the model can process at once. GPT-4 Turbo: 128K tokens. Claude 3: 200K tokens. Grok 4: 1M tokens.

"Hello world" → tokenize → [15496, 995] → embed → [[0.12, -0.34, ...], [0.56, 0.78, ...]] (token IDs) (4096-dim vectors each)

Pretraining: Learning from the Internet

Pretraining is where the model learns language, facts, and reasoning from massive amounts of text. The objective is simple: predict the next token.

Training Data — Trillions of tokens from the web (Common Crawl), books, Wikipedia, code (GitHub), scientific papers, and curated datasets. Data quality matters enormously.
Objective Function — Cross-entropy loss on next-token prediction. The model outputs a probability distribution over all tokens; the loss penalizes putting low probability on the correct next token.
Optimization — Adam or AdamW optimizer with learning rate warmup and decay. Training uses mixed-precision (FP16/BF16) to reduce memory and speed up computation.
Parallelism — Training happens across thousands of GPUs using data parallelism (different batches), tensor parallelism (split layers), and pipeline parallelism (split model stages).
Scaling Laws — Performance improves predictably with more parameters, data, and compute. Loss scales as a power law. Larger models are more sample-efficient.

Compute Scale: GPT-4 reportedly trained on ~25,000 A100 GPUs for months. Training a frontier model costs $50M-$100M+ in compute alone. Grok 4 trained on 200,000 GPUs in the Colossus cluster.

RLHF: Reinforcement Learning from Human Feedback

Pretraining produces a model that can predict text, but it doesn't know how to be helpful or follow instructions. RLHF aligns the model with human preferences.

The RLHF Process

Step 1: Collect Demonstrations — Human labelers write examples of ideal responses to prompts. This creates a supervised fine-tuning (SFT) dataset.
Step 2: Train SFT Model — Fine-tune the pretrained model on these demonstrations. The model learns the format and style of helpful responses.
Step 3: Collect Comparisons — Show labelers two model responses to the same prompt. They rank which is better. This creates preference data.
Step 4: Train Reward Model — Train a separate model to predict human preferences. Given a prompt and response, it outputs a scalar reward.
Step 5: Optimize with RL — Use PPO (Proximal Policy Optimization) to fine-tune the SFT model to maximize reward while staying close to the original distribution (KL penalty).

REWARD MODEL TRAINING: Prompt + Response A vs Response B → Human ranks A > B → Train model to predict this POLICY OPTIMIZATION: Generate response → Get reward → Update policy to increase expected reward (sample from model) (from reward model) (PPO algorithm)

Why it works: RLHF teaches the model that being helpful, honest, and harmless leads to higher reward. The model internalizes these preferences and generalizes them to new situations.

Tool Use: LLMs as the Brain, Not the Hands

Here's the key insight: LLMs don't actually do anything. They just generate text. The magic happens when you connect that text generation to real tools that execute actions.

Think of the LLM as a brain that understands your request and decides what to do. But the actual execution—reading files, running code, browsing the web, sending emails—happens through separate tools, functions, scripts, and APIs. The LLM is the interface; the tools are the hands.

YOU: "Find all TODO comments in my codebase and fix them" ↓ LLM THINKS: "I need to search the codebase, then edit files" ↓ LLM OUTPUTS: { "tool": "grep", "args": { "pattern": "TODO", "path": "." } } ↓ SYSTEM EXECUTES: grep runs, returns results to LLM ↓ LLM OUTPUTS: { "tool": "edit_file", "args": { "path": "app.js", "changes": [...] } } ↓ SYSTEM EXECUTES: file is edited ↓ LLM RESPONDS: "I found 3 TODOs and fixed them. Here's what I changed..."

How Tool Use Works

Function Calling — The LLM is trained to output structured JSON that describes which tool to call and with what arguments. OpenAI calls this "function calling"; Anthropic calls it "tool use."
Tool Definitions — You give the model a list of available tools with descriptions of what they do and what parameters they accept. The model learns to pick the right tool for each task.
Execution Loop — The system executes the tool, captures the output, and feeds it back to the LLM. The model can then decide what to do next based on the result.
Multi-step Reasoning — Complex tasks require multiple tool calls. The LLM plans, executes, observes, and iterates until the task is complete.

Examples of Tools

File Operations — Read files, write files, search codebases, create directories
Shell Commands — Run terminal commands, install packages, execute scripts
Browser Automation — Navigate to URLs, click buttons, fill forms, take screenshots
APIs — Call external services (GitHub, Slack, databases, email)
Code Execution — Run Python/JavaScript in a sandbox to compute results

This is what makes Cursor powerful: The LLM understands your intent ("fix this bug"), then orchestrates tools to actually do it—reading your code, making edits, running tests, committing to git. The model is the decision-maker; the tools are the execution layer.

Agentic AI

When an LLM can autonomously use tools in a loop—planning, executing, observing results, and adapting—it becomes an agent. This is the frontier of AI right now.

ReAct Pattern — Reason → Act → Observe → Repeat. The model thinks about what to do, takes an action, sees what happened, and continues.
Planning — Advanced agents can break down complex tasks into subtasks and execute them systematically.
Error Recovery — Good agents can detect when something went wrong and try a different approach.
Memory — Agents can maintain state across interactions, remembering what they've done and learned.

MCP (Model Context Protocol): This is Anthropic's open standard for connecting LLMs to tools. Think of it like USB for AI—a universal way to plug capabilities into any model. Cursor uses MCP to connect to GitHub, Slack, browsers, and more.

How Different Companies Train

Each AI lab has developed unique training approaches that reflect their philosophy and research priorities.

OpenAI GPT Models

InstructGPT / ChatGPT — Pioneered RLHF for instruction-following. SFT on human demonstrations, then RL with human preference data.
GPT-4 — Mixture-of-Experts architecture (rumored). Massive scale pretraining followed by RLHF. Strong focus on capabilities and reasoning.
o1 / o3 — Reasoning models trained with RL on chain-of-thought. The model learns to "think" step-by-step before answering.

Anthropic Claude

Constitutional AI (CAI) — Instead of pure human feedback, Claude is trained with a "constitution" of principles. The model critiques and revises its own responses.
RLAIF — Reinforcement Learning from AI Feedback. An AI evaluates responses against the constitution, reducing reliance on human labelers.
Two-Phase CAI — (1) Supervised phase: model self-critiques and improves. (2) RL phase: AI-generated preferences train the reward model.
Focus on Safety — Anthropic prioritizes harmlessness and honesty. Claude is designed to refuse harmful requests while explaining why.

xAI Grok

Mixture-of-Experts (MoE) — Grok-1 uses 314B parameters with MoE, activating only a subset for each token. More efficient than dense models.
Real-time Data — Trained with continuous ingestion from X (Twitter), giving it more current knowledge than competitors.
Colossus Supercomputer — 200,000 GPUs for Grok 4 training. 10x more compute than Grok 3.
Open Weights — Grok-1 weights were released publicly, unlike GPT and Claude.

Google Gemini

Multimodal Native — Trained on text, images, audio, and video from the start, not just text with vision added later.
TPU Training — Uses Google's custom TPU chips rather than NVIDIA GPUs. Different hardware optimization.
Ultra Scale — Gemini Ultra reportedly used more compute than GPT-4. Google has massive infrastructure advantages.

Meta LLaMA

Open Source Focus — LLaMA models are released with weights, enabling the open-source community to build on top.
Efficient Training — LLaMA 2 achieved competitive performance with less compute by focusing on data quality and longer training.
Academic Friendly — Available for research, driving rapid innovation outside big labs.

Key Concepts to Understand

These are the terms and ideas you need to know to discuss LLM training intelligently.

Parameters — The learnable weights in the neural network. GPT-4: ~1.7 trillion (rumored). Claude 3 Opus: ~200B. More parameters = more capacity.
FLOPs — Floating point operations. A measure of compute. Frontier models train with 10^24+ FLOPs.
Perplexity — A measure of how well the model predicts text. Lower is better. Exponential of average cross-entropy loss.
Gradient Descent — The optimization algorithm that updates parameters to minimize loss. Backpropagation computes gradients through the network.
Batch Size — How many examples are processed together before updating weights. Larger batches are more efficient on GPUs.
Learning Rate — How big a step to take when updating weights. Too high = unstable. Too low = slow convergence.
Overfitting — When the model memorizes training data instead of generalizing. Regularization and data diversity prevent this.
Emergent Abilities — Capabilities that appear suddenly at scale, like in-context learning, chain-of-thought reasoning, and tool use.
Chinchilla Scaling — DeepMind's finding that models should be trained on ~20 tokens per parameter for optimal compute efficiency.
KV Cache — Key-value cache that stores intermediate computations during inference to avoid redundant work. Critical for fast generation.

What Cursor Does

Cursor isn't training foundation models from scratch. Instead, they're doing applied AI research that makes these models dramatically more useful for coding.

Model Orchestration — Cursor routes queries to different models (Claude, GPT-4) based on task type. They optimize for speed, quality, and cost.
Custom Fine-tuning — Cursor trains specialized models for code completion, error fixing, and codebase understanding.
Context Engineering — Massive focus on what context to provide the model. Semantic search over codebases, relevant file detection, diff context.
Tool Use — Training models to use tools: file editing, terminal commands, browser automation. This is agentic AI.
Inference Optimization — Fast response times require optimized serving infrastructure. Speculative decoding, caching, batching.

Why this matters for jobs: Cursor is hiring for ML engineering, infrastructure, and product roles that focus on applying frontier models to real problems. You don't need to know how to pretrain GPT-4, but you do need to understand how these models work and how to make them useful.

Resources for Going Deeper

Papers

Attention Is All You Need — The original transformer paper (2017)
Training language models to follow instructions with human feedback — InstructGPT / RLHF (OpenAI)
Constitutional AI — Anthropic's alignment approach
Scaling Laws for Neural Language Models — Kaplan et al.
Training Compute-Optimal Large Language Models — Chinchilla scaling (DeepMind)