PRE-TRAINING
Video (best)
- Andrej Karpathy — “Intro to Large Language Models”
- Watch: YouTube
- Why: Karpathy provides an exceptionally clear mental model of pre-training as the foundational “compression of the internet” step — covering next-token prediction, autoregressive generation, and the intuition behind perplexity in a way that is accessible yet technically honest. Already validated in the existing curated list.
- Level: beginner/intermediate
Blog / Written explainer (best)
- Lilian Weng — “Large Language Model Pre-training”
- Link: https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
- Why: Lilian Weng’s blog posts are the gold standard for structured, citation-backed written explainers in ML. Her coverage of training objectives, data curation (including LAION-5B context), and architectural choices bridges theory and practice better than most written resources. The exact post slug should be verified.
- Level: intermediate/advanced
Deep dive
- Sebastian Raschka — “Pre-Training LLMs from Scratch” (Magazine/Substack series)
- url: https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training [VERIFY — Raschka’s Substack at magazine.sebastianraschka.com is confirmed real; exact slug needs verification]
- Why: Raschka’s writing uniquely combines rigorous mathematical grounding with practical implementation notes. His pre-training coverage explicitly addresses data pipelines, tokenization, training stability, and evaluation via perplexity — making it the most complete written deep dive for practitioners building intuition before touching code.
- Level: advanced
Original paper
- Brown et al. (2020) — “Language Models are Few-Shot Learners” (GPT-3)
- Link: https://arxiv.org/abs/2005.14165
- Why: This is the most widely cited and pedagogically readable paper establishing the modern pre-training paradigm at scale. It clearly articulates the next-token prediction objective, training data composition, and emergent capabilities — making it the canonical reference for what “pre-training” means in the LLM era. For multi-modal pre-training specifically, the CLIP paper (arxiv.org/abs/2103.00020) is the contrastive pre-training counterpart. [NOT VERIFIED]
- Level: intermediate/advanced
Code walkthrough
- Andrej Karpathy — “Let’s build GPT: from scratch, in code, spelled out”
- Watch: YouTube
- Why: This is arguably the best hands-on pre-training walkthrough in existence. Karpathy implements autoregressive language model pre-training from scratch in ~2 hours, covering the training loop, next-token prediction loss, and perplexity evaluation with minimal abstraction. The associated GitHub repo (github.com/karpathy/ng-video-lecture) provides runnable code.
- Level: intermediate
Coverage notes
- Strong: Unimodal LLM pre-training (next-token prediction, autoregressive generation, perplexity, scale) — Karpathy’s video and code walkthrough cover this exceptionally well.
- Weak: Multi-modal pre-training specifics (interleaved training, natively multi-modal architectures, LAION-5B data curation) — no single curated video covers this with the same depth as the LLM-only case.
- Gap: No excellent standalone YouTube video exists specifically for contrastive pre-training (CLIP-style) or interleaved multi-modal pre-training (Flamingo/Gemini-style) at a beginner-friendly level. The intro-to-multimodal course will need supplementary resources for these sub-topics. Consider Yannic Kilcher’s CLIP paper walkthrough (youtube_id:
T9XSU0pKX2E) [NOT VERIFIED] as a candidate for contrastive pre-training.
Cross-validation
This topic appears in 2 courses: intro-to-llms, intro-to-multimodal
- The Karpathy video (
zjkBMFhNj_g) is already curated 4× forintro-to-llms/how-language-models-work— deduplication recommended in the platform’s content index. - The
intro-to-multimodalcourse will require additional resources specifically addressing LAION-5B, contrastive pre-training, and natively multi-modal objectives not covered by the LLM-focused resources above.
Additional Resources for Tutor Depth
7 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 Chinchilla compute-optimal scaling (tokens vs parameters)
Paper · source
Chinchilla compute-optimal rule-of-thumb: train smaller models on more tokens; explicit token/parameter tradeoff under fixed FLOPs.
Key content
- Objective (Eq. 1): choose parameters N and training tokens D to minimize final pre-training loss L(N,D) under a fixed compute budget C:
[ (N_{\text{opt}}(C),D_{\text{opt}}(C))=\arg\min_{N,D:\ \text{FLOPs}(N,D)=C} L(N,D) ]- N = number of model parameters; D = number of seen training tokens; C = total training FLOPs.
- Empirical dataset: >400 transformer LMs, 70M–16B parameters, trained on 5B–500B tokens.
- Main scaling result (Section 3; Table 2 exponents): compute-optimal scaling is approximately equal in parameters and tokens:
- Approach 1: (N_{\text{opt}}\propto C^{0.50}), (D_{\text{opt}}\propto C^{0.50})
- Approach 2 (IsoFLOP): (a=0.49,\ b=0.51)
- Approach 3 (parametric fit): (a=0.46,\ b=0.54)
Rule-of-thumb: for every doubling of model size, double training tokens.
- Parametric loss model (Eq. 2):
[ \hat L(N,D)=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}} ] - Training-procedure detail (Section 3.1/3.2): best final loss when cosine LR schedule length matches token horizon; they decay LR by 10× over ~D tokens.
- Key comparison (Abstract/Fig. 1): Chinchilla 70B trained with same compute as Gopher (5.76×10^{23} FLOPs) but 4× more data, and outperforms Gopher (280B), GPT‑3 (175B), Jurassic‑1 (178B), MT‑NLG (530B).
- MMLU: 67.5% average accuracy, >7% absolute improvement over Gopher.
📄 Cross-entropy ↔ Perplexity (next-token prediction)
Paper · source
Explicit definition of perplexity from cross-entropy loss; next-token prediction evaluation details + human-vs-LM numbers.
Key content
- Per-token cross-entropy loss (human or model) and perplexity (Section 4.1.1, Eq. 1):
Let (p(\cdot\mid c)) be the predictor’s distribution over next tokens given context (c), and (t) the true next token. The (expected) loss is
[ \mathcal{L} ;=; \mathbb{E}_{(c,t)}\big[-\log p(t\mid c)\big] ] Perplexity is defined by exponentiating this cross-entropy:
[ \mathrm{PPL} ;=; \exp(\mathcal{L}) ] (If (\log) is base-2, loss is in bits and (\mathrm{PPL}=2^{\mathcal{L}}); base (e) gives nats and (\exp(\mathcal{L})).) - Top-1 accuracy definition (Intro): fraction of positions where the predictor’s highest-probability token equals the true next token.
- Human vs LM top-1 accuracy on OpenWebText (Section 3.2):
- Humans: mean 29% top-1 accuracy (38 participants with ≥50 answers: 30%).
- GPT-3: 56% top-1 accuracy on same dataset.
- Even GPT-Neo-125M exceeded all human players in their sample.
- Human perplexity estimation procedure (Section 4):
- Humans can’t provide full (p(\cdot\mid c)) over ~50k tokens, so authors elicit relative likelihoods between two candidate tokens (true token vs a token sampled from a generator LM).
- Importance sampling with generator (q) (GPT-2-small, 117M) to approximate sums over vocabulary (Eq. 3–5).
- Bias control: scoring uses a weighted binary cross-entropy reward so optimal reporting matches the human’s true belief (Eq. 7–9).
- Defaults/parameters: OpenWebText validation prompts up to 120 tokens; response options restricted to 11 ratios (99%, 90%, …, 1%); 60 participants (top-1 game), 54 participants (perplexity game).
📄 Kaplan et al. 2020 — LLM Scaling Laws & Compute-Optimal Training
Paper · source
Power-law fits for cross-entropy loss vs parameters/data/compute; compute-optimal allocation equations + fitted exponents/constants.
Key content
- Metric/setting: Autoregressive Transformer LM; optimize next-token cross-entropy loss L (nats) over 1024-token context on WebText2; BPE vocab 50,257. (Sec. 2)
- Power-law scaling (when not bottlenecked by other factors):
- Params-limited, trained to convergence:
Eq. (1.1) (L(N)=\left(\frac{N_c}{N}\right)^{\alpha_N}), with (\alpha_N\approx 0.076), (N_c\approx 8.8\times10^{13}) non-embedding params. - Data-limited, early-stopped:
Eq. (1.2) (L(D)=\left(\frac{D_c}{D}\right)^{\alpha_D}), with (\alpha_D\approx 0.095), (D_c\approx 5.4\times10^{13}) tokens. - Compute-limited, compute-optimal:
Eq. (1.3) (L(C_{\min})=\left(\frac{C^{\min}c}{C{\min}}\right)^{\alpha^{\min}_C}), with (\alpha^{\min}_C\approx 0.050), (C^{\min}_c\approx 3.1\times10^{8}) PF-days.
- Params-limited, trained to convergence:
- Joint overfitting law:
Eq. (1.5)/(4.1) (L(N,D)=\Big[\left(\frac{N_c}{N}\right)^{\alpha_N/\alpha_D}+\frac{D_c}{D}\Big]^{\alpha_D}). Implies data scaling (D\propto N^{\alpha_N/\alpha_D}\approx N^{0.74}).- Practical rule to avoid overfitting near seed-noise (\sim0.02): Eq. (4.4) (D\gtrsim (5\times10^3),N^{0.74}) (tokens).
- Training compute estimate: (C\approx 6NBS) (forward+backward factor 6), with batch B and steps S. (Sec. 3.3)
- Learning curve fit (infinite-data limit):
Eq. (1.6) (L(N,S)=\left(\frac{N_c}{N}\right)^{\alpha_N}+\left(\frac{S_c}{S_{\min}(S)}\right)^{\alpha_S}), with (S_c\approx 2.1\times10^3), (\alpha_S\approx 0.76). (Sec. 5) - Critical batch size (depends on loss, not model size):
Eq. (1.4)/(5.3) (B_{\text{crit}}(L)=B_,L^{1/\alpha_B}), with (B_\approx 2\times10^8) tokens, (\alpha_B\approx 0.21). (Sec. 5.1) - Compute-optimal allocation (key empirical exponents): (N\propto C_{\min}^{0.73}), (B\propto C_{\min}^{0.24}), (S\propto C_{\min}^{0.03}) ⇒ spend extra compute mostly on bigger models, stop far before convergence. (Sec. 6; Eq. 1.8)
- Defaults: 10% dropout; LR schedule 3000-step warmup + cosine decay to 0; early stop when test loss stops decreasing. (Sec. 4.2, App. D.6)
📄 LAION-5B dataset creation & filtering (CLIP-based)
Paper · source
Web-scale image–text dataset creation pipeline (CLIP filtering, language split, safety tags) + dataset stats
Key content
- Dataset scale & composition (Abstract; Section 4):
- LAION-5B = 5.85B CLIP-filtered image–text pairs.
- Subsets derived from Common Crawl:
- 2.32B English image–text pairs (LAION-2B-en / LAION-2B).
- 2.26B multilingual pairs.
- 1.27B language-unspecific/unknown (e.g., places/products).
- Core filtering equation/criterion (Section 3.1):
- Compute CLIP cosine similarity between image embedding and text embedding:
- ( s = \cos(\mathbf{e}{img}, \mathbf{e}{txt}) )
- Keep pair if ( s \ge \tau ).
- Thresholds used:
- English: remove if (s < 0.28).
- Non-English: remove if (s < 0.26).
- Effect: starting from ~50B candidate images, this CLIP-threshold step removed ~90%, leaving just short of 6B examples.
- Compute CLIP cosine similarity between image embedding and text embedding:
- Models used for filtering (Section 3.1):
- English: OpenAI CLIP ViT-B/32.
- Other languages: multilingual CLIP ViT-B/32 (Carlsson et al.).
- Rationale: larger CLIP variants existed later, but authors used ViT-B/32 consistently across the dataset for timing/consistency.
- Safety/metadata provided (Abstract):
- Released detection scores for watermark, NSFW, and toxic content; plus tooling for exploration/subset generation (e.g., nearest-neighbor indices).
📖 Trainer LR scheduler knobs (cosine, restarts, custom)
Reference Doc · source
Exact Trainer/TrainingArguments knobs for scheduler selection + how to override with a custom scheduler
Key content
- Built-in scheduler selection via
TrainingArguments:- Set
lr_scheduler_type = "cosine_with_restarts"to use cosine annealing with restarts. - Pass scheduler-specific parameters via
lr_scheduler_kwargs, e.g.lr_scheduler_kwargs = {"num_cycles": 5}(controls number of cosine restart cycles).
- Set
- Custom scheduler when built-ins don’t fit (e.g., don’t decay to 0):
- HF core maintainer guidance: “You can pass your own learning rate scheduler to the
Trainer.” (used when you want a different final LR such as 50% of peak).
- HF core maintainer guidance: “You can pass your own learning rate scheduler to the
- Manual cosine-with-warmup scheduler wiring (procedure):
- Create optimizer (example shown:
PagedAdamW_32bit(model.parameters())). - Create
Trainer(..., optimizers=(optimizer, None)). - Build scheduler with Transformers helper:
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=training_args.warmup_steps, num_training_steps=...)
- Attach it:
trainer.lr_scheduler = scheduleror pass directly:Trainer(..., optimizers=(optimizer, lr_scheduler)).
- Create optimizer (example shown:
- Training step count used in examples:
num_warmup_steps = int(max_steps * warmup_ratio)num_training_steps = max_steps(explicitly set inTrainingArguments(max_steps=...)).
📖 Transformers v3.0.2 Optimization (AdamW + LR Schedules)
Reference Doc · source
Exact transformers.AdamW defaults + official LR scheduler APIs (PyTorch/TensorFlow)
Key content
-
AdamW (PyTorch) API + defaults (Section “AdamW (PyTorch)”):
transformers.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True)lr: learning rate (default 1e-3)betas=(b1,b2): Adam momentum coefficients (default (0.9, 0.999))eps: numerical stability (default 1e-6)weight_decay: decoupled weight decay (default 0.0)correct_bias: bias correction toggle (default True; BERT TF repo uses False)- Rationale: “weight decay fix” per Decoupled Weight Decay Regularization (decay does not interact with Adam’s m/v states).
-
AdamWeightDecay (TensorFlow) defaults:
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, weight_decay_rate=0.0- Rationale: adding L2 penalty to loss is not correct for Adam; use decoupled decay (equivalent to L2 with plain SGD).
-
TF helper:
create_optimizerworkflow: warmup → linear decay schedule.
create_optimizer(init_lr, num_train_steps, num_warmup_steps, min_lr_ratio=0.0, adam_epsilon=1e-8, weight_decay_rate=0.0, include_in_weight_decay=None)- Final LR at end: init_lr * min_lr_ratio.
-
PyTorch LR schedulers (return
torch.optim.lr_scheduler.LambdaLR):get_constant_schedule(optimizer, last_epoch=-1)get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1)(linear warmup 0 → base lr)get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1)(warmup then linear decay to 0)get_cosine_schedule_with_warmup(..., num_cycles=0.5)(half-cosine to 0)get_cosine_with_hard_restarts_schedule_with_warmup(..., num_cycles=1)(cosine with hard restarts)
📋 # Source: https://discuss.huggingface.co/t/how-can-i-use-evaluates-perplexity-metric-on-a-model-thats-already-loaded/48564
Source ·