Long Context

Video (best)

  • Andrej Karpathy — “Let’s build GPT: from scratch, in code, spelled out.”
  • Watch: YouTube
  • Why: Clear, practical explanation of Transformer attention and why context length matters; good foundation before diving into long-context extensions.
  • Level: Beginner → Intermediate

Blog / Written explainer (best)

  • Lil’Log (Lilian Weng) — “Attention? Attention!”
  • Link: https://lilianweng.github.io/posts/2018-06-24-attention/
  • Why: Strong conceptual grounding in attention mechanisms; useful for understanding efficient attention and context utilization issues.
  • Level: Beginner → Intermediate

Deep dive

Original paper

  • Su et al. — “RoFormer: Enhanced Transformer with Rotary Position Embedding”
  • Link: https://arxiv.org/abs/2104.09864
  • Why: Primary reference for RoPE (rotary position embeddings), a key building block behind many modern long-context extension techniques (and later rope scaling variants).
  • Level: Intermediate → Advanced
  • Beltagy, Peters, Cohan — “Longformer: The Long-Document Transformer”
  • Link: https://arxiv.org/abs/2004.05150
  • Why: Canonical efficient-attention approach (sparse attention) for longer sequences; useful contrast vs “just increase context”.
  • Level: Intermediate → Advanced
  • Dao et al. — “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
  • Link: https://arxiv.org/abs/2205.14135
  • Why: Widely used exact-attention implementation enabling longer contexts in practice by reducing memory/compute overhead.
  • Level: Advanced

Code walkthrough

  • Andrej Karpathy — “nanoGPT”
  • Link: https://github.com/karpathy/nanoGPT
  • Why: Readable reference implementation of GPT-style training/inference; good base for experimenting with context length, attention cost, and positional embeddings.
  • Level: Intermediate

Coverage notes

  • Strong: Transformer attention fundamentals; positional embeddings (incl. RoPE); efficient attention (Longformer) and practical attention optimization (FlashAttention).
  • Weak: Direct, practitioner-focused guides on rope scaling, yarn, and position interpolation as used in modern long-context LLMs (few stable, canonical explainers).
  • Gap: A single authoritative explainer that cleanly compares retrieval vs long context, addresses lost-in-the-middle and context utilization, and ties them to needle-in-a-haystack evaluation and context compression techniques.

Additional Resources for Tutor Depth

10 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.

📄 Position Interpolation (PI) for RoPE context extension

Paper · source

Defines Position Interpolation (PI) for RoPE by rescaling positions (e.g., (p’ = p/s)) and specifies how to apply it at inference and during continued training, including the exact modification point in RoPE computation.

Key content
  • RoPE definition (Eq. 1–2): RoPE applies a complex rotation to each head-dim pair using position (p). Attention score depends on relative position via trig functions after applying RoPE to both queries (q) and keys (k) at each layer.
  • Problem with direct extrapolation (Sec. 2.2): Using RoPE beyond the trained max length can cause catastrophically high attention scores and perplexity comparable to untrained models; even evidence at distance 2900 may fail when querying at position 3000 if trained max is 2048.
  • Position Interpolation (PI) (Eq. 4): For extending context from original (L) to longer (L’), down-scale position indices before RoPE:
    [ p’ = p/s,\quad s = L’/L ] i.e., replace RoPE((p)) with RoPE((p/s)). This aligns max relative distance back to the pretrained range (reduces from (L’) to (L)).
  • Stability rationale (Theorem 2.1, Eq. 5–7 vs Eq. 8): Interpolated attention scores are bounded by a linear interpolation of nearby “well-behaved” grid points; interpolation bound is at least smaller than RoPE’s extrapolation bound (numerically (B(s)\ge 1), often much larger).
  • Fine-tuning procedure (Sec. 3.1): Next-token prediction on Pile; AdamW ((\beta_1,\beta_2)=(0.9,0.95)); weight decay 0; warmup 20 steps from 0 to max LR. LR: (2!\times!10^{-5}) (7B/13B), (1!\times!10^{-5}) (33B/65B). PI fine-tune typically 1000 steps (direct FT baseline: 10000).
  • Empirical: effective context via passkey retrieval (Table 4): PI reaches target window after 200 steps (e.g., 7B: 8192/16384/32768 all achieved; 33B: 8192/16384 achieved). Direct FT barely improves: ~2048→2560 even after 10000 steps.
  • Empirical: perplexity improves with longer windows (Table 1): LLaMA-7B extended to 32768 with PI: PG19 ppl 7.23→6.77 (2048→32768). Direct FT to 8192 worsens at long windows (e.g., 7B: 7.21@2048 → 7.69@8192).

📄 YaRN RoPE extension (NTK-by-parts + attention scaling)

Paper · source

Piecewise/ramped RoPE frequency scaling (low/high-frequency treatment) + YaRN attention temperature/rescaling to preserve short-range behavior while extending max context.

Key content
  • Goal/notation (Section 2.2): pretrained max context (L); target (L’); scale factor (s=L’/L). RoPE wavelength for dim (i): (\lambda_i=\frac{2\pi}{\theta_i}) (Eq. 8). Rotation count ratio over pretrained window: (r_i=\frac{L}{\lambda_i}) (Eq. 10).
  • Position Interpolation (PI): scale positions by (1/s): (m(t)=t/s) (Eq. 9; also Eq. 16 uses (t\cdot L/L’)).
  • NTK-by-parts (piecewise ramp) (Section 3.2, Def. 1): choose thresholds (\alpha,\beta). If (r_i\le \alpha): interpolate like PI (avoid extrapolation). If (r_i\ge \beta): no interpolation (preserve local relative distances). Between: linear ramp (Eq. 11) to blend. Implemented via RoPE modification of form Eq. 7 with functions (m(\cdot)), (n(\cdot)) given in Eqs. 12–13. Recommended for LLaMA: (\alpha=1,\ \beta=32).
  • YaRN attention scaling (Section 3.3): modify attention softmax temperature:
    (\text{Attn}(Q,K,V)=\text{softmax}!\left(\frac{QK^\top}{\tau}\right)V) (Eq. 14). Implement via “length scaling” by scaling both (Q) and (K) through scaled complex RoPE embeddings (no attention-code change; cached RoPE ⇒ zero overhead).
  • Recommended YaRN params (Def. 2): set (\alpha=1,\beta=32) and choose (\tau) by fit vs extension factor (Eq. 15; used across LLaMA/Llama2).
  • Dynamic Scaling (Section 3.4): during inference, set scale factor per forward pass to current sequence length (vs fixed (s)); with KV-cache, cache K/V before applying RoPE because RoPE changes when (s) changes.
  • Key empirical results:
    • Llama 2 7B YaRN fine-tuned: 4k→64k (×16) perplexity on Proof-pile: 8k 3.51; 16k 2.99; 32k 2.65; 64k 2.42 (Table 1).
    • Llama 2 7B YaRN: 4k→128k (×32) (trained on 64k data, +200 steps) perplexity: 8k 3.56; 16k 3.04; 32k 2.70; 64k 2.45; 128k 2.37 (Table 1).
    • Passkey retrieval @128k: YaRN 7B 99.4%, 13B 99.4% (Table B.5).
    • Training recipe for 128k models (Section 4.1): PG19, chunked 64k with BOS/EOS; AdamW, lr (2\times10^{-5}), warmup 20 steps, no weight decay; 7B trained 400 steps (then +200 for 128k extrapolation).

📊 LaRA — RAG vs Long-Context Routing Has No Universal Winner

Benchmark · source

Structured benchmark + measured outcomes on when Retrieval-Augmented Generation (RAG) vs Long-Context (LC) input wins/loses, including routing criteria and failure modes.

Key content
  • Benchmark design (Section 1, 3): LaRA has 2326 test cases across 4 QA task types (Location, Reasoning, Comparison, Hallucination detection) and 3 natural long-context types: novels, academic papers, financial statements. Contexts chosen to be near 32k and 128k tokens (avoid truncation; reduce leakage via recency + entity replacement in novels).
  • Task definitions (Section 3.2):
    • Location: answer in one sentence/paragraph; paraphrase allowed (not verbatim “needle”).
    • Reasoning: inference/calculation from context.
    • Comparison: synthesize info from multiple distant parts.
    • Hallucination detection: correct response is refusal: “XXX is not mentioned in the provided context.”
  • RAG pipeline defaults (Section 4.1): chunk size 600 tokens, overlap 100, 5 chunks per document; embeddings GTE-large-en-v1.5; hybrid retrieval = embedding similarity + BM25.
  • Evaluation procedure (Section 3.4): GPT-4o as judge given (query, ground-truth, prediction); validated with Cohen’s Kappa vs humans.
  • Key empirical results (Abstract + Section 4.3/Table 2):
    • Model strength: RAG helps weaker models more; at 128k, RAG beats LC by +6.48% (Llama-3.2-3B) and +38.12% (Mistral-Nemo-12B).
    • Context length flip: at 32k, LC averages +2.40% over RAG; at 128k, RAG averages +3.68% over LC.
    • Task routing: RAG ≈ LC on single-location; RAG best for hallucination detection; LC best for reasoning + comparison. Comparison shows largest LC advantage: avg gap +15.22% (32k) and +14.30% (128k) (LC over RAG).
    • Lost-in-the-middle (Section 4.6): LC accuracy drops when answers are near context center; RAG shows no clear position correlation.

📊 Lost-in-the-Middle (Liu et al., 2023/2024)

Benchmark · source

Canonical “lost-in-the-middle” U-shaped curves + task definitions (multi-doc QA, key–value retrieval) + evaluation protocol knobs (context length, answer position, baselines)

Key content
  • Core phenomenon (Fig. 1, Fig. 5, Fig. 7): Performance vs. position of relevant info is U-shaped: best when relevant content is at start (primacy) or end (recency); worst in the middle, even for long-context models.
  • Multi-document QA task (Sec. 2 / 3.1):
    • Input = question + k documents, with exactly 1 answer-containing doc and k−1 distractors.
    • Manipulations: (i) increase context length by increasing k; (ii) change position of the answer doc by reordering documents (beginning/middle/end; finer-grained positions in plots).
    • Metric: Accuracy = whether any annotated correct answer string (NaturalQuestions annotations) appears in model output (string match).
    • Baselines: closed-book (no docs; parametric memory) and oracle (only the answer doc).
    • Concrete result: For GPT-3.5-Turbo, when the answer doc is in the middle, open-book accuracy can be lower than closed-book; closed-book reported as 56.1%.
    • Extended-context not necessarily better: When both models’ windows fit the same inputs (e.g., 10- and 20-doc settings), GPT-3.5-Turbo vs GPT-3.5-Turbo (16K) curves are nearly superimposed (Fig. 5).
  • Synthetic key–value retrieval (Sec. 3):
    • Context contains 75 / 140 / 300 key–value pairs; 500 examples each.
    • Query asks for the value of a key; metric: output contains the correct value.
    • Same U-shaped degradation with key–value position (Fig. 7); query-aware contextualization (placing query before and after data) yields near-perfect key–value performance but minimal change for multi-doc QA trends (Sec. 4.2).
  • Architecture probe (Sec. 4.1): Encoder–decoder models (Flan-T5-XXL, Flan-UL2) are robust within training-time lengths, but show U-shape when evaluated beyond (e.g., encoder training max noted as 2048 tokens for UL2).

📊 Lost-in-the-Middle (Long-Context Utilization Curves)

Benchmark · source

Canonical “lost-in-the-middle” U-shaped accuracy curves vs. position of relevant evidence in long contexts; setups for multi-doc QA + synthetic key–value retrieval.

Key content
  • Core finding (Fig. 1, Fig. 5, Fig. 7): Accuracy is highest when relevant info is at the beginning (primacy) or end (recency) of the context and drops in the middle (U-shaped “lost-in-the-middle” curve), even for explicitly long-context models.
  • Multi-document QA task (Sec. 2.1):
    • Input: question + k documents, with exactly 1 answer-containing doc and k−1 distractors.
    • Data: NaturalQuestions-Open, 2655 queries where long answer is a paragraph.
    • Docs: Wikipedia chunks ≤100 tokens. Distractors retrieved by Contriever (MS-MARCO fine-tuned); presented in decreasing relevance.
    • Controls: vary context length by changing total docs (10/20/30; ~2K/4K/6K tokens), and vary answer-doc position by reordering docs.
    • Metric: accuracy = any gold answer string appears in output.
  • Closed-book vs oracle baselines (Table 1, multi-doc QA accuracy):
    • LongChat-13B (16K): 35.0% closed-book, 83.4% oracle
    • MPT-30B-Instruct: 31.5%, 81.9%
    • GPT-3.5-Turbo: 56.1%, 88.3%
    • GPT-3.5-Turbo (16K): 56.0%, 88.6%
    • Claude-1.3: 48.3%, 76.1%
    • Claude-1.3 (100K): 48.2%, 76.4%
    • Notable comparison (Sec. 2.3): GPT-3.5-Turbo worst-case (middle) in 20/30-doc settings can be < closed-book 56.1%.
  • Extended context ≠ better utilization (Sec. 2.3): When prompts fit both windows, GPT-3.5 (4K) vs GPT-3.5 (16K) curves are nearly identical (Fig. 5); similarly Claude-1.3 vs Claude-1.3-100K.
  • Synthetic key–value retrieval (Sec. 3):
    • Input: JSON with k UUID→UUID pairs + query key; output associated value (Fig. 6).
    • k tested: 75/140/300 pairs (~4K/8K/16K tokens), 500 examples each.
    • Result: some models (notably Claude-1.3 / 100K) near-perfect; others show middle-position degradation (Fig. 7).
  • Query-aware contextualization (Sec. 4.2):
    • Put query before and after data.
    • Effect: dramatically improves key–value retrieval (e.g., GPT-3.5-Turbo-16K becomes perfect at 300 pairs; without it, worst-case 45.6%), but minimal improvement for multi-doc QA (Fig. 9).
  • Architecture effect (Sec. 4.1): Encoder–decoder (Flan-UL2, Flan-T5-XXL) is robust within training-time lengths (Flan-UL2: 1.9% best–worst gap within 2048 tokens), but shows U-shape beyond training lengths (Fig. 8).
  • More retrieved docs saturate (Sec. 5, Fig. 11): In open-domain QA, reader accuracy saturates well before retriever recall; using 50 vs 20 docs yields only ~+1.5% (GPT-3.5) and ~+1% (Claude-1.3).

📊 MMNeedle (Multimodal Needle-in-a-Haystack) long-context benchmark

Benchmark · source

Concrete NIAH-style multimodal evaluation design + reported accuracy drops as visual context (images × stitched sub-images) grows; includes negative-sample hallucination.

Key content
  • Task (Sec. 3.1): Given an image haystack (sequence of images; each may be a stitched grid of sub-images) + a caption query describing one sub-image, model must output the needle location.
  • Context-length construction (Sec. 3.2):
    • Vary input images (N \in {1,10}) (10 chosen because GPT-4V/4o max images/request).
    • Image stitching: stitch (s \times s) sub-images into one image; (s \in {1,2,4,8}). Total sub-images in context = (N \cdot s^2) (e.g., (N{=}10,s{=}8 \Rightarrow 640) sub-images).
    • Sub-image resize: 256×256 px; stitched resolutions: (s{=}2\Rightarrow512^2), (4\Rightarrow1024^2), (8\Rightarrow2048^2). 2048 chosen due to API image-size limits (GPT-4 long side ~2000 px; Claude max 8000×8000).
  • Dataset sampling (Sec. 3.3): MS COCO 2014 val; 5000 positive + 5000 negative per ((N,s,#needles)) setting. Steps: build stitched images → sample 10-image haystacks → pick needle sub-image inside haystack (positive) or outside (negative) → use COCO caption(s) as query.
  • Metrics (Sec. 3.4):
    • Output format (positive): “(i,r,c)” where (i)=image index (1..N), (r,c)=row/col (1..s). Negative: “-1”.
    • ExistenceAcc, IndexAcc, ExactAcc with relation: ExactAcc ≤ IndexAcc ≤ ExistenceAcc.
  • Key empirical results (Tables 2–3):
    • Single image (N=1): GPT-4o ExactAcc 94.60% (s=1), 83.00% (s=2), 19.00% (s=4); Gemini 1.5 best at s=4: 29.81%.
    • Multi-image (N=10): GPT-4o ExactAcc 97.00% (s=1)81.80% (s=2)26.90% (s=4)1.00% (s=8); IndexAcc drops 97.0% → 87.2% → 45.0% → 17.8%.
    • Open-source models: near-zero ExactAcc in multi-image settings (e.g., LLaVA-Llama-3 ExactAcc 0.00% across (s=1,2,4,8) for (N=10)).
  • Negative samples (Sec. 4.3): Strong retrievers (e.g., GPT-4o) show worse existence accuracy in harder negative settings → hallucination: predicting a needle exists when it doesn’t.
  • Statistical significance (Sec. 4.4): ExactAcc stabilizes after ~500 samples; standard error decreases markedly from 100→1000 samples.

📊 NoLiMa — Long-context eval without literal-match shortcuts

Benchmark · source

Needle-in-a-haystack-style long-context evaluation where questions/needles have minimal lexical overlap, forcing latent association retrieval; reports degradation across long context lengths + ablations (distractors, hops, CoT).

Key content
  • Core task design (Section 3): Hide a single needle (fact about a unique character) inside a long irrelevant haystack (book snippets). Ask a question whose keyword q is associatively linked to needle keyword k (world knowledge/commonsense), with minimal literal overlap.
    • Example: Needle “Yuki lives next to the Semper Opera House (k)” + Question “Which character has been to Dresden (q)?” (association: Semper Opera House → Dresden).
    • Latent hops: 1-hop vs 2-hop (e.g., Dresden → Saxony).
    • Fact order variants: Default (character name before k) vs Inverted (name after k); inverted harder due to causal attention backtracing limits.
  • Haystack construction + filtering (Section 3.1):
    • Build haystacks by concatenating random <250-token snippets from 10 open-licensed books until >2K lines / >60K tokens.
    • Distractor filtering: Contriever embeddings; inspect top-20 similar words to question keywords; remove sentences containing flagged words.
    • Conflicting-info filtering: scan chunks (1000 chars, 800-char stride250 tokens); instruction-tuned LLM flags potential false answers; manual review.
  • Evaluation setup (Section 4):
    • 58 question–needle pairs; 5 haystacks; each needle placed 26 times at equal intervals per context length ⇒ 7,540 tests per length.
    • Lengths tested: 250, 500, 1K, 2K, 4K, 8K, 16K, 32K.
    • Base score: for each example, take max accuracy over {250,500,1K} (avg over 5 haystacks), then average across examples.
    • Effective length: largest tested length with score ≥ 0.85 × base_score.
  • Key empirical results (Table 3): strong short-context, sharp long-context drop
    • GPT-4o: base 99.3; 8K 89.2, 16K 81.6, 32K 69.7; effective length 8K (claimed 128K).
    • Gemini 1.5 Pro: base 92.6; 32K 48.2; effective length 2K (claimed 2M).
    • Llama 3.3 70B: base 97.3; 32K 42.7; effective length 2K.
    • At 32K, 10/12 models fall to ≤50% of base; at 32K “10 models drop below 50% of strong short-length baselines.”
  • CoT prompting gains but doesn’t fix long contexts (Table 4):
    • One-hop: 32K 56.2 → 60.6 with CoT.
    • Two-hop: 32K 25.9 → 34.3 with CoT (bigger relative gains; still poor ≥16K).
  • Literal-match ablations (Section 4.4.4, Table 6):
    • Direct (literal overlap like vanilla NIAH): ~98.3–98.5 at 8K/16K/32K (near-saturated).
    • Multiple-choice adding literal matches: One-hop 32K 93.1; Two-hop 32K 87.2 (vs non-literal one-hop 56.2, two-hop 25.9).
    • Distracting literal matches (irrelevant overlap inserted) severely hurt: GPT-4o effective length drops to 1K (base also drops: GPT-4o 93.8, Llama 3.3 70B 84.4).
  • Needle position findings (Section 4.4.2):
    • “Lost-in-the-middle” dip at 32K.
    • In two-hop, performance depends more on context length than needle position; aligning placements in the last 2K tokens shows drops not explained by RoPE distance (relative distance constant), implicating attention limits without surface cues.

📖 vLLM Engine Args (Long-Context & KV/Attention-Relevant Flags)

Reference Doc · source

Authoritative CLI/API parameter names + defaults for vLLM serving settings that affect long-context behavior, KV cache, and performance.

Key content
  • Context length control
    • --max-model-len: Model context length; if unset, auto-derived from model config.
    • --disable-sliding-window: disables sliding window behavior (caps to sliding window size otherwise).
  • RoPE / context extension knobs
    • --rope-scaling: RoPE scaling JSON, e.g. {"type":"dynamic","factor":2.0} (factor scales context).
    • --rope-theta: RoPE theta; used with rope_scaling and can improve performance of scaled models.
  • KV cache memory/accuracy tradeoffs
    • --kv-cache-dtype {auto, fp8, fp8_e5m2, fp8_e4m3}; default auto (= model dtype). CUDA 11.8+ supports FP8; ROCm supports fp8_e4m3.
    • --quantization-param-path: JSON scaling factors for KV cache (generally needed for FP8 KV); otherwise scaling defaults to 1.0 (may hurt accuracy).
  • Batching/scheduling limits impacting long prompts
    • --max-num-batched-tokens: max batched tokens per iteration (throughput/latency tradeoff).
    • --max-num-seqs: default 256 sequences per iteration.
    • --enable-chunked-prefill: chunk prefill based on max_num_batched_tokens (helps very long prompts).
  • CUDA graphs vs eager fallback (long seq performance)
    • --max-seq-len-to-capture: default 8192; sequences longer fall back to eager mode.
    • --enforce-eager: always eager-mode PyTorch (disables CUDA-graph hybrid).
  • Memory provisioning
    • --gpu-memory-utilization: default 0.9 fraction of GPU memory for executor.
    • --swap-space: default 4 GiB per GPU.
    • --cpu-offload-gb: default 0; “virtual GPU memory” via CPU offload (requires fast interconnect).
  • KV block granularity
    • --block-size {8,16,32}; default 16 (token block size for contiguous chunks).

📋 # Source: https://aclanthology.org/2024.tacl-1.9/

Source ·

📋 YaRN (RoPE context extension + dynamic scaling)

Code · source

End-to-end YaRN method + concrete training/eval settings and released 64k/128k Llama 2 variants; shows how scaling schedules are applied.

Key content
  • Goal/notation (Sec. 2.2): pretrained max context (L); extend to (L’). Scale factor (s=L’/L). RoPE wavelength for dim (i): (\lambda_i = 2\pi/\theta_i) (Eq. 8 conceptually; used to reason about “high vs low frequency” dims).
  • Position Interpolation (PI) (Eq. 9 / App. A.1 Eq. 16): modify RoPE by scaling positions: (m(t)=t/s) (i.e., interpolate indices into pretrained range).
  • YaRN attention “length scaling” (Sec. 3.3): modify attention softmax temperature:
    (\text{Attn}(q,k)=\text{softmax}!\left(\frac{qk^\top}{\tau}\right)) (Eq. 14). Implemented by scaling rotary embeddings (no attention-kernel code change; cached RoPE ⇒ zero overhead).
  • Dynamic Scaling (Sec. 3.4): during inference, set scale factor per forward pass to current sequence length (vs fixed max). KV-cache caveat: cache K/V before applying RoPE because RoPE changes when scale changes.
  • Recommended YaRN params for LLaMA/Llama 2 (Def. 2, Eq. 15): provides fitted defaults for (\beta) (temperature/length scaling) as a function of extension factor; authors report it transfers across LLaMA 7B–65B and Llama 2 7B–70B.
  • Training pipeline (Sec. 4.1): Llama 2 7B/13B to 128k: PG19, chunked 64k segments with BOS/EOS; AdamW, lr (2\times10^{-5}), no weight decay, 20-step warmup, (\beta_1=0.9,\beta_2=0.95); FSDP + FlashAttention2. Steps: 400 (64k model), then +200 more to reach 128k.
  • Key results (Sec. 4.2 Table 1): Llama 2 7B YaRN (4k→64k, 400 steps) perplexity on Proof-pile: 8k 3.51; 16k 2.99; 32k 2.65; 64k 2.42. 7B YaRN (4k→128k, 400+200): 8k 3.56; 16k 3.04; 32k 2.70; 64k 2.45; 128k 2.37 (“train short, test long” extrapolation).
  • Passkey retrieval (App. B.5): YaRN 7B @128k: 99.4%; YaRN 13B @128k: 99.4%.