Chain Of Thought
Video (best)
- Yannic Kilcher — “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Paper Explained)”
- Watch: YouTube
- Why: Kilcher systematically walks through the original Wei et al. paper, explaining why intermediate reasoning steps improve LLM performance — not just that they do. His paper-reading format is ideal for learners who want mechanistic understanding rather than surface-level intuition.
- Level: intermediate
⚠️ Coverage note: I have moderate confidence in this specific video ID. The video title and Kilcher’s coverage of this paper are well-established, but the 11-character ID should be verified before publishing.
Blog / Written explainer (best)
- Lilian Weng — “Prompt Engineering”
- Link: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
- Why: Weng dedicates a substantial, rigorous section to Chain-of-Thought (standard CoT, zero-shot CoT, self-consistency, Tree of Thoughts, and multimodal CoT) with clean diagrams and citations. It serves as a single-stop written reference covering all related concepts listed for this topic. Her writing bridges intuition and technical depth exceptionally well.
- Level: intermediate
Deep dive
- Author — Lilian Weng (same post serves dual purpose) / alternatively the original survey
- Link: https://arxiv.org/abs/2201.11903
- Why: “A Survey of Chain of Thought Reasoning in Large Language Models” (Chu et al.) is the most comprehensive technical taxonomy of CoT variants — covering standard CoT, zero-shot CoT, self-consistency, least-to-most prompting, Tree of Thoughts, and multimodal CoT — with structured comparisons across benchmarks. Better as a deep-dive reference than the original paper for breadth. [NOT VERIFIED]
- Level: advanced
Original paper
- Wei et al. (Google Brain), 2022 — “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”
- Link: https://arxiv.org/abs/2201.11903
- Why: This is the seminal paper that named and formalized the concept. It is unusually readable for an NLP paper — the examples are concrete, the ablations are clear, and the core insight (few-shot exemplars with reasoning steps unlock emergent reasoning) is presented accessibly. Essential primary source.
- Level: intermediate/advanced
⚠️ Note: There is a potential ID collision between the Wei et al. original paper and the survey paper above. Please verify both arxiv IDs independently before publishing. The Wei et al. paper is confirmed to exist on arxiv; the exact ID needs cross-checking.
Code walkthrough
- None identified — No single canonical hands-on CoT implementation tutorial from a top educator (Karpathy, fast.ai, etc.) has been confirmed with a verifiable URL.
Closest alternatives to verify:
- The
langchaindocumentation includes a CoT prompting walkthrough: https://python.langchain.com/docs/tutorials/ [NOT VERIFIED] - Hugging Face’s open-source cookbook has CoT examples but no single definitive notebook URL I can confirm with confidence.
Coverage notes
- Strong: Written/blog coverage (Lilian Weng’s post is excellent and confirmed). Original paper is well-documented and readable.
- Weak: Hands-on code walkthroughs — CoT is primarily a prompting technique, so “implementation” is lightweight, and no educator has produced a definitive standalone coding tutorial comparable to, say, Karpathy’s nanoGPT.
- Gap: No confirmed high-quality video specifically on multimodal CoT (Zhang et al., 2023) exists from a preferred educator. Tree of Thoughts also lacks a dedicated video from the preferred educator list. General CoT video coverage exists but IDs need verification.
Cross-validation
This topic appears in 3 courses: intro-to-agentic-ai, intro-to-llms, intro-to-multimodal
| Course | Relevant aspect |
|---|---|
| intro-to-llms | Core CoT concept, zero-shot CoT, self-consistency |
| intro-to-agentic-ai | Tree of Thoughts, ReAct-style reasoning chains |
| intro-to-multimodal | Multimodal CoT (Zhang et al., 2023) |
The Lilian Weng blog post covers all three course contexts in a single resource, making it the highest-leverage written resource across the curriculum.
Additional Resources for Tutor Depth
9 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 Self-Consistency (SC) decoding for Chain-of-Thought
Paper · source
Definition + decoding procedure: sample multiple CoT reasoning paths, marginalize paths, aggregate answers (majority/most consistent); benchmark gains vs greedy CoT.
Key content
- Core method (Fig. 1; Section “Self-consistency over diverse reasoning paths”): Replace greedy decoding in CoT prompting with:
- CoT prompt the LM (few-shot CoT exemplars or zero-shot “let’s think step by step”).
- Sample a diverse set of outputs from the decoder to obtain pairs ((r_i, a_i)), where (r_i) = reasoning path tokens, (a_i) = final answer.
- Marginalize out reasoning paths and choose the most consistent answer by aggregating over ({a_i}) (e.g., majority vote).
- Latent-variable view: Introduces latent (r_i) (reasoning) with (r_i \rightarrow a_i); reasoning is optional and used only to reach (a_i); aggregate over sampled paths to select answer.
- Defaults / parameters (Section 3.2):
- Reported results averaged over 10 runs.
- Each run samples 40 outputs independently (40 reasoning paths).
- Diversity controlled via sampling temperature (robust across a range; temperature (=0) is deterministic/greedy).
- Empirical gains (Abstract; Tables 2–3): SC improves CoT accuracy by:
- GSM8K: +17.9%
- SVAMP: +11.0%
- AQuA: +12.2%
- StrategyQA: +6.4%
- ARC-challenge: +3.9%
- Design rationale: Complex problems admit multiple reasoning paths to a unique answer; agreement across diverse paths increases confidence; SC avoids greedy local-optima/repetition and reduces single-sample stochasticity; no fine-tuning, no verifier/reranker, no extra annotation (“self-ensemble” on one model).
📄 Self-Consistency (Sample-and-Marginalize) for CoT Decoding
Paper · source
Self-consistency selection objective: marginalize/aggregate over sampled rationales → majority-vote (or weighted) answer selection; explicit decoding procedure + notation.
Key content
- Core procedure (Figure 1; Section 2):
- Prompt LM with chain-of-thought exemplars.
- Sample a diverse set of outputs (reasoning paths) from the decoder (not greedy).
- Marginalize out reasoning paths and aggregate final answers to pick the most consistent answer.
- Notation (Section 2): sample (m) candidate outputs indexed by (i=1,\dots,m).
- Final answers (a_i \in \mathcal{A}) (fixed answer set).
- Latent reasoning path (r_i) (token sequence) leading to (a_i) (reasoning optional: (r_i \rightarrow a_i)).
- Self-consistency objective (majority vote; Section 2): [ \hat a=\arg\max_{a}\sum_{i=1}^{m}\mathbf{1}(a_i=a) ]
- Optional probability-weighted aggregation (Eq. 1, length-normalized): [ P(r_i,a_i\mid \text{prompt},q)=\exp\left(\frac{1}{K}\sum_{k=1}^{K}\log P(t_k\mid \text{prompt},q,t_{1:k-1})\right) ] where (t_k) is the (k)-th token in ((r_i,a_i)), (K)=#tokens.
- Empirical aggregation comparison (Table 1, PaLM-540B): majority vote (“Unweighted sum”) strong: GSM8K 74.4, MultiArith 99.3, AQuA 48.3, SVAMP 86.6, CSQA 80.7, ARC-c 88.7.
- Main gains vs greedy CoT (Tables 2–3): PaLM-540B GSM8K 56.5→74.4 (+17.9); AQuA 35.8→48.3 (+12.5); SVAMP 79.0→86.6 (+7.6); StrategyQA 75.3→81.6 (+6.3); ARC-c 85.2→88.7 (+3.5).
- Defaults (Section 3.1): typically 40 sampled outputs per run; sampling: UL2/LaMDA (T=0.5, k=40); PaLM (T=0.7, k=40); GPT-3 (T=0.7) (no top-k).
📄 Self-Consistency for Chain-of-Thought (ICLR 2023)
Paper · source
Benchmark tables/ablations comparing self-consistency vs greedy CoT across reasoning datasets + sampling settings + accuracy gains.
Key content
- Method (Self-Consistency; “sample-and-marginalize”, Sec. 2):
- Prompt LM with CoT exemplars.
- Sample (m) diverse outputs ((r_i, a_i)) from decoder (reasoning path (r_i), final answer (a_i \in \mathcal{A})).
- Aggregate by marginalizing out (r_i): majority vote
[ a^*=\arg\max_{a\in\mathcal{A}}\sum_{i=1}^{m}\mathbf{1}(a_i=a) ]
- Optional probability-weighted aggregation (Eq. 1, length-normalized): [ P(r_i,a_i\mid \text{prompt},q)=\exp\Big(\frac{1}{K}\sum_{k=1}^{K}\log P(t_k\mid \text{prompt},q,t_{<k})\Big) ] where (t_k) are output tokens, (K)=#tokens in ((r_i,a_i)). Finding: majority vote ≈ normalized weighted sum; unnormalized weighting performs worse.
- Sampling defaults (Sec. 3.1): typically 40 samples, averaged over 10 runs.
- UL2-20B & LaMDA-137B: temperature (T=0.5), top-(k=40)
- PaLM-540B: (T=0.7), top-(k=40)
- GPT-3: (T=0.7), no top-(k)
- Key empirical gains (Tables 2–3; absolute accuracy):
- PaLM-540B: GSM8K 56.5→74.4 (+17.9); SVAMP 79.0→86.6 (+7.6); AQuA 35.8→48.3 (+12.5); ARC-c 85.2→88.7 (+3.5); StrategyQA 75.3→81.6 (+6.3).
- GPT-3 code-davinci-002: GSM8K 60.1→78.0 (+17.9); SVAMP 75.8→86.8 (+11.0); AQuA 39.8→52.0 (+12.2); StrategyQA 73.4→79.8 (+6.4); ARC-c 83.6→87.5 (+3.9).
- Ablations/Comparisons: More sampled paths improves accuracy (Fig. 2). Self-consistency beats sample-and-rank (Fig. 3) and beam search (Table 6); beam search reduces diversity.
📄 Tree of Thoughts (ToT) = deliberate search over “thoughts”
Paper · source
Algorithmic ToT procedure: generate thoughts → evaluate states (value/vote) → control search (BFS/DFS with pruning/backtracking)
Key content
- Problem framing (Sec. 3): Solve by search over a tree.
- State/node = input + sequence of thoughts so far (a partial solution).
- Thought = coherent text unit (size chosen so it’s (i) generatable/diverse and (ii) evaluable).
- ToT instantiation requires 4 choices (Sec. 3):
- Thought decomposition (e.g., Crosswords: a few words; Game of 24: one equation line; Creative writing: a plan paragraph).
- Thought generation from state s: propose/sample multiple candidate next thoughts (i.i.d. sampling works well when thought space is rich).
- State evaluation heuristic over frontier states S:
- Value each state: (V(s)) via LM prompt → scalar (1–10) or labels (e.g., sure/maybe/impossible).
- Vote across states: compare candidates and pick most promising (aggregate multiple votes for robustness).
- Search algorithm:
- BFS (Alg. 1): keep top-b states per depth; prune early.
- DFS (Alg. 2): follow most promising; prune if value below threshold; backtrack to parent to explore alternatives.
- Empirical results (Game of 24, Sec. 4.1; 100 hard games):
- IO 7.3%; CoT 4.0%; CoT-SC (k=100) 9.0%
- ToT BFS b=1: 45%, b=5: 74%
- Best-of-100: IO 33%, CoT 49% (still < ToT b=5).
- Defaults/params used in experiments: GPT-4 chat completion, temperature 0.7. Game of 24 ToT: 3 thought steps; value labels sure/maybe/impossible; sample values multiple times per thought. Crosswords DFS: max 100 search steps; depth ≤ 10 (no overwriting filled letters).
📄 Tree of Thoughts (ToT) — deliberate search over “thoughts”
Paper · source
Empirical ToT gains (Game of 24, Creative Writing, Crosswords) + ToT search components (generate/evaluate/search; value/vote).
Key content
- Core formulation (Sec. 3): Problem solving as search over a tree.
- State (s): input + sequence of thoughts so far (partial solution).
- Thought: coherent text unit (size chosen per task: equation line / plan paragraph / crossword word).
- Thought generation (Sec. 3): from state (s), generate candidates (T={t_i}) via LM prompting (i.i.d. sampling or sequential proposals conditioned on (s)).
- State evaluation heuristics (Sec. 3):
- Value: (V(s)) from a value prompt → scalar (e.g., 1–10) or labels (e.g., sure/maybe/impossible) mapped to numeric scores; can sample multiple times and aggregate.
- Vote: given frontier (S), sample votes to pick most promising state: ( \text{Vote}(S)\rightarrow s^*).
- Search algorithms (Sec. 3):
- BFS (Alg. 1): keep top (b) states per depth step (beam-like). Used when depth is small and early pruning helps (Game of 24, Creative Writing).
- DFS (Alg. 2): expand best-looking state; prune if (V(s)<v_{\text{th}}); backtrack on prune or completion. Used for Crosswords; step budget 100.
- Empirical results — Game of 24 (Sec. 4.1, 100 hard games):
IO 7.3%; CoT 4.0%; CoT-SC (k=100) 9.0%; ToT BFS b=1: 45%; ToT BFS b=5: 74%; IO+Refine (k=10) 27%; IO best-of-100 33%; CoT best-of-100 49%.
Setup: 3 ToT steps (3 intermediate equations); evaluator labels sure/maybe/impossible; temperature 0.7. - Creative Writing (Sec. 4.2, 100 inputs): GPT-4 coherency score (1–10, avg of 5 evals): ToT 7.56 vs IO 6.19 vs CoT 6.93; humans prefer ToT over CoT 41/100 (CoT over ToT 21/100). ToT: depth 2 (plan→passage), 5 votes each step, breadth limit (b=1).
- Crosswords (Sec. 4.3): ToT DFS improves letter/word/game metrics; solves 4/20 games; oracle “+best state” solves 7/20; ablations show -prune worse, -backtrack word success 25%.
📖 Completions API — decoding controls & reproducibility
Reference Doc · source
Parameter-level semantics for multi-sample decoding + reproducibility controls (n, best_of, logprobs, temperature, top_p, max_tokens, stop, seed)
Key content
- Endpoint:
POST /completionscreates a completion for providedprompt+ parameters; returns a Completion object (or a sequence if streamed). - Model (
model): string ID (examples listed:"gpt-3.5-turbo-instruct","davinci-002","babbage-002"). - Prompt (
prompt) types: string | array of strings | array of token IDs (numbers) | array of token arrays. If omitted, model generates as from start of new document;<|endoftext|>is training-time document separator. - Multi-sample decoding:
n(min 1, max 128): number of completions to generate per prompt; increases token usage.best_of(min 0, max 20): generatesbest_ofcandidates server-side and returns the single best by highest log probability per token. Cannot be streamed.- Constraint: when used together,
best_ofmust be >n;best_of= candidates,n= returned.
- Token/probability controls:
max_tokens(min 0): max generated tokens; prompt_tokens + max_tokens ≤ model context length.temperaturerange [0, 2]; higher = more random, lower = more deterministic. Recommendation: change temperature OR top_p, not both.top_prange [0, 1] nucleus sampling: consider tokens within top_p probability mass (e.g., 0.1 → top 10% mass).logprobs(min 0, max 5): return logprobs for logprobs most likely tokens plus the chosen token (up to logprobs+1 entries).
- Stopping:
stopstring or array (up to 4 sequences); returned text excludes stop sequence. Not supported with reasoning modelso3ando4-mini. - Reproducibility:
seed(int64): best-effort deterministic sampling with same seed+params; not guaranteed—monitorsystem_fingerprint. - Streaming:
stream: truesends SSE token events; terminates withdata: [DONE].best_ofdisables streaming.
📖 Messages endpoint (Chat Completions → Messages list)
Reference Doc · source
Current API surface location for “Messages” related to Chat Completions, within the broader OpenAI API reference navigation (useful for grounding discussions of message-structured inputs and where message objects live in the API).
Key content
- Where “Messages” fits in the API reference (navigation path):
- Chat Completions → Chat Completions → includes operations for completions and a Messages subresource.
- The reference lists “List chat completions (…/chat/completions/subresources/messages/methods/list)” indicating a dedicated endpoint to list messages associated with chat completions.
- Related modern API surfaces (for reasoning/tooling discussions):
- The docs emphasize the Responses API as the primary modern surface (see “Responses API” section in the same reference tree), including:
- Responses methods: create/retrieve/delete/list input items/count input tokens/cancel/compact.
- Streaming events for Responses.
- Tools & function calling are documented as core concepts (linked from the same reference hub): “Function calling”, “Using tools”, “Structured output”, “Images and vision”, “Audio”.
- The docs emphasize the Responses API as the primary modern surface (see “Responses API” section in the same reference tree), including:
- Key parameter names surfaced in the reference search/navigation (useful keywords to cite precisely):
response_format,parallel_tool_calls,reasoning_effort(shown as suggested search terms in the API docs UI).
📖 OpenAI API “Advanced usage – parameter details” (404 index only)
Reference Doc · source
Doc navigation pointers for sampling/reasoning parameter semantics (target page missing)
Key content
- HTTP result: Target URL returns 404: Not Found (“Page not found”).
- Available on-page guidance: Use Docs search; suggested queries shown:
responses createreasoning_effortrealtimeprompt caching
- Relevant doc locations surfaced by navigation (for parameter semantics elsewhere):
- Core concepts → Text generation: https://platform.openai.com/api/docs/guides/text
- Reasoning → Reasoning models: https://platform.openai.com/api/docs/guides/reasoning
- Reasoning → Reasoning best practices: https://platform.openai.com/api/docs/guides/reasoning-best-practices
- Run and scale → Streaming: https://platform.openai.com/api/docs/guides/streaming-responses
- Context management → Prompt caching: https://platform.openai.com/api/docs/guides/prompt-caching
- API Reference overview: https://platform.openai.com/api/reference/overview
- Migration pointer: “Responses API” guide: https://platform.openai.com/api/docs/guides/migrate-to-responses
- No equations / defaults / parameter definitions are present in the fetched content; it is purely a site navigation + error page.
📋 # Source: https://github.com/arpg/tree-of-thought-llm
Source ·