Reasoning Models
Video (best)
- Andrej Karpathy — “Deep Dive into LLMs like ChatGPT”
- Why: Clear, high-signal overview of how modern LLMs are trained and used, including RLHF and inference-time behavior that connects to “reasoning” and test-time compute ideas.
- Level: Intermediate
Blog / Written explainer (best)
- Lilian Weng (OpenAI) — “LLM Powered Autonomous Agents”
- Link: https://lilianweng.github.io/posts/2023-06-23-agent/
- Why: Strong conceptual grounding for reasoning-like behaviors in LLM systems (planning, reflection, tool use), and how inference-time scaffolding changes capabilities.
- Level: Intermediate
Deep dive
- Lilian Weng (OpenAI) — “Reinforcement Learning with Human Feedback”
- Link: https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
- Why: One of the clearest end-to-end explainers of RLHF-style training loops and reward modeling; useful background for “reinforcement learning for reasoning,” outcome vs process supervision, and reward model design.
- Level: Intermediate–Advanced
- OpenAI — “Learning to summarize with human feedback”
- Link: https://openai.com/research/learning-to-summarize-with-human-feedback
- Why: Canonical, readable RLHF case study (reward modeling + policy optimization) that transfers directly to reasoning-focused RL setups.
- Level: Intermediate
Original paper
- Ouyang et al. (OpenAI, 2022) — “Training language models to follow instructions with human feedback” (InstructGPT)
- Link: https://arxiv.org/abs/2203.02155
- Why: Foundational paper for reward modeling + RL fine-tuning; core prerequisite for understanding later “reasoning model” training recipes.
- Level: Advanced
- Cobbe et al. (OpenAI, 2021) — “Training Verifiers to Solve Math Word Problems”
- Link: https://arxiv.org/abs/2110.14168
- Why: Directly targets verification at test time (verifier/reranker) and connects to test-time compute scaling via sampling + selection.
- Level: Advanced
Code walkthrough
- Hugging Face TRL — “TRL (Transformer Reinforcement Learning)”
- Link: https://github.com/huggingface/trl
- Why: Widely used, practical RLHF/RLAIF tooling (PPO/DPO-style training, reward modeling utilities) suitable for implementing outcome-reward training pipelines.
- Level: Intermediate–Advanced
- CarperAI — “trlx”
- Link: https://github.com/CarperAI/trlx
- Why: Another established RLHF training codebase; useful for seeing end-to-end reward model + policy optimization in practice.
- Level: Advanced
Coverage notes
- Strong: RLHF fundamentals; reward modeling; verifier-based selection; practical RL tooling (TRL/trlx).
- Weak: Specific “reasoning-architectures” branding (o1, o3, deepseek-r1) and their proprietary details; “thinking tokens” as an explicit mechanism.
- Gap: High-confidence, primary sources that explicitly define/standardize “process reward models” vs “outcome reward models” for reasoning, and authoritative public docs for o1/o3/deepseek-r1 and “reasoning traces” policies.
Additional Resources for Tutor Depth
13 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 DeepSeek-R1 training recipe + benchmark comparisons (o1-level)
Paper · source
DeepSeek-R1 technical report: RL/distillation pipeline, benchmark tables vs OpenAI o1, reasoning-trace formatting/rewards, test-time compute scaling.
Key content
- Models & goal (Abstract/§1): DeepSeek-R1-Zero = pure RL on DeepSeek-V3-Base (no SFT). DeepSeek-R1 = multi-stage pipeline with cold-start SFT + RL + SFT + RL; reported “comparable to OpenAI-o1-1217” on reasoning tasks.
- RL algorithm (GRPO, §2.2.1 Eq. 1–3): Group Relative Policy Optimization optimizes policy using group-sampled outputs; baseline estimated from group scores (no critic model). Uses advantage (A_i) computed from rewards within each sampled group (Eq. 3). (Hyperparameters (\epsilon,\beta) appear in objective Eq. 1–2.)
- Reward design (rule-based, §2.2.2):
- Accuracy reward: verifiable correctness (e.g., boxed math answer; compiler + tests for LeetCode).
- Format reward: enforce reasoning between
<think>...</think>tags. - Rationale: avoid neural outcome/process reward models due to reward hacking risk and added complexity/resources.
- DeepSeek-R1 4-stage pipeline (§2.3):
- Cold-start SFT: “thousands” of long-CoT samples; readable format
|special_token|<reasoning_process>|special_token|<summary>. - Reasoning RL: add language consistency reward (proportion of target-language words in CoT); final reward = accuracy + language consistency.
- Rejection sampling → SFT: ~600k reasoning samples + ~200k non-reasoning (writing/QA/etc.) = ~800k; SFT 2 epochs.
- RL all scenarios: rule-based rewards for reasoning; preference reward models for general data; helpfulness judged on final summary, harmlessness on entire response.
- Cold-start SFT: “thousands” of long-CoT samples; readable format
- Test-time compute scaling (§2.2.4/§3): long CoTs (hundreds–thousands tokens); benchmark outputs capped at 32,768 tokens.
- Key empirical results:
- R1-Zero AIME 2024: pass@1 15.6% → 71.0% after RL; cons@64 86.7% (matches/exceeds o1-0912).
- R1-Zero vs o1-0912 (Table §2.2.4): AIME pass@1 71.0 vs 74.4; MATH-500 pass@1 86.7 vs 83.3; GPQA pass@1 73.3 vs 77.3; Codeforces rating 1444 vs 1843.
- DeepSeek-R1 vs OpenAI-o1-1217 (Table §3.1): AIME 2024 pass@1 79.8 vs 79.2; MATH-500 97.3 vs 96.4; GPQA Diamond 71.5 vs 75.7; LiveCodeBench 65.9 vs 63.4; Codeforces rating 2029 vs 2061.
- Distillation (Table §3.2): Distill-Qwen-32B AIME 72.6, MATH-500 83.3, GPQA 94.3, LiveCodeBench 62.1, Codeforces 1691.
- Distill vs RL-on-small (Table §4.1): RL-trained Qwen-32B (R1-Zero-Qwen-32B) AIME 47.0 vs Distill-Qwen-32B 72.6 → distillation > RL on small base at similar scale.
📄 Inference-time compute ↔ adversarial robustness (o1)
Paper · source
Quantitative evidence that increasing inference-time compute improves robustness/performance; methods + metrics (e.g., StrongREJECT goodness@0.1), attack/defense scaling curves, and human red-teaming table.
Key content
- Adversarial training objective (context, Intro):
[ \min_f \mathbb{E}{(x,y)\sim D}\left[\max{t\in T} L(f(t(x)),y)\right] ] where (T) is allowed perturbations; contrasted with this paper’s approach: no adversarial training, only more inference-time compute (“thinking time”). - Core measurement protocol (Sec. 2): For each prompt (x), sample multiple model outputs at varying defender inference-time compute; compute attack success probability vs (a) attacker resources and (b) defender compute. Models: o1-preview, o1-mini (plus o1-v for vision).
- Tasks + attacker goals (Table 1):
- Math (2-digit add/mult; MATH): force output 42, correct+1, or correct×7.
- Safety: StrongREJECT (60 filtered policy-violating prompts; 35 jailbreaks), Misuse Prompts, Rule Following (unambiguous internal spec subset).
- Agentic: AdvSimpleQA prompt injection; adversary wants output “COMPROMISED”.
- Vision: ImageNet-A; Attack-Bard (transfer, (\epsilon=16/255), (L_\infty)); plus Attack-Bard-clean.
- StrongREJECT metric (Sec. 3.2, Fig. 7): goodness@0.1 = for each misuse prompt, apply all jailbreaks; score responses with StrongREJECT grader; take worst 10% average; then average over prompts. Robustness increases nearly monotonically with compute.
- Key empirical patterns (Figs. 2,3,12–15):
- For unambiguous tasks (math, AdvSimpleQA), for fixed attacker resources, attack success tends toward 0 as compute increases.
- Soft-token attack procedure (Sec. 3.4): optimize soft embeddings on train split to maximize logprob of sampled CoT + adversary answer; evaluate on held-out test; unconstrained norms can grow >1000× typical embedding norm → use norm-capped projection to average embedding norm every few steps.
- Human red-teaming (Sec. 3.7, Table 2; o1-preview; 40 red-teamers; 5 compute levels):
- Avg attempts needed for success (↑ better): 11.0, 16.8, 15.6, 15.9, 22.6 (levels 1→5).
- Transfer success on 80 prompts (↓ better): 36%, 30%, 21%, 20%, 19%.
- New compute-targeting attacks: “Think Less” reduces model compute (Sec. 3.8); “Nerd sniping” shows top 5% longest-compute traces can have higher attack success than median (Sec. 3.9, Fig. 18).
📄 PRM for reflective (long-CoT) math reasoning
Paper · source
Formal PRM training objective + reflective-step labeling rules (Error Propagation/Cessation) + evaluation procedures (BoN vs step-search) with concrete results.
Key content
- PRM vs ORM (Section 2):
- ORM scores whole solutions via final answer.
- PRM scores individual steps to provide granular intermediate feedback for search/RL.
- Reflective long-CoT labeling problem (Section 1, 3): Traditional PRM datasets truncate incorrect solutions at the first error, assuming all later steps wrong—fails when models self-correct after mistakes.
- New step-label rules (Section 4.2):
- Error Propagation: if earlier steps are incorrect and current step builds on them without correction/new approach ⇒ label incorrect.
- Error Cessation: if earlier steps are incorrect but current step corrects them or starts a new error-free approach ⇒ label correct.
- LLM judge annotation (Section 4.3, Appx B/E): Incorporate the above rules into a judge prompt; reported step-annotation accuracy: o1 = 0.963, claude-3.5-sonnet = 0.726, gpt-4o-2024-08-06 = 0.668 (Table 8).
- PRM training objective (Eq. 1): binary step classification with cross-entropy over steps
[ L_{\text{PRM}}=\sum_{i=0}^{K}\hat y_i\log y_i+(1-\hat y_i)\log(1-y_i) ] where (K)=#steps; (y_i)=gold label for step (s_i); (\hat y_i=\text{PRM}(\text{prompt}, s_{\le i}))=predicted probability/score for step (s_i). - Evaluation metrics (Section 5.1.3):
- PRM@N (Best-of-N): pick best among N candidates using final-step score.
- PRM@N-step (Online search): at each step sample N continuations, choose top-scoring step to continue.
- Key results (Table 2): “Ours” PRM: MATH500 PRM@64 = 0.816, PRM@8-step = 0.750; AIME2024 PRM@64 = 0.267, PRM@8-step = 0.167; step-level F1 = 0.828 (Precision 0.850, Recall 0.806).
- Hyperparameters: Generator SFT: lr 1e-5, epochs 3, batch 24, max len 16384 (Table 6). PRM training: lr 1e-6, epochs 1, batch 256, max len 10240 (Table 7).
📄 Self-Consistency (SC) for Chain-of-Thought Decoding
Paper · source
Empirical accuracy gains from test-time sampling + majority/consistency selection over multiple CoT reasoning paths; ablations on number of sampled paths.
Key content
- Core idea (Self-Consistency decoding): Replace greedy decoding in Chain-of-Thought (CoT) prompting with sampling multiple reasoning paths and selecting the most consistent final answer by marginalizing out the reasoning traces (Figure 1; Section 2).
- Procedure (Figure 1 / Section 2):
- Prompt LM with CoT exemplars (or zero-shot “let’s think step by step”).
- Sample a diverse set of outputs from the decoder to obtain pairs ((r_i, a_i)), where (r_i) is the reasoning path (tokens) and (a_i) is the final answer.
- Aggregate by choosing the answer with highest agreement across samples (majority vote / “most consistent answer”).
- Design rationale: Complex reasoning problems admit multiple valid reasoning paths leading to a unique correct answer; correct paths tend to agree more on the final answer than incorrect ones. SC avoids greedy decoding’s local optimality/repetitiveness and reduces variance vs a single sampled decode.
- Defaults / parameters (Section 3.2): Results averaged over 10 runs; each run samples 40 outputs (“40 reasoning paths”).
- Key empirical gains (Abstract / Tables 2–3): SC boosts CoT accuracy by:
- GSM8K: +17.9%
- SVAMP: +11.0%
- AQuA: +12.2%
- StrategyQA: +6.4%
- ARC-Challenge: +3.9%
- Sampling ablation (Figure 2): Increasing sampled paths improves accuracy; ~40 paths consistently better than fewer.
- Model coverage: Demonstrated across UL2-20B, LaMDA-137B, PaLM-540B, GPT-3 175B; gains often larger at larger scale.
📄 Step-level PRM for inference-time reasoning search (HGS-PRM)
Paper · source
Step-level (process) reward modeling objective + using PRM feedback to guide multi-step reasoning at inference time
Key content
- Core distinction (PRM vs ORM):
- Outcome Reward Model (ORM): provides a single reward for the final answer/trajectory outcome (sparse terminal feedback).
- Process-Supervised Reward Model (PRM): provides step-by-step feedback over a multi-step reasoning trace (dense intermediate feedback), trained to predict correctness/quality of intermediate steps (process supervision).
- Inference-time use (main contribution): PRM is not only for training (e.g., PPO / reject sampling), but can be used during decoding to discern better solution paths for multi-step tasks (math, code).
- Algorithmic procedure (HGS-PRM):
- A heuristic greedy search that uses step-level PRM scores to guide which next reasoning step/path to expand, aiming to optimize the explored reasoning pathway (search guided by per-step reward signals rather than only final correctness).
- Empirical comparisons (as stated in source excerpt):
- The PRM-guided inference method improves over Chain-of-Thought (CoT) on GSM8K and MATH benchmarks (math reasoning).
- Similar improvements reported for code generation, using an automatically generated step-level reward dataset.
- Data generation for code PRM (workflow):
- Construct step-level reward data for coding tasks via automatic code mutation plus unit tests to label/score intermediate steps.
📖 Reasoning best practices (doc index only)
Reference Doc · source
Entry point for OpenAI “Reasoning best practices” guidance; includes related navigation targets for controlling reasoning behavior (effort/latency/cost) and handling reasoning traces.
Key content
- The fetched page content is a 404 “Page not found” response; no best-practice guidance, equations, empirical results, or parameter defaults are present in the retrieved text.
- The document shell exposes adjacent/related doc endpoints via navigation and search suggestions (useful as pointers during tutoring):
- Reasoning section links:
- Reasoning models: https://platform.openai.com/api/docs/guides/reasoning
- Reasoning best practices: https://platform.openai.com/api/docs/guides/reasoning-best-practices
- Search suggestions shown on the page (as keywords students may ask about):
- responses create, reasoning_effort, realtime, prompt caching
- Reasoning section links:
- The broader docs IA (information architecture) visible here indicates where operational controls likely live:
- Responses API migration guide: https://platform.openai.com/api/docs/guides/migrate-to-responses
- Streaming responses: https://platform.openai.com/api/docs/guides/streaming-responses
- Latency optimization and Cost optimization sections (for managing reasoning latency/cost tradeoffs).
📖 Reasoning models via Responses API (effort, tokens, summaries)
Reference Doc · source
How to use reasoning models with the Responses API; parameters to control reasoning behavior and how to request/suppress reasoning outputs.
Key content
- Model guidance (selection): Start with
gpt-5.4for most reasoning workloads; usegpt-5.4-profor highest intelligence (more latency);gpt-5-mini/gpt-5-nanofor lower cost/latency. Reasoning models “work better” with Responses API vs Chat Completions. - Core control knob:
reasoning: {"effort": <level>}guides how many reasoning tokens are generated before visible output. Supported values (model-dependent):none,minimal,low,medium,high,xhigh.- Table (start here when…):
none: lowest latency for extraction/routing/simple transformslow: small extra thinking improves reliabilitymedium/high: planning, coding, synthesis, harder reasoningxhigh: only if evals justify extra latency/cost
- Defaults are model-dependent:
gpt-5.4defaults tonone; older GPT‑5 models default tomedium.
- Table (start here when…):
- Token accounting & context: Reasoning tokens are discarded from context after the response, but still consume context window and are billed as output tokens. Usage shows reasoning tokens at:
usage.output_tokens_details.reasoning_tokens(example:reasoning_tokens: 1024). - Cost/length limit:
max_output_tokenscaps (reasoning + final output) tokens. - Incomplete handling: If context limit or
max_output_tokenshit →status: "incomplete"andincomplete_details.reason: "max_output_tokens". Can happen before any visible output (cost incurred for input + reasoning). - Practical buffer: Recommend reserving ≥ 25,000 tokens for reasoning+outputs when experimenting.
- Function calling continuity: Pass back reasoning items (plus tool call + tool outputs) across turns; easiest via
previous_response_idor replaying prioroutputitems. - Stateless/ZDR: Include
"reasoning.encrypted_content"inincludeto receive encrypted reasoning items for reuse. - Reasoning summaries (not raw traces): Opt-in via
reasoning.summary(e.g.,"auto"). Summary appears in an output item of type"reasoning"undersummary[].
📖 o1 System Card — Safety, CoT visibility, evals & mitigations
Reference Doc · source
Safety/faithfulness constraints, what CoT is shown vs hidden (summaries), evaluation methodology, and risk mitigations for o1.
Key content
- Training & alignment (Sections 1–2, 5.3):
- o1 family trained with large-scale reinforcement learning to reason using chain-of-thought; includes deliberative alignment: teaches models to explicitly reason through safety specs before answering.
- Data pipeline: diverse public + proprietary partnerships + in-house datasets; filtering to reduce PII; Moderation API + safety classifiers to exclude harmful/sensitive content (incl. CSAM).
- Deployment decision: CoT surfaced as summaries (Section 4.3.2):
- ChatGPT surfaces CoT summaries (not full CoT). For o1 launch, same summarizer as o1-preview/mini; no summaries for image-input results (at time of writing).
- Summarizer safety eval: summary introduced disallowed content when answer didn’t in 0.06% of completions; no regurgitation found in summaries on regurgitation evals.
- Instruction hierarchy to prevent developer-message jailbreaks (Section 4.2):
- Priority: system > developer > user; supervised on conflicts.
- Tutor jailbreak eval pass rates: o1 0.95 (system message) and 0.92 (developer message) vs GPT-4o 0.33 / 0.58.
- Key safety eval numbers (Section 4.1):
- Challenging refusal (not_unsafe): GPT-4o 0.713 vs o1 0.92.
- Hallucinations: SimpleQA accuracy 0.47 (o1) vs 0.38 (GPT-4o); hallucination rate 0.44 vs 0.61. PersonQA hallucination rate 0.20 vs 0.30.
- External red teaming findings (Section 4.4):
- Pairwise safety: o1 rated safer 59.75% vs GPT-4o 28.48% (tie 11.76%).
- Gray Swan Arena ASR: harmful text 6% (o1) vs ~3.5% (4o); harmful image-text 5% vs 4%; malicious code 5% vs 6% (o1’s longer detail increased severity once jailbroken).
- Preparedness Framework (Section 5):
- Deployment rule: only post-mitigation “medium” or below can be deployed.
- o1 risk ratings: Medium (CBRN, persuasion), Low (cybersecurity, autonomy). Preparedness evals are a lower bound; more scaffolding/rollouts can elicit more.
📋 # Source: https://openai.com/index/o3-o4-mini-system-card/
Source ·
📋 # Source: https://platform.openai.com/docs/api-reference/responses-streaming/response/reasoning
Source ·
🔍 o1 reasoning via large-scale RL + test-time compute scaling (Safety & CoT)
Explainer · source
Primary-source description of o1-style approach: large-scale RL for reasoning, trading inference-time compute for performance, and rationale for “thinking longer” before answering.
Key content
- Training approach (Reasoning RL): “Large-scale reinforcement learning algorithm” trains the model to “think productively” using chain-of-thought in a highly data-efficient process. RL teaches the model to: recognize/correct mistakes, break down tricky steps, and try alternate approaches when stuck (Chain of Thought section).
- Compute scaling claim: o1 performance consistently improves with:
- More RL = more train-time compute
- More time spent thinking = more test-time compute
(Stated explicitly; figure caption: “smoothly improves with both train-time and test-time compute.”)
- Test-time selection/verification workflow (Coding/IOI):
- Sample many candidate submissions; submit 50 chosen via a test-time selection strategy using: IOI public tests + model-generated tests + a learned scoring function.
- With 10,000 submissions/problem, score 362.14 (above gold threshold) even without selection strategy.
- Empirical results (selected):
- AIME 2024: GPT‑4o 12% (1.8/15) avg; o1 74% (11.1/15) single-sample; 83% (12.5/15) with consensus@64; 93% (13.9/15) when re-ranking 1000 samples with learned scoring.
- GPQA diamond: o1 surpasses recruited PhD experts; pass@1 77.3, cons@64 78.0 (Appendix A).
- Safety table (harmful prompts): Challenging jailbreak/edge cases safe completions: GPT‑4o 0.714 vs o1‑preview 0.934; StrongREJECT Goodness@0.1: 0.220 → 0.840; Human-sourced jailbreak eval: 0.770 → 0.960.
- Design rationale (Hiding CoT): Raw chains-of-thought not shown to users; instead show a model-generated summary. Rationale: preserve potential for monitoring (“read the mind”) while avoiding training user-preference/policy compliance onto the hidden trace and avoiding exposing unaligned thoughts.
🔍 o3-mini empirical evals + compute/latency tradeoffs (benchmarks)
Explainer · source
o3-mini-specific benchmark tables; reasoning behavior vs latency/cost via tool scaffolds + test-time attempts
Key content
- Reasoning model training (Section 2): o-series trained with large-scale reinforcement learning to “think before answering” (chain-of-thought), learning to refine strategies and recognize mistakes; includes deliberative alignment (explicitly reason through safety specs before answering).
- Evaluation defaults / test-time compute:
- CTF eval (Sec. 5.3): headless Kali Linux; up to 60 rounds of tool use per attempt; 12 attempts per task. Results (post-mitigation): 61% high-school, 21% collegiate, 21% professional CTFs.
- SWE-bench Verified N=477 (Sec. 5.7.2):
- Agentless 1.0 scaffold: 5 tries to generate patch; metric pass@1 computed by averaging per-instance pass rates over valid patches.
- o3-mini (tools) internal scaffold: efficient iterative editing/debugging; 4 tries per instance; pass@1 = 61% (non-final checkpoint).
- o3-mini launch candidate (Agentless): 39%; o1: 48%. (Shows tool scaffold/test-time procedure materially changes performance.)
- Safety/jailbreak robustness (Sec. 4):
- StrongReject goodness@0.1: GPT-4o 0.37, o1-mini 0.72, o3-mini 0.73.
- Gray Swan Arena attack success rate: o3-mini 3.6%, o1-mini 3.7%, gpt-4o 4.0%, o1 1.9%.
- Hallucination (PersonQA, Table 3): accuracy o3-mini 21.7%; hallucination rate 14.8% (vs GPT-4o-mini 52.4%, o1-mini 27.4%).
- Model autonomy indicator (Sec. 5.7): interview coding 92% pass@1; multiple-choice matches o1 (cons@32).
🔍 o3/o4-mini — test-time compute, long rollouts, and benchmark evidence
Explainer · source
Citable PDF tables/figures quantifying reasoning/tool-use gains from longer rollouts (“test-time compute”) + evaluation methodology/metrics.
Key content
- Training / design (Sections 1–2):
- o3 and o4-mini are reasoning models trained with large-scale reinforcement learning on chains of thought; trained to “think before they answer,” trying strategies and recognizing mistakes.
- Models can use tools inside their chain-of-thought (web browsing, Python, image/file analysis, etc.) to augment reasoning.
- Test-time compute / long rollouts evidence (Cybersecurity CTF; Section 4.3.1, Fig. 7):
- Evaluation uses 16 rollouts per CTF, reports pass@12 (“best set of rollouts”).
- With 12 attempts: o3 solves 89% high-school, 68% collegiate, 59% professional CTFs; o4-mini solves 80%, 55%, 41% respectively.
- Authors attribute gains vs prior o-series models to improved tool use + ability to make use of long rollouts.
- No-browsing results plotted to avoid answer lookup contamination.
- Agentic cyber range workflow + compute settings (Section 4.3.2):
- Two scenarios; run configs: Normal, With Hints, With Solver Code.
- Trials: online-retailer scenario 30 trials/config; priv-esc scenario 16 trials/config.
- Metrics: pass@12 (Normal/With Hints), pass@1 (With Solver Code).
- Key result: no model solves either scenario unaided or with hints; o3/o4-mini solve with reasonably high accuracy when given solver code.
- Evaluation methodology note (Section 4.1):
- Evals are lower bounds; longer rollouts/scaffolding can elicit more capability.
- 95% CI for pass@1 via bootstrap over attempts per problem.