Evaluation Benchmarks
Video (best)
- Andrej Karpathy — “State of GPT” (Microsoft Build 2023)
- Watch: YouTube
- Why: Karpathy dedicates a substantial segment to how LLMs are evaluated, covering benchmark design philosophy, contamination risks, and the limitations of static benchmarks — directly relevant to this topic and highly accessible.
- Level: beginner/intermediate
Blog / Written explainer (best)
- Eugene Yan — “Patterns for Building LLM-based Systems & Products”
- Link: https://eugeneyan.com/writing/llm-patterns/
- Why: Widely cited in the ML community, covers evaluation patterns including LLM-as-judge, human eval, and benchmark contamination with concrete examples and practical framing. Bridges research and production concerns.
- Level: intermediate
Supplementary:
- Chip Huyen — “Open Challenges in LLM Research” (evaluation section)
- Link: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html
- Why: Covers hallucination measurement, benchmark limitations, and production evaluation challenges from a practitioner perspective.
- Level: intermediate
Deep dive
- EleutherAI — Open LLM Leaderboard documentation + evaluation harness docs
- Link: https://github.com/EleutherAI/lm-evaluation-harness
- Why: The
lm-evaluation-harnessrepository is the de facto standard implementation for running MMLU, HellaSwag, HumanEval, and dozens of other benchmarks. Its README and task implementations serve as the most comprehensive technical reference for how benchmarks actually work in practice — covering prompt formatting, few-shot setup, metric computation, and contamination concerns. - Level: advanced
Original paper
- Hendrycks et al. — “Measuring Massive Multitask Language Understanding” (MMLU, 2020)
- Link: https://arxiv.org/abs/2009.03300
- Why: MMLU is the single most referenced LLM benchmark and this paper established the paradigm of broad, multi-domain academic evaluation. It is readable, well-structured, and directly motivates discussions of contamination, task diversity, and benchmark saturation that define the field. HumanEval (Chen et al., 2021, arxiv 2107.03374) is the complementary seminal paper for code evaluation.
- Level: intermediate
Honorable mention:
- Zheng et al. — “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (2023)
- Link: https://arxiv.org/abs/2306.05685
- Why: Foundational paper for the LLM-as-judge paradigm and Chatbot Arena (LMSYS), directly covering two of the related concepts in this topic.
- Level: intermediate/advanced
Code walkthrough
- EleutherAI —
lm-evaluation-harness— running MMLU and HumanEval end-to-end - Link: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks
- Why: Walking through actual task implementations (e.g.,
mmlu/,humaneval/) shows learners exactly how prompts are constructed, how metrics are computed, and where contamination can enter — far more instructive than any tutorial video. Pairs naturally with the deep dive above. - Level: advanced
Supplementary notebook-style walkthrough:
- Hugging Face — “Evaluate” library documentation with worked examples
- Link: https://huggingface.co/docs/evaluate/index
- Why: More beginner-friendly entry point with runnable code for common metrics (accuracy, BLEU, ROUGE, exact match) before tackling full benchmark harnesses.
- Level: beginner/intermediate
Coverage notes
- Strong: Static benchmark evaluation (MMLU, HumanEval), LLM-as-judge / Chatbot Arena, contamination concerns, VQA evaluation — all have solid papers and written resources
- Weak: Agent evaluation and trajectory analysis — emerging area with limited consolidated pedagogical resources; most material is in recent papers (GAIA, AgentBench, τ-bench) rather than polished explainers
- Weak: Observability and task completion rate in production settings — covered in MLOps literature but rarely connected explicitly to benchmark framing
- Gap: No single high-quality YouTube video cleanly covers the full evaluation benchmarks landscape (static + arena + agent + multimodal) in one explainer. Most videos focus on a single benchmark or paper.
- Gap: VQA evaluation specifically (as distinct from general multimodal) lacks a strong standalone explainer video outside of original paper presentations.
Cross-validation
This topic appears in 3 courses: intro-to-agentic-ai, intro-to-llms, intro-to-multimodal
| Concept | intro-to-llms | intro-to-agentic-ai | intro-to-multimodal |
|---|---|---|---|
| MMLU / HumanEval | ✅ Core | ➖ Reference | ➖ Reference |
| Chatbot Arena / LLM-as-judge | ✅ Core | ✅ Relevant | ➖ Peripheral |
| Agent eval / trajectory analysis | ➖ Peripheral | ✅ Core | ➖ Peripheral |
| VQA evaluation | ➖ Not covered | ➖ Peripheral | ✅ Core |
| Contamination | ✅ Core | ✅ Relevant | ✅ Relevant |
| Observability / task completion | ➖ Peripheral | ✅ Core | ➖ Peripheral |
Additional Resources for Tutor Depth
54 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 Agentic LLM Reliability—KAMI v0.1 trace analysis + eval setup
Paper · source
Empirical reliability evidence in agentic tool-use + evaluation configuration (KAMI v0.1); failure archetypes that motivate reliability metrics beyond aggregate scores
Key content
- Dataset / method: 900 manually reviewed execution traces = 3 models × 10 scenarios × 30 trials (random sample, 12.5% of ~240 trials/model/scenario). Emergent coding over full traces (tool calls + tool outputs + outcomes). (Section 2.4)
- Models + pooled KAMI v0.1 accuracy (Table 1):
- Granite 4 Small (32B, dense): 58.5% (95% t-CI 57.5–59.3)
- Llama 4 Maverick (400B total / 17B active, MoE): 74.6% (CI 73.8–75.3)
- DeepSeek V3.1 (671B total / 37B active, MoE): 92.2% (CI 91.2–93.2)
- DeepSeek V3: 59.4% (CI 59.0–59.7) → same architecture as V3.1; improvement attributed to post-training RL (Sections 1, 4; Table 1).
- Recurring failure archetypes (Section 1.2): (1) premature action without grounding (schema guessing), (2) over-helpfulness substituting missing entities, (3) distractor-induced context pollution, (4) fragile execution under load (malformed tool calls, loops, inconsistent recovery).
- Benchmark configuration defaults (Section 2.3): max 20 rounds; single tool call per round; temperature 0.4; context window 32K (non-thinking) / 128K (thinking); max output 8K tokens/round (non-thinking); all analyzed models run non-thinking.
- Design rationale: randomized task parameters to probe stochasticity + reduce memorization; aggregate scores hide how failures occur—trace-level analysis needed for enterprise reliability engineering. (Sections 1.2, 2.3)
📄 CMMLU (Chinese MMLU-style benchmark) — construction + eval protocol + results
Paper · source
Benchmark construction + leaderboard-style results/ablations (splits, protocol, numbers)
Key content
-
Benchmark design (Section 3):
- 67 subjects, 4-choice single-answer MCQ format.
- Total questions: reported as 11,528 (text) / 11,582 test-set (Table 1).
- Per-subject split: 5-question few-shot dev + >100-question test; each subject min 105 questions (Table 1).
- Supercategories: 17 STEM (2,531 Q), 13 Humanities (2,489 Q), 22 Social Science (3,652 Q), 15 Other (2,910 Q); China-specific: 15 tasks (2,572 Q) (Table 1).
- Data collection: 4 annotators; 50 CNY/hour; ~250 hours; >80% from PDFs (OCR) to reduce training contamination; estimated ~2% label noise.
- Overlap check vs CEval/M3KE: exact-string match after sorting choices + punctuation removal → 74 overlaps with CEval, 158 with M3KE (~1% of CMMLU).
-
Evaluation protocol (Section 4):
- Closed models: free generation + regex extract option.
- Open models: next-token prediction over tokens {A,B,C,D} using next-token logits (preferred vs perplexity; Appendix G).
- Prompt: “以下是关于[主题]的单项选择题,请直接给出正确答案的选项…答案是:”; 0-shot and up to 5-shot; if context too long, drop longest examples by sub-token count.
-
Main 5-shot results (Table 3; macro-average over subjects):
- GPT-4: 70.95% overall (STEM 65.23, Humanities 72.11, Social 72.06, Other 74.79, China-spec 66.12).
- ChatGPT: 55.51% overall.
- Best open multilingual: LLaMA2-70B 53.21%; best Chinese: Baichuan2-13B 61.92% (beats ChatGPT).
-
Ablations (Section 4.2):
- Chain-of-thought prompt often doesn’t help; e.g., Baichuan2-13B-Chat overall 58.77 (DA) → 52.82 (COT) (Table 4).
- Negation words: ~10.7% of data; most models worse with negation (Table 5).
- Sub-options:
10.8% of data; accuracy drops ~10–20 points; GPT-4 5-shot: 71.72 (no sub-options) vs 53.41 (with) (−18.3) (Table 6).
📄 DPR dual-encoder objective + in-batch negatives
Paper · source
Dual-encoder DPR training objective (in-batch negatives / contrastive log-likelihood), dot-product scoring, end-to-end retriever→reader procedure + key hyperparams/results
Key content
- Retrieval scoring (Eq. 1, Sec. 3.1):
[ \text{sim}(q,p)=E_Q(q)^\top E_P(p) ] where (E_Q, E_P) are question/passage encoders; vectors are (d)-dim (BERT-base [CLS], (d=768)). Retrieve top-(k) passages by maximum inner product search (FAISS). - Training loss (Eq. 2, Sec. 3.2): for instance (\langle q_i,p_i^+,p_{i,1}^-,…,p_{i,n}^-\rangle), [ L=-\log \frac{e^{\text{sim}(q_i,p_i^+)}}{e^{\text{sim}(q_i,p_i^+)}+\sum_{j=1}^n e^{\text{sim}(q_i,p_{i,j}^-)}} ]
- In-batch negatives (Sec. 3.2): batch size (B). Build (Q,P\in\mathbb{R}^{B\times d}); (S=QP^\top\in\mathbb{R}^{B\times B}). Positive pairs are diagonal (i=j); negatives are other batch passages ((B-1) per question), yielding (B^2) pairs/batch.
- Negatives used (Sec. 3.2, 5.2): best model uses gold in-batch negatives + 1 BM25 hard negative per question (BM25 passage that doesn’t contain answer). Adding 1 BM25 negative helps; adding 2 doesn’t.
- Key retrieval results (Table 2): Top-20 accuracy (answer in retrieved passages):
- NQ: DPR 78.4 vs BM25 59.1
- TriviaQA: 79.4 vs 66.9
- WQ: 73.2 vs 55.0
- TREC: 79.8 vs 70.9
- Training hyperparams (Sec. 5): batch size 128; epochs 40 (large datasets) / 100 (small); LR 1e-5 Adam + linear warmup; dropout 0.1.
- Indexing/runtime (Sec. 5.4): Wikipedia split into 21,015,324 passages of 100 words (+ title + [SEP]). FAISS retrieval ~995 Q/s (top-100). Dense embedding compute ~8.8h on 8 GPUs; FAISS index build 8.5h; Lucene index build ~30 min.
📄 MMLU-CF (contamination-free MMLU) — methodology + results
Paper · source
Contamination-detection methodology + contamination-free evaluation results (tables) incl. sampling/settings
Key content
- Problem framing: contamination in public MCQ benchmarks (MMLU) can be unintentional (train-data overlap) or deliberate (benchmark added to training; models regurgitate exact choices). Example shown where models output identical MMLU choices from question-only prompt (Fig. 1).
- Dataset scale & split (Sec. 3): MMLU-CF = 20,000 MCQs across 14 fields, sourced from 200+ billion webpages; final split 10k closed-source test + 10k open-source validation to deter deliberate contamination while enabling transparency.
- Construction pipeline (Fig. 3):
- MCQ collection: extract 2.7M MCQs from 3000+ domains.
- Cleaning: length 10–512 chars; require ≥4 choices; normalize labels to A/B/C/D; English-only; dedup → 1.66M.
- Difficulty sampling: GPT-4o rates difficulty 0–9 (prompt Table 7); sample ~normal centered at 6; keep balanced disciplines → 50k.
- LLM checking: GPT-4o/Gemini/Claude rate quality (1–5); keep avg >4; safety filters (hate/sex/self-harm/violence); redundancy detection inspired by Decontaminator.
- Contamination-free processing (Sec. 3.2, Fig. 5):
- Rule 1 Rephrase question (GPT-4o)
- Rule 2 Shuffle choices (special-case “All/None of the above”)
- Rule 3 50%: replace one choice with “None of the other choices” (skip if last choice is All/None above)
- Evaluation defaults (Sec. 4, Appx A.6, Table 5): 0-shot & 5-shot, no CoT (except marked). Prompt: “Answer by replying A, B, C or D” (Table 6). Temperatures/max tokens: GPT-4o 0.7/2048, GPT-3.5 0.7/2048, DeepSeek-R1 0.6/32768, DeepSeek-V3 0.7/8192, Qwen2.5 0.7/4096, others 0.7/1024.
- Key empirical results (5-shot test, Table 1): large drops vs MMLU and rank reshuffles. Examples:
- OpenAI o1: 92.3 → 80.3 (−12.0)
- GPT-4o: 88.0 → 73.4 (−14.6)
- Qwen2-72B-instruct: 82.3 → 63.7 (−18.6), rank ↓7
- Rule ablation (Table 2, 5-shot): applying all 3 rules drops accuracy:
- On MMLU: GPT-4o 88.0 → 79.8 (−8.2)
- On MMLU-CF: GPT-4o 79.8 → 73.4 (−6.4)
Larger drop on MMLU ⇒ more contamination.
- Contamination detection metric (Sec. 4.5, Fig. 6): “match rate” of model outputs to original MMLU choices on 1k samples/40 models: ~10% of models show 1–5% match on MMLU; with decontam rules 97.5% of models <1% match; on MMLU-CF 100% of models <0.2% match.
- Validation–test gap as contamination monitor (Appx A.3, Table 4): define Δ = |score_val − score_test|; before validation release: ~60% of Δ <0.5 and 96% <1.0 (5-shot), suggesting similar difficulty; future Δ growth indicates validation contamination.
📄 Maximal Marginal Relevance (MMR) selection criterion
Paper · source
Original MMR equation balancing query relevance vs novelty/diversity via λ
Key content
- MMR selection criterion (Eq. 1 / Section 2): incrementally select the next item (D_i) from retrieved set (R) given already-selected set (S):
[ \mathrm{MMR} \triangleq \arg\max_{D_i \in R \setminus S}\Big[\lambda, \mathrm{Sim}1(D_i,Q);-;(1-\lambda)\max{D_j \in S}\mathrm{Sim}_2(D_i,D_j)\Big] ] Variables:- (C): document collection/stream; (Q): query/user profile
- (R = IR(C,Q,\theta)): retrieved/ranked list from an IR system with threshold (\theta) (match degree or top-N cutoff)
- (S\subset R): already selected docs/passages; (R\setminus S): unselected candidates
- (\mathrm{Sim}_1): similarity for relevance (doc/passages ↔ query)
- (\mathrm{Sim}_2): similarity for redundancy (candidate ↔ selected); may equal (\mathrm{Sim}_1) or differ
- (\lambda\in[0,1]): tradeoff; (\lambda=1) ⇒ pure relevance ranking; (\lambda=0) ⇒ maximal diversity among (R)
- Procedure (reranking / summarization): segment document into passages (sentences), compute cosine similarity, apply MMR to rerank passages for a query; output top passages in original document order (Section 4).
- Suggested λ strategy (Section 2): start broad with (\lambda\approx 0.3), then refocus with reformulated query and (\lambda\approx 0.7).
- Empirical results:
- User study (Section 3): 80% (4/5) chose MMR method for a search task.
- SUMMAC’98 (Section 4): MMR summarizer achieved F-score 0.73 for query-relevant summaries; 70% accuracy on “informative summaries.”
- Sentence precision (Table 1): compression 10%: (\lambda=1) 0.78/0.83, (\lambda=.7) 0.76/0.83, (\lambda=.3) 0.74/0.79; Lead sentences 0.74/0.83. Compression 25%: (\lambda=1) 0.74/0.76, (\lambda=.7) 0.73/0.74, (\lambda=.3) 0.74/0.76; Lead sentences 0.60/0.65.
📄 Measuring Position Bias in LLM-as-a-Judge
Paper · source
Operational definitions + measurement protocol for position bias (RS/PC/PF), factors, and analysis workflow (pairwise + list-wise)
Key content
- Evaluation protocol (Section 2.1):
- Pairwise: judge sees original prompt (A then B) and swapped prompt (B then A) → a judgment pair. Double-blind (candidate identities hidden).
- Option modes: Two-option {A,B}; Three-option {A,B,C} where C=tie (explicit in system prompt).
- List-wise: choose best among ≥3 candidates (not full ranking). Use all order permutations so each candidate appears in each position exactly once (for n candidates → n! permutations). Tie option allowed.
- Metrics (Section 2.2):
- Repetition Stability RS (Eq. 1): reliability under identical repeated queries.
[ RS=\frac{1}{N}\sum_{i=1}^{N}\frac{\max_{c\in C}\text{count}_i(c)}{T} ] C: choice set; T: repeats per query; N: queries. - Position Consistency PC (Eq. 2): fraction of prompt-series where the same winning solution is chosen across permutations:
[ PC=\frac{#\text{consistent series}}{#\text{valid series}} ] - Preference Fairness PF (Eq. 3): single min–max scaled score centered at 0; sign indicates primacy (favor first) vs recency (favor later). Extended to list-wise via “one-vs-all” (first=primacy, others=recency).
- Repetition Stability RS (Eq. 1): reliability under identical repeated queries.
- Defaults/parameters (Section 3.1):
- Temperature =1 for all judges.
- RS computed with 3 repeats; sample: 3 questions/task and 4 candidate models, paired with baseline.
- Pairwise datasets: MTBench (baseline vicuna-13b-v1.3, Two-option, 30 candidates, 8 tasks, 10 Q/task) and DevBench (baseline human, Three-option, 10 candidates, 14 tasks, 8 Q/task).
- Scale: 4,800 (MTBench) and 2,240 (DevBench) instances for PC/PF; >100k total evaluations.
- Key empirical results (Table 2 + Findings):
- Capable judges show RS > 0.95 (e.g., Claude-3.5-Sonnet, GPT-4, Llama-3.3-70B), supporting bias is not random.
- Bias varies by judge & task; PC and PF can diverge (high PC ≠ fair).
- Answer quality gap strongly affects PC (larger gap → higher PC); length effects weak (only output length minimally significant).
- Agreement analysis: >50% of instances have ≥80% judge agreement; <2% are extreme disagreement (hard-to-judge).
- Factor analysis workflow (Section 3.1, Appendix E): bidirectional stepwise regression with AIC predicting PC/PF using: judge identity/series, candidate identity, task category, lengths (input/output/prompt), and answer quality gap.
📄 Neural Network Calibration Metrics & Temperature Scaling (Guo et al., 2017)
Paper · source
Definitions of calibration metrics (ECE/MCE), reliability diagrams, temperature scaling procedure
Key content
- Perfect calibration definition (Eq. 1):
[ \mathbb{P}(\hat Y = Y \mid \hat P = p)=p,\ \forall p\in[0,1] ]
where (\hat Y) is predicted label, (Y) true label, (\hat P) confidence (probability assigned to (\hat Y)). - Reliability diagram binning (Section 2): Partition predictions into (M) confidence bins (I_m=(\frac{m-1}{M},\frac{m}{M}]). Let (B_m={i:\hat p_i\in I_m}).
[ \text{acc}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\mathbf{1}(\hat y_i=y_i),\quad \text{conf}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\hat p_i ] - Expected Calibration Error (ECE) (Eq. 3):
[ \mathrm{ECE}=\sum_{m=1}^M \frac{|B_m|}{n},|\text{acc}(B_m)-\text{conf}(B_m)| ] - Maximum Calibration Error (MCE) (Eq. 5):
[ \mathrm{MCE}=\max_{m\in{1,\dots,M}}|\text{acc}(B_m)-\text{conf}(B_m)| ] - Negative Log Likelihood (NLL) (Eq. 6): (L=-\sum_{i=1}^n \log \hat\pi(y_i|x_i)).
- Temperature scaling (multiclass) (Eq. 9, Section 4.2): with logits (z_i), temperature (T>0):
[ \hat q_i=\max_k \mathrm{softmax}(z_i/T)_k ] Optimize (T) on a held-out validation set by minimizing NLL; predicted class unchanged (argmax invariant), so accuracy unchanged. (T>1) softens; (T\to\infty\Rightarrow 1/K); (T\to 0\Rightarrow 1). - Empirical defaults/results: ECE reported with (M=15) bins (Table 1). Example: CIFAR-100 ResNet-110 (SD) ECE 12.67% → 0.96% with temperature scaling (Figure 4/Table 1). CIFAR-10 ResNet-110 ECE 4.6% → 0.83%. Temperature scaling often best on vision tasks.
- Compute/implementation note: temperature scaling is 1D convex optimization; reported ~10 iterations with conjugate gradient; implement by inserting a scalar multiply (1/T) between logits and softmax.
📄 Unbiased pass@k for code functional correctness (HumanEval/Codex)
Paper · source
Primary-source definition + unbiased estimator + sampling protocol for pass@k (HumanEval-style)
Key content
- Why pass@k (functional correctness) vs BLEU/match metrics (Sec. 2.1): Many programs are functionally equivalent but text-different; unit-test passing matches how developers judge code. BLEU can overlap heavily between correct/incorrect solutions (Fig. 8), so BLEU improvements may not imply correctness.
- pass@k definition + unbiased estimator (Eq. 1, Sec. 2.1): For each task, generate n ≥ k samples, run unit tests, count c correct samples (c ≤ n). Unbiased estimator: [ \text{pass@}k := \mathbb{E}_{\text{problems}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right] ] where (n)=#samples, (c)=#passing samples, (k)=budget. Authors use n=200, k≤100 to reduce variance vs naive “any of k” computation.
- Numerically stable computation (Fig. 3):
- If (n-c<k): return 1.0
- Else: (1 - \prod_{i=n-c+1}^{n}\left(1-\frac{k}{i}\right))
- (Given as a stable numpy implementation.)
- Bias warning: Estimating pass@k as (1-(1-\hat p)^k) with (\hat p=) empirical pass@1 is biased (Appendix A).
- HumanEval dataset (Sec. 2.2): 164 hand-written Python function tasks; avg 7.7 unit tests/problem.
- Key empirical pass rates (Table 1, HumanEval): Codex-12B pass@1 28.81%, pass@100 72.31%; GPT-J 6B 11.62% / 27.74%; GPT-Neo 2.7B 6.41% / 21.37%; TabNine 2.58% / 7.59%.
- Sampling defaults: nucleus sampling top-p=0.95; stop sequences include
\nclass,\ndef,\n#,\nif,\nprint. Temperature tuned by k (e.g., optimal for 679M: T=0.2* for pass@1, T=0.8* for pass@100).
📄 pass@k functional correctness eval (Codex / HumanEval)
Paper · source
Formal definition + unbiased estimator for pass@k; offline evaluation protocol for code generation (HumanEval), plus key reporting conventions.
Key content
- Why functional correctness (Sec. 2.1): Match-based metrics (exact match/BLEU) fail to capture the large space of functionally equivalent programs; instead evaluate correctness by unit tests.
- pass@k definition (Sec. 2.1): Generate k samples per problem; a problem is “solved” if any sample passes unit tests; report fraction solved across problems.
- Unbiased pass@k estimator (Eq. 1): For each task, generate n ≥ k samples, let c be samples that pass tests. Estimate
[ \text{pass@k} := \mathbb{E}_{\text{Problems}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right] ]- Variables: n total samples, c correct samples, k budgeted draws.
- Paper uses n = 200, k ≤ 100.
- Notes: direct computation can be numerically unstable; paper provides a stable NumPy product-form implementation. Avoid biased shortcut (1-(1-\hat p)^k) (biased; App. A).
- HumanEval dataset (Sec. 2.2): 164 hand-written Python function-synthesis problems; each includes signature/docstring/body + unit tests; avg 7.7 tests/problem.
- Sampling/eval defaults (Sec. 3): Nucleus sampling top-p = 0.95; stop sequences:
\nclass,\ndef,\n#,\nif,\nprint. - Temperature tuning (Sec. 3.3): Optimize temperature per k; higher T for larger k (diversity helps). Example (679M): T=0.2 for pass@1*, T=0.8 for pass@100*.
- Key results (Abstract/Fig. 1): On HumanEval: Codex-12B pass@1 = 28.8%; GPT-3 ~0%; GPT-J-6B 11.4%. With 100 samples, Codex reaches 70.2%; Codex-S pass@1 = 37.7%, and 77.5% within 100 samples. Best-of-100 by highest mean log-prob yields 44.5%.
📊 AgentArch enterprise agent-architecture benchmark (18 configs)
Benchmark · source
Benchmark design + quantitative comparisons across orchestration, prompting (ReAct vs function calling), memory, thinking tools
Key content
- Benchmark setup (Section 3.1): 2 enterprise workflows, each 60 user utterances with deterministic tool outputs and messy/long enterprise-realistic data (KB articles thousands of words; tool outputs complex JSON with metadata/error codes).
- Requesting Time Off (TO): simpler; 8 tools, 3 agents; challenges include date calculations, leave balance/policy compliance.
- Customer Request Routing (CR): complex; 31 tools, 9 agents; challenges include escalation decisions, context preservation, ambiguous requests, routing logic.
- Architectural dimensions tested (18 configurations):
- Orchestration: (1) orchestrator-led isolated agents, (2) orchestrator-led open agent network, (3) single agent (all tools, no collaboration).
- Agent style: Function calling vs ReAct (reasoning-action format).
- Memory: complete (all prior tool calls/params/responses) vs summarized (final summaries only).
- Thinking tools: enabled/disabled (e.g., math, synthesize_collected_information).
- Primary metric (Section 3.2): Acceptable Score = % records satisfying all: correct tool choice, correct tool arguments, correct final decision.
- Tool choice scoring: Lenient Acceptable (allows extra read-only tools; penalizes extraneous writes) vs Strict Acceptable (exact tools, correct order, no extras/hallucinations). Main reporting uses lenient.
- Reliability: pass@1 (primary) and pass^K = probability all K trials succeed.
- Key empirical results (Section 4.1):
- Peak performance: TO max 70.8% (GPT-4.1); CR max 35.3% (Sonnet 4).
- Thinking tools help on TO for non-reasoning models: GPT-4.1 48.5% → 70.8% (single-agent function calling, summarized memory). Minimal benefit for o3-mini 55.8% → 56.7%.
- Function calling generally > ReAct; multi-agent ReAct consistently underperforms.
- Hallucinations occur exclusively under ReAct for all models except GPT-4o; Sonnet 4 shows 36–36% hallucination in multi-agent ReAct vs 0% elsewhere.
- Reliability gap: best pass^k peaks at 0.0634 (only 6.34% chance of perfect success across 8 trials).
📊 AgentBench — LLMs as Interactive Agents (8 Environments)
Benchmark · source
End-to-end agent task success rates across multiple environments + trajectory/round limits + common failure causes (TLE/IF/IA/CLE)
Key content
- Formalization (Section 2): Interactive evaluation of an LLM agent is modeled as a POMDP with components: state space (S), action space (A), transition (T), reward (R), task-instruction space (I), observation space (O). Agent denoted (M).
- Prompting/eval procedure (Section 4.1):
- Two-role dialogue: user (instruction + environment feedback) and agent alternating; trajectory stored as conversation history.
- Input truncation: choose minimal (k) such that token count of history (\le 3500); omit earlier messages and append
"[NOTICE] messages are omitted." - Output format includes Thought + Action in one round (CoT-style); temperature = 0 (greedy) for reproducibility.
- Non-chat models: prepend
USER:/AGENT:per turn; end withAGENT:to elicit completion.
- Finish reason taxonomy (Section 2):
- CLE (context limit exceeded), IF (invalid format), IA (invalid action), TLE (task limit exceeded / repetitive generations), Complete.
- Benchmark composition (Section 3): 8 environments across code-grounded (OS bash SR; DB SQL SR; KG QA F1), game-grounded (DCG win rate; LTP game progress; HH/ALFWorld SR), web-grounded (WS reward; WB step SR). Estimated solving rounds per problem: 5–50.
- Dataset sizes (Table 2): total Dev 269, Test 1,014; ~3k and 11k inference calls (≈ MMLU call volume).
- Key empirical results (Table 3 / Section 4.2):
- gpt-4 (0613) overall score 4.01; notable SRs: House Holding 78.0%, Web Shopping 74.5%, OS 42.4%, DB 32.0%, KG F1 58.8, WB step SR 61.1.
- gpt-3.5-turbo (0613) overall 2.32; OS 32.6%, HH 64.1%, WS 33.7%.
- OSS vs API gap: average OSS overall 0.51 vs API 2.32; best OSS reported codellama-34b overall 0.96.
- Failure outcome proportions (Table 4; per-environment): TLE dominates in several tasks (e.g., KG TLE 67.9%, LTP TLE 82.5%); DB IF 53.3%; HH IA 64.1%; OS Complete 75.0% with TLE 23.9%.
📊 Chatbot Arena Elo Leaderboard (Anonymous Pairwise Human Votes)
Benchmark · source
Elo-based leaderboard methodology for Chatbot Arena (anonymous randomized battles, Elo computation framing, initial results)
Key content
- Benchmark setup (workflow):
- Users chat with two anonymous models side-by-side and vote for the better answer; model names revealed only after voting.
- Platform logs interactions; analysis uses only votes where names were hidden (anonymous votes).
- Initial launch collected 4.7k valid anonymous votes in ~1 week.
- Pairing policy: initially non-uniform (biased toward “strong pairings” based on prior ranking), later switched to uniform sampling for better coverage; introduced fastchat-t5-3b late → non-uniform model frequency.
- Prompts are “in the wild”; language distribution: mostly English (top-15 languages plotted).
- Elo model (Eq. 1–2):
- Win probability (logistic, base 10):
Eq. 1: (E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}})
where (R_A, R_B) are Elo ratings; (E_A) is expected score/probability A wins. - Rating update:
Eq. 2: (R’_A = R_A + K(S_A - E_A))
where (S_A) is actual score (win=1, tie=0.5, loss=0); (K) is update factor.
- Win probability (logistic, base 10):
- Empirical results (initial leaderboard):
- Timeframe: Apr 24 – May 1, 2023; 9 models listed; ratings computed from the 4.7k votes (notebook linked in post).
- Pairwise win-rate heatmap shown; Elo-predicted win rates match observed win rates “relatively well.”
- Design rationale: Pairwise human preference handles open-ended assistant quality; Elo provides scalability, incrementality (new model needs fewer trials), and a unique ordering across many models.
📊 SpecTool tool-use error taxonomy + metrics
Benchmark · source
Quantitative taxonomy + measurement of tool-use failure modes (schema/argument/format vs planning/selection), with model breakdowns and feedback-based evaluation.
Key content
- SpecTool dataset (Section 4): 150 human-annotated tool-use queries across 10 environments (Tools, Movies, Travel, Sports, Entertainment, Data, Social, Media, Weather, Video Images; table also lists Patent, Spaceflight). Overall averages: 6.8 interactions/query, 6.2 APIs/query. Example env stats: Movies 30 queries, 11 avg interactions, 11.5 avg APIs; Patent 20, 6, 7.2; Spaceflight 25, 9, 10.
- 7 error patterns (Section 3):
- IAC Insufficient API Calls (too few calls to complete task)
- IAV Incorrect Argument Value (incl. missing required args)
- IAN Incorrect Argument Name (hallucinated arg names)
- IAT Incorrect Argument Type
- RAC Repeated API Calls (exact repeats)
- IFN Incorrect Function Name (hallucinated tool)
- IFE Invalid Format Error (violates required parseable format)
- Evaluation workflow (Section 5): deterministic environments; agent gets instructions + tool list + format requirements. A constructive feedback mechanism checks: (1) parse/format, (2) action in allowed action space (else list valid actions), (3) argument validity (else list valid args + descriptions), (4) argument types (else specify correct types). Only then execute tool and feed observation back.
- Metrics (Section 6): Let N = max steps; GT = labeled ground-truth trajectories.
- For IFN/IAN/IAT/IFE/RAC: error accuracy = 1 − (# API calls exhibiting that error / N).
- For IAC/IAV: computed vs each labeled trajectory g: 1 − (# API calls with errors w.r.t. GT trajectory g / N).
- Higher metric = fewer errors; aligned with success rate.
- Key empirical results (Table 3): Success rates: GPT-4-0125-preview 0.71 (best), xLAM-8x22b 0.68, xLAM-7b 0.64, Code-Llama-13b 0.54, GPT-3.5-turbo-1106 0.53, Meta-Llama3-8b 0.27, Vicuna-13b-16k 0.16, Mixtral-8x7b 0.10. Example error metrics for GPT-4-0125: RAC 1.00, IFE 1.00, IAC 0.84, IAV 0.94, IAN 0.94, IFN 0.94.
- Design rationale (Sections 1–2, 6–7): prior benchmarks mostly report success-only; SpecTool adds diagnostic error breakdown, multiple valid GT trajectories, and feedback to expose brittle tool-call failures (schema/format/selection) that propagate downstream.
📊 WebArena benchmark (realistic, reproducible web-agent eval)
Benchmark · source
End-to-end web task benchmark with baseline agent implementations and success-rate results in a reproducible, realistic browser environment.
Key content
- Environment design (Sec. 2): Standalone, self-hosted web apps (Docker + gym-style APIs) to ensure reproducibility (avoids CAPTCHAs, content drift, config changes) while preserving realism via open-source stacks + imported real-world data.
- Domains/sites: 4 fully functional websites: e-commerce, social forum, collaborative development (GitLab-based), CMS. Plus utility tools: map, calculator, scratchpad; knowledge resources: English Wikipedia + site manuals.
- Formal agent interaction (Sec. 2.1): Given intent (I), agent chooses action (a_t) from current observation (o_t), action history, observation history; deterministic transition yields new state/observation. Reward checks whether state transitions satisfy intent (e.g., order placed; answer correctness).
- Observation space (Sec. 2.3): Mimics browser: URL + open tabs + focused tab content; supports multi-tab tasks. Render modes: DOM/HTML, screenshot, accessibility tree (compact structured subset of DOM). Optional viewport-limited observations for context constraints.
- Action space (Sec. 2.4): Compound mouse/keyboard + tab + navigation actions: click/hover/type/press/scroll; tab_focus/new_tab/tab_close; go_back/go_forward/goto(URL). Elements selectable by coordinates or unique element IDs (turns selection into (N)-way classification; e.g.,
click [1582]). - Benchmark (Sec. 3): 812 long-horizon tasks from 241 templates (avg 3.3 instantiations/template). Categories: information-seeking, site navigation, content/config. Includes unachievable tasks labeled “N/A” to test non-hallucination.
- Evaluation (Sec. 3.2): Functional correctness via programmatic checks of intermediate states/DB/page content. Text answers scored by exact_match, must_include, or fuzzy_match (LM-based; uses gpt-4-0613).
- Baseline procedure (Sec. 4): Few-shot ICL with 2 in-context examples; two prompting strategies: direct action vs CoT then action; uses accessibility tree + element IDs. Optional Unachievable (UA) hint instructs stopping if impossible.
- Key empirical results (Table 2 / Sec. 5):
- Human: 78.24% success (unachievable detection 100%).
- GPT-4 + CoT + UA hint: 11.70% success.
- GPT-4 + CoT (no UA hint): 14.41% success; unachievable detection 44.44%.
- GPT-3.5 + CoT + UA hint: 8.75%; text-bison-001 + CoT + UA hint: 5.05%.
- UA hint causes early stopping: GPT-4 marks 54.9% of feasible tasks as impossible (Sec. 5.1).
📖 AWS Step Functions Retry/Catch Semantics (Error Handling)
Reference Doc · source
Retry fields (IntervalSeconds, MaxAttempts, BackoffRate) and Catch behavior; error names and propagation rules
Key content
- Default behavior: When a state reports an error, Step Functions fails the entire execution unless handled via Retry/Catch.
- Where Catch/Retry apply: Available on Task, Parallel, Map states (not for top-level execution failures). For anticipated execution-level failures: handle in caller, nest child workflows, or listen for TIMED_OUT events (Standard) via EventBridge.
- Error names (case-sensitive strings):
- Built-ins start with
States.; custom errors cannot start withStates.. - Wildcards:
States.ALLmatches any known error name but must appear alone and last inErrorEquals; cannot catchStates.DataLimitExceededorStates.Runtime.States.TaskFailedmatches any known error exceptStates.Timeout.
- Notable errors:
States.Timeout: task timeout or heartbeat missed; if nested SM throwsStates.Timeout, parent receivesStates.TaskFailed. Also emitted when execution exceedsTimeoutSeconds.States.Runtime: non-retriable; always fails; not caught byRetry/CatchonStates.ALL.States.DataLimitExceeded: terminal; not caught byStates.ALL.
- Built-ins start with
- Retry algorithm (ordered scan): On error, Step Functions scans
Retry[]in order; first retrier whoseErrorEqualscontains the error governs retries. If retries exhausted, normal error handling continues. - Retry timing formula (Eq. 1): delay before attempt k (1-indexed)
Delay_k = IntervalSeconds * (BackoffRate)^(k-1)capped byMaxDelaySecondsif set; withJitterStrategy=FULL, delay is randomized in[0, Delay_k]. - Retry defaults/limits:
IntervalSeconds=1(max99999999),MaxAttempts=3(0 = never retry; max99999999),BackoffRate=2.0,JitterStrategy=NONE,MaxDelaySecondsoptional (0 < value <31622401). - Catch algorithm (ordered scan): If no Retry or retries fail, scan
Catch[]in order; first matching catcher transitions toNext.ResultPathcontrols whether error output overwrites input ($default) or is merged. - Billing note: Retries count as state transitions.
📖 AWS Step Functions — Quotas & Limits (numeric feasibility)
Reference Doc · source
Concrete numeric limits for Step Functions (payloads, timeouts, throttles, history size, retention)
Key content
- Name constraints (General): State machine / execution / activity task names ≤ 80 chars, unique per account+Region; must not include whitespace, wildcards
? *, brackets< > { } [ ], many special chars (" # % \ ^ | ~ \$ & , ; : /), or control chars (\u0000-\u001f,\u007f-\u009f`). Non-ASCII allowed but can break CloudWatch logging (recommend ASCII). - Account quotas (selected):
- Registered state machines: 100,000 (increase to 150,000).
- Registered activities: 100,000 (increase to 150,000).
- State machine definition size: 1 MB (hard).
- Step Functions API max request size: 1 MB per request (hard) (includes headers + all request data).
- Open executions (Standard): 1,000,000 per account per Region (exceed →
ExecutionLimitExceeded); doesn’t apply to Express. - Distributed Map: open Map Runs max 1000 (hard); parallel Map Run child executions max 10,000 (hard).
- Execution/task hard limits (Standard vs Express):
- Max execution time: Standard 1 year; Express 5 minutes.
- Max execution history size: Standard 25,000 events (hit → execution fails).
- Max idle time: Standard 1 year; Express 5 minutes.
- Execution history retention after close: Standard 90 days (can request reduction to 30 days); Express 14 days.
- Max input/output size (task/state/execution): 256 KiB UTF-8 string (both).
- HTTP Task: duration (request+response) 60 seconds (hard).
- API throttling (token bucket; per account per Region; soft/increasable):
StartExecution(Standard): bucket/refill 1300/300 (us-east-1, us-west-2, eu-west-1); 800/150 (other Regions).StartExecution(Express): 6000/6000 (all Regions).RedriveExecution(Standard): 1300/300 (key Regions); 800/150 (others).StopExecution(Standard): 1000/200 (key Regions); 500/25 (others).GetActivityTask(Standard): 3000/500 (key Regions); 1500/300 (others).
- Versions/aliases: published versions 1000 per state machine; aliases 100 per state machine.
- Tagging (hard): 50 tags/resource; key 128 chars; value 256 chars; reserved prefix
aws:.
📖 Amazon States Language (ASL) — State machine definition skeleton + example
Reference Doc · source
Authoritative ASL workflow definition structure; example showing Task, Choice, Fail and transitions
Key content
- ASL definition (what it is): A JSON-based structured language to define a Step Functions state machine (a collection of states) that can:
- do work with
Taskstates, - branch with
Choicestates, - stop with error using
Failstates, etc.
- do work with
- File naming requirement (outside console): Save definitions with extension
.asl.json. - Top-level required structure (example):
Comment: free-text description.QueryLanguage: example sets"JSONata".StartAt: name of first state (example:"FirstState").States: object mapping state names → state definitions.
- State transition fields (example):
Next: name of next state (used byTask,Choicebranches).End: true: marks terminal state (example:NextState).Default: fallback transition forChoice(example:"DefaultState").
Taskstate fields (example):Type: "Task"Resource: ARN (example Lambda ARN format:arn:aws:lambda:region:123456789012:function:FUNCTION_NAME)- Optional
Assign(JSONata) to set variables (example assignsfoofrom$states.input.foo_input).
Choicestate fields (example):Type: "Choice"Choices: array of rules, each withCondition(JSONata) andNext.
Failstate fields (example):Type: "Fail", plusErrorandCausestrings.
📖 Anthropic Messages SSE Streaming Event Semantics
Reference Doc · source
SSE streaming message event structure, incremental delivery, tool/thinking deltas, and recovery patterns
Key content
- Enable streaming: set
"stream": trueon a Messages create request; responses arrive via Server-Sent Events (SSE). - Event flow (canonical order):
message_start→ contains a Message object with emptycontent: []- For each content block
iin finalmessage.content[i]:
content_block_start(index=i)→ 1+content_block_delta(index=i)→content_block_stop(index=i) - 1+
message_delta(top-level Message changes) message_stop(final)
- Usage accounting: token counts in
message_delta.usageare cumulative (e.g.,{"output_tokens": 15}). - Ping events: arbitrary number of
pingevents may appear anywhere in the stream. - Error events in-stream: SSE
event: errorwith JSON like
{"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}(maps to HTTP 529 in non-streaming). - Forward compatibility: new event types may be added; clients should ignore/handle unknown event types gracefully.
- Delta types (content_block_delta.delta.type):
text_delta: incremental text, e.g."text":"Hello".input_json_delta: fortool_useblocks; provides partial JSON string chunks inpartial_json. Accumulate chunks; parse to an object atcontent_block_stop. (Current models emit one complete key/value at a time, so gaps between chunks can occur.)thinking_delta+signature_delta(extended thinking): thinking streams as deltas; a signature_delta arrives just beforecontent_block_stop. Ifthinking.display: "omitted", no thinking deltas—only signature then stop.
- SDK accumulation pattern: stream events but return final Message via
.get_final_message()(Python) /.finalMessage()(TS); Go usesmessage.Accumulate(event); JavaMessageAccumulator.create().accumulate(event); Ruby.accumulated_message. - Recovery:
- Claude 4.5 and earlier: resume by sending partial assistant content and continue.
- Claude 4.6: add a user message: “Your previous response was interrupted and ended with [previous_response]. Continue…”
📖 Azure Durable Functions — Stateful serverless workflows (overview)
Reference Doc · source
Deterministic replay model context + orchestrator/activity/entity concepts; pointers to orchestrator constraints, storage providers, and monitoring.
Key content
- What Durable Functions is (definition): An extension of Azure Functions for building stateful workflows in a serverless environment by writing orchestrator, activity, and entity functions in code. Runtime manages state, checkpoints, retries, and recovery so workflows can run reliably for long periods.
- Core workflow structure (conceptual procedure):
- Orchestrator function coordinates execution.
- Activity functions perform work steps.
- Entity functions model stateful entities (for durable state patterns).
- Getting started procedure (numbered steps from doc):
- Create a new Azure Functions app using a language quickstart.
- Add an orchestrator function and one or more activity functions.
- Choose/configure a backend via Durable Functions storage providers; recommended: Durable Task Scheduler.
- Run/test locally with Azure Functions Core Tools.
- Deploy to Azure and monitor orchestration instances.
- Supported languages (table facts): Durable Functions support listed as Supported for .NET (C#), JavaScript, TypeScript, Python, PowerShell, Java (each has a “Create your first durable function” quickstart link).
- Design rationale (explicit): Runtime-managed state/checkpointing/retries/recovery enables reliable long-running workflows without manual state management.
- Key follow-up topics to consult next (links): Task hubs, HTTP features, and orchestrator code constraints (deterministic/replay constraints are referenced via link).
📖 Durable Functions External Events (Wait/Raise) API + Semantics
Reference Doc · source
API surface for WaitForExternalEvent / RaiseEvent and correlation semantics
Key content
- Purpose/use case: Orchestrator functions can wait for external events—commonly for human interaction or other external triggers (human-in-the-loop signaling).
- One-way async constraint: External events are one-way asynchronous; not suitable when the sender needs a synchronous response from the orchestrator.
- Wait API (orchestrator side):
- Isolated worker:
await context.WaitForExternalEventAsync<T>("EventName") - In-process:
await context.WaitForExternalEvent<T>("EventName") - Orchestrator declares event name and expected payload type
T. - Type conversion rule (.NET): if payload can’t be converted to
T, an exception is thrown.
- Isolated worker:
- Concurrency patterns:
- Wait for any: create multiple
WaitForExternalEvent*tasks andawait Task.WhenAny(...). - Wait for all:
await Task.WhenAll(gate1, gate2, gate3)before proceeding.
- Wait for any: create multiple
- Indefinite wait + lifecycle/billing:
- Waits indefinitely; app/worker can be stopped/unloaded while waiting; instance is awakened automatically when event arrives.
- Consumption Plan: no billing charges while an orchestrator is awaiting an external event task (regardless of duration).
- Raise API (client side):
await client.RaiseEventAsync(instanceId, eventName, eventData)- Event parameters:
instanceId,eventName,eventData(must be JSON-serializable). - Correlation:
eventNamemust match between sender and receiver. - Delivery mechanics: message is enqueued; if instance isn’t currently waiting on that
eventName, it’s buffered (in-memory) until it starts listening. - If no instance with
instanceId: event is discarded.
- Reliability + dedup:
- External events have at-least-once delivery ⇒ duplicates possible (restarts/scaling/crashes).
- Best practice: include a unique ID in events for manual dedup in orchestrators.
- Storage note: MSSQL provider updates state transactionally ⇒ no duplicate risk vs Azure Storage provider, but unique IDs/names still recommended for portability.
- HTTP raise-event example:
POST /runtime/webhooks/durabletask/instances/MyInstanceId/raiseEvent/Approval&code=XXXwith JSON body"true".
📖 Durable Functions — Durable Timers (sleep/wait) semantics
Reference Doc · source
Durable timer API usage + semantics for long waits/timeouts (incl. human-approval wait patterns)
Key content
- Use durable timers in orchestrators (not language
sleep/delay) to implement delays and timeouts in Durable Functions/Durable Task orchestrations. - Timer creation (Eq. 1: due-time form)
dueTime = context.CurrentUtcDateTime + Δ(e.g.,AddHours(72))await context.CreateTimer(dueTime, cancellationToken)- Variables:
context.CurrentUtcDateTime= orchestrator’s deterministic “now”;Δ= desired delay;cancellationTokencontrols cancellation.
- Timer creation (Eq. 2: duration form)
await context.CreateTimer(TimeSpan.FromHours(72), cancellationToken)
- Semantics: awaiting the timer “sleeps” the orchestrator until expiration while the orchestration can still process other incoming events during the wait.
- Underlying behavior: creating a timer for time T enqueues a message that becomes visible at T (e.g., 4:30 PM UTC). If the app scales to zero, the visible timer message triggers reactivation on an appropriate VM.
- Long-timer limits / behavior (numbers):
- JavaScript, Python, PowerShell: durable timers limited to 6 days; workaround: loop with multiple timers to simulate longer delays.
- .NET and Java (up-to-date): support arbitrarily long timers.
- Some SDK/storage-provider combos may implement ≥6-day waits as multiple shorter timers (e.g., 3-day chunks); visible in logs/history but not orchestration behavior.
- Time calculation rule: don’t use built-in date/time APIs; always use orchestration context time (
context.CurrentUtcDateTime,ctx.current_utc_datetime,context.CurrentUtcDateTimein JS). - Timeout pattern (procedure):
- Start activity task
activityTask = CallActivityAsync(...) - Start timer
timeoutTask = CreateTimer(deadline, cts.Token) winner = await Task.WhenAny(activityTask, timeoutTask)- If activity wins:
cts.Cancel()(cancels timer) else timeout.
- Start activity task
- Cancellation requirement: if you create timers you won’t await, cancel them; orchestration won’t reach “Completed” until all outstanding tasks (incl. timers) are completed or canceled.
- Consumption plan default: abandoned activities still run/bill; default function timeout 5 minutes (configurable).
📖 Evals API — Monitoring stored completions for regressions
Reference Doc · source
Production-ish workflow: log stored completions, create eval + runs, detect prompt regressions, iterate across prompt/model versions.
Key content
-
Logging for later evals (production observability)
- Set
store=Trueonclient.chat.completions.create(...)to log requests/responses for later evaluation. - Alternative: enable org-wide logging “on by default” in admin data controls:
platform.openai.com/settings/organization/data-controls/data-retention. - Use metadata to segment use-cases and versions, e.g.
metadata={"prompt_version": "v1", "usecase": "push_notifications_summarizer"}.
- Set
-
Evals structure (configuration vs execution)
- Eval = shared configuration:
data_source_config+testing_criteria. - Run = an execution of an Eval over a specific data source slice (e.g., prompt version), producing a report URL.
- Eval = shared configuration:
-
Data source config (stored completions)
data_source_config = {"type":"stored_completions","metadata":{"usecase":"push_notifications_summarizer"}}- Variables exposed to graders:
{{item.input}}= messages sent to the completion call{{sample.output_text}}= assistant response text
-
Testing criteria (LLM-as-judge label grader)
- Grader type:
"type": "label_model", model:"o3-mini". - Labels:
["correct","incorrect"]; passing:["correct"]. - Grader prompt judges whether summary is “concise and snappy”.
- Grader type:
-
Run creation patterns (regression detection)
- Compare prompt versions by filtering stored completions:
- Run v1:
metadata={"prompt_version":"v1"} - Run v2:
metadata={"prompt_version":"v2"}
- Run v1:
- Generate new completions for a different model using stored inputs:
input_messages={"type":"item_reference","item_reference":"item.input"}model="gpt-4o"(vs storedgpt-4o-mini)
- Compare prompt versions by filtering stored completions:
📖 FastAPI wiring for LangChain/LangGraph event streaming (index)
Reference Doc · source
Community discussion hub pointing to FastAPI + “events stream API” plumbing patterns and related official how-to guides (stream runnables, debug apps, inspect runnables, stream events from tools), plus LangSmith for tracing/observability.
Key content
- Primary use: A navigation/index-style discussion page for streaming + observability topics across LangChain/LangGraph/LangSmith.
- Relevant procedures (as linked how-to areas):
- Streaming runnables (LCEL/Runnable protocol): guidance on streaming outputs back to clients (token streaming) and runtime configuration of runnable behavior.
- Inspecting/debugging: “How to: inspect runnables” and “How to: debug your LLM apps” (intermediate state inspection / debugging workflow).
- Tool event streaming: “How to: stream events from a tool” (step/event streaming hooks).
- Async/callback environments: “How to: use callbacks in async environments” and “dispatch custom callback events” (observability/event emission patterns).
- Observability rationale: Recommends LangSmith as the platform to trace, monitor, evaluate, and deploy agents; emphasizes tracing as vital for diagnosing issues and inspecting step-level execution.
- No equations / no numeric benchmarks / no explicit default hyperparameters are provided in the captured text; it functions as a pointer to the concrete implementations elsewhere.
📖 LangGraph streaming + runtime config + persistence (thread_id)
Reference Doc · source
Concrete config placement patterns + stream_mode="events" usage context; thread-level persistence/checkpointing patterns
Key content
- Minimal StateGraph compile/invoke pattern (hello world):
from langgraph.graph import StateGraph, MessagesState, START, END def mock_llm(state: MessagesState): return {"messages": [{"role": "ai", "content": "hello world"}]} graph = StateGraph(MessagesState) graph.add_node(mock_llm) graph.add_edge(START, "mock_llm") graph.add_edge("mock_llm", END) graph = graph.compile() graph.invoke({"messages": [{"role": "user", "content": "hi!"}]})- Procedure: define node(s) → connect edges
START → node → END→compile()→invoke(input_state).
- Procedure: define node(s) → connect edges
- Prebuilt agent invocation pattern (ReAct-style):
from langgraph.prebuilt import create_react_agent def get_weather(city: str) -> str: return f"It's always sunny in {city}!" agent = create_react_agent( model="anthropic:claude-3-7-sonnet-latest", tools=[get_weather], prompt="You are a helpful assistant", ) agent.invoke({"messages": [{"role": "user", "content": "what is the weather in sf"}]})- Defaults/parameters shown:
model="anthropic:claude-3-7-sonnet-latest",tools=[...],prompt=....
- Defaults/parameters shown:
- State update rule for message history (
add_messagesreducer):- Merges
left(existing messages) andright(new messages). - If IDs match: message in
rightreplaces the one inleft. - Else: messages from
rightare appended (append-only history).
- Merges
- Design rationale (observability/streaming): LangGraph emphasizes durable execution + streaming + debugging/observability (LangSmith) for tracing execution paths and state transitions.
📖 LangSmith Datasets — Versioning, Splits, Filtering, Eval Inputs
Reference Doc · source
Dataset creation/versioning primitives + identifiers (versions/tags/splits/examples) used as inputs to eval automation.
Key content
- Core objects
- Dataset = collection of examples for repeatable evaluation.
- Example structure:
inputs: dict(passed to app)reference_outputs: dict(optional; used for evaluation, not passed to app)metadata: dict(optional; enables filtered views)
- Dataset versioning (default = timestamp)
- Any add/update/delete of examples ⇒ new dataset version created automatically.
- UI: Examples tab shows latest by default; selecting a past version (by timestamp) shows dataset state then; examples are read-only in past versions.
- Tests tab shows experiments across versions (latest shown in Examples; experiments from all versions shown in Tests).
- Tagging versions (human-readable milestones)
- UI: + Tag this version (Examples tab).
- SDK (Python):
client.update_dataset_tag(dataset_name=..., as_of=<timestamp>, tag="prod") - Rationale: stable named versions (e.g.,
"prod") for CI / regression testing.
- Evaluate on a specific version / view
- Fetch examples for a version via
list_examples(dataset_name=..., as_of="latest" | <tag> | <timestamp>), then pass iterable toevaluate/aevaluate(data=...).
- Fetch examples for a version via
- Evaluate on filtered/split subsets
- Filter by metadata:
list_examples(dataset_name=..., metadata={"desired_key":"desired_value"}) - Evaluate on splits:
list_examples(dataset_name=..., splits=["test","training"])
- Filter by metadata:
- UI workflows to build datasets
- Add traces → dataset: from Tracing Projects, multi-select runs → Add to Dataset, or open run → Add to → Dataset.
- Annotation queue: review/edit run → Add to Dataset (hotkey
D); edits + run metadata carry over. - Playground: Set up Evaluation → select/create dataset → +Row; note: inline creation doesn’t support nested keys.
📖 LangSmith REST API — Run evals (API-only)
Reference Doc · source
LangSmith REST API endpoints + request/response patterns to run experiments/evals without SDKs (auth headers, dataset/session/run/feedback schema)
Key content
- Auth (all requests): HTTP header
x-api-key: $LANGSMITH_API_KEY. - Core workflow (single experiment/session):
- Fetch dataset examples (filter by dataset id):
GET https://api.smith.langchain.com/api/v1/exampleswith querydataset=<dataset_id>. - Create experiment = tracer session (ties runs to dataset):
POST /api/v1/sessionsJSON:start_time(ISO8601 UTC),reference_dataset_id(string)- optional:
name,description,extra.metadata
Response includesid=experiment_id.
- Create runs for each example (you must do parent/child + linking):
POST /api/v1/runsJSON fields:id(uuid hex),name,run_type(e.g.,"chain","llm")inputs(object),start_time(ISO8601 UTC)- Required for experiments:
reference_example_id(example id),session_id(experiment id) - optional:
parent_run_id(to form hierarchy).
- Update/close runs with outputs:
PATCH /api/v1/runs/{run_id}JSON:outputs(object),end_time(ISO8601 UTC). - Close experiment/session:
PATCH /api/v1/sessions/{session_id}JSON:end_time(ISO8601 UTC).
- Fetch dataset examples (filter by dataset id):
- Add evaluation feedback (scoring):
- Query root runs:
POST /api/v1/runs/queryJSON:
session: [experiment_id],is_root: true,select: ["id","reference_example_id","outputs"]. - Create feedback:
POST /api/v1/feedbackJSON:
run_id,key(e.g.,"correctness"),score(e.g.,1.0/0.0), optionalcomment.
- Query root runs:
- Pairwise/comparative experiments:
- Create:
POST /api/v1/datasets/comparativeJSON:
experiment_ids(list),reference_dataset_id,name, optionaldescription,extra.metadata. - Fetch:
GET /api/v1/datasets/{dataset_id}/comparativewithid=<comparative_experiment_id>. - Rank via feedback:
POST /api/v1/feedbackwithkey:"ranked_preference",score(1 preferred else 0), plusfeedback_group_idandcomparative_experiment_id.
- Create:
📖 LangSmith/LangGraph Streaming API (runs + threads)
Reference Doc · source
Official streaming primitives/modes and how streamed outputs map to runs/threads for tracing & observability
Key content
- Core workflow (run streaming):
client = get_client(url=<DEPLOYMENT_URL>, api_key=<API_KEY>)- (Stateful)
thread = await client.threads.create()→thread_id = thread["thread_id"] - Stream a run:
async for chunk in client.runs.stream(thread_id, assistant_id, input=inputs, stream_mode="updates"): print(chunk.data)
- Stateless run: pass
Noneinstead ofthread_idto avoid persisting outputs in the checkpointer DB:
client.runs.stream(None, assistant_id, input=inputs, stream_mode="updates") - Stream modes (run streaming):
values: full graph state after each super-step (.stream()/.astream()withstream_mode="values")updates: state updates after each step; if multiple updates in same step (e.g., multiple nodes), streamed separatelymessages-tuple: token-by-token LLM output + metadata (for chat UIs)debug: “as much information as possible” incl. node name + full statecustom: user-defined streamed data from inside graphevents: all events (incl. state); mainly for migrating large LCEL apps (.astream_events())
- Multi-mode streaming:
stream_mode=["updates","custom"]→ outputs are tuples(mode, chunk). - Subgraph streaming: set
stream_subgraphs=Trueto include parent + subgraph outputs. - Token streaming shape (
messages-tuple):chunk.data == (message_chunk, metadata); example filterschunk.event != "messages". Printmessage_chunk["content"]. Metadata includes node/LLM invocation details (e.g.,langgraph_node). - Join existing run:
client.runs.join_stream(thread_id, run_id); outputs not buffered (miss earlier output). - Thread streaming vs run streaming (comparison table):
- Methods:
client.threads.join_stream()vsclient.runs.stream() - REST:
GET /threads/{thread_id}/streamvsPOST /threads/{thread_id}/runs/stream - Scope: all runs on thread vs single run; lifetime: indefinite vs closes on completion; creates run: no vs yes.
- Methods:
- Thread stream modes:
run_modes(default; equivalent to run stream output),lifecycle(only run start/end). Example:stream_mode=["lifecycle","state_update"]. - Resumability (thread streams): use
Last-Event-ID/last_event_id="<LAST_EVENT_ID>"; pass"-"to replay from beginning.
📖 OTel Tracing SDK essentials (sampling + processors/exporters)
Reference Doc · source
Canonical SDK semantics for span recording vs sampling, parent-based sampling, span processors/exporters, defaults.
Key content
- Two gating signals for data flow
Span.IsRecording(bool): iffalse, span discards attributes/events/status; SpanProcessors MUST receive only spans withIsRecording=true.SpanContext.TraceFlags.Sampled(bool): propagated to children; indicates span will be exported; SpanExporters MUST receive spans only whenSampled=true.- Forbidden combo:
Sampled=true&IsRecording=falseMUST NOT be allowed (would create trace gaps).
- Recording/Sampled reaction table
IsRecording=true, Sampled=true→ Processor: yes; Exporter: yesIsRecording=true, Sampled=false→ Processor: yes; Exporter: noIsRecording=false, Sampled=false→ Processor: no; Exporter: no
- SDK span creation procedure (ordered)
- Use parent trace ID if valid else generate new trace ID (before sampling).
- Call
Sampler.ShouldSample(...). - Generate new span ID regardless of sampling decision.
- Create recording/non-recording span per decision (
DROP,RECORD_ONLY,RECORD_AND_SAMPLE).
- Sampler API
ShouldSample(parentContext, traceId, name, kind, attributes, links) -> SamplingResult- Decisions:
DROP(IsRecording=false),RECORD_ONLY(IsRecording=true, Sampled=false),RECORD_AND_SAMPLE(IsRecording=true, Sampled=true).
- Built-in sampler defaults
- Default sampler:
ParentBased(root=AlwaysOn). ParentBasedrouting (defaults): remote/local parent sampled→AlwaysOn; not sampled→AlwaysOff.
- Default sampler:
- BatchSpanProcessor defaults
maxQueueSize=2048,scheduledDelayMillis=5000,exportTimeoutMillis=30000,maxExportBatchSize=512(≤ queue).
- Span limits defaults
EventCountLimit=128,LinkCountLimit=128,AttributePerEventCountLimit=128,AttributePerLinkCountLimit=128.
📖 Okapi BM25 (Probabilistic Relevance Framework)
Reference Doc · source
BM25 scoring formula + parameter meanings (k1, b), IDF term, length normalization
Key content
- BM25 ranking score (Eq. BM25):
[ RSV_{BM25}(d,q)=\sum_{i\in q} \log\frac{N}{df_i}\cdot \frac{(k_1+1),tf_i}{k_1\left((1-b)+b\frac{dl}{avdl}\right)+tf_i} ]- (N): number of documents in collection
- (df_i): document frequency of term (i)
- (tf_i): term frequency of term (i) in document (d)
- (dl): document length (often (dl=\sum_{i\in V} tf_i))
- (avdl): average document length in collection
- (k_1): term-frequency saturation control
- (b\in[0,1]): length normalization strength
- Length normalization component (Eq. B):
[ B=(1-b)+b\frac{dl}{avdl} ] and normalized term frequency (t’_f=tf/B). - Parameter interpretations + defaults:
- (k_1=0) → binary model; large (k_1) → approaches raw (tf).
- (b=0) → no length norm; (b=1) → full relative-frequency scaling.
- Typical settings: (k_1\approx 1.2\text{–}2), (b\approx 0.75).
- Design rationale: BM25 approximates a probabilistic “2-Poisson/eliteness” view with a saturating tf curve (bounded contribution vs unbounded tf-idf), plus partial length normalization to balance verbosity vs scope.
- Empirical comparison (machine learning query example, (k_1=2)):
- doc1: learning=1024, machine=1 → BM25: (7\cdot3 + 10\cdot1 = 31)
- doc2: learning=16, machine=8 → BM25: (7\cdot2.67 + 10\cdot2.4 = 42.7)
(tf-idf ranks doc1 higher: 87 vs 75)
📖 OpenAI API Streaming (Responses + Events)
Reference Doc · source
Parameter-level reference for enabling streaming + event framing/lifecycle
Key content
- Default behavior: API returns the model’s entire output in one HTTP response (non-streaming). Streaming reduces perceived latency by sending partial output as it’s generated.
- Enable streaming (Responses endpoint): set
stream: true(JS) /stream=True(Python) inclient.responses.create(...).
Procedure:- Call
responses.create(..., stream=true) - Iterate events (
for await (const event of stream)/for event in stream) - Route by
event.type(SDK events are typed;typeproperty identifies schema).
- Call
- Streaming model: Responses API streams semantic, typed events (type-safe). Example union includes:
response.created,response.in_progress,response.failed,response.completed,
response.output_text.delta,response.text.done,error, plus tool-related deltas (e.g.,response.function_call_arguments.delta/done, file search and code interpreter progress events). - Common text-stream events to listen for:
response.created(once)response.output_text.delta(many; incremental text)response.completed(once; end-of-stream)error
- Chat Completions streaming: also supports
stream=True, returning data-only SSE chunks; iterate chunks and readchunk.choices[0].delta. - Design rationale: OpenAI recommends Responses API for streaming because it’s “designed with streaming in mind” and uses semantic, type-safe events.
- Production constraint: Moderation risk—streaming partial output is harder to moderate; partial completions may be difficult to evaluate.
📖 Reliable streaming + efficient state management (LangGraph)
Reference Doc · source
Release-level guarantees/behavior changes for “reliable streaming” + “efficient state management”; recommended streaming/state patterns.
Key content
- Release guarantees / behavior changes (LangGraph API/Cloud):
- Streaming runs now use the same job queue as background runs → greater reliability while keeping low-latency real-time output.
- New streaming endpoint:
GET /threads/{thread_id}/runs/{run_id}/streamand SDK:client.runs.join_stream()→ stream output from any run, including background runs (supports UX where user leaves/returns and streaming continues). - Final state retrieval now reliable:
GET /threads/{thread_id}/runs/{run_id}/joinand SDK:client.runs.join()→ reliably returns final state values whether run is ongoing or finished. - Thread status expanded:
GET /threads/{id}/client.threads.get()now includeserrorandinterrupted(in addition to existingidle,busy). - Streamlined state retrieval:
GET /threads/{id}andGET /threadsnow include latest state values (fewer API calls; no separate “get state”). - Advanced search:
POST /threads/search/client.threads.search()can filter by thread state values + status (enables “agent inbox” UIs).
- Streaming procedures (graph runtime):
- Use
graph.stream(...)/graph.astream(...)withstream_modeandversion="v2"for unified StreamPart format. - Stream modes (table):
values: full state snapshot after each stepupdates: only changed keys; multiple updates in same step streamed separatelymessages:(message_chunk, metadata)from LLM calls (emitted even if model invoked via.invoke)custom: arbitrary events viaget_stream_writer()/ injectedwriterargcheckpoints: checkpoint events (same format asget_state(); requires checkpointer)debug: “as much info as possible” incl. node name + full state
- Subgraph streaming: pass
subgraphs=True; streamed parts includensnamespace to distinguish root vs subgraph.
- Use
📖 Responses API SSE Streaming Events (event types + payload fields)
Reference Doc · source
Exact streaming (SSE/WebSocket) event names + object fields for incremental output, tool-call deltas, and lifecycle/error events in the Responses API.
Key content
- Streaming model: server emits a sequence of ResponseStreamEvent objects (also called ResponsesServerEvent for WebSocket). Each event includes:
type(event name discriminator)- often
sequence_number - often
output_index,item_id, and sometimescontent_indexfor locating the delta within the response.
- Lifecycle/status events (ResponseStatus):
queued,in_progress,completed,failed,cancelled,incomplete.response.created: ResponseCreatedEvent{ type, sequence_number, response }response.queued: ResponseQueuedEvent{ type, sequence_number, response }response.in_progress: ResponseInProgressEvent{ type, sequence_number, response }response.completed: ResponseCompletedEvent{ type, sequence_number, response }response.failed: ResponseFailedEvent{ type, sequence_number, response }response.incomplete: ResponseIncompleteEvent{ type, sequence_number, response }response.error: ResponseErrorEvent{ type, code, message, param, … }
- Incremental text output:
response.output_text.delta: ResponseTextDeltaEvent{ type, sequence_number, output_index, item_id, content_index, delta }response.output_text.done: ResponseTextDoneEvent{ type, sequence_number, output_index, item_id, content_index, logprobs, … }- Content-part boundaries: ResponseContentPartAddedEvent, ResponseContentPartDoneEvent (include
output_index,item_id,content_index, …). - Output-item boundaries: ResponseOutputItemAddedEvent, ResponseOutputItemDoneEvent
{ type, sequence_number, output_index, item }.
- Tool-call streaming (arguments/code/input deltas + done):
- Function calls: ResponseFunctionCallArgumentsDeltaEvent
{ delta, item_id, output_index, … }; …DoneEvent{ arguments, name, item_id, output_index, … } - Custom tool input: ResponseCustomToolCallInputDeltaEvent / …DoneEvent (
delta→ finalinput) - MCP tool args: ResponseMcpCallArgumentsDeltaEvent / …DoneEvent (
delta→ finalarguments) - Code interpreter code: ResponseCodeInterpreterCallCodeDeltaEvent / …DoneEvent (
delta→ finalcode) plus state events: …InProgress, …Interpreting, …Completed - Search tools: web/file search state events …InProgress, …Searching, …Completed
- Image generation: …InProgress, …Generating, …PartialImage (
partial_image_b64), …Completed
- Function calls: ResponseFunctionCallArgumentsDeltaEvent
- Audio streaming: ResponseAudioDeltaEvent (
delta), ResponseAudioDoneEvent; transcript: ResponseAudioTranscriptDeltaEvent (delta), …DoneEvent. - Refusals & reasoning summaries: refusal delta/done (ResponseRefusalDeltaEvent, ResponseRefusalDoneEvent); reasoning summary part/text delta/done events.
- Include extra data via
include[](ResponseIncludable):web_search_call.action.sources,web_search_call.results,file_search_call.results,code_interpreter_call.outputs,computer_call_output.output.image_url,message.input_image.image_url,message.output_text.logprobs,reasoning.encrypted_content.
- Text output formatting defaults:
ResponseTextConfig.formatdefault is{ "type": "text" }. Structured Outputs via{ "type": "json_schema" }(preferred over{ "type": "json_object" }for newer models).
📖 Stream only the final node’s output (LangGraph streamEvents)
Reference Doc · source
Event filtering/selection patterns (node-level filtering) for streaming/debugging
Key content
- Problem: In a multi-node LangGraph (e.g., RAG graph with a query-rewrite node then a generation node), the user wants to stream tokens only from the last/generation node, not earlier nodes.
- Baseline streaming loop (JS):
- Create event stream:
const eventStream = await graph.streamEvents(inputs, config); - Consume events:
for await (const { event, data } of eventStream) { ... } - Token streaming event type used in example:
event === "on_chat_model_stream" - Accumulate streamed text when chunk content is a string:
if (typeof data.chunk.content === "string") result += data.chunk.content;
- Create event stream:
- Key filtering mechanism (design rationale):
- Events include metadata that can identify which node produced the event (“metadata containing information about the node that it’s within”).
- A common practice is to use tagging to narrow which events are published/handled, instead of maintaining a manual
currentNodestate variable.
- Canonical procedure reference: Maintainer points to an official how-to demonstrating streaming outputs from the final node (Python example):
https://langchain-ai.github.io/langgraph/how-tos/streaming-from-final-node/#stream-outputs-from-the-final-node
(Use this for the concrete pattern; this issue establishes that node metadata/tags are the intended approach.)
📖 Structured outputs / JSON mode (doc index only)
Reference Doc · source
Enforcing JSON outputs via response_format / JSON mode; schema/format constraints; failure modes & guardrails.
Key content
- This fetch contains no structured-output guidance. The target URL returns HTTP 404: Not Found and displays a “Page not found” screen.
- Available actionable items are navigation pointers to the current docs locations:
- “Structured output” guide:
https://platform.openai.com/api/docs/guides/structured-outputs - “Function calling” guide:
https://platform.openai.com/api/docs/guides/function-calling - “Responses API” migration guide:
https://platform.openai.com/api/docs/guides/migrate-to-responses - “Using tools” guide:
https://platform.openai.com/api/docs/guides/tools
- “Structured output” guide:
- No equations, parameters, schemas, or step-by-step procedures for
response_format/ JSON mode appear in the provided text (only site navigation and doc section listings). - Design rationale / defaults / failure modes: not present in this excerpt; consult the linked “Structured output” guide above for the authoritative details.
📖 Temporal Activity Operations (Pause/Unpause/Reset/Update Options)
Reference Doc · source
Operational controls for Activity Executions + effects on retries/timeouts/heartbeats + observability limits
Key content
- Scope/availability
- Applies to Activity Executions (not lifecycle behaviors). Not for Local or Standalone Activities.
- Public Preview; available in Server v1.28.0+; self-hosted UI requires v2.47.0+.
- Not available as SDK client methods; use CLI/UI/gRPC.
- Pause (
temporal activity pause)- Stops server-side scheduling of new retries; parent Workflow keeps running (Signals/Queries/Updates unaffected).
- Heartbeat semantics: with Heartbeat → interrupted on next Heartbeat (SDK raises pause-specific error); without Heartbeat → continues to completion; if it fails, no retry scheduled.
- Does not stop/extend Schedule-To-Close timeout; may still time out → use update-options to adjust.
- Idempotent; pausing completed Activity errors.
- Unpause (
temporal activity unpause)- Reschedules immediately; discard remaining retry backoff.
- Attempts + Heartbeat data preserved by default; optional
--reset-attempts,--reset-heartbeats. - Doesn’t override Workflow Pause; both must be unpaused.
- Reset (
temporal activity reset)- Clears retry state: attempt resets to 1, backoff discarded, rescheduled immediately.
- If paused, Reset also unpauses unless
--keep-paused. - Heartbeat: with Heartbeat → interrupted on next Heartbeat (reset-specific error); without Heartbeat → no interruption/concurrent run; if attempt>1, service rejects current result due to attempt mismatch; new execution after Start-To-Close expires.
--restore-original-optionsreverts timeouts/Retry Policy/Task Queue to original.
- Update Options (
temporal activity update-options)- Change timeouts (Schedule-To-Close, Start-To-Close, Schedule-To-Start, Heartbeat), Retry Policy (initial interval, max interval, backoff coefficient, max attempts), Task Queue.
- If waiting to retry → takes effect immediately (retry timer regenerated). If running → stored for next execution. If paused → stored; applies on unpause.
--restore-original-optionsworks only with--query(batch); ignored in single-workflow mode.
- Observability/audit
- Operations do not create Workflow Event History events; Workflow code/replay/tools reading history can’t detect them.
- Check state via
temporal workflow describe(paused flag, attempt, last failure) or UI (who/when/why). No namespace-wide query for paused activities; must know Workflow Id.
📖 Temporal Activity Retries — RetryPolicy defaults & backoff
Reference Doc · source
Exact RetryPolicy fields + default retry behavior (Activities vs Workflows)
Key content
- Default retry behavior
- Activities retry automatically by default with exponential backoff until success or cancellation.
- Workflow Executions do not retry by default (no default Retry Policy attached).
- Retry Policies do not apply to Workflow Task Executions; Workflow Tasks retry until Workflow Execution Timeout (unlimited by default) with exponential backoff and max interval 10 minutes.
- RetryPolicy fields (exact names)
initialInterval,backoffCoefficient,maximumInterval,maximumAttempts,nonRetryableErrorTypes.
- Default RetryPolicy values (Properties → Default values)
initialInterval= 1sbackoffCoefficient= 2.0maximumInterval= 100 × initialInterval (⇒ 100s with defaults)maximumAttempts= ∞ (unlimited); 0 also means unlimited, 1 means no retries, negative ⇒ errornonRetryableErrorTypes= [] (none)
- Retry interval formula (Retry interval section, Eq. 1)
retryInterval = min( initialInterval * (backoffCoefficient ^ retries), maximumInterval )- where
retries= number of retries already attempted (0 for first retry delay).
- Procedure: what happens on Activity retry
- Activity fails → service evaluates Retry Policy (attempt count, error type) and computes backoff.
- If retryable: schedules a new Activity Task after backoff (new Activity Task Execution).
- If not retryable / attempts exceeded: Activity fails and error is returned.
- Override mechanism
- An Application Failure can set a “next Retry delay” that overrides the computed interval, but still respects
maximumAttemptsand overall timeouts (Activity Schedule-to-Close, Workflow Execution Timeout).
- An Application Failure can set a “next Retry delay” that overrides the computed interval, but still respects
- Design rationale
- Prefer retrying failed Activities (failure-prone external ops) vs retrying whole Workflows (deterministic replay; retrying often repeats same failure and wastes resources).
📖 Temporal Workflow History Event Types (Authoritative)
Reference Doc · source
Exact Temporal event type names + meanings/fields for Workflow Execution Event History (debugging/auditing/determinism)
Key content
- Event history basics: Events are created by the Temporal Service in response to (a) external occurrences and (b) Commands generated by a Workflow Execution.
- Workflow lifecycle (terminal + key):
WorkflowExecutionStarted(always first event). Key fields:workflow_type,task_queue,input, timeouts (workflow_execution_timeout,workflow_run_timeout,workflow_task_timeout),retry_policy,attempt,cron_schedule,continued_execution_run_id,identity,memo,search_attributes.- Terminal outcomes:
WorkflowExecutionCompleted(result),WorkflowExecutionFailed(failure,retry_state),WorkflowExecutionTimedOut(retry_state),WorkflowExecutionCanceled(details),WorkflowExecutionTerminated(reason,details). - Control-flow:
WorkflowExecutionCancelRequested(cause,identity),WorkflowExecutionSignaled(signal_name,input,identity),WorkflowExecutionContinuedAsNew(new_execution_run_id,input,backoff_start_interval, timeouts),WorkflowExecutionOptionsUpdated(versioning override, attached request id, completion callbacks).
- Workflow Task (WFT) progression:
WorkflowTaskScheduled→WorkflowTaskStarted→WorkflowTaskCompleted; failure modes:WorkflowTaskTimedOut(timeout_type),WorkflowTaskFailed(often non-determinism; also used by reset withbase_run_id,new_run_id,fork_event_version). - Activity progression:
ActivityTaskScheduled(timeouts:schedule_to_close_timeout,schedule_to_start_timeout,start_to_close_timeout,heartbeat_timeout, plusretry_policy) →ActivityTaskStarted(written to history only when terminal event occurs) → terminal:ActivityTaskCompleted/ActivityTaskFailed(retry_state)/ActivityTaskTimedOut(timeout_type) / cancel:ActivityTaskCancelRequested→ActivityTaskCanceled. - Other primitives: Timers (
TimerStarted/TimerFired/TimerCanceled), markers (MarkerRecordedis server-transparent), child workflows (initiate/start/complete/fail/cancel/timeout/terminate), external cancel/signal initiation + failure events, search attribute upserts (UpsertWorkflowSearchAttributes), Updates (WorkflowExecutionUpdateAcceptedEvent,WorkflowExecutionUpdateCompletedEvent), Nexus ops (NexusOperationScheduled/Started/Completed/Failed/TimedOut/CancelRequested/Canceled).
📖 Temporal Workflow timeouts (Execution/Run/Task) — definitions, defaults, API params
Reference Doc · source
Workflow-level timeout semantics + parameter names; contrast with Activity timeouts
Key content
-
General guidance (design rationale):
- Temporal generally does not recommend setting Workflow Timeouts because Workflows are long-running/resilient; timeouts can limit ability to handle delays.
- For “do something after X time” inside a Workflow, prefer a Timer (durable sleep managed by Temporal service), not Workflow timeouts.
-
Where configured (procedure/API):
- Set at Workflow start via
client.start_workflow()orclient.execute_workflow(). - Timeout parameter names:
execution_timeout,run_timeout,task_timeout. - Example (Python):
await client.execute_workflow(..., execution_timeout=timedelta(seconds=2), run_timeout=..., task_timeout=...)
- Set at Workflow start via
-
Workflow timeout types (definitions + defaults):
- Workflow Execution Timeout: max time a Workflow Execution can be Open, including retries and Continue-As-New.
- Default:
∞(infinite). - On reach: Execution becomes Timed Out.
- Common use: limit total duration of a Temporal Cron Job over time.
- Default:
- Workflow Run Timeout: max duration of a single Run (one Run ID) within an Execution; excludes retries/Continue-As-New.
- Default: same as Execution Timeout.
- On reach: Execution becomes Timed Out.
- Constraint: cannot be greater than Execution Timeout.
- Workflow Task Timeout: max time a Worker may execute a Workflow Task after pulling from Task Queue (detect Worker down / recovery).
- Default: 10s; max: 120s.
- Increase only if large history load needs >10s; not recommended beyond default.
- Workflow Execution Timeout: max time a Workflow Execution can be Open, including retries and Continue-As-New.
-
Observability/troubleshooting:
- Use Search Attribute
TemporalReportedProblemsto find Workflows with failed Workflow Tasks; a failed Workflow Task does not fail the Workflow but can prevent completion if unhandled.
- Use Search Attribute
📖 stream_mode="updates" can miss tool messages when tools return Command
Reference Doc · source
Edge-case semantics for stream_mode="updates" with multi-tool calls + tools returning Command(update=...)
Key content
- Repro setup (LangGraph ReAct agent):
- Build tools
addandsub; create agent viacreate_react_agent(model, tools=tools, checkpointer=MemorySaver()). - Stream with:
agent.stream(input={"messages":[("user","add(1,1), add(1,2), add(1,3) at once")]}, config={"configurable":{"thread_id":"1"}}, stream_mode="updates")
- Build tools
- Tool return patterns compared:
addtool returns a Command:- Eq. 1 (Command update):
Command(update={"messages":[ToolMessage(f"add result: {result}", tool_call_id=tool_call_id)]})
whereresult = a + b, andtool_call_idis injected viaAnnotated[str, InjectedToolCallId].
- Eq. 1 (Command update):
subtool returns a plain string:return f"sub result: {result}"(note: code showsresult = a + beven though tool is namedsub).
- Observed streaming behavior (empirical):
- When the LLM issues multiple tool calls at once (3
addcalls),stream_mode="updates"emits:- an
agentupdate containing 3 tool_calls, - then a
toolsupdate containing only 1 ToolMessage:add result: 4(the last call’s message), - then final
agentresponse.
- an
- For
sub(string return), thetoolsupdate contains all 3 ToolMessages in one chunk:sub result: 2,sub result: 3,sub result: 4.
- When the LLM issues multiple tool calls at once (3
- State vs stream discrepancy:
agent.get_state(config).values["messages"]shows all ToolMessages foradd(2, 3, 4) even though streaming only showed the last one.
📖 lm-eval Harness Interface (CLI + Python API)
Reference Doc · source
CLI argument surface + equivalent simple_evaluate() kwargs for standardized eval runs
Key content
- Primary invocation: run via
python -m lm_evalorlm-evalCLI entrypoint; flags viewable with-h/--help. - Model selection
--model <string>: model type/provider name (must match enabled names list in main README).--model_args "arg1=val1,arg2=val2,...": comma-separated kwargs passed to model constructor (example:pretrained=EleutherAI/pythia-160m,dtype=float32).
- Task selection & prompting
--tasks "t1,t2,group1,...": comma-separated task and/or task-group names (must be valid).--num_fewshot <int>: number of few-shot examples inserted into context.
- Generation controls
--gen_kwargs "k=v,...": kwargs passed togenerate_untiltasks (e.g.,temperature,top_p,top_k); applies to allgenerate_untiltasks in the run (no per-task overrides via CLI; per-task control via task YAML).
- Batching & device
--batch_size <int|auto|auto:N>: fixed batch size or auto-fit;auto:Nre-finds max batch size N times during eval (helps because docs are sorted by descending context length).--max_batch_size <int>: cap when using--batch_size auto.--device <string>: e.g.,cuda(default),cuda:0,cpu,mps.
- Outputs & observability
--output_path dir/file.jsonl|dir/: save high-level results; if--log_samples, also saves per-document outputs/metrics into directory.--log_samples: requires--output_path; logs model inputs/outputs per document.
- Debugging/repro
--limit <int|float 0.0–1.0>: evaluate first X docs or first X% per task.--use_cache /path/to/sqlite_cache_: creates per-process caches/path/to/sqlite_cache_rank{i}.db.--check_integrity: run task tests.--write_out: print prompt + gold target for first doc of each task.--show_config: print fullTaskConfig(incl. non-default YAML settings).--include_path <folder>: add external YAML task configs to registry.
- Python API workflow
- Implement an
lm_eval.api.model.LMsubclass (loglikelihood,loglikelihood_rolling,generate_until). - Register tasks:
lm_eval.tasks.initialize_tasks()orinclude_path(...). - Call
lm_eval.simple_evaluate(model=lm_obj, tasks=[...], num_fewshot=..., ...)(kwargs mirror CLI flags).
lm_eval.evaluate()provides core functionality with less abstraction thansimple_evaluate().
- Implement an
📋 # Source: https://arize.com/docs/phoenix/resources/frequently-asked-questions/open-source-langsmith-alternative-arize-phoenix-vs.-langsmith
Source ·
📋 # Source: https://docs.anthropic.com/ja/docs/agents-and-tools/tool-use/fine-grained-tool-streaming
Source ·
📋 # Source: https://docs.langchain.com/langsmith/evaluation
Source ·
📋 # Source: https://docs.langchain.com/langsmith/trace-with-api
Source ·
📋 # Source: https://docs.temporal.io/ai-cookbook/human-in-the-loop-python
Source ·
📋 # Source: https://docs.temporal.io/encyclopedia/event-history/event-history-python
Source ·
📋 # Source: https://github.com/langchain-ai/langgraphjs/issues/1482
Source ·
📋 # Source: https://github.com/langchain-ai/langgraphjs/issues/318
Source ·
📋 LangGraph Agent Streaming via FastAPI WebSocket (Repo Scaffold)
Code · source
Runnable FastAPI server scaffold showing LangGraph Agent + real-time streaming “tokens” (words) over WebSocket; practical place to add redaction/sanitization and observability hooks.
Key content
- End-to-end architecture (repo intent):
- LangGraph Agent used to build a stateful, multi-actor LLM application; coordinates and checkpoints multiple chains/actors across cyclic computational steps using regular Python functions.
- FastAPI provides the HTTP server framework (high-performance, auto API docs).
- WebSocket used for real-time, bidirectional low-latency streaming to a web UI.
- Streaming behavior (important implementation detail):
- “Streaming Tokens” feature is ChatGPT-like word streaming: not raw token streaming; tokens are converted to words before displaying in the web UI.
- Tooling/agent extensibility:
- Agent is created with LangGraph and has access to one tool by default in this example; design explicitly supports integrating many tools.
- Design rationale (as stated):
- WebSocket chosen to ensure low-latency data exchange and interactive UX.
- LangGraph chosen for coordination + checkpointing across iterative/cyclic steps (Pregel/Apache Beam-inspired; NetworkX-like interface).
- Repo structure (for quick navigation):
- Key files:
main.py(FastAPI entry),assistant.py(agent logic), plusstatic/(web UI assets),docs/,README.md.
- Key files:
- Empirical/config values: none stated in the provided excerpt (no hyperparameters, ports, or numeric benchmarks).
🔍 LangGraph runtime + streaming modes (Pregel/BSP)
Explainer · source
Design rationale for LangGraph’s runtime (graph execution model + why streaming exists) and concrete graph.stream/graph.astream pattern via stream_mode values.
Key content
- Production needs driving design (6 features): Parallelization (reduce actual latency), Streaming (reduce perceived latency), Task queue (reliable retries), Checkpointing (cheap retries), Human-in-the-loop (interrupt/resume), Tracing (visibility into agent loops).
- Why streaming exists (latency rationale): LLM agents run in seconds/minutes/hours; when you can’t reduce true latency without harming quality, stream useful intermediate info (progress/actions) up to token-by-token output.
- Runtime architecture choice (Section “Execution algorithm”): Uses BSP/Pregel to support cycles/loops and deterministic concurrency (avoid data races).
- Execution model (algorithm steps):
- Channels: named data containers with monotonically increasing version strings.
- Nodes: functions subscribing to channels; run when subscribed channel versions change.
- Loop per iteration:
- Select runnable nodes by comparing channel versions vs last-seen versions.
- Execute selected nodes in parallel with independent state copies.
- Apply node updates to channels in a deterministic order, bump versions.
- Halt when no nodes runnable or max iteration steps reached (developer-set constant).
- Streaming implementation + modes: Engine emits stream output inside nodes while running and at step boundaries without custom developer code. Provides 6 stream modes:
values,updates,messages,tasks,checkpoints,custom. Example guidance: chatbots →messages; long-running agents →updates. - Checkpoint contents (for resume-anywhere): serialized channel values (MsgPack by default, optionally encrypted), version strings, and record of last-seen channel versions per node.
- Empirical scaling table (Big-O):
- Planning a step: nodes O(1), edges O(1), channels O(n), active nodes O(n), history O(1), threads O(1).
- History length is O(1) across start/plan/run/finish (fetch latest checkpoint; no replay).
📋 LangGraph streaming + state inspection patterns (SSE/WebSocket-adaptable)
Code · source
Concrete end-to-end patterns for streaming agent execution + inspecting intermediate state (checkpoint snapshots), suitable for adapting to SSE/WebSocket event streams.
Key content
- Graph “hello world” (JS/TS) procedure
- Define state schema with
messages: MessagesValue. - Node returns state update:
return { messages: [{ role: "ai", content: "hello world" }] }; - Build graph:
new StateGraph(State).addNode("mock_llm", mockLlm).addEdge(START,"mock_llm").addEdge("mock_llm",END).compile(); - Invoke:
await graph.invoke({ messages: [{ role:"user", content:"hi!" }] });
- Define state schema with
- Tool-calling loop (Python) procedure
- State:
messages: Annotated[list, add_messages](reducer appends, not overwrites). - Bind tools:
llm_with_tools = llm.bind_tools(tools) - Nodes:
chatbot(state) -> {"messages":[llm_with_tools.invoke(state["messages"])]}ToolNode(tools=[...])
- Control flow:
add_conditional_edges("chatbot", tools_condition)routes to tools when tool calls exist.add_edge("tools","chatbot")returns to LLM after tool execution.add_edge(START,"chatbot")
- State:
- Checkpointing + intermediate state inspection
- Compile with checkpointer:
graph.compile(checkpointer=MemorySaver()) - Provide
thread_idinconfigurableto persist/restore across calls. - Inspect via
StateSnapshot(...)containing:values(full state, incl. message history)next(empty when atEND)config.configurable.thread_id,checkpoint_id,checkpoint_nsmetadata.step(example showsstep: 4)
- Compile with checkpointer:
- Human-in-the-loop (HIL) interrupt rationale + default constraint
- Tool uses
interrupt({"query": query})and returnshuman_response["data"]. - Rationale: disable parallel tool calling to avoid repeated tool invocations on resume.
- Enforced by:
assert len(message.tool_calls) <= 1.
- Tool uses
🔍 Predictable background coding agents via verification loops
Explainer · source
Concrete production pattern: iterative verification/feedback loops (agent unaware of verifier details) to improve PR success predictability.
Key content
- Primary failure modes (production agent at scale):
- Agent fails to produce a PR (minor; manual fallback).
- PR produced but fails CI (frustrating; leaves half-broken code).
- PR passes CI but is functionally incorrect (most serious; erodes trust; hard to spot across thousands of components).
- Core procedure: “verification loop” (inner loop)
- Implement strong verification loops so the agent can incrementally confirm it’s on track before committing/opening a PR.
- Design principle: agent doesn’t know what verification does/how; it only knows it can/must call a verification tool.
- Loop consists of one or more independent verifiers that auto-activate based on repo contents (e.g., Maven verifier triggers if
pom.xmlexists at repo root). - Verifiers are exposed via an abstraction layer (e.g., MCP tool definition); individual verifiers are not exposed directly to the agent.
- Verifiers run formatting/build/test and parse noisy outputs (often via regex) to return short, relevant error messages or a short success message.
- System runs all relevant verifiers before opening a PR; in Claude Code implemented via a stop hook. If any verifier fails → PR not opened; user gets error.
- Additional safeguard: LLM “Judge” (post-verifier)
- Inputs: diff of proposed change + original prompt; evaluated by an LLM.
- Purpose: prevent “ambitious” out-of-scope changes (refactors, disabling flaky tests).
- Empirics: across thousands of agent sessions, judge vetoes ~25%; when vetoed, agent course-corrects ~50% of the time.
- Design rationale for predictability/security
- Keep agent narrowly scoped: see codebase, edit files, run verifiers only.
- Surrounding infra handles pushing code, Slack interaction, prompt authoring.
- Run agent highly sandboxed (container, limited permissions, few binaries, minimal system access).
🔍 Spotify “Honk” Background Coding Agent (Part 1) — Deployment Pattern
Explainer · source
System-level architecture narrative for a background coding agent: how work is scoped/queued, PRs produced, and operational constraints (human review, reliability, cost).
Key content
- Baseline platform (Fleet Management): Runs source-to-source transformations as jobs in a containerized environment, then automatically opens PRs against target repos. Historically strong for:
- Dependency bumps (e.g., Maven pom.xml)
- Config updates (deployment manifests)
- Simple refactors (replace deprecated calls)
- Scale/impact metrics:
- Since mid-2024, ~50% of Spotify PRs have been automated by Fleet Management.
- AI agents have generated 1,500+ PRs merged into production.
- Reported 60–90% total time savings vs writing changes by hand (for complex migrations).
- Why agents (design rationale):
- Deterministic transformation scripts become extremely complex (example: Maven dependency updater grew to 20,000+ LOC to handle corner cases).
- Goal: let engineers define fleet-wide changes in natural language, lowering expertise barrier.
- Architecture choice: Replace only the transformation declaration with an agent; keep surrounding infra unchanged (repo targeting → PR opening → review → merge).
- Internal CLI (pluggable agent runner):
- Delegates prompt execution to an agent
- Runs formatting/linting via local MCP (Model Context Protocol)
- Uses LLMs-as-judge to evaluate diffs
- Uploads logs to GCP
- Captures traces in MLflow
- Rationale: enables swapping agents/LLMs without changing user workflow.
- Background agent workflow (Slack/GitHub):
- User interacts with an interactive agent to gather task info
- Interaction produces a prompt
- Prompt handed to coding agent → PR produced
- Used for ADR drafting from Slack threads; PM-proposed simple changes.
- Operational constraints called out: long runtimes, unpredictable outputs → need validation/quality control; plus safety, sandboxing, and cost/LLM quota management.
🔍 Two-layer agent architecture (LangGraph logic + Temporal durability)
Explainer · source
Pattern: keep agent logic in LangGraph-style graphs, but use Temporal for durable execution (state persistence, retries, recovery, scaling).
Key content
- Use case: “Deep research agent” for a Fortune 500 manufacturer with 100+ plants; searches internal DBs/shared drives/repos, then expands to web if needed; labels internal vs open-web sources and cites sources.
- Observed LangGraph-in-prod pain points (why migrate):
- Needed robust error handling + retries → built custom retry/error-handling especially for human-in-the-loop waits, requiring manual state maintenance; led to inconsistent workflow state and hard recovery/debugging.
- Redis-based state: had to manage lifecycle/expiration; bugs around expired state were time-consuming to reproduce; caching updates could wipe common requests.
- Scaling/exactly-once: to ensure each request processed exactly once, they used Apache Kafka + executor pool; still hit race conditions, stale state, stuck agents.
- Temporal design rationale (what changed):
- State becomes part of the workflow (durably persisted in Temporal event history), not an external “baton” (Redis key). Workflow passes a serializable state object into each Activity; Activities return updated state.
- Declarative retries via
RetryPolicyattached to Activity execution (delete “thousands of lines” of try/catch + retry loops).- Example defaults shown:
initial_interval=1s,backoff_coefficient=2.0,maximum_interval=60s,maximum_attempts=4; another policyinitial_interval=5s,backoff_coefficient=1.0,maximum_interval=15s,maximum_attempts=3;non_retryable_error_typesincludesValueError(andTypeErrorin second).
- Example defaults shown:
- Scaling procedure: run multiple identical stateless Temporal Worker replicas on Kubernetes polling the same task queue; Temporal handles load balancing/distribution.
- Architectural decoupling step: convert each LangGraph node into a self-contained Temporal Activity with explicit serializable inputs/outputs; move shared client init into Activities; optimize via client pooling/lazy init.