Agent Workflows

Video (best)

  • LangChain — “LangGraph: Build Stateful, Multi-Agent Workflows”
  • Link: https://blog.langchain.dev/langgraph-multi-agent-workflows
  • Why: LangGraph is one of the clearest, widely-used introductions to DAG/state-machine style orchestration for agents (nodes, edges, conditional routing, retries).
  • Level: intermediate

Blog / Written explainer (best)

  • Lilian Weng (OpenAI) — “LLM Powered Autonomous Agents”
  • Link: https://lilianweng.github.io/posts/2023-06-23-agent/
  • Why: High-signal overview of agent building blocks and patterns (planning, tool use, memory) that underpin real workflows; good conceptual grounding before orchestration/production details.
  • Level: intermediate

Deep dive

  • LangChain Docs — “LangGraph”
  • Link: https://langchain-ai.github.io/langgraph/
  • Why: Practical deep dive into orchestration patterns: DAG-like graphs, conditional branching, cycles, persistence/checkpointing, streaming, and human-in-the-loop patterns.
  • Level: intermediate/advanced
  • Microsoft — “AutoGen”
  • Link: https://microsoft.github.io/autogen/stable/index.html
  • Why: Multi-agent conversation/workflow patterns, tool execution, and coordination; useful for parallel/role-based agent workflows and production-ish patterns.
  • Level: intermediate/advanced

Original paper

  • ReAct — “ReAct: Synergizing Reasoning and Acting in Language Models”
  • Link: https://arxiv.org/abs/2210.03629
  • Why: Foundational pattern for tool-using agent loops (reason → act → observe), which is the core of many sequential chains and orchestration designs.
  • Level: intermediate
  • MRKL — “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning”
  • Link: https://arxiv.org/abs/2205.00445
  • Why: Early, influential framing for routing/orchestrating between tools and modules (a precursor to many workflow-engine patterns).
  • Level: intermediate

Code walkthrough

  • LangChain (GitHub) — “LangGraph” (examples)
  • Link: https://github.com/langchain-ai/langgraph
  • Why: Concrete reference implementations for graph/DAG orchestration, streaming, retries, and human-in-the-loop checkpoints.
  • Level: intermediate
  • Microsoft (GitHub) — “AutoGen” (examples)
  • Link: https://github.com/microsoft/autogen
  • Why: End-to-end multi-agent workflow examples (coordinator/worker patterns, tool calls, conversation-driven orchestration).
  • Level: intermediate

Coverage notes

  • Strong: orchestration-patterns (DAG/graph orchestration, sequential chains, conditional routing), workflow engines (LangGraph), multi-agent coordination (AutoGen), core agent loop pattern (ReAct).
  • Weak: production-patterns specifics (agent-as-API design, robust retry/fallback taxonomies, streaming UX patterns) in a single canonical explainer; agent-deployment details are scattered across vendor docs.
  • Gap: agent tracing/observability and deployment SRE guidance (latency management, cost optimization) in one stable, vendor-neutral “best” resource; likely needs a dedicated page or curated set of vendor/OSS observability docs.

Additional Resources for Tutor Depth

31 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.

📄 Core RL/MDP + Bellman + DP + Q-learning (Sutton & Barto 2e)

Paper · source

MDP formalism, Bellman equations, policy/value iteration, Q-learning, planning/model connections

Key content
  • MDP dynamics & policy notation (Ch. 3, “Summary of Notation”)
    • States (s\in\mathcal S), actions (a\in\mathcal A(s)), reward (r\in\mathcal R), discount (\gamma).
    • Policy: deterministic (\pi(s)) or stochastic (\pi(a\mid s)).
    • Transition model: (p(s’,r\mid s,a)).
  • Return (Ch. 3): (G_t=\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}).
  • Value functions (Ch. 3):
    • (v_\pi(s)=\mathbb E_\pi[G_t\mid S_t=s])
    • (q_\pi(s,a)=\mathbb E_\pi[G_t\mid S_t=s,A_t=a])
    • Optimal: (v_(s)=\max_\pi v_\pi(s)), (q_(s,a)=\max_\pi q_\pi(s,a)).
  • TD(0) / “backup” update shown in tic-tac-toe example (Sec. 1.5):
    Eq. 1: (V(s)\leftarrow V(s)+\alpha,[V(s’)-V(s)]) (step-size (\alpha>0)); illustrates bootstrapping from successor state estimate.
  • Q-learning (Ch. 6.5): off-policy TD control update
    Eq. 2: (Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha,[R_{t+1}+\gamma\max_a Q(S_{t+1},a)-Q(S_t,A_t)]).
  • Design rationale (Ch. 1.3): reward defines immediate goal; value is long-run desirability; estimating values is central because action choice should maximize expected long-run return, not immediate reward.
  • Planning/model-based connection (Ch. 1.3, Ch. 8): a model predicts next state/reward; planning = deciding by considering possible futures; integrates planning, acting, learning.

📄 ESAA — Event-Sourced Agent Orchestration (CQRS-style)

Paper · source

Architecture/procedure for representing agent execution as an event-sourced log with deterministic replay + hash-verified projections

Key content
  • Core separation (Section 3): LLM agent emits structured intentions only (e.g., agent.result, issue.report) in validated JSON; a deterministic orchestrator validates, persists, and applies effects (agent has no direct write permission).
  • Canonical artifacts (.roadmap/, Section 3.1):
    • activity.jsonl: append-only ordered event log (event_seq).
    • roadmap.json: materialized read-model (projection) incl. projection_hash_sha256.
    • Contracts: AGENT_CONTRACT.yaml, ORCHESTRATOR_CONTRACT.yaml (allowed actions + prohibitions, e.g., deny agent file.write).
    • PARCER profiles (PARCER_PROFILE.*.yaml): metaprompting constraints (Persona/Audience/Rules/Context/Execution/Response) enforcing strict JSON envelope.
  • Trace-first + immutability (Section 3.2): record event before irreversible effects; “done immutability rule”: completed tasks don’t regress—defects create new issue.report → new hotfix path.
  • Deterministic verification (Section 3.3, 4.2):
    • Compute SHA-256 over canonicalized projected state: projection_hash_sha256 = SHA256(canonicalize(roadmap.json)) (canonicalization aligned with RFC 8785 JCS).
    • esaa verify: replay log → reproject → compare hash; emits verify.ok or verify.fail.
  • Orchestrator pipeline (Section 4.2): validate (JSON Schema + boundary rules) → output.rejected if violation → apply effects via orchestrator.file.write → append events → reproject roadmap.json → verify via replay+hash.
  • Concurrency model (Section 3.4): agents run in parallel, but results are validated and appended sequentially, preserving total order; orchestrator can detect conflicts (e.g., overlapping file mods) before applying.
  • Empirical results (Case studies, Tables 1–2):
    • CS1 landing page: 9 tasks, 49 events, run.status=success, verify_status=ok, output.rejected=0. Cycle: attempt.create → orchestrator.dispatch → agent.result → orchestrator.file.write → task.update (repeat) → verify.ok → run.end.
    • CS2 clinic-asr: 50 tasks, 86 events, 4 concurrent agents, 15 phases (8 completed); at analysis 31/50 tasks done (62%); distribution: 30 claims, 30 completions, 17 promotes, 8 phase.complete, 1 version init; output.rejected=0; log size ~15 KB; run duration ~15 hours.
  • Overheads (Section 6.6): JSON envelope + validation preamble ~200–500 tokens/invocation; validation+persistence sub-second/event.

📄 GoalAct = continuously updated global plan + hierarchical execution

Paper · source

Taxonomy/positioning of agent architectures (ReAct vs Plan-and-Execute vs CodeAct) and how “global planning” changes decision flow

Key content
  • Problem framing (Intro):
    • ReAct: incremental “Thought–Action–Observation” w/o global perspective → can get stuck in local branches/local optima on multi-branch tasks.
    • Plan-and-Execute: makes a global plan then executes + adjusts via feedback, but plans often non-executable (exceed agent action space).
    • Execution trade-off: text/json tool calls are stable but limited (no loops/conditionals); code actions (CodeAct) are expressive but can be unstable in unpredictable tool environments; writing tasks not well-solved by code alone.
  • Global plan definition (Section 3.1):
    • Eq. (1): global plan (P={p_i}_{i=1}^{n}), where (p_i) is the i-th plan step and has corresponding action (a_i). Plan steps specify high-level skills (not low-level actions). Final step (p_n) is always Finish.
    • Eq. (2): at time (t), update plan (P_t = \pi(u, T, H_t)), where (\pi) is the update policy; (u)=user query; (T)=available tools; (H_t)=history.
    • Eq. (3): history (H_t={(a_i,o_i)}_{i=1}^{t-1}), where (o_i)=observation from executing (a_i); (H_1=\emptyset).
    • Rationale: tight coupling of planning+execution via continuous plan updates → coherent long-term goals + executability.
  • Hierarchical execution (Section 3.2): plan over skills (searching/coding/writing/…) then pick tools/params within skill.
    • Searching: simple/stable; limited expressiveness (no loops/branches).
    • Coding: python enables loops/branches; higher complexity → more error-prone with uncertain tool outputs.
    • Writing: needed for tasks like legal document generation; not solved by code/search alone.
  • Empirical results (LegalAgentBench, Table 2; temp=0):
    • GoalAct SOTA; average success-rate improvement +12.22% over second-best (reported).
    • GPT-4o-mini ALL: GoalAct 0.7720 vs ReAct 0.6161, CodeAct 0.6275, Plan-and-Solve 0.4196, Plan-and-Execute 0.4503.
    • GLM-4-Plus ALL: GoalAct 0.8710 vs ReAct 0.7499, CodeAct 0.6648.
  • Ablation (Table 3, GLM-4-Plus): removing global plan drops ALL from 0.8710 → 0.7896 (−8.14%).

📄 Latency-aware orchestration via critical path (LAMaS)

Paper · source

Method: orchestration/search space (layered DAG/parallel), latency reward + critical-path credit assignment, and latency/cost tradeoffs + ablations.

Key content
  • Parallel execution + critical path definition (Sec. 3.1–3.3): Operators are atomic nodes (each may include multiple LLM/tool calls). Operators are selected per layer and run in parallel within a layer after removing intra-layer dependencies (refinement ops consume previous-layer outputs to avoid synchronization barriers).
  • Latency vs cost under parallelism:
    • Latency (critical path), Eq. (1): (L=\sum_{l\in \mathcal{L}} \max_{o\in \mathcal{O}_l} t(o)), where (\mathcal{O}_l) are operators executed at layer (l), (t(o)) is operator time.
    • Cost, Eq. (2): (C=\sum_{l\in \mathcal{L}}\sum_{o\in \mathcal{O}_l} c(o)) (token/$ cost adds across all operators).
  • Controller/search space (Sec. 3.4): Probabilistic agentic supernet DAG; controller samples operators layer-by-layer with autoregressive factorization (Eq. 3). Threshold-based sampling (Eq. 4): select highest-scoring operators until cumulative confidence exceeds threshold (\tau) (controls width/parallelism). EarlyExit ends generation immediately if selected.
  • Reward + credit assignment:
    • Global reward, Eq. (5): (R = S - \lambda_c C - \lambda_l \hat{L}) (task score (S), cost (C), latency proxy (\hat{L})).
    • Critical operator per layer, Eq. (6): (o_l^*=\arg\max_{o\in \mathcal{O}_l}\hat{t}(o)).
    • CP-aware operator rewards, Eq. (7): apply latency penalty only to (o_l^*) to avoid credit assignment error.
    • Training, Eq. (8): policy gradient on sampled trajectories; reward normalization via EMA mean/variance.
  • Latency proxy for evaluation/optimization (Eq. 9): CP length (=\sum_l \max_{o\in \mathcal{O}_l}(\text{output tokens}(o)+\alpha\cdot \text{tool seconds}(o))); (\alpha=50) (1s tool time = 50 “virtual tokens”).
  • Key results (Tables 2–4): vs MaAS (same space, parallel enabled):
    • GSM8K: Score 93.37 vs 93.13; CP len 913.5 vs 1474.6 (−38.0%)
    • HumanEval: 92.11 vs 93.00; CP len 1042.7 vs 1810.8 (−42.4%)
    • MATH: 52.26 vs 51.23; CP len 1195.8 vs 2218.5 (−46.1%)
    • Cost/CP examples (Table 3): GSM8K LAMaS cost 0.88 vs MaAS 0.56; HumanEval LAMaS 0.10 vs MaAS 0.08; MATH LAMaS 0.99 vs MaAS 0.37.
  • Ablations (Table 4):
    • w/o latency weight (but parallel deps removed): GSM8K CP 1215.9 (vs 913.5) and cost 1.73 (vs 0.88); HumanEval CP 1629.3 (vs 1042.7); MATH score 48.97 (vs 52.26) and CP 1342.1 (vs 1195.8).
    • w/o CP credit (HumanEval): score 91.60, cost 0.12, CP 1197.5 (worse than LAMaS 1042.7).
  • Defaults/hyperparams (Sec. 4.1): LLM gpt-4o-mini-0718, temperature 1; supernet layers (L=4); cost penalty (\lambda_c=0.1); sampling times (N=5); threshold (\tau=0.8); latency weight (\lambda_l=0.5) (normalized by factor 50 in objective); tool scaling (\alpha=50).

📄 MasRouter (MAS Routing: topology + roles + LLMs)

Paper · source

Routing policy formulation + training objective/procedure + benchmark cost/quality tradeoffs

Key content
  • MAS search space & instance (Eq. 1, Sec. 3.1): Search space (S=(\mathcal M,\mathcal R,\mathcal T)) with LLM pool (\mathcal M) (size (N_m)), role set (\mathcal R) (size (N_r)), collaboration modes (\mathcal T) (size (N_t)). A MAS instance
    [ \mathbf S={{M_i}{i=1}^k,{R_i}{i=1}^k,T},; M_i\in\mathcal M,; R_i\in\mathcal R,; T\in\mathcal T ]
  • MASR definition (Eq. 2): Router defines ( \pi(\mathbf S)=P(\mathbf S\mid Q)) mapping query (Q) to a tailored MAS.
  • Cost–utility objective (Eq. 3):
    [ \max_{P(\mathbf S|Q)}; \mathbb E_{(Q,a)\sim D,;\mathbf S\sim P(\mathbf S|Q)}\big[U(\mathbf S;Q,a)-\lambda,C(\mathbf S;Q)\big] ] (U)=performance vs oracle (a); (C)=expected cost (tokens/API calls); (\lambda)=tradeoff.
  • Cascaded controller (Eq. 5): (F_\theta = F_{\theta m}\circ F_{\theta r}\circ F_{\theta t}): collaboration determiner (F_{\theta t}:Q\to T); role allocator (F_{\theta r}:(Q,T)\to {R_i}); LLM router (F_{\theta m}:(Q,T,{R_i})\to {M_i}).
  • Collaboration determiner (Eq. 6–7): variational latent (H): (F_{\theta t}(T|Q)=\int p_g(T|H)p_h(H|Q)dH), with (p_h(H|Q)=\mathcal N(\mu_t(Q),\mathrm{diag}(\sigma_t^2(Q)))); softmax via temperature (\tau).
  • Dynamic agent count (Sec. 4.1): (k=\lceil \delta(H)\cdot \gamma\rceil), (\delta:[0,1]), (\gamma)=max agents.
  • Role cascade (Eq. 8–9): sequential role sampling (\prod_{\ell=1}^k \pi^r_\ell(R_\ell|Q,T,R_{<\ell})) with softmax temperature (\tau).
  • LLM routing multinomial (Eq. 10–12): assigns (k) agents across (N_m) LLMs; multinomial coefficient approximated with Gamma to keep gradients: (\Gamma(\delta(H)\gamma+1)/\prod_i\Gamma(n_i+1)).
  • Training (Eq. 13, Sec. 4.4): minimize (\mathbb E[-p(a|Q)+\lambda C(\mathbf S;Q)]); optimized with policy gradient (Williams, 1992).
  • Empirical headline (Table 1): MasRouter best avg 85.93 vs RouterDC 82.42 (+3.51). On MBPP: MasRouter 84.00 vs AFlow 82.20 (+1.80) and AgentPrune 75.40 (+8.60). On HumanEval: MasRouter 90.62 vs RouterDC 87.75 (+2.87).
  • Cost results: overhead on HumanEval reduced 0.185 (intro). Plug-in (Table 2): MacNet HumanEval cost 0.404 with +MasRouter, performance 86.82 → 88.37; MAD HumanEval cost 1.096, performance 86.05 → 87.60.
  • Defaults (Sec. 5.1): learning rate (\alpha=0.01); temperature (\tau=1); (\lambda\in{5,15,25}); iterations (K\in{5,10}); max agents (\gamma=6). LLM pool: gpt-4o-mini-0718, claude-3.5-haiku, gemini-1.5-flash, llama-3.1-70b; temp=1.

📄 Options Framework & Induced SMDP Equations

Paper · source

Formal Options definition (⟨I, π, β⟩) + SMDP models/objectives/Bellman equations for temporally-extended actions

Key content
  • Option definition (Section 4): An option is ⟨I, π, β
    • Initiation set: (I \subseteq S). Option available iff (s\in I).
    • Policy: ( \pi: S\times A \to [0,1]) (Markov) selects primitive actions while option runs.
    • Termination: ( \beta: S^+ \to [0,1]) gives probability option terminates upon arrival in state (s). Episodic terminal state has (\beta(\text{terminal})=1).
  • Primitive actions as options (Section 4): action (a) corresponds to option with (I={s: a\in A_s}), (\beta(s)=1\ \forall s), (\pi(s,a)=1).
  • SMDP “multi-time” option model (Section 5): if option (o) initiated in (s) at time (t), terminates after random duration (k) in (s_{t+k}):
    • Reward model (Eq. 5):
      [ r^o_s = \mathbb{E}\left[r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^{k-1}r_{t+k}\mid E(o,s,t)\right] ]
    • Discounted transition model (Eq. 6):
      [ p^o_{ss’}=\sum_{j\ge1}\gamma^j \Pr(s_{t+k}=s’,k=j\mid E(o,s,t)) =\mathbb{E}\left[\gamma^k \mathbf{1}{s_{t+k}=s’}\mid E(o,s,t)\right] ]
  • Bellman equations over options (Section 5): for Markov policy over options (\mu(s,o)):
    • State value (Eq. 7): (V^\mu(s)=\sum_{o\in O_s}\mu(s,o)\left[r^o_s+\sum_{s’}p^o_{ss’}V^\mu(s’)\right])
    • Option value (Eq. 8): (Q^\mu(s,o)=r^o_s+\sum_{s’}p^o_{ss’}\sum_{o’\in O_{s’}}\mu(s’,o’)Q^\mu(s’,o’))
  • Optimality with restricted option set (O) (Eq. 9–11):
    [ V^O(s)=\max{o\in O_s}\left[r^o_s+\sum_{s’}p^o_{ss’}V^O(s’)\right] ] [ Q^*O(s,o)=r^o_s+\sum{s’}p^o{ss’}\max_{o’\in O_{s’}}Q^*_O(s’,o’) ]
  • Key procedure (planning): Synchronous Value Iteration with options (Eq. 12):
    (V_{k+1}(s)\leftarrow \max_{o\in O_s}\left[r^o_s+\sum_{s’\in S^+}p^o_{ss’}V_k(s’)\right])

📄 POMDP formalism—belief updates & value iteration

Paper · source

Formal POMDP formulation with belief-state updates, value functions, and planning/acting loop assumptions (explicit equations and definitions).

Key content
  • MDP definition (Sec. 2.1): tuple ⟨S, A, T, R⟩ with finite states/actions.
    • Transition: (T(s,a,s’) = \Pr(s’ \mid s,a))
    • Reward: (R(s,a)) expected immediate reward.
  • Discounted return objective (Sec. 2.2): maximize
    (\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right]), with (0<\gamma<1).
  • MDP Bellman optimality equation (Sec. 2.2):
    (Bellman*) (V^(s)=\max_a\left[R(s,a)+\gamma\sum_{s’\in S}T(s,a,s’)V^(s’)\right]).
    Greedy policy: (\pi_V(s)=\arg\max_a\left[R(s,a)+\gamma\sum_{s’}T(s,a,s’)V(s’)\right]).
  • Value iteration algorithm (Alg. 1): initialize (V_1(s)=0). Iterate
    (Q_t^a(s)=R(s,a)+\gamma\sum_{s’}T(s,a,s’)V_{t-1}(s’)); (V_t(s)=\max_a Q_t^a(s)).
    Stop when (|V_t(s)-V_{t-1}(s)|<\varepsilon\ \forall s). Error bound:
    (\max_s |V^{\pi_{V_t}}(s)-V^*(s)| < \frac{2\varepsilon\gamma}{1-\gamma}).
  • POMDP definition (Sec. 3.1): tuple ⟨S, A, T, R, Ω, O⟩ with observations Ω and observation model
    (O(s’,a,o)=\Pr(o\mid s’,a)).
  • Belief state update (Sec. 3.3): belief (b(s)) over S; after action a, obs o:
    (BeliefUpdate) (b’(s’)=\eta; O(s’,a,o)\sum_{s\in S}T(s,a,s’),b(s)), where (\eta) normalizes.
  • Belief-MDP reward (Sec. 3.4): (\rho(b,a)=\sum_{s\in S} b(s)R(s,a)).
    Rationale: belief equals true occupation probabilities under correct model ⇒ (\rho) is true expected reward.
  • Piecewise-linear convex value over beliefs (Sec. 4.1): for t-step policy trees p with vector (\alpha_p=\langle V_p(s_1),…,V_p(s_n)\rangle):
    (V_p(b)=b\cdot \alpha_p); (PWLC) (V_t(b)=\max_{p} b\cdot \alpha_p).

📄 SPAgent — Speculation to Reduce Search-Agent Latency

Paper · source

End-to-end latency benchmarks for a ReAct-style search agent using speculation + scheduling (not just microbenchmarks)

Key content
  • Problem: ReAct “Reason→Action” is strictly serial: full LLM reasoning must finish before tool execution; both inference and tool time are substantial contributors to wall-clock latency (Sec. I, II-A).
  • Key observation: Directly sampled speculative actions (no reasoning tokens) match the post-reasoning action 73.4% at step 1, but can drop to ~11% in later steps (Fig. 1b, Sec. III-A).
  • Two-phase adaptive speculation (Sec. III):
    • Aggressive Speculation Phase: skip reasoning; directly sample actions, execute via Action Server; reduces LLM inference time (Sec. III-B).
    • Verified Speculation Phase: run normal reasoning while parallel speculative action sampling+execution; reuse result if speculative action matches; else fallback to executing correct action (Sec. III-C).
    • Phase transition: self-reflection scoring of speculative actions on 1–5 scale; switch when all scores < threshold τ. Accuracy saturates once τ ≥ 3; τ=2 or 3 best latency/accuracy tradeoff (Fig. 4, Sec. III-D).
  • Action Server (Sec. III-E): in-memory thread-safe dict “Action Buffer” mapping action→state/result; avoids redundant tool calls; footprint ~200 Bytes/task.
  • Scheduling (Sec. IV):
    • Intra-speculation objective (Eq. 1): maximize expected overlap benefit − inference overhead by selecting subset of main requests for speculation.
    • Expected overlap benefit (Eq. 2): with hit prob p and k speculative samples, benefit scales with 1 − (1−p)^k times average tool time t_act (variables in Table I).
    • Overheads: decode overhead (Eq. 3) + prefill overhead (Eq. 4) using engine-profiled hybrid-batch times.
    • Inter-request scheduling: SJF-like “speculation-first” so short (<10 token) speculative jobs finish before long reasoning jobs (Sec. IV-B).
  • Empirical results (Sec. V):
    • Tool latency: Wikipedia API ~1.5 s/request (Setup).
    • Single-request: SPAgent reduces LLM time 23.8% and un-overlapped action exec 29.4% on average vs naive; “Speculative Actions” can increase inference latency up to 26% (Sec. V-B, Fig. 7).
    • Serving: SPAgent achieves 24.2% mean latency reduction on avg, up to 69.6% vs naive & Speculative Actions; Speculative Actions becomes up to 49.3% slower than naive when load > 2 rps (Fig. 8).
    • Accuracy: generally on par; Qwen2.5-32B TriviaQA +>5% accuracy gain (Table II).
    • Action Buffer hit rate: typically ~40% in single-request (Fig. 9a). Default speculative samples k=4 (diminishing returns beyond) (Fig. 9b).

📄 Speculative Actions (predict–verify for faster agents)

Paper · source

Procedure/algorithm for speculative tool/action execution (predict-verify) + measured latency reductions & trade-offs

Key content
  • Core idea (Section 2): Treat each agent step as an API call with non-trivial latency. Break strict sequentiality by running Speculator(s) (fast, cheap) in parallel with Actor(s) (slow, authoritative). Speculator predicts next action (API + params) and often predicted observation/state delta; Actor validates and commits.
  • Algorithm 1 (cache + async futures):
    • Maintain cache mapping API call specifier → pending future response.
    • While waiting for true response at step t, Speculator predicts likely next call(s) for step t+1 and pre-launches them asynchronously.
    • At time of issuing the real next call: if cache hit, skip invocation and only await the already-running future; else call normally.
    • Lossless via: (a) semantic guards (Actor confirms equivalence before commit), (b) safety envelope (only idempotent/reversible/sandboxed side effects), (c) repair paths (rollback/compensating actions).
  • Assumptions (Section 2):
    1. Speculation accuracy: implied next call matches true next call with probability p > 0.
    2. Concurrent + reversible pre-launch: wrong-branch calls have no external side effects or can be rolled back.
  • Proposition 1 (expected runtime ratio): Let L = mean latency of actual API call, l = mean latency of speculative model (with l < L), p = per-step probability speculative branch implies correct next call (independent). Then expected runtime ratio
    [ \frac{\mathbb{E}[T_{\text{spec}}]}{\mathbb{E}[T_{\text{seq}}]}=\frac{1}{2-p}\left(1+\frac{l}{L}\right) ] (Appendix A). Implies ≤50% ideal latency reduction when l→0 and p→1; multi-step speculation can exceed this bound.
  • Empirical results (Abstract + Section 3/4):
    • Next-action prediction accuracy up to 55%; up to 20% end-to-end lossless speedup (Abstract).
    • E-commerce (-bench retail): 22%–38% API-call prediction accuracy; low-budget speculators run 2–3s, below ~30s user typing time; multi-model speculation improves accuracy; ~34% accuracy at typing-time threshold (Appendix B.2).
    • HotpotQA (Wikipedia multi-hop): top-3 strict-match next-call accuracy up to 55%; top-3 >> top-1 (Section 3.3).
    • OS tuning (lossy extension, last-write-wins): p95 latency (ms): Untuned 102.97, Actor-only 54.00, Actor+Spec 37.93 (Section 4.2). Convergence: 10–15s (Actor+Spec) vs ~200s (Actor-only); Spec-only stuck at 0.55ms → 36.24ms, while Actor+Spec reaches 0.2ms → 30.26ms (Section 4.2). Tuned parameter: Linux CFS min_granularity_ns range 50,000–50,000,000 ns; default 3ms (Appendix B.3.1).
  • Cost/latency trade-off: More speculative calls (larger top-k, wider beams) ↑ accuracy but ↑ token/API cost; self-hosted LLMs can mitigate via batching (Section 2, Appendix B.1/B.3.3).

📊 ALFWorld task suite + success metrics (planning benchmark)

Benchmark · source

Task suite definition + standardized success metrics for long-horizon interactive tasks (ALFWorld), enabling numeric comparisons of planning/decomposition methods

Key content
  • ALFWorld setup (Section 2): Parallel aligned environments: TextWorld (high-level text actions) + ALFRED/THOR embodied simulator (low-level robot primitives). Uses PDDL latent state to generate equivalent TextWorld games from ALFRED scenes.
  • Task types + dataset sizes (Table 1): Pick&Place (train 790 / seen 35 / unseen 24), Examine in Light (308/13/18), Clean&Place (650/27/31), Heat&Place (459/16/23), Cool&Place (533/25/21), Pick Two&Place (813/24/17). All: train 3,553, seen 140, unseen 134.
  • Embodied action primitives: MoveAhead, RotateLeft/Right, LookUp/Down, Pickup, Put, Open, Close, ToggleOn/Off.
    TextWorld high-level actions: goto {recep}, take {obj} from {recep}, put {obj} in/on {recep}, open/close {recep}, toggle {obj}{recep}, clean/heat/cool {obj} with {recep}.
  • Splits definition: Seen = rooms seen in training but new object placements/appearances; Unseen = unseen rooms with different layouts/receptacles (OOD generalization).
  • Success metrics (Section 4.1): report task success rate and goal-condition success rate (ALFRED metric for partial completion; e.g., “put a hot potato on countertop” has 3 goal-conditions: heat something; put potato on countertop; heat potato + put on countertop).
  • Key embodied results (Table 2, All Tasks): Seq2Seq 6% (15) seen / 5% (14) unseen; BUTLER 19% (31) seen / 10% (20) unseen; BUTLER-Oracle 37% (46) seen / 26% (37) unseen. Parentheses = goal-condition success.
  • Training pipeline defaults (Appendix B): DAgger IL; 50K episodes (text agents), max 50 steps/episode, replay buffer 500K episodes, batch collect 10, update every 5 steps, sample 64, LR 0.001, grad clip 5, expert assistance anneal 100%→1% over 50K episodes. Beam-search recovery at eval: beam width 10, try top-5 candidates.

📊 Latency-aware parallel orchestration via critical path (LAMaS)

Benchmark · source

Experimental tables + equations showing latency (critical path) vs cost/accuracy for parallel DAG orchestration; ablations on latency reward + critical-path credit assignment.

Key content
  • Parallel latency vs cost distinction (Section 3.3):
    • Latency (critical path), Eq. 1: (L=\sum_{l\in \mathcal{L}} \max_{o\in \mathcal{O}_l} t(o))
      • (\mathcal{L}): layers; (\mathcal{O}_l): operators executed in parallel at layer (l); (t(o)): operator time.
    • Cost, Eq. 2: (C=\sum_{l\in \mathcal{L}} \sum_{o\in \mathcal{O}_l} c(o)) (token/$ cost accumulates additively).
  • Controller sampling (Eq. 4): threshold-based subset selection per layer: pick highest-scoring operators until cumulative confidence exceeds threshold (\tau); EarlyExit ends generation.
  • Reward (Eq. 5): global reward combines task score (S), cost penalty, and latency proxy penalty.
  • Critical-path-aware credit assignment (Eq. 6–7): identify per-layer critical operator (o_l^*!=\arg\max_{o\in \mathcal{O}_l}\hat{t}(o)); apply latency penalty only to bottleneck operators to avoid credit assignment error under parallelism.
  • Latency proxy metric (Eq. 9): CP length (CP len) sums, per layer, the max of (output tokens + scaled tool time). Tool scaling: 1 sec = 50 virtual tokens.
  • Key results vs MaAS (Table 2):
    • GSM8K: 93.37% score, CP 913.5 vs MaAS 93.13%, CP 1474.6 (-38.0%).
    • HumanEval: 92.11%, CP 1042.7 vs 93.00%, CP 1810.8 (-42.4%).
    • MATH: 52.26%, CP 1195.8 vs 51.23%, CP 2218.5 (-46.1%).
  • Fixed baselines tradeoffs (Table 3 examples): GSM8K Generate CP 405.2 (92.80%, cost 0.31) vs LAMaS CP 913.5 (93.37%, cost 0.88); CoT*5+SC cost 1.96 with CP 488.3 (92.99%).
  • Ablations (Table 4):
    • w/o latency weight: GSM8K CP 1215.9 (vs 913.5) and cost 1.73 (vs 0.88). HumanEval CP 1629.3 (vs 1042.7).
    • w/o CP credit (HumanEval): CP 1197.5 (vs 1042.7), score 91.60 (vs 92.11).
  • Defaults (Section 4.1): LLM gpt-4o-mini-0718, temperature 1; layers (=5); sampling times (=5); activation threshold (\tau=0.8); latency weight (\lambda_L=0.5) (normalized by 50); cost penalty (\lambda_C=0.1).

📊 VestaBench (safe long-horizon planning under adversarial constraints)

Benchmark · source

Benchmark construction + evaluation framework/metrics for multi-constraint long-horizon embodied planning with safety + adversarial instructions/environments

Key content
  • Benchmark design (Section 2):
    • Built from VirtualHome (Evolving Graph Simulator) and BEHAVIOR-100 (via Embodied Agent Interface simulator with an action-transition layer).
    • Two datasets:
      • VestaBench-VH: 100 tasks with safety constraints (physical, electrical, contamination, etc.). 70 tasks in normal or adversarial environments; 30 tasks with adversarial instructions the agent must avoid.
      • VestaBench-B50: 50 tasks from BEHAVIOR-100 augmented with safety constraints; simulator provides 30 actions.
    • Key claim: only benchmark (per Table 1) combining multi-constraint tasks + adversarial instructions + adversarial environments, with a guarantee tasks are safely achievable.
  • Problem definition (Section 3):
    • Given instruction t, agent A outputs plan P = (a₁,…,aₙ) with actions aᵢ ∈ 𝒜, executed in simulator S → final environment graph G*.
    • Plan is successful and safe iff predefined success + safety goals/criteria are satisfied on G*.
  • Planning strategies (Section 3, Fig. 3):
    • One-go: generate full multi-action plan once → execute → evaluate.
    • Stepwise: interact for n steps and m trials; each step executes aᵢⱼ → observation oᵢⱼ + state Gᵢⱼ; trajectory τᵢ = {a₁₁,o₁₁,a₁₂,o₁₂,…}. End of each trial: critic J gives feedback fᵢ; repeat until Done or trials exhausted.
  • Evaluation metrics (Section 4.1): report delivery rate, success rate, safety rate.
  • Empirical findings (Section 4.2–4.3):
    • Direct one-go is weakest; direct stepwise improves but remains low.
    • ReAct improves macro/micro success & safety on VestaBench-VH by ~5% and ~10% respectively; minimal gains on B50.
    • ReAct+Critic > ReAct+Reflexion (attributed to stronger critic model).
    • Complexity hurts safety: for ReAct+Critic (1) on VestaBench-VH, safety 66.67% (low), 48.64% (medium), 33.33% (high).
    • Adversarial instructions: agents often generate unsafe plans; struggle to distinguish malicious from safe instructions.
  • Defaults/models (Section 4.1): planning agents include GPT-4.1-Mini and Qwen3-32B; in ReAct+Critic (1), GPT-4.1 used as critic.

📖 Dagster job execution + per-run concurrency controls

Reference Doc · source

Run executor configuration knobs (e.g., max_concurrent, tag_concurrency_limits) and how steps are scheduled within a run

Key content
  • Default execution behavior
    • By default, Dagster runs jobs with multiprocess_executor: each step runs in its own process, and independent steps can run in parallel.
  • Execution entry points (procedures)
    • UI: Launchpad → Launch Run; includes config editor for runtime config.
    • CLI: dg launch --jobs my_job (launches asynchronously via the instance run launcher).
    • Python: JobDefinition.execute_in_process() returns ExecuteInProcessResult.
  • Executor configuration (per-run)
    • Each JobDefinition has an executor_def (an ExecutorDefinition) controlling isolation/parallelism (in-process ↔ multiprocess ↔ k8s pods, etc.).
    • Toggle to in-process via run config YAML:
      execution:
        config:
          in_process:
  • Multiprocess knobs (defaults/parameters)
    • max_concurrent: limits max concurrent subprocesses within a run.
      • Example sets max_concurrent: 4.
    • start_method: controls subprocess spawn method; example uses forkserver to reduce per-process overhead.
  • Op-level concurrency limits (per-run)
    • tag_concurrency_limits: caps concurrent ops matching a tag key or key-value; if launching an op would exceed a limit, it stays queued.
    • Example: overall max_concurrent: 4, plus at most 2 ops with tag database=redshift:
      tag_concurrency_limits:
        - key: database
          value: redshift
          limit: 2
    • Applies per-run only; cross-run limits via celery_executor / celery_k8s_job_executor.

📖 Dagster op retry policies (RetryPolicy & RetryRequested)

Reference Doc · source

Dagster op retry policy configuration (max_retries, delay/backoff/jitter) and retry behavior at the op boundary during job execution

Key content
  • Core behavior (Overview): When an exception occurs during op execution, Dagster can retry that op within the same job run (retry happens at the op boundary, not by rerunning the whole job).
  • Two mechanisms (Relevant APIs / Using op retries):
    • Declarative: attach dagster.RetryPolicy to an op (or job / invocation) so retries are requested automatically on exception.
    • Manual: raise dagster.RetryRequested from inside the op body to conditionally request a retry.
  • RetryPolicy parameters (Section “RetryPolicy”):
    • max_retries = maximum retry attempts (example: max_retries=3)
    • delay = base delay between retries in seconds (example: delay=0.2 → 200ms)
    • backoff modifies delay by attempt number (example enum: Backoff.EXPONENTIAL)
    • jitter adds randomness to delay (example enum: Jitter.PLUS_MINUS)
    • Delay formula (Eq. 1, conceptual): wait_time(attempt) = f(delay, backoff, jitter, attempt_number) where backoff scales with attempt and jitter perturbs the result.
  • Where to set policy (Section “RetryPolicy”):
    1. On op definition: @op(retry_policy=RetryPolicy(...))
    2. On a specific invocation: problematic.with_retry_policy(flakey_op_policy)()
    3. On a job for all contained ops: @job(op_retry_policy=default_policy)
    • Example job-level defaults/overrides: default_policy = RetryPolicy(max_retries=1), override op with RetryPolicy(max_retries=10).
  • RetryRequested usage (Section “RetryRequested”):
    • Pattern: try/exceptif should_retry(e): raise RetryRequested(max_retries=1, seconds_to_wait=1) from e
    • raise ... from e preserves original exception info in Dagster.
  • Applies to asset jobs too: define_asset_job(..., op_retry_policy=RetryPolicy(max_retries=3)).

📖 Dagster+ run-level retries (full deployment settings)

Reference Doc · source

Run-level retry configuration in Dagster+ deployment settings; boundary vs op/asset-level retries

Key content
  • Where configured: Full deployment settings are YAML. Run retries live under run_retries: (Section “Run retries”).
  • Core parameter (Eq. 1 — Run retry cap):
    max_run_retry_attempts = run_retries.max_retries
    • Definition: Maximum number of times Dagster+ will attempt to retry a failed run.
    • Default: 0 (no run retries).
    • Behavior: If run_retries.max_retries is undefined, Dagster+ uses its default.
  • Failure-scope toggle (boundary vs op/asset retries): run_retries.retry_on_asset_or_op_failure
    • Meaning: Whether to retry runs that failed because assets or ops in the run failed.
    • Rationale: Set to false to only retry failures due to the run worker crashing/unexpectedly terminating, and rely on op/asset-level retry policies for op/asset failures (explicit separation of concerns: run-level vs op/asset-level).
    • Version gate: Setting this to false changes behavior only on Dagster version ≥ 1.6.7.
  • Example snippet:
    run_retries:
      max_retries: 0
  • Related operational defaults (often confused with retries): run_monitoring.start_timeout_seconds: 1200, cancel_timeout_seconds: 1200, max_runtime_seconds: 7200 (timeouts affect run state transitions, not retry count).

📖 LangGraph Persistence & Checkpoint Semantics

Reference Doc · source

Concrete checkpoint/persistence semantics (checkpointer configuration, what state is stored, resume/replay behavior) and canonical durable-execution pattern in LangGraph.

Key content
  • Persistence model: Compile a LangGraph with a checkpointer to save a checkpoint (StateSnapshot) at every super-step (a “tick” where all scheduled nodes run, potentially in parallel). Enables HITL, memory, time travel, fault tolerance.
  • Required config (threading): Must pass thread_id in config to persist/resume:
    Config formula: config = { configurable: { thread_id: "<id>" } }
    Checkpointer uses thread_id as the primary key; without it, it cannot save state or resume after interrupts.
  • Checkpoint contents (StateSnapshot fields):
    • values: state channel values at checkpoint
    • next: node names to execute next ([] means complete)
    • config: includes thread_id, checkpoint_ns, checkpoint_id
    • metadata: source ∈ {“input”,“loop”,“update”}, writes (node outputs), step (super-step counter)
    • createdAt, parentConfig, tasks (task id/name/error/interrupts; may include subgraph state)
  • Empirical checkpoint count example: For sequential START -> A -> B -> END, invoking once yields exactly 4 checkpoints: (1) empty/START next, (2) input saved/nodeA next, (3) nodeA outputs/nodeB next, (4) nodeB outputs/complete. Reducers accumulate (e.g., bar becomes ['a','b']).
  • Replay semantics: Invoke with prior checkpoint_id to re-execute after that checkpoint; earlier nodes skipped. LLM calls/API requests/interrupts are re-triggered during replay.
  • Fault tolerance + pending writes: If a node fails mid super-step, LangGraph stores pending writes from successful nodes; on resume you don’t re-run successful nodes.
  • Namespaces: checkpoint_ns="" for root; subgraph checkpoints use "node_name:uuid"; nested join with |. Accessible via config.configurable.checkpoint_ns.
  • APIs: graph.getState(config) (latest or specific checkpoint_id), graph.getStateHistory(config) (most recent first), graph.updateState() creates a new checkpoint; reducer channels accumulate.
  • Defaults/infra: Agent Server / LangGraph API handle checkpointing (and stores) automatically. Checkpointer libs: MemorySaver (in-memory), SqliteSaver, PostgresSaver, MongoDBSaver, RedisSaver. Base interface methods: .put, .putWrites, .getTuple, .list.

📖 OpenAI Agents SDK (Python) — RunConfig & RunOptions

Reference Doc · source

Exact RunConfig fields + runtime hooks/tracing/session/handoff controls; RunOptions chaining + error handlers.

Key content
  • RunConfig (dataclass): config for an entire agent run

    • Model selection
      • model: str | Model | None = None — if set, overrides every agent’s model; model_provider must resolve string names.
      • model_provider: ModelProvider = MultiProvider() — default provider (docs: “Defaults to OpenAI”).
      • model_settings: ModelSettings | Nonenon-null values override agent-specific model settings.
    • Handoffs
      • handoff_input_filter: HandoffInputFilter | None — global filter for all handoffs; per-handoff filter takes precedence.
      • nest_handoff_historydefault disabled; when True, wraps prior run history into one assistant message before handoff if no custom filter.
      • handoff_history_mapper: HandoffHistoryMapper | None — runs only when nest_handoff_history=True; maps normalized transcript → history passed to next agent. If None, runner collapses transcript into one assistant message.
    • Guardrails
      • input_guardrails: list[InputGuardrail] | None — run on initial run input.
      • output_guardrails: list[OutputGuardrail] | None — run on final run output.
    • Tracing/telemetry knobs
      • tracing_disabled — disables tracing entirely.
      • tracing: TracingConfig | None
      • trace_include_sensitive_data — if False, spans exist but tool/LLM inputs/outputs omitted.
      • workflow_name, trace_id (custom), group_id (link traces), trace_metadata (dict).
    • Session/memory
      • session_input_callback: SessionInputCallback | None — default: append new input to session history; custom callback can merge history+input.
      • session_settings: SessionSettings | None — non-null overrides session defaults (e.g., retrieval item count).
    • Pre-model/tool hooks
      • call_model_input_filter(agent, context, ModelInputData) -> ModelInputData — invoked immediately before model call; can edit instructions/items (e.g., token limits, add system prompt).
      • tool_error_formatter(ToolErrorFormatterArgs) -> str | None — format tool errors; None uses SDK default.
    • Reasoning item IDs
      • reasoning_item_id_policy: None/"preserve"|"omit" — preserve IDs or strip them from next-turn model input.
  • RunOptions (TypedDict): arguments for AgentRunner methods

    • previous_response_id, auto_previous_response_id (auto chaining first turn), conversation_id
    • error_handlers: RunErrorHandlers | None keyed by error kind; currently supports max_turns.

📖 OpenAI Agents SDK (Python) — Session backends & persistence semantics

Reference Doc · source

Concrete session backends + what’s stored, how runs resume, and how conversation state is represented.

Key content
  • Core session semantics (client-side memory):
    • Before each run: Runner fetches session history via session.get_items(...) and prepends it to the new turn input.
    • After each run: Runner persists all new items from the run (user input, assistant outputs, tool calls, etc.) into the session.
    • Result: Subsequent Runner.run(..., session=...) includes full stored history automatically (no manual .to_input_list()).
  • Mutual exclusivity (important constraint): Sessions cannot be combined with conversation_id, previous_response_id, or auto_previous_response_id in the same run. Use sessions or OpenAI server-managed continuation mechanisms.
  • Resuming interrupted / HITL runs: If a run pauses for approval, resume by calling Runner.run(...) again with the same session (or another instance pointing to the same backing store) so history continues consistently.
  • History merge control (procedure): RunConfig.session_input_callback(history, new_input) -> final_input
    • Receives copies of history and new_input (safe to mutate).
    • Returned list controls model input for that turn, but SDK persists only new-turn items (filtering/reordering old history doesn’t re-save it).
    • Example policy: keep last 10 history items: history[-10:] + new_input.
  • Retrieval limiting (default/parameter): SessionSettings(limit=None) retrieves all items (default). limit=N retrieves most recent N items; set per-run via RunConfig(session_settings=SessionSettings(limit=50)).
  • Built-in backends (comparison table):
    • SQLiteSession (file-backed or in-memory), AsyncSQLiteSession (aiosqlite), RedisSession, SQLAlchemySession, DaprSession (supports TTL + consistency options), OpenAIConversationsSession (OpenAI Conversations API), wrappers: OpenAIResponsesCompactionSession (Responses API responses.compact), EncryptedSession (encryption + TTL), AdvancedSQLiteSession (branching/analytics).
  • Compaction specifics: OpenAIResponsesCompactionSession can auto-compact after turns; may block streaming until compaction completes. Modes: "previous_response_id" (best when chaining response IDs), "input" (rebuild from session items), default "auto"; if ModelSettings(store=False), "auto" falls back to input-based compaction. Do not wrap OpenAIConversationsSession with compaction wrapper.

📖 Responses API Streaming Event Types (Server-Sent Events / WS)

Reference Doc · source

Enumerated Responses streaming event types + payload shapes (delta vs done, lifecycle/termination, tool-call streaming)

Key content
  • Response lifecycle status values (ResponseStatus): "queued", "in_progress", "completed", "failed", "cancelled", "incomplete".
  • Top-level termination/lifecycle events (each includes type, sequence_number, and often response):
    • ResponseCreatedEvent { response, sequence_number, type }
    • ResponseQueuedEvent { response, sequence_number, type }
    • ResponseInProgressEvent { response, sequence_number, type }
    • ResponseCompletedEvent { response, sequence_number, type }
    • ResponseFailedEvent { response, sequence_number, type }
    • ResponseIncompleteEvent { response, sequence_number, type }
    • ResponseErrorEvent { code, message, param, … } (error during streaming)
  • Output structure events (robust client should handle item/content boundaries):
    • ResponseOutputItemAddedEvent { item, output_index, sequence_number, type }
    • ResponseOutputItemDoneEvent { item, output_index, sequence_number, type }
    • ResponseContentPartAddedEvent / ResponseContentPartDoneEvent { content_index, item_id, output_index, … }
  • Text streaming (delta → done):
    • ResponseTextDeltaEvent { content_index, delta, item_id, output_index, … }
    • ResponseTextDoneEvent { content_index, item_id, logprobs, … }
  • Refusal streaming: ResponseRefusalDeltaEvent / ResponseRefusalDoneEvent with { content_index, delta/refusal, item_id, output_index, … }.
  • Audio streaming: ResponseAudioDeltaEvent { delta, sequence_number, type } and ResponseAudioDoneEvent; transcript equivalents ResponseAudioTranscriptDeltaEvent / Done.
  • Tool-call streaming patterns (all keyed by item_id, output_index, sequence_number):
    • Function args: ResponseFunctionCallArgumentsDeltaEvent { delta, item_id, output_index, … }…DoneEvent { arguments, name, item_id, … }
    • Code interpreter code: …CallCodeDeltaEvent…CallCodeDoneEvent plus …InProgress/Interpreting/Completed
    • Web/file search: …InProgress…Searching…Completed
    • Image gen: …InProgress/Generating…PartialImage…Completed
    • MCP tool calls: args delta/done + in_progress/completed/failed; list-tools in_progress/completed/failed
  • Include-able extra fields (ResponseIncludable): "file_search_call.results", "web_search_call.results", "web_search_call.action.sources", "code_interpreter_call.outputs", "computer_call_output.output.image_url", "message.input_image.image_url", "message.output_text.logprobs", "reasoning.encrypted_content".

📖 Responses API — tool calls, streaming events, execution controls

Reference Doc · source

Exact request/response schema for tool calls, streaming events, and fields controlling execution (tool_choice, parallel tool calls where supported, truncation/limits) that determine what a supervisor can delegate and how results are returned

Key content
  • Core endpoints (Responses):
    • Create: POST /responses
    • Get: GET /responses/{response_id}
    • Delete: DELETE /responses/{response_id}
    • Cancel: POST /responses/{response_id}/cancel
    • Compact: POST /responses/compact
    • List input items: GET /responses/{response_id}/input_items
    • Input token counts: POST /responses/input_tokens
  • Instruction hierarchy (input messages): EasyInputMessage {role, content, ...} where developer/system instructions override user; assistant role is treated as prior model output.
  • Tool execution control (tool_choice):
    • ToolChoiceOptions: "none" | "auto" | "required"
      • none: model will not call tools; generates a message.
      • auto: model may choose message vs tool call(s).
      • required: model must call ≥1 tool.
    • Forcing specific tools (objects): ToolChoiceFunction {name}, ToolChoiceCustom {name}, ToolChoiceMcp {server_label, name}, ToolChoiceShell {type}, ToolChoiceApplyPatch {type}, ToolChoiceAllowed {mode, tools} (constrain to a set).
  • Response lifecycle status (ResponseStatus): "queued" | "in_progress" | "completed" | "failed" | "cancelled" | "incomplete".
  • Streaming/event model: server emits typed events including ResponseCreatedEvent, ResponseInProgressEvent, ResponseCompletedEvent, ResponseFailedEvent, ResponseIncompleteEvent, plus granular deltas/done events for text/audio/refusals and tool calls (e.g., ResponseFunctionCallArgumentsDelta/Done, ResponseMcpCall..., ResponseWebSearchCall..., ResponseFileSearchCall..., ResponseCodeInterpreterCall...).
  • Including extra tool/output data (include[]: ResponseIncludable):
    • web_search_call.action.sources, file_search_call.results, web_search_call.results
    • code_interpreter_call.outputs
    • computer_call_output.output.image_url
    • message.input_image.image_url
    • message.output_text.logprobs
    • reasoning.encrypted_content
  • Output formatting: text.format supports {type:"text"} (default), {type:"json_schema"} (Structured Outputs), {type:"json_object"} (older JSON mode; “not recommended for gpt-4o and newer”).

📖 Runner.run & RunConfig (OpenAI Agents SDK, Python)

Reference Doc · source

Canonical runner signatures, accepted input types, and multi-turn lifecycle semantics

Key content
  • Canonical async runner signature (Runner.run):
    await Runner.run(starting_agent, input, *, context=None, max_turns=DEFAULT_MAX_TURNS, hooks=None, run_config=None, error_handlers=None, previous_response_id=None, auto_previous_response_id=False, conversation_id=None, session=None) -> RunResult

    • Input types: input ∈ { str | list[TResponseInputItem] | RunState[TContext] }
    • starting_agent: Agent[TContext] (required)
  • Lifecycle loop (workflow semantics):

    1. Invoke agent with given input
    2. Stop condition: if agent produces final output of type agent.output_type
    3. If handoff occurs: repeat loop with the new agent
    4. Else: execute tool calls (if any), then re-run loop
  • Turn definition / limit:
    max_turns counts one AI invocation per turn, including tool calls.

  • Exceptions (unless handled):

    • MaxTurnsExceeded when max_turns exceeded
    • GuardrailTripwireTriggered when a guardrail tripwire triggers
    • Guardrail note: Only the first agent’s input guardrails are run.
  • Multi-turn / state parameters:

    • previous_response_id: Responses API optimization to avoid resending prior-turn input
    • conversation_id: uses Responses API conversation state; runner reads/writes items; recommended only if exclusively using OpenAI models (other providers won’t write to Conversation)
    • session: automatic conversation history management
  • Sync + streaming variants:

    • Runner.run_sync(...): wraps run; won’t work inside an existing event loop (e.g., Jupyter/async frameworks).
    • Runner.run_streamed(...) -> RunResultStreaming: provides method to stream semantic events.
  • RunConfig key overrides (global):

    • model: overrides every agent’s model
    • model_provider: resolves string model names (default: OpenAI via MultiProvider)
    • model_settings: non-null values override agent-specific settings
    • input_guardrails (initial input), output_guardrails (final output)
    • handoff_input_filter (global; per-handoff filter takes precedence)
    • nest_handoff_history (beta, default False) + handoff_history_mapper (used when nesting True)
    • call_model_input_filter (edit model input pre-call), tool_error_formatter
    • tracing controls: tracing_disabled, tracing, workflow_name, trace_id, group_id, trace_metadata, trace_include_sensitive_data
    • reasoning_item_id_policy: None/"preserve" keeps IDs; "omit" strips IDs

📖 Streaming API responses (SSE) — Responses vs Chat Completions

Reference Doc · source

Central index of streaming behavior across endpoints (SSE framing, event types, lifecycle patterns, and robust client iteration).

Key content
  • Default behavior: API returns the model’s entire output in one HTTP response; streaming lets clients process output incrementally while generation continues.
  • Enable streaming (Responses API): set stream: true (JS) / stream=True (Python) in client.responses.create(...), then iterate events (for await ... / for event in stream).
  • Transport: HTTP streaming uses Server-Sent Events (SSE) with semantic, typed events (type-safe schemas).
    • Persistent alternative: WebSocket mode (incremental inputs via previous_response_id) is referenced separately.
  • Common lifecycle events (Responses streaming):
    • response.created
    • response.output_text.delta
    • response.completed
    • error
  • Event typing (examples from union list): ResponseCreatedEvent, ResponseInProgressEvent, ResponseCompletedEvent, ResponseOutputTextDelta, ResponseFunctionCallArgumentsDelta/Done, tool-call progress events (file search, code interpreter), plus Error.
  • Chat Completions streaming: set stream: true / stream=True on chat.completions.create(...). Stream returns data-only SSE chunks.
    • Key parsing rule: streamed chunks use choices[0].delta (not message).
      • delta may contain a role token, content token, or nothing (example shows {} at end).
    • To print only text: write chunk.choices[0]?.delta?.content || "" (JS) or check delta.content is not None (Python).
  • Design rationale: OpenAI recommends Responses API for streaming because it’s “designed with streaming in mind” and uses semantic events.
  • Moderation risk: streaming partial outputs makes moderation harder; may affect approved usage.

📖 Structured Outputs (JSON Schema enforcement)

Reference Doc · source

JSON schema-based structured output constraints + enforcement behavior via response_format / text.format

Key content
  • What it guarantees: Structured Outputs ensures the model output adheres to your supplied JSON Schema (not just valid JSON). Prevents missing required keys and invalid enum values.
  • How to enable (Responses API): set text: { format: { type: "json_schema", strict: true, schema: {...} } } or use SDK helpers (responses.parse with Pydantic / Zod).
  • When to use which:
    • Function calling: when connecting model to tools/functions/data in your system.
    • response_format / text.format schema: when you want the assistant’s user-facing response structured for UI, tutoring steps, extraction, etc.
  • Structured Outputs vs JSON mode (table facts):
    • Valid JSON: both Yes
    • Schema adherence: Structured Outputs Yes; JSON mode No
    • Structured Outputs compatible models: gpt-4o-mini, gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06 and later (JSON mode works on broader set incl. gpt-3.5-turbo, gpt-4-*, gpt-4o-*).
    • JSON mode enable: text: { format: { type: "json_object" } }
  • Refusals: If the model refuses for safety, output may not match schema; API includes a refusal content item/field so refusals are programmatically detectable.
  • Schema rules/limits (enforced):
    • Root schema must be an object (cannot be top-level anyOf).
    • All fields must be required; emulate optional via union with null (e.g., "type": ["string","null"]).
    • Objects must set additionalProperties: false.
    • Limits: ≤5000 total object properties, ≤10 nesting levels; total string size ≤120,000 chars; ≤1000 enum values overall.
    • Key ordering: output keys follow schema key order.
  • Supported JSON Schema subset: types string, number, boolean, integer, object, array, enum, anyOf; supports $defs and recursion ($ref: "#", etc.). Unsupported keywords include allOf, not, if/then/else, etc.

📖 Tool/Function Calling Schema & Control Knobs (OpenAI Responses API)

Reference Doc · source

Tool-calling request/response schema (tool definitions, tool_choice behavior, tool call arguments), plus planning-loop-relevant fields like parallel_tool_calls

Key content
  • Tool-calling workflow (5 steps):
    1. Request model with tools available → 2) Model returns tool call(s) → 3) App executes tool(s) → 4) App sends tool outputs back → 5) Model returns final answer or more tool calls.
  • Tool definition schema (function tools):
    {"type":"function","name":..., "description":..., "parameters": JSONSchema, "strict": bool}.
    Example includes additionalProperties:false and required:[...].
  • Tool call item (model → app): response output array contains items with:
    type:"function_call", call_id, name, arguments (JSON-encoded string). Multiple calls may appear in one turn.
  • Tool output item (app → model): append to next request input:
    {"type":"function_call_output","call_id": <from tool call>, "output": <string | array of image/file objects>}.
  • Reasoning-model constraint: for GPT-5 / o4-mini, any reasoning items returned alongside tool calls must be passed back with tool outputs in the next request.
  • tool_choice behaviors (defaults & forcing):
    • Default: "auto" (0, 1, or many tool calls)
    • "required" (must call ≥1 tool)
    • Force one tool: {"type":"function","name":"get_weather"}
    • Restrict without changing tools: {"type":"allowed_tools","mode":"auto","tools":[...]}
    • "none" imitates passing no tools.
  • Parallelism control: parallel_tool_calls:false ⇒ model can call exactly 0 or 1 tool per turn. (Parallel calling not possible with built-in tools.)
  • Strict mode requirements (Structured Outputs): if strict:true: every object must set additionalProperties:false and all properties must be in required; optional fields use union types like ["string","null"].

📖 Typed Handoffs (Supervisor → Sub-agent) in OpenAI Agents SDK

Reference Doc · source

Precise, typed handoff interfaces for transferring context/state to a delegated agent (filters, history nesting, schemas, enable/disable).

Key content
  • Core type aliases

    • HandoffInputFilter (Eq. 1):
      Callable[[HandoffInputData], MaybeAwaitable[HandoffInputData]]
      Filters/edits the data passed to the next agent.
    • HandoffHistoryMapper (Eq. 2):
      Callable[[list[TResponseInputItem]], list[TResponseInputItem]]
      Maps prior transcript → nested summary payload.
  • HandoffInputData (dataclass) fields

    • input_history: str | tuple[TResponseInputItem, ...] — history before Runner.run().
    • pre_handoff_items: tuple[RunItem, ...] — items generated before the turn where handoff invoked.
    • new_items: tuple[RunItem, ...] — items generated during current turn including the triggering item and the tool output message representing the handoff output.
    • run_context: RunContextWrapper[Any] | None = None — optional (backwards compatibility).
    • input_items: tuple[RunItem, ...] | None = None — if set, used instead of new_items to build next agent input (lets you filter duplicates for model input while keeping full new_items in session history).
    • clone(**kwargs) -> HandoffInputData — copy with modifications.
  • Handoff (dataclass) behavior/params

    • input_json_schema — schema exposed to model as tool parameters; describes structured payload passed to on_invoke_handoff and does not replace next agent’s main input.
    • on_invoke_handoff: Callable[[RunContextWrapper[Any], str], Awaitable[TAgent]] — receives (1) handoff run context, (2) LLM JSON args string (or "" if schema empty); must return an agent.
    • input_filter: HandoffInputFilter | None — default: next agent sees entire conversation history; can remove older inputs/tools, etc. Streaming note: results of this function are not streamed; earlier items already streamed.
    • strict_json_schema — recommended True to increase correct JSON input.
    • is_enabled: bool | Callable[[RunContextWrapper[Any], AgentBase[Any]], MaybeAwaitable[bool]] = True — disabled handoffs hidden from LLM at runtime.
    • nest_handoff_history — per-handoff override of run-level nesting behavior.
  • History nesting utilities

    • default_handoff_history_mapper(transcript)single assistant message summarizing transcript.
    • nest_handoff_history(handoff_input_data, history_mapper=None) → summarizes previous transcript for next agent.
    • Wrapper markers: get_conversation_history_wrappers(), set_conversation_history_wrappers(...), reset_conversation_history_wrappers().
  • Factory: handoff(...) -> Handoff

    • Key args: agent (required), tool_name_override, tool_description_override, on_handoff (+ optional input_type for validation/parsing), input_filter, nest_handoff_history, is_enabled.

📋 # Source: https://docs.temporal.io/encyclopedia/event-history/event-history-go

Source ·

📋 # Source: https://openai.github.io/openai-agents-python/ref/memory/

Source ·

📋 # Source: https://openai.github.io/openai-agents-python/ref/memory/session/

Source ·

🔍 LLM Agent Evaluation Metrics & Benchmark Construction

Explainer · source

Definitions/taxonomy of agent evaluation metrics + how benchmarks/evals are run (offline/online, tooling, contexts)

Key content
  • Two-dimensional taxonomy (Section 2):
    • Evaluation Objectives (what): Agent Behavior, Agent Capabilities, Reliability, Safety & Alignment.
    • Evaluation Process (how): Interaction Mode, Evaluation Data, Metrics Computation Methods, Evaluation Tooling, Evaluation Contexts.
  • Agent Behavior metrics (Section 3.1):
    • Task completion: Success Rate (SR) / Task Success Rate / Overall Success Rate; Task Goal Completion (TGC); Pass Rate; binary reward {0,1} for goal achievement.
    • Multi-trial success: pass@k = succeeds at least once in k attempts; stricter pass^ = succeeds in all k attempts (used for mission-critical consistency).
    • Latency: TTFT (Time To First Token) = delay until first streamed token; End-to-End Request Latency = time until complete response (more relevant for async agents).
    • Cost: estimated from input tokens + output tokens (usage-based pricing proxy).
  • Tool-use capability metrics (Section 3.2.1):
    • Invocation Accuracy (call tool vs not), Tool Selection Accuracy, Retrieval Accuracy (rank-based); ranking metrics: MRR, NDCG.
    • Parameter evaluation: parameter name F1; execution-based evaluation runs tool calls to catch semantic errors beyond AST validity.
  • Planning/reasoning metrics (Section 3.2.2): Node F1 (tool set), Edge F1 / Normalized Edit Distance (tool sequence/graph structure), stepwise “next tool” alignment (T-Eval), Progress Rate (trajectory vs expected), Step Success Rate (% plan steps executed).
  • Evaluation process (Section 4):
    • Offline/static datasets vs online/dynamic (simulators/users); Evaluation-driven Development (EDD) + AgentOps loop for continuous monitoring/regression detection.
    • Metric computation methods: code-based (assertions), LLM-as-a-judge, human-in-the-loop (gold standard for subjective/safety).

📋 LangGraph human-in-the-loop via checkpointed interrupts (Pregel runtime)

Code · source

Concrete pattern for pausing/resuming execution (human input) using checkpointing + deterministic graph runtime.

Key content
  • Design rationale (production agents): LangGraph prioritizes control + durability over “easy start.” Agents differ from classic software mainly due to latency (seconds→minutes→hours) and need for: Parallelization, Streaming, Task queue, Checkpointing, Human-in-the-loop, Tracing (six-feature shortlist).
  • Why structured graphs (not one big while-loop): Splitting into discrete nodes enables checkpointing + human-in-the-loop; execution state of arbitrary subroutines can’t be portably saved/resumed across machines.
  • Execution algorithm (Pregel/BSP) procedure (Section “Execution algorithm”):
    • Channels: named data containers with version = monotonically increasing string.
    • Nodes: functions subscribing to channels; run when subscribed channel versions change.
    • Input mapping: initial input written to input channels triggers subscribed nodes.
    • Output mapping: agent returns values of output channels when execution halts.
    • Per-iteration loop:
      1. Select runnable nodes by comparing channel versions vs last-seen versions.
      2. Execute selected nodes in parallel with isolated copies of state.
      3. Nodes write updates locally.
      4. Apply updates to channels in a deterministic order (prevents data races), bump versions.
    • Stop when no nodes runnable or iteration limit reached (developer-set constant).
  • Checkpointing details: Save serialized channel values (default MsgPack, optionally encrypted), channel version strings, and “which versions each node has seen.” Enables resume on any machine, arbitrarily later.
  • Human-in-the-loop mechanism: Add interrupt() inside a node to pause; later resume from checkpoint with human input (scales better than keeping processes waiting).

📋 OpenAI Agents SDK — Examples Index (Patterns & Multi-Agent Building Blocks)

Code · source

Runnable end-to-end examples demonstrating agent composition and handoffs (supervisor-to-specialist patterns) with concrete execution flow and payload shapes.

Key content
  • Where to find runnable implementations: All examples live in the repo under examples/ with categorized subfolders: https://github.com/openai/openai-agents-python/tree/main/examples
  • Agent design patterns (multi-agent relevant): examples/agent_patterns/ includes concrete patterns for:
    • Agents as tools (including streaming events):
      • examples/agent_patterns/agents_as_tools_streaming.py
      • Structured tool inputs: examples/agent_patterns/agents_as_tools_structured.py
    • Parallel agent execution (pattern category explicitly listed).
    • Conditional tool usage and forcing tool use: examples/agent_patterns/forcing_tool_use.py
    • Guardrails & judging: input/output guardrails, “LLM as a judge,” routing, streaming guardrails.
    • Human-in-the-loop (HITL) with approval + state serialization:
      • examples/agent_patterns/human_in_the_loop.py
      • Streaming HITL: examples/agent_patterns/human_in_the_loop_stream.py
      • Custom rejection messages: examples/agent_patterns/human_in_the_loop_custom_rejection.py
  • Handoffs (delegation/message filtering): examples/handoffs/ provides practical handoff flows with message filtering:
    • examples/handoffs/message_filter.py
    • Streaming variant: examples/handoffs/message_filter_streaming.py
  • Basic execution plumbing useful for orchestration: examples/basic/ includes lifecycle hooks (examples/basic/lifecycle_example.py), streaming outputs, retry management (examples/basic/retry.py), and websocket streaming with shared session helper (examples/basic/stream_ws.py).