Agent Memory

Video (best)

Sam Witteveen — “LangChain - Conversations with Memory (explanation & code walkthrough)”
Watch: YouTube
Why: Covers LangChain memory types (buffer, window, summary, entity) with code walkthroughs — the most accessible video introduction to memory patterns in LLM agents.
Level: intermediate

Blog / Written explainer (best)

Lilian Weng — “LLM Powered Autonomous Agents”
Link: https://lilianweng.github.io/posts/2023-06-23-agent/
Why: Weng’s post has a dedicated, well-structured section on agent memory — explicitly breaking down sensory, short-term, and long-term memory, with coverage of vector stores, episodic memory, and knowledge retrieval. It is the most cited written explainer in the agentic AI space and connects memory to the broader agent architecture in a pedagogically coherent way.
Level: intermediate

Deep dive

LangChain / LangGraph Documentation — “Memory in LangGraph”
Link: https://langchain-ai.github.io/langgraph/concepts/memory/
Why: Provides the most comprehensive technical breakdown of how memory is implemented in a production agentic framework — covering in-thread vs. cross-thread memory, semantic memory stores, episodic recall, and integration with vector databases. Bridges theory and implementation better than any standalone article.
Level: advanced

Original paper

Generative Agents: Interactive Simulacra of Human Behavior — Park et al., 2023
Link: https://arxiv.org/abs/2304.03442
Why: This is the clearest seminal paper on agent memory architectures. It explicitly introduces a memory stream (episodic storage), retrieval mechanisms combining recency + importance + relevance, and reflection/synthesis into higher-level memories. Highly readable and directly maps to the concepts taught in this topic (episodic memory, semantic memory, long-term memory, personalization).
Level: intermediate/advanced

Code walkthrough

mem0ai/mem0 GitHub repository — Official mem0 quickstart and examples
Link: https://github.com/mem0ai/mem0
Why: mem0 is the most focused open-source library specifically built for agent memory (long-term, personalized memory for LLM applications). The repository includes working examples of adding, retrieving, and updating memories across sessions — directly demonstrating vector store integration, user-level personalization, and semantic search over memory. Hands-on and immediately runnable.
Level: intermediate

Coverage notes

Strong: Written explainers (Lilian Weng’s agent post is excellent), original research (Generative Agents paper is canonical), and code-level implementations (mem0, LangGraph memory docs).
Weak: Video content from top-tier ML educators. Agent memory is a relatively applied/systems topic that hasn’t attracted the same deep-learning-theory video treatment as transformers or backprop.
Gap: No high-quality video from the preferred educator list (3Blue1Brown, Karpathy, Yannic, StatQuest, Serrano, Stanford/MIT) specifically addresses agent memory architectures. Knowledge graphs as a memory substrate are also underserved in beginner-friendly formats. A dedicated explainer video on the episodic/semantic/procedural memory taxonomy in agents would be a valuable addition to this platform.

Additional Resources for Tutor Depth

33 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.

📄 BM25 / BM25F scoring (Lucene implementation notes)

Paper · source

Concrete BM25 scoring + parameter meanings (k1, b, tf, length norm) and BM25F fielded variant (Eq. 1–2)

Key content

BM25 term contribution (as implemented): for query q and document d, sum over terms t ∈ q that occur in d:
[ R(q,d)=\sum_{t\in q} idf(t)\cdot \frac{tf_{t,d}}{k_1\left((1-b)+b\frac{l_d}{avl_d}\right)+tf_{t,d}} ] where (tf_{t,d}) (“occurs(_{d,t})”) = term frequency of t in d; (l_d)=doc length; (avl_d)=average doc length in collection; (k_1) free parameter (usually 2); (b\in[0,1]) (usually 0.75).
- (b=0): no length normalization; (b=1): full length normalization.
IDF (classical):
[ idf(t)=\log\frac{N-df(t)+0.5}{df(t)+0.5} ] where (N)=#documents; (df(t))=#documents containing term t.
- Note: a common variant multiplies BM25 weight by ((k_1+1)) (mentioned as Wikipedia variant).
BM25F (fielded) accumulated term weight:
[ weight(t,d)=\sum_{c\in d}\frac{tf_{t,d,c}\cdot boost_c}{(1-b_c)+b_c\frac{l_c}{avl_c}} ] with field length (l_c), avg field length (avl_c), field parameter (b_c), and field boost (boost_c).
BM25F final score (Eq. 1):
[ R(q,d)=\sum_{t\in q} idf(t)\cdot \frac{weight(t,d)}{k_1+weight(t,d)} ] with (idf(t)) as above (Eq. 2).
Implementation workflow (Lucene): average lengths not in API → compute at index time via custom Similarity counting tokens per field; after indexing, divide total field length by numDocs and load into parameters at search time.
Defaults: BM25Parameters: (k_1=2), (b=0.75). BM25FParameters: same (k_1); per-field (b_c) defaults 0.75; per-field boosts default 1; arrays must align with fields order.
Design/engineering note: Lucene docFreq is field-level; for BM25F they heuristically compute IDF using the field with the longest average length unless indexing an “all terms” field.

📄 DPR dual-encoder retrieval (dot product + in-batch negatives)

Paper · source

Dual-encoder similarity scoring (dot product), in-batch negatives objective, retrieval pipeline (question encoder + passage encoder + ANN/FAISS index)

Key content

Retrieval setup (Section 3.1): Encode passages with (E_P(\cdot)\in\mathbb{R}^d) and questions with (E_Q(\cdot)\in\mathbb{R}^d); precompute all passage vectors and build ANN index (FAISS). At runtime embed question (v_q=E_Q(q)) and retrieve top-(k) passages by similarity.
Similarity (Eq. 1):
[ \mathrm{sim}(q,p)=E_Q(q)^\top E_P(p) ] (dot product / maximum inner product search). Chosen because decomposable ⇒ passage embeddings can be precomputed; ablations show comparable to L2, better than cosine.
Training objective (Eq. 2): For each question (q_i), positive passage (p_i^+), negatives ({p_{i,j}^-}{j=1}^n):
[ \mathcal{L}=-\log \frac{e^{\mathrm{sim}(q_i,p_i^+)}}{e^{\mathrm{sim}(q_i,p_i^+)}+\sum{j=1}^n e^{\mathrm{sim}(q_i,p_{i,j}^-)}} ]
In-batch negatives (Section 3.2): Batch size (B). Let (Q,P\in\mathbb{R}^{B\times d}); (S=QP^\top\in\mathbb{R}^{B\times B}). Pair ((q_i,p_i)) is positive (diagonal), all ((q_i,p_j), j\neq i) are negatives ⇒ (B-1) negatives per question; effectively trains on (B^2) pairs/batch. Best: in-batch “gold” negatives + one extra BM25 hard negative per question.
Encoders/defaults: Two independent BERT-base (uncased); use ([CLS]) vector; (d=768). Passage units: fixed 100-word blocks; Wikipedia Dec 20, 2018 ⇒ 21,015,324 passages.
Key retrieval results (Table 2, Top-20): DPR vs BM25: NQ 78.4 vs 59.1, TriviaQA 79.4 vs 66.9, WQ 73.2 vs 55.0, TREC 79.8 vs 70.9 (SQuAD is exception: DPR 63.2 vs BM25 68.8). BM25+DPR can improve (e.g., TREC 85.2 Top-20).
Efficiency (Section 5.4): FAISS index retrieval ~995 questions/sec (top-100). Index build: embed 21M passages ~8.8h on 8 GPUs; build FAISS index ~8.5h; Lucene inverted index ~30 min.

📄 Maximal Marginal Relevance (MMR) objective

Paper · source

Primary-source MMR objective with explicit tradeoff parameter λ balancing query relevance vs novelty/diversity (anti-redundancy)

Key content

MMR objective (Eq. 1 / Section 2): select next item incrementally to balance relevance to query and novelty vs already-selected items:
[ \mathrm{MMR} \triangleq \arg\max_{D_i \in R \setminus S}\Big[\lambda,\mathrm{Sim}1(D_i,Q);-;(1-\lambda)\max{D_j\in S}\mathrm{Sim}_2(D_i,D_j)\Big] ] Definitions:
- (C): document collection/stream; (Q): query (or user profile)
- (R = IR(C,Q,\theta)): retrieved/ranked set from an IR system with threshold (\theta)
- (S): subset already selected; (R\setminus S): unselected candidates
- (\mathrm{Sim}_1): similarity for query relevance; (\mathrm{Sim}_2): similarity for redundancy (can equal (\mathrm{Sim}_1) or differ)
- (\lambda \in [0,1]): tradeoff parameter
Parameter behavior (Section 2):
- (\lambda=1) ⇒ standard relevance ranking.
- (\lambda=0) ⇒ maximal diversity ranking within (R).
- Suggested interactive strategy: start (\lambda \approx 0.3) (broad sampling), then refine query and use (\lambda \approx 0.7) (focus).
Summarization procedure (Section 4): segment document into passages (sentences); rerank passages with MMR + cosine similarity; output top passages in original document order.
Empirical results:
- User pilot (Section 3): 80% chose MMR method for a search task.
- SUMMAC’98 (Section 4): query-relevant summaries F-score = 0.73; informative summaries 70% accuracy.
- Table 1 (sentence precision; compression 10% / 25%):
  - 10%: (\lambda=1) .78/.83, (\lambda=.7) .76/.83, (\lambda=.3) .74/.79, Lead .74/.83
  - 25%: (\lambda=1) .74/.76, (\lambda=.7) .73/.74, (\lambda=.3) .74/.76, Lead .60/.65

📄 Options → SMDP framing + regret (UCRL-SMDP)

Paper · source

Formal mapping from MDP+options to SMDP; average-reward objective; regret definition; optimistic exploration (UCRL) adapted to options/macro-actions.

Key content

Option definition (Sec. 2): option (o={I_o,\beta_o,\pi_o}) with initiation set (I_o\subset S), termination prob. (\beta_o:S\to[0,1]), intra-option policy (\pi_o:S\to A).
MDP + options induces SMDP (Prop. 1): (M_O={S_O,O,p_O,r_O,\tau_O}), where
(S_O=(\cup_o I_o)\cup(\cup_o{s:\beta_o(s)>0})).
Transition: (p_O(s,o,s’)=\sum_{k\ge1} \Pr(s_k=s’\mid s,\pi_o),\beta_o(s’)).
(r_O(s,o,s’)): cumulative reward until termination at (s’).
(\tau_O(s,o,s’)): holding time (# primitive steps) until termination.
Average reward in SMDP (Prop. 2, Eq. 1): with (N(t)=\sup{n:\sum_{i=1}^n\tau_i\le t}),
(\rho^\pi(s)=\limsup_{t\to\infty}\mathbb E_\pi[\sum_{i=1}^{N(t)} r_i/t\mid s_0=s]) (and (\liminf)). Communicating ⇒ stationary deterministic (\pi^) exists with constant gain (\rho^).
Optimality equation (Eq. 2):
(u^(s)=\max_a{r(s,a)-\rho^\tau(s,a)+\sum_{s’}p(s’|s,a)u^*(s’)}).
Regret for temporally-extended actions (Def. 1, Eq. 3): after (n) decision steps,
(\Delta(M,A,s,n)=(\sum_{i=1}^n\tau_i)\rho^*(M)-\sum_{i=1}^n r_i). (Reduces to MDP regret when (\tau_i=1).)
UCRL-SMDP procedure (Sec. 3, Fig. 1): episodic optimism. Each episode (k): build confidence set (M_k) over (\tilde r,\tilde\tau,\tilde p); run Extended Value Iteration (EVI) to pick optimistic (\tilde M_k) and policy (\tilde\pi_k); execute until some ((s,a)) sample count doubles.
Uniformization to equivalent MDP (Eq. 4, Prop. 3): choose (\tau<\tau_{\min}).
(r^{eq}(s,a)=r(s,a)/\tau(s,a));
(p^{eq}(s’|s,a)=\frac{\tau}{\tau(s,a)}(p(s’|s,a)-\delta_{s,s’})+\delta_{s,s’}).
Optimal gain preserved; bias rescales by (\tau^{-1}).
EVI update (Eq. 5) + stop (Eq. 6):
(u_{j+1}(s)=\max_a{\tilde r_{j+1}(s,a)/\tilde\tau_{j+1}(s,a)+\frac{\tau}{\tilde\tau_{j+1}(s,a)}(\tilde p_{j+1}^\top u_j-u_j(s))}+u_j(s)).
Stop when (\max_s\Delta u-\min_s\Delta u<\varepsilon); greedy policy is (\varepsilon)-optimal (Lemma 1). Use (\varepsilon=R_{\max}/\sqrt{i_k}).
Options vs primitive actions (Lemma 2): for (T_n=\sum_{i=1}^n\tau_i),
(\Delta(M,A,T_n)=\Delta(M_O,A,n)+T_n(\rho^(M)-\rho^(M_O))). Linear term = policy-class restriction from options.
Holding times distribution (Lemma 3): option holding times are sub-exponential in general; sub-Gaussian iff a.s. bounded (e.g., no cycles).
Regret bound (Thm. 1): w.p. (\ge 1-\delta),
(\Delta=O\big((D\sqrt S + C(M,n,\delta))R_{\max}\sqrt{SA,n\log(n/\delta)}\big)), where (D) is SMDP diameter (Def. 2, Eq. 7).
Lower bound (Thm. 2): (\mathbb E[\Delta]=\Omega((\sqrt D+\sqrt{T_{\max}})R_{\max}\sqrt{SA,n})).
Sufficient condition for options to help (Eq. 9, Sec. 5, bounded holding times): roughly
(\frac{D_O\sqrt S + T_{\max}}{D\sqrt S,\tau_{\min}}\le 1) (longer options can reduce regret if they don’t blow up diameter / don’t reduce optimal gain).

📄 POAD/BAD — Token-level PPO for Language Agents

Paper · source

Formal RL framing + explicit objective/updates for policy optimization with action decomposition (BAD → POAD)

Key content

Language-augmented POMDP setup (Sec. 3.1): observations/actions are token sequences. At env step (t), agent generates action string (a_t=(x_{t,1},\dots,x_{t,n})) autoregressively; env reward (r_t) is given only for complete actions; objective is maximize discounted return (G=\sum_{t=0}^{T}\gamma^t r_t).
Action-level value/advantage (Sec. 3.2):
(Q^\pi(o_t,a_t)=\mathbb{E}[G_t\mid o_t,a_t]) (Eq. 1), (V^\pi(o_t)=\mathbb{E}[G_t\mid o_t]) (Eq. 2), advantage (A_t=Q^\pi(o_t,a_t)-V^\pi(o_t)).
Action likelihood via chain rule (Eq. 3): (\pi_\theta(a_t\mid o_t)=\prod_{i=1}^{n}\pi_\theta(x_{t,i}\mid o_t,x_{t,<i})).
Bellman backups (Eq. 4–5): (Q(o_t,a_t)=r_t+\gamma,\mathbb{E}[V(o_{t+1})]), (V(o_t)=\mathbb{E}_{a\sim\pi}[Q(o_t,a)]).
Problem rationale: action space grows exponentially (|\mathcal{A}|=|\mathcal{V}|^n); action-level RL gives uncertain intra-action token credit and often requires manual action-space restriction (GLAM/TWOSOME).
BAD: Bellman backup with Action Decomposition (Sec. 5.1): modifies naive token-level backups to ensure optimality consistency with original action-level MDP while giving token-level credit (Eq. 13–14). Complexity reduces from (\mathcal{O}(|\mathcal{V}|^n)) to (\mathcal{O}(n|\mathcal{V}|)).
POAD = PPO + BAD (Sec. 5.2):
- Critic learns token value by minimizing empirical token Bellman error (Eq. 15) with a target network (\bar\phi) updated periodically.
- Actor uses PPO clipped objective at token granularity (Eq. 16): ratio (r_{t,i}(\theta)=\frac{\pi_\theta(x_{t,i}\mid \cdot)}{\pi_{\theta_{\text{old}}}(x_{t,i}\mid \cdot)}); advantage (\hat A_{t,i}) via GAE.
Empirical numbers (Sec. 6.3.2, DataSciCoding unrestricted actions up to 128 tokens): ROC-AUC (higher better)
- Cheese: POAD 0.7553 vs NTPO 0.7428
- Hamburger: 0.7602 vs 0.7476
- Apple Pie: 0.7650 vs 0.7141
- Pizza: 0.7625 vs 0.7355
- Washing Plate: 0.7075 vs 0.7491 (NTPO higher)
- Laundry: 0.7014 vs 0.5687
General ability retention (Sec. 6.6): zero-shot benchmarks roughly unchanged; e.g., ARC_C: base 0.44, POAD 0.45; HellaSwag: base 0.57, POAD 0.59; PIQA: 0.78 unchanged; MMLU: 0.41 unchanged.

📄 RAG = retrieval + generation with top‑K document marginalization

Paper · source

RAG marginalization over retrieved docs (top‑k), joint retriever+generator likelihood objective, rationale for retrieval-then-generate

Key content

Model components (Sec. 2):
- Retriever (p_\eta(z\mid x)): returns a (top‑(K) truncated) distribution over passages (z) given input/query (x) (initialized from DPR).
- Generator (p_\theta(y\mid x,z)): seq2seq (BART-large, ~400M params) generating output (y) conditioned on (x) and retrieved passage (z) (implemented by concatenating (x) and (z)).
Latent-document marginalization (Sec. 2.1): retrieved passage (z) is a latent variable; approximate marginalization with top‑(K) docs.
- RAG-Sequence (same doc for whole output):
  [ p(y\mid x)\approx \sum_{z\in \text{top-}K} p_\eta(z\mid x); p_\theta(y\mid x,z) ]
- RAG-Token (doc may vary per token):
  [ p(y\mid x)=\prod_{i} \sum_{z\in \text{top-}K} p_\eta(z\mid x); p_\theta(y_i\mid x,z,y_{<i}) ]
Training objective (Sec. 2.4): minimize negative marginal log-likelihood over paired data ((x,y)): (-\log p(y\mid x)). Jointly learn retriever + generator without retrieval supervision; keep document encoder/index fixed, fine-tune query encoder + BART.
Retriever details (Sec. 2.2–3): DPR bi-encoder with inner product scoring; retrieval via MIPS (FAISS). Wikipedia index: Dec 2018 dump, split into 100-word chunks, ~21M documents.
Defaults/parameters (Sec. 3): retrieve top (K=5) during training; (K=10) at test (chosen on dev). Retrieving more docs: monotonic gains for RAG-Seq; RAG-Tok peaks around 10.
Empirical results (Tables 2/6):
- Jeopardy QGen Q-BLEU-1: BART 19.7, RAG-Token 22.2, RAG-Seq 21.4.
- FEVER label accuracy: BART 64.0/81.1 (3-way/2-way), RAG-Token 72.5/89.5.
- Human eval Jeopardy factuality: RAG better 42.7% vs BART better 7.1%.
Design rationale (Intro/Results): retrieval augments parametric memory to reduce hallucinations, improve factuality/specificity, provide inspectable evidence, and enable knowledge updates by swapping the index (no retraining).

📄 Recursive Summarization for Long-Term Dialogue Memory (LLM-Rsum)

Paper · source

Concrete procedure for recursive summarization/memory compression over long dialogues + evaluation setup for long-term retention

Key content

Problem: Long dialogues cause inconsistency/forgetting even with large context windows; explicit memory helps LLMs “digest” key history.
Two-stage formulation (Section 3, Eq. 1):
Goal decomposed into memory update then response generation:
(p(y \mid \text{history}, x_t) = p(M \mid \text{history}) \cdot p(y \mid M, x_t)) (conceptually; paper defines memory (M) available after session (t)).
Markov assumption: memory after session (t) depends only on previous memory and current session.
Memory iteration (Section 4.1, Eq. 2):
(M_t = \text{LLM}(P_m,; M_{t-1},; S_t))
where (M_t)=updated memory (multi-sentence summary), (S_t)=dialogue session (t), (P_m)=memory-iteration prompt. Initialize (M_0=) "none".
Memory-based response generation (Section 4.2, Eq. 3):
(y_t = \text{LLM}(P_g,; M_t,; x_t))
where (x_t)=current session context, (P_g)=response prompt emphasizing consistency with memory.
Algorithm (Section 4.3, Alg. 1): loop over sessions: update (M) at each session end; generate response using latest (M).
Defaults/params (Implementation): temperature = 0; retriever baselines use BM25/DPR, top-k=3 or 5 utterances.
Datasets (Section 5.1): MSC + Carecall; 5 sessions; evaluate mainly sessions 4–5.
Key results (Session 5, Table 3; ChatGPT backbone):
- MSC: ChatGPT-Rsum F1 20.48, BScore 86.89, human Consistency 1.45 vs vanilla ChatGPT F1 19.41, BScore 86.13, Consistency 1.32.
- Carecall: ChatGPT-Rsum F1 14.02 vs ChatGPT 13.69; human Consistency 1.70 vs 1.43.
Ablation (Table 5, MSC S5): Ours F1 20.48; W/O Memory F1 18.94. Using gold memory slightly worse than predicted memory (BLEU/F1), attributed to gold being fragmented vs recursive memory being more coherent/easy-to-digest.
Complementarity (Table 8): Retrieval + framework improves: ChatGPT-BM25(k=5) F1 20.91 → 21.81; coherence 75.44 → 84.44, consistency 76.88 → 90.68.
Memory error rate (Table 6): fabricated facts 2.7%, incorrect relationships 3.2%, missing details 3.9% (≤ ~10% total issues in sampled summaries).

📄 ReflAct (Reflection-for-Action) vs ReAct Backbone

Paper · source

World-grounded ReAct-style thought/action loop variant with explicit goal grounding + decision procedure

Key content

POMDP framing (Section 2): Task as (M=\langle U,S,A,O,P,R\rangle). Instruction (u\in U), hidden state (s\in S), action (a\in A), observation (o\in O). (U,A,O) are natural language.
ReAct loop formalization (Section 2): Context (c_t=(h_t,o_t)), where (h_t={u,\tau_1,a_1,o_1,\dots,\tau_t,a_t}).
Thought: (\tau_t\sim \pi^{thought}_\theta(\cdot\mid c_t)).
Augmented context: (c’t=c_t\oplus \tau_t).
Action: (a_t\sim \pi^{act}\theta(\cdot\mid c’_t)).
Thought strongly reweights actions (Section 3.1): Average action-distribution entropy over 134 ALFWorld tasks (Llama-3.1-8B-Instruct):
(\bar H_{NoThinking}=1.23) vs (\bar H_{ReAct}=0.30) (Table 1).
Core failure modes of ReAct (Section 3.2): (1) incoherent internal belief/state (revisits, false “holding” assumptions), (2) short-sighted planning vs long-term goal → compounding errors/hallucinations.
ReflAct objective (Section 4): Long-term return (G_t=\sum_{k=0}^\infty \gamma^k R_{t+k}), (\gamma\in[0,1)). Optimal thought:
(\tau^*t=\arg\max{\tau\in T}\mathbb{E}{a\sim \pi^{act}\theta(\cdot\mid c_t\oplus\tau)}[\mathbb{E}[G_t\mid s_t,a]]).
ReflAct procedure (Section 4): Replace “next-action thinking” with goal-state reflection each step. Prompt: “reflect on the agent’s state in relation to the task goal, then output the action.” Reflection space (K): structured encoding of belief state (M) + goal (G).
Empirical gains (Table 2): Success Rate (SR) improvements of ReflAct over ReAct: ALFWorld +36.4%, ScienceWorld +8.5%, Jericho +38.1%. Example rows:
- GPT-4o ALFWorld SR: ReAct 85.1 → ReflAct 93.3.
- Llama-3.1-8B ALFWorld SR: ReAct 29.1 → ReflAct 60.5.
Safety/reliability (Fig. 8): “No tasks where only ReflAct fails”; ReAct introduces unique failures vs NoThinking.
Defaults/params: ALFWorld test set 134 tasks; ScienceWorld 211; Jericho 20. Reflexion trials repeated 3. RAFA cost example: depth (d=2), branching (b=2) ⇒ 13 LLM queries/step vs ReflAct 1.

📄 Tree of Thoughts (ToT) = deliberate search over “thoughts”

Paper · source

Explicit ToT search procedure (generate–evaluate–select/backtrack) + compute-vs-performance vs linear CoT

Key content

LM sequence model (Sec. 2): For tokens (x=(x[1],…,x[n])), (p_\theta(x)=\prod_{i=1}^n p_\theta(x[i]\mid x[1..i])).
IO prompting: (y\sim p^{IO}\theta(y\mid x)).
CoT: sample thoughts sequentially (z_i\sim p^{CoT}\theta(z_{i}\mid x,z_{1..i-1})), then (y\sim p^{CoT}\theta(y\mid x,z{1..n})).
CoT-SC: sample (k) chains, return (\arg\max_y #{i\mid y^{(i)}=y}).
ToT framing (Sec. 3): search tree over states (s=[x,z_{1..i}]) (partial solution). Instantiation choices: (1) thought decomposition granularity; (2) generator (G(p_\theta,s,k)); (3) evaluator (V(p_\theta,S)); (4) search algorithm.
Thought decomposition rationale: thoughts must be “small enough” for diverse generation but “big enough” for evaluability; examples: word (crosswords), equation line (24), paragraph plan (writing).
Generation (Sec. 3.2): (a) i.i.d. sampling (z^{(j)}\sim p^{CoT}\theta(z{i+1}\mid s)) for rich spaces; (b) sequential “propose prompt” ([z^{(1)}..z^{(k)}]\sim p^{propose}\theta(z{i+1}^{(1..k)}\mid s)) to reduce duplicates in constrained spaces.
Evaluation (Sec. 3.3): (a) value each state (V(s)\sim p^{value}\theta(v\mid s)) (scalar or classes like sure/maybe/impossible); (b) vote: pick best (s^*\sim p^{vote}\theta(s^*\mid S)).
Search algorithms (Sec. 3.4):
- ToT-BFS (Alg. 1): expand frontier with (k) thoughts/state; score via (V); keep top breadth (b) each step up to depth (T).
- ToT-DFS (Alg. 2): explore best-first with pruning if (V(s)\le v_{th}); backtrack on prune/failure.
Empirical results (GPT-4, Game of 24; Table 2): IO 7.3%, CoT 4.0%, CoT-SC(k=100) 9.0%, ToT(b=1) 45%, ToT(b=5) 74%; IO+Refine(k=10) 27%; best-of-100: IO 33%, CoT 49%.
Compute/cost (App. B.3, Table 7): Game of 24 per case: IO(best-of-100) 1.8k completion/1.0k prompt tokens, $0.13, 33$ 0.47, 49%; ToT 5.5k/1.4k, $0.74, 74%.
Default experimental knobs: GPT-4 chat completion, temperature 0.7; Game of 24 ToT: BFS, depth (T=3), breadth (b=5), propose prompt, value prompt (“sure/maybe/impossible”), sample values 3× per thought.

📊 AgentBench (LLMs as Agents) — environments, protocol, results

Benchmark · source

Benchmark suite + reported success rates across domains with unified tool interface + evaluation protocol

Key content

Formalization (Section 2): Interactive evaluation of an LLM-as-Agent is modeled as a POMDP:
((\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{I}, \mathcal{O})) with state (\mathcal{S}), action space (\mathcal{A}), transition (\mathcal{T}), reward (\mathcal{R}), task instruction space (\mathcal{I}), observation space (\mathcal{O}). Agent denoted (\pi).
Environments (Section 3): 8 total, grouped as:
- Code-grounded: Operating System (bash in Ubuntu Docker; metric Success Rate, SR), Database (authentic SQL interface; SR), Knowledge Graph (partially observable KG QA; metric Answer F1).
- Game-grounded: Digital Card Game (Aquawar; metric win rate), Lateral Thinking Puzzles (host answers yes/no/irrelevant; metric game progress), House Holding (ALFWorld; SR).
- Web-grounded: Web Shopping (WebShop; metric reward), Web Browsing (Mind2Web; metric step SR).
Evaluation defaults (Section 4.1):
- Temperature = 0 (greedy) for reproducibility.
- Multi-round chat history truncated to keep input ≤ 3500 tokens; append "[NOTICE] messages are omitted."
- Prompt format includes “Thought” + “Action” in one round (primitive CoT; no reflection/ensembles/search).
Dataset scale (Table 2): Dev 269, Test 1,014; ~3k and 11k inference calls (≈ MMLU call count).
Main empirical results (Table 3):
- gpt-4 (0613) overall OA 4.01; notable task scores: HH SR 78.0, WS reward 74.5, WB step SR 29.0, OS SR 42.4, DB SR 32.0, KG F1 58.8.
- gpt-3.5-turbo (0613) OA 2.32; OS SR 32.6, HH SR 64.1.
- Best OSS in scope: codellama-34b-instruct OA 0.96 (still far below gpt-3.5-turbo).
- Average API vs OSS OA: 2.32 vs 0.51.
Failure modes (Section 2; Table 4): Finish reasons: CLE, Invalid Format (IF), Invalid Action (IA), Task Limit Exceeded (TLE), Complete. Predominant failure: TLE (weak long-term reasoning/decision-making). DB/DCG often IF; HH/WB often IA.
Design rationale (Section 4.1): Overall score uses task reweighting: rescale each task’s average score to ([0,1]) across evaluated models, then average; fixed future weights use reciprocal of average score per task to avoid easy tasks dominating.

📊 LoCoMo — Very Long-Term Conversational Memory Benchmark

Benchmark · source

LoCoMo construction (multi-session, ~600 turns) + tasks/metrics for evaluating very long-term conversational memory (QA, event summarization, multimodal dialog gen)

Key content

Dataset scale (Table 1 / Conclusion): LoCoMo has 10 conversations, each ~588 turns and ~16,618 tokens on average, spanning ~27.2 sessions (up to 32 sessions) over a few months; multimodal (image sharing). Compared to MSC: 16× longer in tokens, 10× more turns, 5× more sessions.
Generation pipeline (Section 3):
- Persona (Sec 3.1): start from MSC persona seed (4–5 sentences) → expand with gpt-3.5-turbo.
- Temporal event graph (Sec 3.2): per speaker graph G with up to 25 events over 6–12 months; iterative generation in batches k=3; includes causal links l=(e_i,e_j) and event times t_i.
- Agent memory (Sec 3.3): after each session k, produce summary w_k conditioned on session history h_k and prior summary w_{k−1}; store turn-level observations o_{k,j} in long-term memory; responses conditioned on persona p, current session history, retrieved observations, and events between sessions {e ∈ G | t_s^k < t_e < t_s^{k+1}}. Image captions also stored as memory.
- Human editing (Sec 3.4): annotators edited ~15% of turns; removed/substituted ~19% images.
Evaluation tasks (Section 4):
- QA (Sec 4.1): 5 types: single-hop, multi-hop, temporal, open-domain knowledge, adversarial (unanswerable). Metric: token-level F1 after normalization; answers drawn directly from dialogue; QA annotated with turn IDs; for RAG report retrieval recall@k.
- Event summarization (Sec 4.2): summarize events in timeframe; compare to event graph G using FactScore (atomic facts) → precision/recall/F1.
- Multimodal dialog gen (Sec 4.3): evaluate alignment with ground-truth using MMRelevance (+ standard NLG metrics).
Key empirical results (Tables 2–4):
- Human QA overall F1: 87.9 (single-hop 95.1, temporal 92.6).
- Best long-context QA: gpt-4-turbo 128K overall 51.6 (single-hop 72.3, multi-hop 51.5, temporal 51.4, adversarial 15.7).
- Long-context hallucination vulnerability: adversarial drops sharply (e.g., claude-3-sonnet adversarial 2.5, gemini-1.0-pro 5.2).
- Event summarization best: gpt-4-turbo FactScore F1 48.9 (precision 51.9, recall 46.5).
- RAG finding (Sec 6.1/Table 3): storing history as observations (assertions) improves QA vs raw dialog; too many retrieved items hurts (signal-to-noise).

📊 LongBench — Task taxonomy & long-context baselines

Benchmark · source

6-category / 21-task taxonomy + quantitative long-context performance baselines (LongBench, LongBench-E) + effects of truncation & context compression

Key content

Problem formalization (Sec. 3.1): Given (I, C) → output A, where I (input) and A (answer) are short; C is long (thousands of tokens). (Task-specific instantiations in Table 7.)
Benchmark composition: 21 datasets across 6 categories (Single-Doc QA, Multi-Doc QA, Summarization, Few-shot Learning, Synthetic, Code). 4,750 test instances. Avg length: 6,711 words (EN) / 13,386 chars (ZH).
Datasets & metrics (Table 1): QA uses F1 (most) / ROUGE-L (DuReader); Summarization uses ROUGE-L; Few-shot uses Accuracy (CLS) / F1 / ROUGE-L; Synthetic uses Accuracy (EM); Code uses Edit Sim.
Truncation rule when input length L > model max M (Sec. 4.1): truncate from the middle:
S₁:L → [S₁:⌊M/2⌋ ; S_{L-⌊M/2⌋-1:L}] (keep beginning + end).
Model baselines (Tables 2–3, Overall-All): GPT-3.5-Turbo-16k 44.7; ChatGLM2-6B-32k 41.4; Vicuna-16k 30.5; LongChat-32k 31.6; Llama2-4k 26.8.
LongBench-E length robustness (Fig. 3): relative drop from 0–4k → 8k+: ChatGLM2-6B-32k −4%, LongChat-32k −7%, GPT-3.5-16k −17%.
Context compression (Sec. 4.2): Retrieval pipeline: chunk size M=200 or 500 words/chars; take top-N=7 (M=200) or top-N=3 (M=500); retrievers: ada-002, Contriever, BM25. Best retrieval improves Llama2-4k by +21%, but GPT-3.5-16k −2%, ChatGLM2-32k −5% (Table 4). Summarization-based compression generally hurts (Table 5; only helps VCSUM).
Memorization check (Sec. 4.3, Table 6): evaluate w/o context vs with context; ∆score indicates reliance on context (e.g., GPT-3.5 NarrativeQA 4.7→23.6 (+18.9); MultiFieldQA-zh 10.9→61.2 (+50.3)).

📊 Open-Source Agent Framework Comparison (Langfuse)

Benchmark · source

Criteria-based comparison across agent frameworks (control-flow expressiveness, observability/debugging, tool integration, production-readiness tradeoffs)

Key content

Core decision variables (framework selection):
- Task complexity & workflow structure: complex multi-step reasoning benefits from explicit orchestration (e.g., graph-/skill-based); simpler tasks can use lightweight code-centric agents.
- Collaboration / multi-agent needs: choose frameworks supporting role delegation or asynchronous conversations when multiple specialized agents must coordinate.
- Integrations: evaluate ease of tool calling / external system integration.
- Performance & scalability: high concurrency / real-time interactions may require event-driven architectures; observability becomes crucial for optimization.
Framework paradigms & “best for” positioning (comparative facts):
- LangGraph (LangChain extension): graph/DAG-based agent steps as nodes; edges control transitions/data flow. Best when you need precise control, branching, error handling, parallel branching, and stateful workflows; benefits from LangChain integrations.
- OpenAI Agents SDK: structured runtime for roles, tools, triggers; strong if already in OpenAI stack (GPT-4o / GPT-o3 mentioned).
- Google ADK: declarative agent definition + runner abstraction; built-in multi-agent orchestration, tool use, session management; native Gemini ecosystem (also supports other providers).
- CrewAI: role-based “Crew” container coordinating multiple agents; supports memory modules and error-handling logic.
- AutoGen: asynchronous conversation among agents; suited for long tasks, waiting on external events, real-time concurrency, role switching.
- Semantic Kernel: skills + planner; enterprise readiness (security/compliance/Azure integration); multi-language (C#, Python, Java).
- Strands Agents / Pydantic AI / Mastra / Microsoft Agent Framework: emphasize production readiness with OpenTelemetry tracing; Strands adds optional deep AWS/Bedrock integrations; Mastra is TypeScript-first.
Observability rationale: agent systems have many moving parts (prompts, tool calls, branching). Langfuse tracing captures a structured timeline of prompts/responses/tool calls to debug and evaluate production behavior.

📊 Procedural Memory Retrieval Benchmark (ALFWorld)

Benchmark · source

Benchmark tasks + quantitative results isolating procedural (skill/trajectory) retrieval under cross-context (vocabulary shift) generalization.

Key content

Problem definitions (Sec. 3.1):
- Procedural trajectory: sequence of state–action pairs ((s_t, a_t)).
- Procedural similarity aims at functional equivalence independent of object vocabulary: combines structural patterns (action sequence) + semantic intent (Eq. 1; described conceptually).
- Cross-context retrieval: given corpus (C), query (q) with unseen object vocab, retrieval function (R(q,C)\to) ranked subset; objective maximize expected “procedural utility” over relevant set (Eq. 2).
- Generalization gap: performance drop from seen→unseen contexts for method (m) (Eq. 3–4; defined as degradation).
Corpora (Sec. 3.2):
- Expert ALFWorld corpus: 78 trajectories (63 successful + 15 interrupted), avg length 14.2 actions, 42 unique object types, 6 task types.
- AgentInstruct corpus: 336 trajectories (filtered from 954; 35.2% retained), avg length 12.7 actions, 38 object types.
Retrieval methods (Sec. 3.3): embeddings use all-MiniLM-L6-v2 (384-d), cosine similarity (Eq. 5), stored in ChromaDB. Variants: action-only, enriched (adds task metadata/context), LLM procedural summaries (GPT-5), combined. Keyword baseline uses Jaccard (Eq. 6). BM25 omitted in final benchmark due to leakage from keyword-based coverage filtering.
Evaluation workflow (Sec. 3.4):
- LLM-as-judge: GPT-5, reasoning effort “low”, relevance score 1–10; binary relevance threshold 6 (avoids zero-relevant queries).
- Coverage-balanced benchmark: select validation tasks with 8–20 relevant trajectories (estimated via lightweight keyword overlap), then stratify into EASY/MEDIUM/HARD by procedural complexity; final 40 queries.
- Metrics: MAP (Eq. 7), P@k (Eq. 8), Recall@k (Eq. 9), F1@k (Eq. 10), NDCG@k (Eq. 11).
Key empirical results:
- Generalization cliff (Table 1; 78 traj, 36 queries):
  - Combined embeddings MAP 0.844→0.592 (−29.9%)
  - Enriched 0.794→0.565 (−28.9%)
  - Action-only 0.756→0.488 (−35.5%)
  - Summary embeddings 0.754→0.671 (−11.0%), rank 6→1 (best on unseen).
- Coverage-balanced (336 traj, 40 queries; Table Sec. 4.2):
  - State-aware MAP 0.7945 (EASY 0.842, MED 0.746, HARD 0.791)
  - Action-only MAP 0.7231 (EASY 0.668, MED 0.802, HARD 0.699)
- Corpus scale ablation (Sec. 4.3.1): 336 vs 78 (state-aware) MAP 0.7945 vs 0.644; subsample retains 81.0% of full.
Design rationale (Sec. 5): mean-pooled sentence-transformers behave like bag-of-words, discarding temporal order; LLM summaries help by explicitly abstracting object-specific details before embedding.

📊 WebArena benchmark—realistic web environment + functional success metrics

Benchmark · source

Definition of a realistic, reproducible web-navigation environment + task success metrics for autonomous web agents

Key content

Environment formalization (Sec. 2.1): WebArena is an environment with state space (S), action space (A) (Sec. 2.4), and observation space (O) (Sec. 2.3). Transition function is deterministic and defined by website implementations.
- Given intent (I) (NL instruction), agent selects action (a_t) based on (I), current observation (o_t), action history (a_{<t}), observation history (o_{<t}). Action yields new state (s_{t+1}) and observation (o_{t+1}).
Reward / success metric (Sec. 2.1, 3.2): Reward function evaluates whether the execution trajectory (action sequence (a_{1:T}) and intermediate states (s_{1:T})) achieves the intent via functional correctness checks (e.g., verify an order was placed; verify repository contents), not matching a reference action sequence.
Design rationale (Sec. 2):
- Reproducibility: standalone self-hosted sites (avoids CAPTCHAs, content drift, config changes). Delivered via Docker + gym-style APIs, with scripts to reset to a deterministic initial state.
- Realism: uses open-source stacks mirroring real sites + imported real-world data.
Websites/domains (Sec. 2.2): 4 fully functional domains: e-commerce, social forum, collaborative development (GitLab-like), CMS. Plus tools: map, calculator, scratchpad; knowledge resources: Wikipedia + site manuals.
Observation space (Sec. 2.3): browser-like: URL + opened tabs + focused tab content; supports multi-tab tasks. Render modes: DOM/HTML, screenshot, accessibility tree (compact subset of DOM with roles/text/properties). Optional viewport-limited observations.
Action space (Sec. 2.4): compound UI actions: element ops (click/hover/type/press/scroll), tab ops (new_tab/tab_close/tab_focus), navigation (goto/go_back/go_forward). Elements selectable by coordinates or unique element IDs (turns selection into (N)-way classification; e.g., click [1582]).
Benchmark scale (Sec. 3): 812 long-horizon tasks from 241 templates (avg 3.3 instantiations/template). Intent categories: information-seeking, site navigation, content/config operations.
Evaluation details (Sec. 3.2):
- Info-seeking scoring: exact_match, must_include, fuzzy_match (LM-based semantic equivalence using gpt-4-0613).
- Navigation/content tasks: programmatic locators (DB query/API/JS DOM selection) + keyword/URL checks over intermediate states.
- Includes unachievable tasks labeled “N/A” to test non-hallucination.
Key empirical results (Table 2, Sec. 5):
- Best GPT-4 agent overall 14.41% end-to-end success vs human 78.24%.
- With CoT+UA hint: GPT-4 11.70%, GPT-3.5 8.75%, text-bison-001 5.05%.
- Removing “unachievable hint” boosts GPT-4 to 14.41%; GPT-4 still identifies 44.44% of unachievable tasks.
- Error analysis: GPT-4 falsely marks 54.9% of feasible tasks as impossible when UA hint is present.

📖 Completions API — conversation + defaults

Reference Doc · source

Exact request fields/defaults (e.g., background default=false, conversation default=null) and how conversation items are prepended to input_items.

Key content

Endpoint purpose: Create a text completion from a model; request supports attaching/continuing a conversation state via a conversation field.
Conversation field (stateful context):
- conversation: default = null.
- When provided, the API uses the referenced conversation’s stored items as prior context.
- Ordering rule: conversation items are prepended to the request’s input_items (i.e., effective context = conversation.items + input_items). This matters for context window budgeting and “what the model sees first.”
Background execution:
- background: default = false.
- When true, the completion can run asynchronously (“background mode”), useful for long tasks without holding an interactive connection.
Context-management implication (design rationale):
- Prepending conversation history ensures continuity across turns while letting the caller supply new input_items each request.
- Because the model has a finite context window, long conversations may require message trimming/compaction/summarization on the client side (sliding window) to keep the most relevant items in the effective input.
Practical procedure (workflow):
1. Start with conversation=null for a fresh interaction.
2. Persist the returned conversation identifier (if used by your integration).
3. On later turns, send conversation=<id> plus new input_items; remember the API will place prior items before your new items.

📖 Completions API — output budgeting & determinism knobs

Reference Doc · source

Endpoint-specific schema for request/response fields and parameter semantics needed to implement memory-carryover and output budgeting in production.

Key content

Endpoint: POST /v1/completions → returns a Completion object (or a sequence if streamed). Streamed and non-streamed responses share the same shape.
Core budgeting constraint (Eq. 1):
tokens(prompt) + max_tokens ≤ model_context_length
- max_tokens = maximum tokens generated in the completion (min 0).
Prompt formats: prompt can be string, array of strings, array of tokens (numbers), or array of token arrays. <|endoftext|> acts as a document separator; if no prompt, model generates from start of a new document.
Candidate generation controls:
- n (min 1, max 128): number of completions per prompt.
- best_of (min 0, max 20): generates best_of candidates server-side and returns the single best by highest logprob per token. Cannot be streamed. If used with n: best_of > n. Token quota can rise quickly → pair with reasonable max_tokens and stop.
Stopping: stop = string or array (up to 4 sequences); returned text excludes stop sequence. Not supported with reasoning models o3 and o4-mini.
Sampling knobs: temperature ∈ [0,2]; top_p ∈ [0,1]; recommendation: adjust one, not both.
Penalties: frequency_penalty, presence_penalty ∈ [-2, 2].
Determinism: seed (int64) → best-effort deterministic; monitor backend changes via system_fingerprint in response.
Logprobs: logprobs max 5; response may include up to logprobs+1 tokens (always includes sampled token).
Logit bias: map token_id → bias [-100,100]; e.g., {"50256": -100} bans <|endoftext|>.
Streaming: stream: true uses SSE; terminates with data: [DONE]. stream_options only when streaming.
Response usage accounting: usage = {prompt_tokens, completion_tokens, total_tokens}.

📖 LangGraph Graph API (StateGraph) essentials

Reference Doc · source

Exact Python API surface for building/compiling/invoking StateGraph; state reducers (esp. add_messages), conditional routing, Command, recursion limits, runtime context.

Key content

Core model: Graph = State + Nodes + Edges. Nodes do work; edges decide next. Execution proceeds in discrete super-steps (Pregel-style message passing): nodes become active when receiving messages; execution halts when all nodes are inactive and no messages are in transit.
Build procedure (required):
1. Define State schema (TypedDict, dataclass for defaults, or Pydantic BaseModel—slower).
2. builder = StateGraph(State, input_schema=..., output_schema=..., context_schema=...)
3. add_node(name, fn, cache_policy=...)
4. Add edges: add_edge(src, dst); entry via add_edge(START, first); finish via add_edge(node, END)
5. Conditional routing: add_conditional_edges(node, routing_fn, path_map?)
6. Must compile: graph = builder.compile(...) (structure checks; set runtime args like checkpointers/breakpoints/cache).
Reducers (state update rule): Each key has an independent reducer; default = overwrite. With Annotated[T, reducer], updates combine via reducer (e.g., operator.add for list concatenation).
Messages channel best practice: Use add_messages reducer to append new messages and overwrite by message ID on updates; also deserializes inputs like {"messages":[{"type":"human","content":"..."}]} into LangChain Message objects (access via .content). MessagesState = prebuilt state with messages: Annotated[list[AnyMessage], add_messages].
Command primitive: Command(update=..., goto=..., graph=..., resume=...).
- Return from nodes/tools to combine state update + routing (goto adds dynamic edges; static edges still run).
- Input to invoke/stream: only Command(resume=...) for interrupts. To continue a thread normally, pass a plain dict input (not Command(update=...)).
Recursion limit: default 1000 super-steps (v1.0.6+). Set via graph.invoke(inputs, config={"recursion_limit": 5}) (top-level config key). Step counter available at config["metadata"]["langgraph_step"]. RemainingSteps managed value enables proactive routing before GraphRecursionError.
Runtime context: context_schema=...; pass via graph.invoke(..., context={...}); access in nodes via runtime: Runtime[ContextSchema].

📖 LangGraph many-tools (ToolNode + conditional routing)

Reference Doc · source

End-to-end official pattern for scaling tool selection/routing (ToolNode + tools_condition) with runnable code structure

Key content

Tool definition (LangChain @tool)
- Type hints required (define input schema). Docstring becomes tool description by default.
- Prefer snake_case tool names; avoid spaces/special chars for provider compatibility.
- Customize:
  - Name: @tool("web_search")
  - Description: @tool("calculator", description="...")
- Reserved argument names: config, runtime (cannot be tool args). Use ToolRuntime parameter instead.
Advanced tool schemas
- Use Pydantic via @tool(args_schema=WeatherInput) with fields + descriptions (e.g., units: Literal["celsius","fahrenheit"], include_forecast: bool=False).
Runtime access inside tools (ToolRuntime)
- runtime.state (short-term state, incl. messages)
- runtime.context (immutable invocation config; e.g., user_id)
- runtime.store (long-term memory; namespace/key pattern like store.get(("users",), user_id))
- runtime.stream_writer (emit progress updates)
- runtime.execution_info (thread/run IDs, attempt number)
- runtime.server_info (assistant/graph/auth user when on LangGraph Server)
Tool return patterns
- Return string → becomes ToolMessage.
- Return object/dict → serialized structured output.
- Return Command(update=...) to mutate state; can include ToolMessage(..., tool_call_id=runtime.tool_call_id) (example sets preferred_language).
ToolNode + agent loop (core workflow)
- Create: tool_node = ToolNode([search, calculator])
- Graph pattern:
  1. builder = StateGraph(MessagesState)
  2. Nodes: "llm", "tools"=ToolNode(tools)
  3. Edges: START -> "llm"
  4. Conditional: builder.add_conditional_edges("llm", tools_condition) (routes to "tools" or END)
  5. Loop: "tools" -> "llm"
  6. graph = builder.compile()
ToolNode error handling defaults/options
- Default: catch invocation errors, re-raise execution errors.
- handle_tool_errors=True (return error to LLM), string message, callable handler, or tuple of exception types.

📖 OpenAI API Function/Tool Calling (Schemas, tool_choice, outputs)

Reference Doc · source

Exact tool/function JSON schema, tool_choice behavior, and how tool call arguments are returned/parsed

Key content

Tool calling flow (5 steps): (1) Request model with tools it can call → (2) receive function_call(s) → (3) execute app-side code using tool args → (4) send function_call_output with results (matched by call_id) → (5) receive final model response (or more tool calls).
Function tool schema (per tool):
- Fields: type:"function", name, description, parameters (JSON Schema), strict (enforce schema).
- Example requirements for strict schemas: additionalProperties:false on each object; all properties fields must be in required. Optional fields via union types like "type":["string","null"].
How tool calls appear in Responses API: response.output is an array; tool calls are items with type:"function_call" containing call_id, name, and JSON-encoded string arguments (parse with json.loads(arguments)).
Returning tool results: append an item:
{"type":"function_call_output","call_id": <call_id>, "output": <string|array-of-image/file-objects>}. For no-return actions, output e.g. "success".
Multiple calls: assume zero/one/multiple tool calls per response; iterate through response.output.
Reasoning models note: any reasoning items returned alongside tool calls must also be passed back with tool outputs.
tool_choice modes:
- "auto" (default): zero/one/many calls
- "required": one or more calls
- {"type":"function","name":"get_weather"}: force exactly one specific function
- {"type":"allowed_tools","mode":"auto","tools":[...]}
- "none": imitate passing no tools
Parallel calls control: parallel_tool_calls:false ⇒ exactly 0 or 1 tool call. (Built-in tools: no parallel calling.)
Token/cost rationale: tool definitions count as input tokens (injected into system message); keep initial tools small (soft suggestion: <20) or use tool search (only gpt-5.4+).

📖 Responses API — conversation state, truncation, and token limits

Reference Doc · source

Concrete request/response fields & defaults for production conversation state handling (previous_response_id, conversation, truncation, max_output_tokens, background).

Key content

Conversation state (multi-turn)
- previous_response_id: string|null — “Use this to create multi-turn conversations.” Cannot be used with conversation.
- conversation: object|null — “Input items and output items from this response are automatically added to this conversation.”
Context window handling (truncation)
- truncation: "auto" | "disabled" with default = "disabled".
  - "auto": if context exceeds model window, API drops input items in the middle to fit.
  - "disabled": if response would exceed context window, request fails with 400.
Output length control
- max_output_tokens: int|null — upper bound on tokens generated including visible output + reasoning tokens.
Background execution
- background: boolean|null — whether to run the model response in the background.
Tool-call controls
- parallel_tool_calls: boolean — allow parallel tool calls.
- max_tool_calls: int|null — max total built-in tool calls across the response; further attempts ignored.
- tool_choice: string|object, tools: array — configure tool selection and availability.
Retrieving / auditing history
- GET /v1/responses/{response_id} retrieve a Response by id.
- GET /v1/responses/{response_id}/input_items lists input items; pagination limit 1–100 (default 20), order default desc.
- Conversations: POST /v1/conversations (add up to 20 initial items); GET /v1/conversations/{id}/items with limit default 20, order desc.
Usage accounting
- usage: input_tokens, output_tokens, total_tokens, plus cached_tokens and reasoning_tokens.

📖 Trace LangGraph applications (observability + debugging hooks)

Reference Doc · source

Concrete tracing/observability workflow for LangGraph runs (node-level execution, state transitions) via LangSmith; supports production debugging/reliability claims.

Key content

Install (CLI):
- pip install -U langgraph
Minimal “hello world” graph (run structure):
- Build a StateGraph(MessagesState) with explicit START -> node -> END edges, then compile() to a runnable graph.
- Example node signature: def mock_llm(state: MessagesState): return {"messages": [{"role":"ai","content":"hello world"}]}
- Invoke with message input: graph.invoke({"messages": [{"role":"user","content":"hi!"}]})
State schema pattern (message accumulation):
- Define state as TypedDict with messages: Annotated[list, add_messages]
- Rationale: add_messages() appends to the message list rather than overwriting, preserving conversation history across node executions.
Streaming execution (debugging/UX):
- Iterate events: for event in graph.stream({"messages":[("user", user_input)]}): ...
- Print latest assistant message from streamed state updates: value["messages"][-1].content
Persistence/checkpointing for reliability (durable execution):
- SQLite checkpointer:
  - from langgraph.checkpoint.sqlite import SqliteSaver
  - memory = SqliteSaver.from_conn_string(":memory:")
  - graph = graph_builder.compile(checkpointer=memory)
- Rationale: durable execution + resume-from-failure via persisted state.
Conditional routing (control-flow debugging):
- Conditional edges require: (1) upstream node, (2) decision function returning a string outcome, (3) mapping from outcomes to next nodes.

📋 # Source: https://docs.anthropic.com/en/docs/build-with-claude/tool-use

Source ·

📋 # Source: https://milvus.io/api-reference/go/v2.3.x/Index/indexLimitations.md

Source ·

🔍 BM25 / BM25F (Okapi) core equations + rationale

Explainer · source

Derivation-style presentation of BM25/BM25F with notation, saturation + length normalization, plus a concrete tf-idf vs BM25 comparison.

Key content

BIM (Binary Independence Model) log-odds RSV (Eq. 1):
[ RSV_{BIM}=\sum_{i\in q} c_i,\quad c_i=\log\frac{p_i(1-r_i)}{(1-p_i)r_i} ] where (p_i=P(x_i=1\mid R=1)), (r_i=P(x_i=1\mid R=0)), (x_i) indicates term presence. With constant (p_i=0.5) simplifies to IDF weighting (\log(N/df_i)).
BM25 term saturation (Eq. 2): bounded tf contribution via
[ \frac{tf}{k_1+tf} ] (monotone in (tf), asymptotically saturating).
Document length normalization (Eq. 3):
[ B=(1-b)+b\frac{dl}{avdl},\quad 0\le b\le 1 ] (dl)=doc length, (avdl)=avg doc length.
Okapi BM25 scoring (Eq. 4):
[ RSV_{BM25}=\sum_{i\in q}\log\frac{N}{df_i}\cdot \frac{(k_1+1)tf_i}{k_1\left((1-b)+b\frac{dl}{avdl}\right)+tf_i} ] Defaults: (k_1\approx 1.2\text{–}2), (b\approx 0.75). Interpretations: (k_1=0) binary; large (k_1) ≈ raw tf. (b=0) none; (b=1) full length norm.
BM25F (zones) (Eq. 5): weighted tf/length across zones (z):
[ \tilde{tf}i=\sum{z=1}^Z v_z\frac{tf_{zi}}{B_z},\quad B_z=(1-b_z)+b_z\frac{len_z}{avlen_z} ] Then
[ RSV_{BM25F}=\sum_{i\in q}\log\frac{N}{df_i}\cdot\frac{(k_1+1)\tilde{tf}_i}{k_1+\tilde{tf}_i} ] Rationale: eliteness is shared across zones; zone-specific normalization ((b_z)) helps empirically.
Empirical comparison (machine learning query, (k_1=2)):
doc1: learning 1024, machine 1 → tf-idf 87, BM25 31;
doc2: learning 16, machine 8 → tf-idf 75, BM25 42.7 (BM25 favors balanced evidence; tf-idf over-rewards huge tf).

🔍 Deployed Dense Retrieval for Semantic Podcast Episode Search

Explainer · source

End-to-end deployed dense retrieval architecture (shared embeddings + ANN serving) + operational considerations for online vector search

Key content

Dense retrieval setup (Eq. 1: cosine similarity): Train encoders to map query text and episode text metadata into a shared embedding space; retrieve by nearest neighbors.
- Episode input text = concatenation of fields: episode title/description + parent show title/description + other metadata.
- Similarity: cos(q, d) = (q·d) / (||q||·||d||) where q = query vector, d = episode vector.
Model choice rationale: Vanilla BERT not ideal because (1) off-the-shelf sentence embeddings are weak (SBERT finding), (2) English-only pretraining. Chosen base: Universal Sentence Encoder CMLM multilingual (100+ languages) with Conditional Masked Language Modeling objective designed for sentence embeddings.
Training procedure (siamese + in-batch negatives):
- Siamese network with shared weights for query/episode encoders.
- Batch size B: for each positive (q, d) in batch, treat other docs as negatives → B positives and B² − B negatives per batch.
- Compute B×B cosine similarity matrix once per batch; diagonal = positives.
- Losses mentioned: MSE vs identity matrix; later improved with in-batch hard negative mining + margin loss.
Data pipeline: positives from (1) successful search logs (Elasticsearch-derived), (2) query reformulations after initial failure, (3) synthetic queries generated via BART fine-tuned on MS MARCO, (4) curated semantic queries (eval only). Ensure eval episodes not in train.
Production deployment workflow:
- Offline: precompute episode vectors; index in Vespa with ANN (tens of millions of episodes) + first-phase rerank using features like popularity.
- Online: compute query vector via Vertex AI GPU inference; retrieve top 30 semantic episodes from Vespa; use vector cache to avoid recomputation.
- GPU cost result: T4 GPU ~6× cheaper than CPU for inference (load tests).
System design: Dense retrieval complements (doesn’t replace) sparse/term retrieval (Elasticsearch). Final-stage reranker blends sources; add cosine similarity feature to help rank semantic candidates.

🔍 Deployment requirements LangGraph Platform (LangSmith Deployment)

Explainer · source

Explicit deployment requirements (retries, persistence/checkpointing, observability/streaming, scaling) + how LangGraph Platform addresses them

Key content

Renaming (timeline): As of Oct 2025, “LangGraph Platform” renamed to LangSmith Deployment.
When you don’t need it: Stateless, quick, low-volume agents can be deployed simply (e.g., “run them as a lambda”).
When deployment gets hard: Agents that are long-running, stateful, or bursty.
Long-running agents: requirements → platform features
- Avoid holding open connections for hours: launch runs in background + polling endpoints, streaming endpoints, webhooks for status.
- Handle timeouts/disruptions: heartbeat signals to prevent connection closure; stream endpoints can be rejoined after drops; (planned) buffer events during disconnect.
- Reliability via retries + persistence: on failure, retry configurable number of times, and each retry resumes from most recent successful checkpoint (checkpoint-based recovery). Workers use isolated event loops/background threads to reduce exceptions.
- Observability/user feedback: multiple streaming modes—intermediate results, token-by-token LLM messages, and custom payloads emitted by nodes; multiple consumers can attach to the same stream; streams re-establishable.
Bursty agents: requirements → platform features
- Load spikes: built-in task queue; horizontal scaling of server + queue; servers are stateless; queue shares runs fairly and never executes the same run more than once.
- “Double texting” (multiple user messages before prior response completes): four built-in strategies to manage.
Stateful agents: requirements → platform features
- Supports complex state (beyond message lists), short/long-term memory, human-in-the-loop pauses, and human-on-the-loop (“time travel”) resume-from-prior-state.
- Provides optimized checkpointers + memory store; specialized endpoints for human-in-the-loop.
- Storage control: attach TTLs to conversation threads and long-term memory entries for automatic expiry.

🔍 LangGraph = explicit cyclical state-machine graphs for agents

Explainer · source

Rationale + mechanics for building reliable cyclical agent runtimes via nodes + shared state (vs linear DAG chains)

Key content

Design rationale (Motivation):
- Traditional LangChain “chains” are effectively DAGs; they lack an easy way to introduce cycles.
- Many production agents need an LLM-in-a-loop (“for-loop”) to reason about next actions (e.g., refine retrieval queries when initial RAG retrieval is poor).
- Pure agent loops (LLM decides everything) often need more control in production: force specific tools first, control tool calling, vary prompts by state → treated as state machines.
- LangGraph provides a way to specify these state machines as graphs with explicit control flow and termination.
Core abstraction: StateGraph(State)
- State is a central typed key-value object updated over time by nodes.
- State update modes:
  - Override: node returns a new value for a key.
  - Accumulate (Eq. 1): state[key] ← state[key] + delta for list-like fields via Annotated[..., operator.add].
    - Example: all_actions: Annotated[List[str], operator.add].
Nodes
- graph.add_node(name, runnable_or_fn); node input/output are dicts shaped like State; output dict specifies state updates.
- Special terminal node: END (cycles must eventually reach END).
Edges / control flow
1. Entry point: graph.set_entry_point("model")
2. Normal edge: always follow: graph.add_edge("tools","model")
3. Conditional edge (Procedure 1):
  - graph.add_conditional_edge(upstream, router_fn, mapping)
  - router_fn → str; mapping routes e.g. { "end": END, "continue": "tools" }.
Compilation
- app = graph.compile() → runnable supporting LangChain runnable methods (.invoke, .stream, .astream_log, etc.).
Agent state defaults (AgentExecutor recreation)
- input, chat_history, agent_outcome (AgentAction|AgentFinish|None), intermediate_steps: Annotated[list[tuple[AgentAction,str]], operator.add].
- Chat-style variant: messages: Annotated[Sequence[BaseMessage], operator.add] (nodes append messages; supports tool/function calling patterns).

🔍 LangGraph Agent Search Loop (Onyx) — cyclical graphs + evolving state

Explainer · source

Concrete LangGraph architecture for an agentic loop (search→decompose→answer→refine) showing why cyclical graphs matter and how state evolves across steps

Key content

Agent Search workflow (high-level loop):
1. Initial search on original question to gather context.
2. Decompose into narrower sub-questions (disambiguation + focused retrieval), informed by initial search.
3. For each sub-question, run a multi-step pipeline: query expansion → search → document validation → reranking → sub-answer generation → sub-answer verification.
4. Compose initial answer from retrieved docs + sub-answers.
5. If initial answer is lacking, refinement loop: re-decompose to address shortcomings using: (a) question + initial answer, (b) sub-questions/answers + unanswerable sub-questions, (c) entity/relationship/term extraction from initial search. Then generate refined answer.
Why LangGraph (design rationale): flow maps naturally to Nodes/Edges/State; needs control, parallelism, dependency management, streaming, extensibility (incl. future human-in-the-loop and reruns with altered parameters).
Parallelism patterns:
- Map-Reduce fan-out for “identical flows” (e.g., validate each retrieved document in parallel; fan-out updates a bolded state key).
- Subgraphs as nodes for “distinct segments” to avoid unnecessary waiting; they “always use subgraphs as nodes within the parent graph” (not invoked inside a node).
State management best practices (Pydantic):
- Build graph state by inheriting from per-node “update” models (grouped keys, defaults allowed, overlapping keys allowed).
- Defaults guidance: define input state keys without defaults (except documented nested-subgraph exceptions). Define updated keys as type | None = None, except list keys expected to be appended by many nodes.
- Avoid silent bugs where missing parent→subgraph key mapping + default values yields empty/incorrect state instead of errors.
Implementation detail: prototype built in ~1 week / 1 FTE to test end-to-end runtime, fan-out parallelization, subgraph parallelization, state management, streaming.

🔍 LangGraph Cloud (LangSmith Deployment) — scaling & reliability rationale

Explainer · source

System-design rationale for scaling/reliability (deployment model, operational concerns, and what the platform adds beyond local execution)

Key content

LangGraph v0.1 design rationale (control vs. agency):
- Legacy high-level abstractions (e.g., LangChain AgentExecutor) can “hide too many details,” reducing developer control and hurting reliability on complex, domain-specific workflows.
- LangGraph provides low-level control over code flow, prompts, and LLM calls, supporting conditional branching and looping for single- or multi-agent architectures (hierarchical/sequential patterns).
- Enables moderation/quality checks as explicit gates before continuing, reducing chances an agent gets stuck on an unrecoverable path.
- Human-in-the-loop via persistence layer: design workflows to (1) wait for human approval before executing and resuming, (2) edit actions before execution, (3) “time travel” to inspect/rewire/edit/resume execution.
- Supports streaming intermediate steps and token-by-token streaming for responsiveness in long-running tasks.
LangGraph Cloud (beta) deployment model & operational additions:
- Purpose-built infra for scalable, fault-tolerant agent deployment; addresses overload from uneven task distribution that can cause slowdowns/downtime.
- Manages horizontally-scaling task queues and servers plus a robust Postgres checkpointer to store large states/threads and handle many concurrent users.
- Adds real-world interaction patterns:
  - Double-texting on currently-running threads with 4 strategies: reject, queue, interrupt, rollback.
  - Async background jobs for long tasks; completion via polling or webhook.
  - Cron jobs for scheduled tasks.
- Integrated dev/ops: LangGraph Studio for visualizing trajectories, debugging failure modes, adding breakpoints, interruption/state editing/resumption/time travel; sharing for stakeholder feedback.
- Deployment workflow: select LangGraph GitHub repo → one-click deploy (no infra expertise). Integrated with LangSmith for monitoring usage, errors, performance, costs.
Naming note: As of Oct 2025, “LangGraph Platform” renamed to LangSmith Deployment.

🔍 LangGraph in Production — reliability/observability/control

Explainer · source

Production-oriented discussion of reliability/observability/control requirements and how LangGraph + LangSmith address them (with named company examples).

Key content

Production adoption examples (named):
- LinkedIn: Built an AI-powered recruiter using a hierarchical agent system on LangGraph to automate candidate sourcing, matching, and messaging; freed human recruiters for higher-level strategy.
- AppFolio: Copilot for property managers; saved 10+ hours/week, cut app latency, and 2× decision accuracy (as attributed to LangGraph’s impact).
- Replit: Multi-agent software-building copilot; LangGraph enables human-in-the-loop transparency so users can see agent actions (e.g., package installs, file creation).
- Uber: Used LangGraph to streamline large-scale code migrations; structured specialized agents for unit test generation steps “with precision.”
- Elastic: Orchestrated a network of agents for real-time threat detection, improving speed/effectiveness of security response.
Why production is hard (key hurdles):
- Unpredictability of LLMs: dynamic generation + free-form user input makes correctness/context hard to guarantee.
- Complex orchestration: coordinating multiple agents, dependencies, error recovery, and communication.
- Observability/debugging limits: hard to diagnose “why” behind bad agent decisions without tracing/monitoring.
What LangGraph is (design rationale): a controllable agent framework designed for production use, emphasizing:
- Low-level, customizable primitives (fully descriptive; intended to scale beyond prototyping).
- Reliability controls: moderation checks, human-in-the-loop, persisted context for long-running workflows.
- Observability: integrates with LangSmith for visibility into interactions, performance monitoring, debugging.
Historical note: Built in early 2024; positioned as default framework for many production agentic apps; tradeoff: steeper learning curve for scalability.

🔍 Production-ready agentic loop patterns (Shopify Sidekick)

Explainer · source

Production architecture patterns for an agentic loop incl. Just-in-Time (JIT) instructions/context injection + operational considerations

Key content

Agentic loop (architecture): Human input → LLM decides actions → actions executed in environment/tools → feedback collected → loop continues until task complete (single-agent emphasized).
Tool scaling failure mode (“Tool Complexity Problem”):
- 0–20 tools: clear boundaries, easy debugging.
- 20–50 tools: unclear boundaries; tool combinations cause unexpected outcomes.
- 50+ tools: multiple ways to do same task; system hard to reason about.
- Leads to “Death by a Thousand Instructions”: bloated system prompt with special cases/conflicts.
Just-in-Time (JIT) instructions (context injection): Return relevant instructions alongside tool data only when needed; aim: “perfect context… not a token less, not a token more.”
- Benefits: localized guidance, cache efficiency (change instructions without breaking prompt caches), modularity (vary by beta flags, model versions, page context).
Evaluation shift: Reject “vibe testing” / generic “rate 0–10” judges; require principled, statistically rigorous evaluation.
Ground Truth Sets (GTX) over golden datasets: Sample real production conversations; define criteria from observed distribution.
- Human eval: ≥ 3 product experts label conversations across criteria.
- Inter-annotator stats: Cohen’s Kappa, Kendall Tau, Pearson correlation; treat human agreement as theoretical max for judges.
LLM-as-a-Judge calibration: Improve judge from Kappa 0.02 → 0.61; human baseline 0.69. Trust when swapping judge/human in GTX is hard to distinguish.
Pre-prod testing: LLM merchant simulator replays “essence/goals” of real convos through candidate systems to catch regressions.
Training: GRPO with N-stage gated rewards = procedural validation (syntax/schema) + semantic LLM-judge rewards.
- Reward hacking modes: opt-out (“can’t help”), tag hacking, schema violations (hallucinated IDs/enum).
- Fixes improved syntax validation ~93% → ~99%, judge correlation 0.66 → 0.75; end-to-end quality matched SFT baseline.
Deployment recommendations: stay simple (quality>quantity tools), start modular (JIT early), avoid multi-agent early; expect reward hacking; iterative judge refinement.

📋 ToolNode (LangGraph prebuilt tool execution + injection)

Code · source

Reference implementation of ToolNode (tool execution loop, tool-call parsing/dispatch, and how tool outputs are written back into graph state/messages).

Key content

Purpose/design patterns implemented (module docstring):
- Parallel execution of multiple tool calls (efficiency).
- Robust error handling with customizable error messages.
- State injection for tools needing graph state.
- Store injection for tools needing persistent storage.
- Command-based state updates for advanced control flow.
Core types referenced for tool calling loop
- Messages: AIMessage, ToolCall, ToolMessage, RemoveMessage, convert_to_messages.
- Graph update primitives: Command, Send, StreamWriter.
- Runtime context: ToolRuntime bundles state, context, config, stream_writer, tool_call_id, store.
Default error templates (string constants)
- Invalid tool name:
  INVALID_TOOL_NAME_ERROR_TEMPLATE = "Error: {requested_tool} is not a valid tool, try one of [{available_tools}]."
- Tool call error:
  TOOL_CALL_ERROR_TEMPLATE = "Error: {error}\n Please fix your mistakes."
- Execution error:
  TOOL_EXECUTION_ERROR_TEMPLATE = "Error executing tool '{tool_name}' with kwargs {tool_kwargs} with error:\n {error}\n Please fix the error and try again."
- Invocation error:
  TOOL_INVOCATION_ERROR_TEMPLATE = "Error invoking tool '{tool_name}' with kwargs {tool_kwargs} with error:\n {error}\n Please fix the error and try again."
Tool call interception request object
- ToolCallRequest dataclass fields: tool_call: ToolCall, tool: BaseTool | None, runtime: ToolRuntime.
- Direct attribute assignment is deprecated; use ToolCallRequest.override() (enforced via __setattr__ warning).
Injected-argument detection algorithm (_get_all_injected_args(tool))
- Collect annotations from both tool input schema (tool.get_input_schema() + get_all_basemodel_annotations) and function signature (get_type_hints(..., include_extras=True)), preferring schema annotations.
- Detect injected keys via _is_injected_arg_type.
- Runtime injection if param name is "runtime" or annotated with ToolRuntime.
- Store injection if annotated with InjectedStore.

SocraticTutor LLM Wiki

Explorer

Agent Memory

Agent Memory

Video (best)

Blog / Written explainer (best)

Deep dive

Original paper

Code walkthrough

Coverage notes

Additional Resources for Tutor Depth

📄 BM25 / BM25F scoring (Lucene implementation notes)

📄 DPR dual-encoder retrieval (dot product + in-batch negatives)

📄 Maximal Marginal Relevance (MMR) objective

📄 Options → SMDP framing + regret (UCRL-SMDP)

📄 POAD/BAD — Token-level PPO for Language Agents

📄 RAG = retrieval + generation with top‑K document marginalization

📄 Recursive Summarization for Long-Term Dialogue Memory (LLM-Rsum)

📄 ReflAct (Reflection-for-Action) vs ReAct Backbone

📄 Tree of Thoughts (ToT) = deliberate search over “thoughts”

📊 AgentBench (LLMs as Agents) — environments, protocol, results

📊 LoCoMo — Very Long-Term Conversational Memory Benchmark

📊 LongBench — Task taxonomy & long-context baselines

📊 Open-Source Agent Framework Comparison (Langfuse)

📊 Procedural Memory Retrieval Benchmark (ALFWorld)

📊 WebArena benchmark—realistic web environment + functional success metrics

📖 Completions API — conversation + defaults

📖 Completions API — output budgeting & determinism knobs

📖 LangGraph Graph API (StateGraph) essentials

📖 LangGraph many-tools (ToolNode + conditional routing)

📖 OpenAI API Function/Tool Calling (Schemas, tool_choice, outputs)

📖 Responses API — conversation state, truncation, and token limits

📖 Trace LangGraph applications (observability + debugging hooks)

📋 # Source: https://docs.anthropic.com/en/docs/build-with-claude/tool-use

📋 # Source: https://milvus.io/api-reference/go/v2.3.x/Index/indexLimitations.md

🔍 BM25 / BM25F (Okapi) core equations + rationale

🔍 Deployed Dense Retrieval for Semantic Podcast Episode Search

🔍 Deployment requirements LangGraph Platform (LangSmith Deployment)

🔍 LangGraph = explicit cyclical state-machine graphs for agents

🔍 LangGraph Agent Search Loop (Onyx) — cyclical graphs + evolving state

🔍 LangGraph Cloud (LangSmith Deployment) — scaling & reliability rationale

🔍 LangGraph in Production — reliability/observability/control

🔍 Production-ready agentic loop patterns (Shopify Sidekick)

📋 ToolNode (LangGraph prebuilt tool execution + injection)

Related Topics

Graph View

Table of Contents

Backlinks