Agentic Coding
Video (best)
- Andrej Karpathy — “Software Is Changing (Again)”
- Watch: YouTube
- Why: Clear, high-level framing of LLM-driven software development and the shift toward more autonomous/agentic tooling; good conceptual grounding before diving into specific coding agents and workflows.
- Level: Beginner → Intermediate
Blog / Written explainer (best)
- Simon Willison — “Prompt injection explained”
- Link: https://simonwillison.net/2023/Nov/27/prompt-injection-explained/
- Why: Essential security and workflow context for agentic coding (tool use, untrusted context, and how “instructions” can be subverted), which directly impacts rules files, context management, and agent-human collaboration.
- Level: Intermediate
Deep dive
- Anthropic — “Building effective agents” [VERIFY]
- url: https://www.anthropic.com/research/building-effective-agents [VERIFY]
- Why: Practical patterns for agent design (task decomposition, tool use, feedback loops) that map well to agentic workflows like iterative debugging, multi-step prompt-to-code, and context management.
- Level: Intermediate → Advanced
Original paper
- Yao et al. (2022) — “ReAct: Synergizing Reasoning and Acting in Language Models”
- Link: https://arxiv.org/abs/2210.03629
- Why: Foundational approach for agentic behavior (interleaving reasoning and tool actions) that underpins many modern coding-agent workflows.
- Level: Intermediate → Advanced
Code walkthrough
- OpenAI Cookbook — “Function calling” examples
- Link: https://cookbook.openai.com/
- Why: Concrete, runnable patterns for tool/function calling that are directly applicable to coding agents (planning → tool invocation → result integration), and a good base for building prompt-to-code and iterative debugging loops.
- Level: Intermediate
Coverage notes
- Strong: High-level motivation for agentic coding; foundational agent pattern (ReAct); practical tool-calling patterns; security considerations relevant to context/rules.
- Weak: Specific IDE agent products (Claude Code, Cursor, Windsurf, GitHub Copilot) and their exact feature sets change rapidly and are not covered deeply by the above evergreen resources.
- Gap: A stable, vendor-neutral “agents.md / rules files” best-practices spec and a canonical, long-lived multi-file refactor walkthrough using a modern coding agent in a real repo.
Additional Resources for Tutor Depth
9 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 Toolformer — self-supervised tool/API call insertion via likelihood filtering
Paper · source
Toolformer’s self-supervised procedure: generate candidate API calls, execute, filter by future-token loss improvement, finetune on augmented text.
Key content
- Goal (Section 1–2): Train LM (M) to decide which API, when, arguments, and how to use results in next-token prediction; requires only a handful of demonstrations per API.
- Pipeline (Figure 2, Section 2):
- Sample API calls: prompt (P(\mathbf{x})) to annotate plain text (\mathbf{x}=x_{1:n}) with candidate calls at position (i).
- Execute calls to obtain results (r_i).
- Filter calls by whether they improve predicting future tokens; merge surviving calls across tools; finetune on augmented corpus (\mathcal{C}^*).
- Filtering objective (Section 2): Weighted cross-entropy loss over future tokens: [ L_i(\mathbf{z})=-\sum_{j=i}^{n} w_{j-i}\log p_M(x_j\mid \mathbf{z},x_{1:j-1}) ] Define: [ L_i^{+}=L_i(e(c_i,r_i)),\quad L_i^{-}=\min\big(L_i(\varepsilon),,L_i(e(c_i,\varepsilon))\big) ] Keep call if: [ L_i^{-}-L_i^{+}\ge \tau_f ] where (e(\cdot)) inserts call+result text; (\varepsilon)=empty; (\tau_f)=tool-specific threshold.
- Weighting (Section 4.1): (\tilde w_t=\max(0,1-0.2t)), (w_t=\tilde w_t/\sum_s \tilde w_s) (encourages calls near where useful).
- Finetuning (Section 4.1): batch size 128, LR (1\times10^{-5}), linear warmup 10%.
- Tools (Section 3): QA (Atlas), Wikipedia search, calculator, calendar (current date), translation (NLLB 600M + fastText language ID).
- Empirical highlights (Section 4.2):
- LAMA: Toolformer improves over best same-size baseline by +11.7 / +5.2 / +18.6 points (SQuAD / Google-RE / T-REx subsets); uses QA tool 98.1% of cases.
- Math (ASDiv/SVAMP/MAWPS): enabling API calls more than doubles performance; calculator used 97.9% of examples.
- Scaling (Section 4.4): effective tool use emerges around 775M parameters (GPT-2 family); smaller models show little gain.
📊 RACE-bench (Reasoning-Augmented Repo-Level Code Agent Eval)
Benchmark · source
Repository-level agent evaluation protocol for feature addition with dual-track (patch + reasoning) metrics
Key content
- Benchmark scope/design (Sec. 2.1): RACE-bench = 528 real-world feature addition instances from 12 OSS Python repos. Each instance includes: Task Context (issue text + env setup + optional hints), Reasoning Ground Truth (4 stages / 5 modules), and Verification (tests + gold/test patches).
- Verification protocol (Sec. 2.1, 2.2.2): Uses Fail-to-Pass (FTP) tests (fail on base commit, pass after gold patch) + Pass-to-Pass (PTP) tests (regression preservation). Patches applied via
git apply; tests run with pytest in per-instance Docker. - Reasoning Ground Truth construction (Sec. 2.2.3):
- Issue Understanding: DeepSeek generates Concept Explanations + Goal Expectation (behavior-only; no code details).
- File Localization: derive from gold/test patches; ablation per modified file—remove file’s changes, rerun tests → label Necessary Code File if tests fail; else Other File; test patch files = Test Files.
- Issue Implementation: static parse gold patch to extract changed functions/methods (classes treated as containers); annotate purpose + is_necessary using FTP tests.
- Step Decomposition: minimal ordered steps from closed taxonomy: introduce new capability; reuse existing semantics; change existing semantics; deprecate/replace behavior; enforce constraints/edge cases.
- Dual-track evaluation (Sec. 2.3–2.4):
- Patch metrics: Resolved Rate = % instances passing FTP+PTP on first attempt; Patch Apply Rate = % patches that apply cleanly.
- Reasoning metrics (Table 1): Recall/OverPrediction for files/tasks/steps; Score@GoalExpectation (10-pt LLM judge); concept recall/accuracy.
- Key empirical results (Sec. 4.1 Table 2):
- AutoCodeRover: Apply 96.21% (508/528), Resolved 28.79% (152/528)
- TraeAgent: Apply 78.98% (417/528), Resolved 52.65% (278/528)
- mini-SWE-Agent: Apply 95.83% (506/528), Resolved 70.08% (370/528)
- Reasoning findings (Sec. 4.2): High intent understanding (Score@Goal ~9.2–9.6/10) but “waterfall” drop from file→task→step recall (e.g., mini-SWE-Agent Recall@Files 0.890 → Recall@Tasks 0.751 → Recall@Steps 0.445). Apply-success/test-fail cases: 35.7% recall decrease and 94.1% over-prediction increase vs successes.
- Defaults/params (Sec. 3.2): single run per instance; temperature=0, top_p=1; max tokens 4096 (agent) / 8192 (summarizer).
📊 RepoBench — repository-level code completion benchmark
Benchmark · source
RepoBench task suite (RepoBench-R retrieval, RepoBench-C completion, RepoBench-P pipeline) + multi-file evaluation protocol
Key content
- Motivation (Sec. 1): Prior benchmarks are mostly single-file; RepoBench targets repository-level (multi-file) auto-completion with explicit cross-file context.
- Data (Sec. 3.1–3.2):
- Train source:
github-code(cutoff Mar 16, 2022); select repos with 32–128 Python/Java files. - Test source: newly crawled non-fork GitHub repos created Feb 9, 2023–Aug 3, 2023 (to reduce leakage).
- Parsed with tree-sitter focusing on import statements → identify cross-file modules, “cross-file lines,” and defining snippets.
- Sizes: training repos 10,345 Python / 14,956 Java; test repos 1,075 Python / 594 Java.
- Train source:
- Task settings (Sec. 3.3):
- XF-F: mask first cross-file line (hardest). XF-R: mask random non-first cross-file line. IF: mask in-file line (no cross-file module).
- Prompt construction (Fig. 1, App. A): cross-file snippets (commented, with path) + in-file context (path + imports + preceding lines). Default in RepoBench-C: max 30 preceding lines.
- RepoBench-R retrieval (Sec. 3.3, 4.1):
- Retrieval objective: top‑k by similarity
[ \arg\max_{i\in{1..n}}^{k} f(C[-m:], S_i) ] where (C)=in-file code, (S_i)=candidate snippet, (n)=#candidates, (m)=kept preceding lines (baseline m=3), (f)=similarity. - Candidates: Easy 5–9, Hard ≥10. Metric: acc@k (Easy: @1,@3; Hard: @1,@3,@5).
- Key results (Table 2, Hard/Python acc@1): InstructOR 19.10, UniXcoder 18.48, Jaccard 10.47, Random 6.43. (Easy/Python acc@1: InstructOR 28.22, UniXcoder 27.09.)
- Retrieval objective: top‑k by similarity
- RepoBench-C completion (Sec. 3.3, 4.2):
- Autoregressive next-line probability (Eq. 1):
[ P(Y)=\prod_{i=1}^{n} P(y_i \mid y_{<i}, C_x, C_{in}) ] (C_x)=cross-file context, (C_{in})=in-file context. - Subsets: 2k prompts ≤ 1,925 tokens (for 2,048 limit); 8k prompts ≤ 7,685 tokens.
- Metrics: Exact Match (EM), Edit Similarity, CodeBLEU.
- Key results (Table 3, 2k/Python EM): CodeLlama‑34B 37.40 (best); Codex 31.31. (2k/Java EM: Codex 42.47 best; CodeLlama‑34B 39.41.)
- Autoregressive next-line probability (Eq. 1):
- RepoBench-P pipeline (Sec. 4.3):
- Pipeline probability (Eq. 2):
[ P(Y)=\prod_{i=1}^{n} P(y_i \mid y_{<i}, S_1..S_k, C_{in}) ] - Constraints: minimum prompt tokens 12k (Python) / 24k (Java); retrieval requires ≥10 candidates.
- Codex baseline config: reserve 1,600 tokens for in-file; crop 60 preceding lines; fill to 6,400 tokens with cross-file snippets.
- Key result (Table 4, Python EM): in-file-only baseline 33.15 vs Jaccard 36.46 vs UniXcoder-L2H 37.11; even Random 34.94 improves → cross-file context helps; snippet ordering matters (higher-similarity nearer completion helps).
- Pipeline probability (Eq. 2):
📊 SWE-bench Verified (human-validated SWE-bench subset)
Benchmark · source
Defines SWE-bench Verified (500 human-filtered SWE-bench instances) and reports verified performance comparisons + rationale for filtering.
Key content
- What SWE-bench evaluates (workflow):
- Input to agent: GitHub issue text (“problem statement”) + repository codebase; tests are hidden.
- Output: a patch (multi-file edits allowed) intended to fix the issue.
- Scoring requires both:
FAIL_TO_PASStests: fail before PR solution, pass after; passing implies the issue is solved.PASS_TO_PASStests: pass before/after; passing implies no regressions.
- Why Verified was created (design rationale): Original SWE-bench can systematically underestimate capability due to:
- overly specific / unrelated unit tests rejecting valid solutions,
- underspecified issue descriptions,
- unreliable environment setup causing failures independent of solution.
- SWE-bench Verified definition & construction (procedure):
- Human annotation campaign: 93 Python-experienced developers.
- Annotated 1,699 random SWE-bench test samples; each sample labeled 3×.
- Two main criteria labeled on severity scale {0,1,2,3}: underspecification; unfair
FAIL_TO_PASStests. - Ensembling rule: take max severity across 3 annotators.
- Filter rule: discard any sample where either criterion has ensemble ≥ 2, or “other major issues” flagged.
- Final dataset: 500 non-problematic samples; includes difficulty slicing from released annotations: easy = 196 (<15 min), hard = 45 (>1 hr).
- Key empirical results:
- 68.3% of SWE-bench samples filtered out (underspecification, unfair tests, or other issues).
- Flag rates: 38.3% underspecified problem statements; 61.1% unfair unit tests.
- Difficulty estimate (original SWE-bench, from 1,699-sample estimate): 77.8% of samples < 1 hour.
- Performance: GPT‑4o = 33.2% solve rate on SWE-bench Verified (model
gpt-4o-2024-05-13); vs 16% on original SWE-bench (best scaffold reported). - Scaffold sensitivity example (SWE-bench Lite): GPT‑4 ranges 2.7% → 28.3% depending on scaffold (early RAG vs CodeR).
- Evaluation reliability improvement: new Docker/containerized harness for easier, more reliable evaluation.
📖 Claude Code Best Practices (Agentic Coding Loop)
Reference Doc · source
Actionable agent loop guidance: task decomposition, iterative verification (tests/linters), safe tool use patterns, and prompt templates
Key content
- Core constraint (Context Window): Claude’s context includes entire conversation + every file read + every command output; it “fills up fast” (debugging/exploration can consume tens of thousands of tokens) and performance degrades as it fills (forgetting earlier instructions, more mistakes). Track via custom status line; reduce via token-usage strategies.
- Verification loop (must-have): Claude performs “dramatically better” when it can verify its own work (run tests, linters, Bash checks, compare screenshots/UI). Without success criteria, the human becomes the only feedback loop. Prefer: write failing test → fix → rerun tests.
- Recommended workflow (4 phases):
- Explore in Plan Mode (read files, answer questions; no changes).
- Plan: produce detailed implementation plan; Ctrl+G opens plan in editor for human edits.
- Implement in Normal Mode; verify against plan.
- Iterate with verification.
Skip planning for tiny diffs (“describe the diff in one sentence”: typo/log/rename).
- Prompting procedures (concrete patterns): specify file(s), scenario, constraints, and testing prefs (e.g., “edge case logged out; avoid mocks”); point to sources (git history); reference existing code patterns; describe symptom + likely location + definition of fixed.
- Persistent rules (CLAUDE.md): loaded at start of every convo; keep short. Include: non-obvious Bash commands, style deviations, test runners, repo etiquette, architecture decisions, env quirks, gotchas. Exclude: things Claude can infer, long tutorials, file-by-file descriptions. Locations:
~/.claude/CLAUDE.md,./CLAUDE.md,./CLAUDE.local.md(gitignored); parent dirs auto-included; child dirs loaded on demand. - Safety/automation defaults: permissions prompt by default; reduce via Auto mode (classifier blocks risky actions), allowlists, or sandboxing. Use hooks for deterministic steps (e.g., run eslint after every edit; block writes to migrations).
- Context management commands:
/clearbetween unrelated tasks;/compact <instructions>;/rewindor Esc+Esc to restore/summarize; Esc stops mid-action. After two failed correction cycles,/clearand rewrite prompt.
📖 Claude Code CLI surface (commands + flags)
Reference Doc · source
Complete Claude Code CLI command/flag surface (sessions -c/-r, MCP via claude mcp, print mode -p, etc.)
Key content
- Doc index lookup (important): fetch full documentation index at
https://code.claude.com/docs/llms.txtto discover all pages. - Feedback endpoint: POST
https://code.claude.com/docs/_mintlify/feedback/claude-code/agent-feedbackwith JSON{ "path": "/current-page-path", "feedback": "..." }. - Core session commands
- Start interactive:
claudeorclaude "query". - Print/SDK then exit:
claude -p "query"; pipe:cat file | claude -p "query". - Continue most recent convo (cwd):
claude -c/claude --continue; alsoclaude -c -p "query". - Resume by ID or name:
claude -r <id|name> "query"orclaude --resume <id|name>(picker if omitted). - Name session:
claude -n "my-feature-work"; resume named session with--resume. - Fork on resume:
--fork-session(with--resume/--continue).
- Start interactive:
- Auth & updates:
claude update;claude auth login [--email] [--sso] [--console];claude auth status(JSON;--text; exit code 0 logged in / 1 not);claude auth logout;claude setup-token(prints long-lived OAuth token). - MCP / plugins / remote control
- MCP config:
claude mcp; load via--mcp-config(JSON file/string);--strict-mcp-configignores other MCP configs. - Plugins:
claude plugin(aliasclaude plugins);--plugin-dirrepeatable. - Remote control server:
claude remote-control; interactive RC:claude --remote-control/--rc; name prefix flag--remote-control-session-name-prefix(env:CLAUDE_REMOTE_CONTROL_SESSION_NAME_PREFIX).
- MCP config:
- Permission/tooling controls
- Permission modes:
--permission-mode {default,acceptEdits,plan,auto,dontAsk,bypassPermissions};--dangerously-skip-permissions== bypassPermissions;--allow-dangerously-skip-permissionsadds bypass to mode cycle. - Tool allow/deny:
--tools(restrict available tools),--allowedTools(auto-allow patterns),--disallowedTools(remove tools).
- Permission modes:
- Print-mode I/O & limits:
--output-format {text,json,stream-json};--input-format {text,stream-json};--max-turns N;--max-budget-usd X;--no-session-persistence;--json-schema <schema>. - System prompt flags (rationale): prefer append to preserve built-ins.
- Replace:
--system-promptXOR--system-prompt-file. - Append:
--append-system-prompt,--append-system-prompt-file(can combine with replacement).
- Replace:
- Other notable defaults/notes:
claude --helpis not exhaustive for flags;--bareskips auto-discovery (hooks/skills/plugins/MCP/auto memory/CLAUDE.md) and sets envCLAUDE_CODE_SIMPLE.
📖 Claude Code Common Workflows (Plan Mode, tests, PRs, sessions, worktrees)
Reference Doc · source
End-to-end agentic coding workflows + concrete prompt/session management patterns
Key content
- Plan Mode (safe analysis, read-only planning)
- When to use: multi-step implementations (many files), deep code exploration before edits, interactive iteration on direction.
- How to enable (in-session):
Shift+Tabcycles permission modes: Normal → Auto-Accept → Plan Mode. Plan Mode indicator: “⏸ plan mode on”. - Start in Plan Mode (CLI):
claude --permission-mode plan(also-pin headless mode). - Rationale: forces read-only operations while Claude analyzes and proposes a plan; uses
AskUserQuestionto clarify requirements before planning.
- Tests workflow
- Ask for tests with specific behaviors to verify.
- Claude should inspect existing test files to match project conventions (framework, assertion style).
- Prompt for edge cases: error conditions, boundary values, unexpected inputs.
- Pull request workflow
- You can ask directly: “create a pr for my changes”, or guide step-by-step using
gh pr create. - Session ↔ PR linking: creating a PR via
gh pr createautomatically links the session; resume later withclaude --from-pr <number>.
- You can ask directly: “create a pr for my changes”, or guide step-by-step using
- Session management (resume/organize)
- Resume:
claude --continue(most recent in current dir),claude --resume(picker),claude --from-pr 123, in-session/resume. - Sessions stored per project directory; picker spans same git repo incl. worktrees.
- Naming: start with
-nor rename via:/rename; picker rename shortcutR.
- Resume:
- Parallel work with Git worktrees
- Create isolated worktree session:
--worktree(-w); creates<repo>/.claude/worktrees/<name>and branchworktree-<name>fromorigin/HEAD. - Update base ref if needed:
git remote set-head origin your-branch-name. - Cleanup rules: no changes → auto-remove; changes/commits → prompt keep/remove.
- Copy gitignored env/config into worktrees via
.worktreeinclude(gitignore syntax; only matches files that are also gitignored).
- Create isolated worktree session:
🔍 How Anthropic teams use Claude Code — org workflows & constraints
Explainer · source
Real organizational usage patterns + operational workflow details for Claude Code in team settings
Key content
- Workflow patterns (repeatable procedures)
- Checkpoint-heavy autonomy loop (Product Dev/RL Eng): start from clean git state → enable auto-accept mode (Shift+Tab) → let Claude write code/run tests/iterate → commit checkpoints regularly for easy rollback if it goes off track.
- Task classification heuristic: use async autonomy for peripheral/prototyping/edge features; use synchronous supervision for core business logic/critical fixes (monitor architecture/style in real time).
- “Try one-shot, then collaborate” (RL Eng): let Claude attempt full implementation first; if it fails, switch to guided iteration.
- End-of-session doc loop (Data Infra): ask Claude to summarize session + suggest improvements → update Claude.md continuously based on real usage.
- Parallel instances: run multiple Claude Code instances in different repos; each maintains context across hours/days for parallel workstreams.
- Integration touchpoints
- GitHub Actions: Claude can address PR comments (e.g., formatting/renames) automatically.
- MCP servers (security control): Data Infra recommends MCP servers instead of BigQuery CLI for sensitive data access control/logging.
- Screenshots/images: used for Kubernetes debugging (dashboard screenshots) and design-to-prototype (paste mockups via Cmd+V).
- Empirical results (numbers)
- Product Dev: ~70% of Vim mode implementation came from autonomous Claude work.
- Security Eng: incident code-scanning reduced 10–15 min → ~5 min.
- Inference: ML research time reduced by ~80% (~60 min → 10–20 min).
- Data Sci/Vis: built ~5,000-line TypeScript app; reports 2–4× time savings on refactors.
- Growth Marketing: ad copy creation 2 hours → 15 min; 10× creative output; constraints: 30-char headlines, 90-char descriptions; Figma plugin generates up to 100 variations per batch.
- Product Design: Figma + Claude Code open ~80% of time; execution 2–3× faster; messaging project reduced ~1 week → two 30-min calls.
- RL Eng: one-shot success ~1/3 of the time.
🔍 Measuring GitHub Copilot’s Impact on Productivity (telemetry + survey)
Explainer · source
Measured productivity impacts + how Copilot usage telemetry relates to perceived productivity (ACM CACM write-up; DOI:10.1145/3633453)
Key content
- Study design (survey + telemetry, 2022 preview):
- Survey emailed to 17,420 preview users; 2,047 responses matched to IDE telemetry.
- Focus period: 4 weeks leading up to survey completion (most responses within first 2 days, on/before Feb 12, 2022).
- Survey built on SPACE; used S, P, C, E (excluded self-reported Activity). Aggregate productivity = mean of 12 measures (11 SPACE statements + “I am more productive…”), excluding skipped items.
- Telemetry event funnel (Table 1):
opportunity,shown,accepted,accepted_char,mostly_unchanged_X(Levenshtein distance < 33%) at X ∈ {30,120,300,600}s,unchanged_Xat same X, and(active) hour. - Core metric formulas (Table 2; “X_per_Y” normalization):
- Acceptance rate =
accepted_per_shown= (# accepted completions) / (# shown completions). - Shown rate =
shown_per_opportunity. - Acceptance frequency =
accepted_per_active_hour. - Contribution speed =
accepted_char_per_active_hour. - Persistence rate =
unchanged_X_per_accepted; Fuzzy persistence =mostly_unchanged_X_per_accepted.
- Acceptance rate =
- Key empirical findings:
- Acceptance rate is the strongest positive predictor of perceived productivity, outperforming persistence-based metrics.
- PLS regression: Component 1 explains 43.2% of variance; Component 2 explains 21.2%; both draw strongly from acceptance rate.
- Controlled experiment (speed): 95 pro developers, JS HTTP server task.
- Completion success: 78% (Copilot) vs 70% (control).
- Time: 1h11m (Copilot) vs 2h41m (control) ⇒ 55% faster; P = .0017; 95% CI [21%, 89%].