Agentic Coding

Video (best)

  • Andrej Karpathy — “Software Is Changing (Again)”
  • Watch: YouTube
  • Why: Clear, high-level framing of LLM-driven software development and the shift toward more autonomous/agentic tooling; good conceptual grounding before diving into specific coding agents and workflows.
  • Level: Beginner → Intermediate

Blog / Written explainer (best)

  • Simon Willison — “Prompt injection explained”
  • Link: https://simonwillison.net/2023/Nov/27/prompt-injection-explained/
  • Why: Essential security and workflow context for agentic coding (tool use, untrusted context, and how “instructions” can be subverted), which directly impacts rules files, context management, and agent-human collaboration.
  • Level: Intermediate

Deep dive

  • Anthropic — “Building effective agents” [VERIFY]
  • url: https://www.anthropic.com/research/building-effective-agents [VERIFY]
  • Why: Practical patterns for agent design (task decomposition, tool use, feedback loops) that map well to agentic workflows like iterative debugging, multi-step prompt-to-code, and context management.
  • Level: Intermediate → Advanced

Original paper

  • Yao et al. (2022) — “ReAct: Synergizing Reasoning and Acting in Language Models”
  • Link: https://arxiv.org/abs/2210.03629
  • Why: Foundational approach for agentic behavior (interleaving reasoning and tool actions) that underpins many modern coding-agent workflows.
  • Level: Intermediate → Advanced

Code walkthrough

  • OpenAI Cookbook — “Function calling” examples
  • Link: https://cookbook.openai.com/
  • Why: Concrete, runnable patterns for tool/function calling that are directly applicable to coding agents (planning → tool invocation → result integration), and a good base for building prompt-to-code and iterative debugging loops.
  • Level: Intermediate

Coverage notes

  • Strong: High-level motivation for agentic coding; foundational agent pattern (ReAct); practical tool-calling patterns; security considerations relevant to context/rules.
  • Weak: Specific IDE agent products (Claude Code, Cursor, Windsurf, GitHub Copilot) and their exact feature sets change rapidly and are not covered deeply by the above evergreen resources.
  • Gap: A stable, vendor-neutral “agents.md / rules files” best-practices spec and a canonical, long-lived multi-file refactor walkthrough using a modern coding agent in a real repo.

Additional Resources for Tutor Depth

9 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.

📄 Toolformer — self-supervised tool/API call insertion via likelihood filtering

Paper · source

Toolformer’s self-supervised procedure: generate candidate API calls, execute, filter by future-token loss improvement, finetune on augmented text.

Key content
  • Goal (Section 1–2): Train LM (M) to decide which API, when, arguments, and how to use results in next-token prediction; requires only a handful of demonstrations per API.
  • Pipeline (Figure 2, Section 2):
    1. Sample API calls: prompt (P(\mathbf{x})) to annotate plain text (\mathbf{x}=x_{1:n}) with candidate calls at position (i).
    2. Execute calls to obtain results (r_i).
    3. Filter calls by whether they improve predicting future tokens; merge surviving calls across tools; finetune on augmented corpus (\mathcal{C}^*).
  • Filtering objective (Section 2): Weighted cross-entropy loss over future tokens: [ L_i(\mathbf{z})=-\sum_{j=i}^{n} w_{j-i}\log p_M(x_j\mid \mathbf{z},x_{1:j-1}) ] Define: [ L_i^{+}=L_i(e(c_i,r_i)),\quad L_i^{-}=\min\big(L_i(\varepsilon),,L_i(e(c_i,\varepsilon))\big) ] Keep call if: [ L_i^{-}-L_i^{+}\ge \tau_f ] where (e(\cdot)) inserts call+result text; (\varepsilon)=empty; (\tau_f)=tool-specific threshold.
  • Weighting (Section 4.1): (\tilde w_t=\max(0,1-0.2t)), (w_t=\tilde w_t/\sum_s \tilde w_s) (encourages calls near where useful).
  • Finetuning (Section 4.1): batch size 128, LR (1\times10^{-5}), linear warmup 10%.
  • Tools (Section 3): QA (Atlas), Wikipedia search, calculator, calendar (current date), translation (NLLB 600M + fastText language ID).
  • Empirical highlights (Section 4.2):
    • LAMA: Toolformer improves over best same-size baseline by +11.7 / +5.2 / +18.6 points (SQuAD / Google-RE / T-REx subsets); uses QA tool 98.1% of cases.
    • Math (ASDiv/SVAMP/MAWPS): enabling API calls more than doubles performance; calculator used 97.9% of examples.
    • Scaling (Section 4.4): effective tool use emerges around 775M parameters (GPT-2 family); smaller models show little gain.

📊 RACE-bench (Reasoning-Augmented Repo-Level Code Agent Eval)

Benchmark · source

Repository-level agent evaluation protocol for feature addition with dual-track (patch + reasoning) metrics

Key content
  • Benchmark scope/design (Sec. 2.1): RACE-bench = 528 real-world feature addition instances from 12 OSS Python repos. Each instance includes: Task Context (issue text + env setup + optional hints), Reasoning Ground Truth (4 stages / 5 modules), and Verification (tests + gold/test patches).
  • Verification protocol (Sec. 2.1, 2.2.2): Uses Fail-to-Pass (FTP) tests (fail on base commit, pass after gold patch) + Pass-to-Pass (PTP) tests (regression preservation). Patches applied via git apply; tests run with pytest in per-instance Docker.
  • Reasoning Ground Truth construction (Sec. 2.2.3):
    • Issue Understanding: DeepSeek generates Concept Explanations + Goal Expectation (behavior-only; no code details).
    • File Localization: derive from gold/test patches; ablation per modified file—remove file’s changes, rerun tests → label Necessary Code File if tests fail; else Other File; test patch files = Test Files.
    • Issue Implementation: static parse gold patch to extract changed functions/methods (classes treated as containers); annotate purpose + is_necessary using FTP tests.
    • Step Decomposition: minimal ordered steps from closed taxonomy: introduce new capability; reuse existing semantics; change existing semantics; deprecate/replace behavior; enforce constraints/edge cases.
  • Dual-track evaluation (Sec. 2.3–2.4):
    • Patch metrics: Resolved Rate = % instances passing FTP+PTP on first attempt; Patch Apply Rate = % patches that apply cleanly.
    • Reasoning metrics (Table 1): Recall/OverPrediction for files/tasks/steps; Score@GoalExpectation (10-pt LLM judge); concept recall/accuracy.
  • Key empirical results (Sec. 4.1 Table 2):
    • AutoCodeRover: Apply 96.21% (508/528), Resolved 28.79% (152/528)
    • TraeAgent: Apply 78.98% (417/528), Resolved 52.65% (278/528)
    • mini-SWE-Agent: Apply 95.83% (506/528), Resolved 70.08% (370/528)
  • Reasoning findings (Sec. 4.2): High intent understanding (Score@Goal ~9.2–9.6/10) but “waterfall” drop from file→task→step recall (e.g., mini-SWE-Agent Recall@Files 0.890 → Recall@Tasks 0.751 → Recall@Steps 0.445). Apply-success/test-fail cases: 35.7% recall decrease and 94.1% over-prediction increase vs successes.
  • Defaults/params (Sec. 3.2): single run per instance; temperature=0, top_p=1; max tokens 4096 (agent) / 8192 (summarizer).

📊 RepoBench — repository-level code completion benchmark

Benchmark · source

RepoBench task suite (RepoBench-R retrieval, RepoBench-C completion, RepoBench-P pipeline) + multi-file evaluation protocol

Key content
  • Motivation (Sec. 1): Prior benchmarks are mostly single-file; RepoBench targets repository-level (multi-file) auto-completion with explicit cross-file context.
  • Data (Sec. 3.1–3.2):
    • Train source: github-code (cutoff Mar 16, 2022); select repos with 32–128 Python/Java files.
    • Test source: newly crawled non-fork GitHub repos created Feb 9, 2023–Aug 3, 2023 (to reduce leakage).
    • Parsed with tree-sitter focusing on import statements → identify cross-file modules, “cross-file lines,” and defining snippets.
    • Sizes: training repos 10,345 Python / 14,956 Java; test repos 1,075 Python / 594 Java.
  • Task settings (Sec. 3.3):
    • XF-F: mask first cross-file line (hardest). XF-R: mask random non-first cross-file line. IF: mask in-file line (no cross-file module).
  • Prompt construction (Fig. 1, App. A): cross-file snippets (commented, with path) + in-file context (path + imports + preceding lines). Default in RepoBench-C: max 30 preceding lines.
  • RepoBench-R retrieval (Sec. 3.3, 4.1):
    • Retrieval objective: top‑k by similarity
      [ \arg\max_{i\in{1..n}}^{k} f(C[-m:], S_i) ] where (C)=in-file code, (S_i)=candidate snippet, (n)=#candidates, (m)=kept preceding lines (baseline m=3), (f)=similarity.
    • Candidates: Easy 5–9, Hard ≥10. Metric: acc@k (Easy: @1,@3; Hard: @1,@3,@5).
    • Key results (Table 2, Hard/Python acc@1): InstructOR 19.10, UniXcoder 18.48, Jaccard 10.47, Random 6.43. (Easy/Python acc@1: InstructOR 28.22, UniXcoder 27.09.)
  • RepoBench-C completion (Sec. 3.3, 4.2):
    • Autoregressive next-line probability (Eq. 1):
      [ P(Y)=\prod_{i=1}^{n} P(y_i \mid y_{<i}, C_x, C_{in}) ] (C_x)=cross-file context, (C_{in})=in-file context.
    • Subsets: 2k prompts ≤ 1,925 tokens (for 2,048 limit); 8k prompts ≤ 7,685 tokens.
    • Metrics: Exact Match (EM), Edit Similarity, CodeBLEU.
    • Key results (Table 3, 2k/Python EM): CodeLlama‑34B 37.40 (best); Codex 31.31. (2k/Java EM: Codex 42.47 best; CodeLlama‑34B 39.41.)
  • RepoBench-P pipeline (Sec. 4.3):
    • Pipeline probability (Eq. 2):
      [ P(Y)=\prod_{i=1}^{n} P(y_i \mid y_{<i}, S_1..S_k, C_{in}) ]
    • Constraints: minimum prompt tokens 12k (Python) / 24k (Java); retrieval requires ≥10 candidates.
    • Codex baseline config: reserve 1,600 tokens for in-file; crop 60 preceding lines; fill to 6,400 tokens with cross-file snippets.
    • Key result (Table 4, Python EM): in-file-only baseline 33.15 vs Jaccard 36.46 vs UniXcoder-L2H 37.11; even Random 34.94 improves → cross-file context helps; snippet ordering matters (higher-similarity nearer completion helps).

📊 SWE-bench Verified (human-validated SWE-bench subset)

Benchmark · source

Defines SWE-bench Verified (500 human-filtered SWE-bench instances) and reports verified performance comparisons + rationale for filtering.

Key content
  • What SWE-bench evaluates (workflow):
    • Input to agent: GitHub issue text (“problem statement”) + repository codebase; tests are hidden.
    • Output: a patch (multi-file edits allowed) intended to fix the issue.
    • Scoring requires both:
      • FAIL_TO_PASS tests: fail before PR solution, pass after; passing implies the issue is solved.
      • PASS_TO_PASS tests: pass before/after; passing implies no regressions.
  • Why Verified was created (design rationale): Original SWE-bench can systematically underestimate capability due to:
    1. overly specific / unrelated unit tests rejecting valid solutions,
    2. underspecified issue descriptions,
    3. unreliable environment setup causing failures independent of solution.
  • SWE-bench Verified definition & construction (procedure):
    • Human annotation campaign: 93 Python-experienced developers.
    • Annotated 1,699 random SWE-bench test samples; each sample labeled 3×.
    • Two main criteria labeled on severity scale {0,1,2,3}: underspecification; unfair FAIL_TO_PASS tests.
    • Ensembling rule: take max severity across 3 annotators.
    • Filter rule: discard any sample where either criterion has ensemble ≥ 2, or “other major issues” flagged.
    • Final dataset: 500 non-problematic samples; includes difficulty slicing from released annotations: easy = 196 (<15 min), hard = 45 (>1 hr).
  • Key empirical results:
    • 68.3% of SWE-bench samples filtered out (underspecification, unfair tests, or other issues).
    • Flag rates: 38.3% underspecified problem statements; 61.1% unfair unit tests.
    • Difficulty estimate (original SWE-bench, from 1,699-sample estimate): 77.8% of samples < 1 hour.
    • Performance: GPT‑4o = 33.2% solve rate on SWE-bench Verified (model gpt-4o-2024-05-13); vs 16% on original SWE-bench (best scaffold reported).
    • Scaffold sensitivity example (SWE-bench Lite): GPT‑4 ranges 2.7% → 28.3% depending on scaffold (early RAG vs CodeR).
  • Evaluation reliability improvement: new Docker/containerized harness for easier, more reliable evaluation.

📖 Claude Code Best Practices (Agentic Coding Loop)

Reference Doc · source

Actionable agent loop guidance: task decomposition, iterative verification (tests/linters), safe tool use patterns, and prompt templates

Key content
  • Core constraint (Context Window): Claude’s context includes entire conversation + every file read + every command output; it “fills up fast” (debugging/exploration can consume tens of thousands of tokens) and performance degrades as it fills (forgetting earlier instructions, more mistakes). Track via custom status line; reduce via token-usage strategies.
  • Verification loop (must-have): Claude performs “dramatically better” when it can verify its own work (run tests, linters, Bash checks, compare screenshots/UI). Without success criteria, the human becomes the only feedback loop. Prefer: write failing test → fix → rerun tests.
  • Recommended workflow (4 phases):
    1. Explore in Plan Mode (read files, answer questions; no changes).
    2. Plan: produce detailed implementation plan; Ctrl+G opens plan in editor for human edits.
    3. Implement in Normal Mode; verify against plan.
    4. Iterate with verification.
      Skip planning for tiny diffs (“describe the diff in one sentence”: typo/log/rename).
  • Prompting procedures (concrete patterns): specify file(s), scenario, constraints, and testing prefs (e.g., “edge case logged out; avoid mocks”); point to sources (git history); reference existing code patterns; describe symptom + likely location + definition of fixed.
  • Persistent rules (CLAUDE.md): loaded at start of every convo; keep short. Include: non-obvious Bash commands, style deviations, test runners, repo etiquette, architecture decisions, env quirks, gotchas. Exclude: things Claude can infer, long tutorials, file-by-file descriptions. Locations: ~/.claude/CLAUDE.md, ./CLAUDE.md, ./CLAUDE.local.md (gitignored); parent dirs auto-included; child dirs loaded on demand.
  • Safety/automation defaults: permissions prompt by default; reduce via Auto mode (classifier blocks risky actions), allowlists, or sandboxing. Use hooks for deterministic steps (e.g., run eslint after every edit; block writes to migrations).
  • Context management commands: /clear between unrelated tasks; /compact <instructions>; /rewind or Esc+Esc to restore/summarize; Esc stops mid-action. After two failed correction cycles, /clear and rewrite prompt.

📖 Claude Code CLI surface (commands + flags)

Reference Doc · source

Complete Claude Code CLI command/flag surface (sessions -c/-r, MCP via claude mcp, print mode -p, etc.)

Key content
  • Doc index lookup (important): fetch full documentation index at https://code.claude.com/docs/llms.txt to discover all pages.
  • Feedback endpoint: POST https://code.claude.com/docs/_mintlify/feedback/claude-code/agent-feedback with JSON { "path": "/current-page-path", "feedback": "..." }.
  • Core session commands
    • Start interactive: claude or claude "query".
    • Print/SDK then exit: claude -p "query"; pipe: cat file | claude -p "query".
    • Continue most recent convo (cwd): claude -c / claude --continue; also claude -c -p "query".
    • Resume by ID or name: claude -r <id|name> "query" or claude --resume <id|name> (picker if omitted).
    • Name session: claude -n "my-feature-work"; resume named session with --resume.
    • Fork on resume: --fork-session (with --resume/--continue).
  • Auth & updates: claude update; claude auth login [--email] [--sso] [--console]; claude auth status (JSON; --text; exit code 0 logged in / 1 not); claude auth logout; claude setup-token (prints long-lived OAuth token).
  • MCP / plugins / remote control
    • MCP config: claude mcp; load via --mcp-config (JSON file/string); --strict-mcp-config ignores other MCP configs.
    • Plugins: claude plugin (alias claude plugins); --plugin-dir repeatable.
    • Remote control server: claude remote-control; interactive RC: claude --remote-control / --rc; name prefix flag --remote-control-session-name-prefix (env: CLAUDE_REMOTE_CONTROL_SESSION_NAME_PREFIX).
  • Permission/tooling controls
    • Permission modes: --permission-mode {default,acceptEdits,plan,auto,dontAsk,bypassPermissions}; --dangerously-skip-permissions == bypassPermissions; --allow-dangerously-skip-permissions adds bypass to mode cycle.
    • Tool allow/deny: --tools (restrict available tools), --allowedTools (auto-allow patterns), --disallowedTools (remove tools).
  • Print-mode I/O & limits: --output-format {text,json,stream-json}; --input-format {text,stream-json}; --max-turns N; --max-budget-usd X; --no-session-persistence; --json-schema <schema>.
  • System prompt flags (rationale): prefer append to preserve built-ins.
    • Replace: --system-prompt XOR --system-prompt-file.
    • Append: --append-system-prompt, --append-system-prompt-file (can combine with replacement).
  • Other notable defaults/notes: claude --help is not exhaustive for flags; --bare skips auto-discovery (hooks/skills/plugins/MCP/auto memory/CLAUDE.md) and sets env CLAUDE_CODE_SIMPLE.

📖 Claude Code Common Workflows (Plan Mode, tests, PRs, sessions, worktrees)

Reference Doc · source

End-to-end agentic coding workflows + concrete prompt/session management patterns

Key content
  • Plan Mode (safe analysis, read-only planning)
    • When to use: multi-step implementations (many files), deep code exploration before edits, interactive iteration on direction.
    • How to enable (in-session): Shift+Tab cycles permission modes: Normal → Auto-Accept → Plan Mode. Plan Mode indicator: “⏸ plan mode on”.
    • Start in Plan Mode (CLI): claude --permission-mode plan (also -p in headless mode).
    • Rationale: forces read-only operations while Claude analyzes and proposes a plan; uses AskUserQuestion to clarify requirements before planning.
  • Tests workflow
    • Ask for tests with specific behaviors to verify.
    • Claude should inspect existing test files to match project conventions (framework, assertion style).
    • Prompt for edge cases: error conditions, boundary values, unexpected inputs.
  • Pull request workflow
    • You can ask directly: “create a pr for my changes”, or guide step-by-step using gh pr create.
    • Session ↔ PR linking: creating a PR via gh pr create automatically links the session; resume later with claude --from-pr <number>.
  • Session management (resume/organize)
    • Resume: claude --continue (most recent in current dir), claude --resume (picker), claude --from-pr 123, in-session /resume.
    • Sessions stored per project directory; picker spans same git repo incl. worktrees.
    • Naming: start with -n or rename via :/rename; picker rename shortcut R.
  • Parallel work with Git worktrees
    • Create isolated worktree session: --worktree (-w); creates <repo>/.claude/worktrees/<name> and branch worktree-<name> from origin/HEAD.
    • Update base ref if needed: git remote set-head origin your-branch-name.
    • Cleanup rules: no changes → auto-remove; changes/commits → prompt keep/remove.
    • Copy gitignored env/config into worktrees via .worktreeinclude (gitignore syntax; only matches files that are also gitignored).

🔍 How Anthropic teams use Claude Code — org workflows & constraints

Explainer · source

Real organizational usage patterns + operational workflow details for Claude Code in team settings

Key content
  • Workflow patterns (repeatable procedures)
    • Checkpoint-heavy autonomy loop (Product Dev/RL Eng): start from clean git state → enable auto-accept mode (Shift+Tab) → let Claude write code/run tests/iterate → commit checkpoints regularly for easy rollback if it goes off track.
    • Task classification heuristic: use async autonomy for peripheral/prototyping/edge features; use synchronous supervision for core business logic/critical fixes (monitor architecture/style in real time).
    • “Try one-shot, then collaborate” (RL Eng): let Claude attempt full implementation first; if it fails, switch to guided iteration.
    • End-of-session doc loop (Data Infra): ask Claude to summarize session + suggest improvements → update Claude.md continuously based on real usage.
    • Parallel instances: run multiple Claude Code instances in different repos; each maintains context across hours/days for parallel workstreams.
  • Integration touchpoints
    • GitHub Actions: Claude can address PR comments (e.g., formatting/renames) automatically.
    • MCP servers (security control): Data Infra recommends MCP servers instead of BigQuery CLI for sensitive data access control/logging.
    • Screenshots/images: used for Kubernetes debugging (dashboard screenshots) and design-to-prototype (paste mockups via Cmd+V).
  • Empirical results (numbers)
    • Product Dev: ~70% of Vim mode implementation came from autonomous Claude work.
    • Security Eng: incident code-scanning reduced 10–15 min → ~5 min.
    • Inference: ML research time reduced by ~80% (~60 min → 10–20 min).
    • Data Sci/Vis: built ~5,000-line TypeScript app; reports 2–4× time savings on refactors.
    • Growth Marketing: ad copy creation 2 hours → 15 min; 10× creative output; constraints: 30-char headlines, 90-char descriptions; Figma plugin generates up to 100 variations per batch.
    • Product Design: Figma + Claude Code open ~80% of time; execution 2–3× faster; messaging project reduced ~1 week → two 30-min calls.
    • RL Eng: one-shot success ~1/3 of the time.

🔍 Measuring GitHub Copilot’s Impact on Productivity (telemetry + survey)

Explainer · source

Measured productivity impacts + how Copilot usage telemetry relates to perceived productivity (ACM CACM write-up; DOI:10.1145/3633453)

Key content
  • Study design (survey + telemetry, 2022 preview):
    • Survey emailed to 17,420 preview users; 2,047 responses matched to IDE telemetry.
    • Focus period: 4 weeks leading up to survey completion (most responses within first 2 days, on/before Feb 12, 2022).
    • Survey built on SPACE; used S, P, C, E (excluded self-reported Activity). Aggregate productivity = mean of 12 measures (11 SPACE statements + “I am more productive…”), excluding skipped items.
  • Telemetry event funnel (Table 1): opportunity, shown, accepted, accepted_char, mostly_unchanged_X (Levenshtein distance < 33%) at X ∈ {30,120,300,600}s, unchanged_X at same X, and (active) hour.
  • Core metric formulas (Table 2; “X_per_Y” normalization):
    • Acceptance rate = accepted_per_shown = (# accepted completions) / (# shown completions).
    • Shown rate = shown_per_opportunity.
    • Acceptance frequency = accepted_per_active_hour.
    • Contribution speed = accepted_char_per_active_hour.
    • Persistence rate = unchanged_X_per_accepted; Fuzzy persistence = mostly_unchanged_X_per_accepted.
  • Key empirical findings:
    • Acceptance rate is the strongest positive predictor of perceived productivity, outperforming persistence-based metrics.
    • PLS regression: Component 1 explains 43.2% of variance; Component 2 explains 21.2%; both draw strongly from acceptance rate.
  • Controlled experiment (speed): 95 pro developers, JS HTTP server task.
    • Completion success: 78% (Copilot) vs 70% (control).
    • Time: 1h11m (Copilot) vs 2h41m (control)55% faster; P = .0017; 95% CI [21%, 89%].