Prompting
Video (best)
- Andrej Karpathy — “Intro to Large Language Models”
- Watch: YouTube
- Why: Karpathy’s talk naturally covers how prompting works in the context of LLM inference, including zero-shot and few-shot patterns, temperature, and how the model responds to context. It’s the most pedagogically grounded explanation of why prompting works, not just how to do it — rooted in the mechanics of next-token prediction. Already curated for this platform.
- Level: beginner/intermediate
Coverage note: This video is a strong general LLM intro but does not deeply cover structured outputs, top-p sampling, or advanced prompting techniques. A more prompting-specific video would strengthen this topic.
Blog / Written explainer (best)
- Lilian Weng — “Prompt Engineering”
- Link: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
- Why: Weng’s post is the gold standard written reference for prompting. It systematically covers zero-shot, few-shot, chain-of-thought, self-consistency, and structured output strategies with clear examples and citations. Her writing bridges intuition and rigor, making it suitable for learners who want depth without reading papers directly.
- Level: intermediate
[NOT VERIFIED] — URL structure is consistent with her blog conventions; confirm post slug is exact.
Deep dive
- DAIR.AI / Elvis Saravia — “Prompt Engineering Guide”
- Link: https://www.promptingguide.ai/
- Why: The most comprehensive freely available reference covering the full prompting landscape: zero-shot, few-shot, chain-of-thought, ReAct, structured outputs, temperature/top-p parameters, and more. Actively maintained, well-organized, and widely used in both academic and industry settings. Serves as a living technical reference rather than a static article.
- Level: intermediate/advanced
Original paper
- Brown et al. (OpenAI), 2020 — “Language Models are Few-Shot Learners” (GPT-3 paper)
- Link: https://arxiv.org/abs/2005.14165
- Why: This is the seminal paper that introduced and formalized the concepts of zero-shot, one-shot, and few-shot prompting as distinct in-context learning paradigms. It is the foundational citation for virtually all prompting research. The results sections are readable without deep ML background, making it accessible to motivated learners.
- Level: intermediate/advanced
Code walkthrough
- OpenAI Cookbook — “Techniques to improve reliability” (few-shot, structured outputs, temperature)
- Link: https://cookbook.openai.com/articles/techniques_to_improve_reliability
- Why: Hands-on, runnable examples demonstrating few-shot prompting, structured output formatting (JSON mode), and the practical effect of temperature and top-p on outputs. Uses the OpenAI API directly, which is the most common practical context learners will encounter. Bridges conceptual understanding to working code.
- Level: beginner/intermediate
[NOT VERIFIED] — OpenAI Cookbook URLs have shifted; confirm this slug resolves correctly.
Coverage notes
- Strong: Zero-shot and few-shot prompting (well covered by GPT-3 paper + Weng blog + Karpathy video); in-context learning conceptual foundations; temperature intuition
- Weak: Top-p (nucleus) sampling mechanics — most resources mention it but few explain it deeply at a pedagogical level; structured outputs / JSON mode is underrepresented in video format
- Gap: No single excellent YouTube video exists that is specifically about prompting techniques end-to-end (zero-shot → few-shot → structured outputs → sampling parameters). Karpathy’s video is the best available but is not a dedicated prompting tutorial. A video from a source like Serrano.Academy or a Stanford lecture specifically on prompt engineering would significantly strengthen this topic’s video coverage.
Cross-validation
This topic appears in 2 courses: intro-to-agentic-ai, intro-to-llms
- For
intro-to-llms: the Karpathy video and GPT-3 paper are the natural anchors; Weng’s blog provides the written complement. - For
intro-to-agentic-ai: the promptingguide.ai deep dive and OpenAI Cookbook are more actionable for learners building agents who need structured outputs and reliable prompting patterns.
Additional Resources for Tutor Depth
9 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 Demonstrations in ICL: labels often don’t matter
Paper · source
Ablations on demonstration properties (label correctness, exemplar order, input distribution, random labels) isolating what drives ICL performance.
Key content
- ICL inference objective (classification/multi-choice) (Section 3/4 framing): predict via
[ \hat y=\arg\max_{y\in C} P(y\mid x) ] Zero-shot (“No demonstrations”).
k-shot (“Demonstrations w/ gold labels”):
[ \hat y=\arg\max_{y\in C} P(y\mid x_1,y_1,\ldots,x_k,y_k,x) ] Random-label ablation (“Demonstrations w/ random labels”):
[ \hat y=\arg\max_{y\in C} P(y\mid x_1,\tilde y_1,\ldots,x_k,\tilde y_k,x) ] where (C) is the discrete label set; ((x_i,y_i)) are demonstrations; (\tilde y_i) are randomly replaced labels. - Core empirical result (Section 4, Fig. 1/3): Replacing gold labels in demonstrations with random labels causes only a marginal performance drop across classification + multi-choice tasks, consistent over 12 models including GPT-3.
- Meta-training effect (Section 4/6): In MetaICL, the drop from randomizing demo labels is 0.1–0.9% absolute, suggesting meta-trained ICL models ignore input–label mapping even more.
- What actually drives gains (Section 5, Fig. 7–10): demonstrations help mainly by specifying
- label space, 2) input-text distribution (in-distribution examples matter), 3) overall format (input–label pairing).
Removing format (“labels only” or “inputs only” without pairing) is close to or worse than zero-shot.
- label space, 2) input-text distribution (in-distribution examples matter), 3) overall format (input–label pairing).
- Label-space ablation (Section 5.2): For direct models, using labels from the true label space vs random English-word labels yields a 5–16% absolute gap → label-space specification is a key contributor.
📄 Nucleus (top‑p) sampling definition & rationale
Paper · source
Primary-source definition + algorithm for nucleus (top‑p) sampling; contrasts with top‑k/beam; explains “unreliable tail” and degeneration.
Key content
- LM factorization (Eq. 1): For tokens (x_{1:m+n}),
[ P(x_{1:m+n})=\prod_{i=1}^{m+n} P(x_i \mid x_{1}\ldots x_{i-1}) ] Generation proceeds token-by-token using a decoding strategy. - Nucleus / top‑p set (Section 3.1, Eq. 2): Given next-token distribution (P(x\mid x_{1:i-1})) over vocabulary (V), define top‑p vocabulary (V^{(p)}\subset V) as the smallest set such that
[ \sum_{x\in V^{(p)}} P(x\mid x_{1:i-1}) \ge p ] - Renormalize + sample (Eq. 3): Let (p’=\sum_{x\in V^{(p)}} P(x\mid x_{1:i-1})). Define truncated distribution
[ P’(x\mid x_{1:i-1})= \begin{cases} P(x\mid x_{1:i-1})/p’ & x\in V^{(p)}\ 0 & \text{otherwise} \end{cases} ] Then sample next token from (P’). Candidate set size is dynamic (expands/contracts with distribution shape). - Top‑k contrast (Section 3.2): (V^{(k)}) is the size-(k) set maximizing (\sum_{x\in V^{(k)}}P(x\mid \cdot)); renormalize as Eq. 3. Unlike top‑p, the retained mass (p’) “can vary wildly” across steps.
- Temperature (Eq. 4): With logits (u_l) and temperature (t),
[ p(x=V_l\mid x_{1:i-1})=\frac{\exp(u_l/t)}{\sum_{l’}\exp(u_{l’}/t)} ] Lower (t\in[0,1)) skews toward high-probability tokens, reducing diversity. - Design rationale / empirical claims: Beam/greedy (maximization) yields repetitive/generic “degeneration”; pure sampling can be incoherent due to an “unreliable tail” of many low-probability tokens. Authors report nucleus sampling best overall by human evaluation (HUSE) and matches human-like perplexity/diversity better than top‑k/beam.
- Experimental defaults mentioned: GPT‑2 Large (762M params); 5,000 conditional generations; max length 200 tokens; context = initial paragraph truncated to 1–40 tokens. HUSE: 200 generations × 20 annotations = 4,000 per decoding scheme; KNN with (k=13); smoothing for truncated methods by interpolating 0.1 mass of original distribution.
📄 Temperature Scaling for Neural Net Calibration
Paper · source
Temperature scaling equation (logits ÷ T) + fitting T by NLL on validation set
Key content
- Perfect calibration definition (Eq. 1):
[ \Pr(\hat Y = Y \mid \hat P = p)=p,\ \forall p\in[0,1] ] where (\hat Y) is predicted class, (\hat P) is predicted confidence. - Reliability diagram binning (Section 2): Partition confidences into (M) bins (I_m=((m-1)/M,m/M]). For bin (B_m):
[ \text{acc}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\mathbf{1}(\hat y_i=y_i),\quad \text{conf}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\hat p_i ] - Expected Calibration Error (ECE) (Eq. 3):
[ \text{ECE}=\sum_{m=1}^M \frac{|B_m|}{n},|\text{acc}(B_m)-\text{conf}(B_m)| ] - Negative Log Likelihood objective (Eq. 6):
[ L=-\sum_{i=1}^n \log \hat\pi(y_i\mid x_i) ] - Temperature scaling (multiclass) (Eq. 9, Section 4.2): Given logits vector (z_i), calibrated probs use softmax on scaled logits:
[ \sigma_{\text{SM}}(z_i/T)^{(k)}=\frac{e^{z_i^{(k)}/T}}{\sum_{j=1}^K e^{z_i^{(j)}/T}},\quad \hat q_i=\max_k \sigma_{\text{SM}}(z_i/T)^{(k)} ] (T>0) fit by minimizing NLL on a held-out validation set; model weights fixed. Argmax unchanged (accuracy unchanged). - Rationale: Modern nets overfit NLL → overconfident; a single scalar (T) often corrects miscalibration (“intrinsically low dimensional”).
- Empirical (Table 1, (M=15) bins): CIFAR-100 ResNet-110 (SD) ECE 12.67% → 0.96% with temperature scaling; CIFAR-10 ResNet-110 4.6% → 0.54%.
- Implementation note: Insert multiplicative constant (1/T) between logits and softmax; set (T=1) during training, tune after.
📊 GPT-3 Few-shot / One-shot / Zero-shot Evaluation Protocol & Benchmarks
Benchmark · source
Definitions + evaluation protocol for 0S/1S/FS; benchmark tables showing scaling trends with model size and # in-context examples.
Key content
- Learning settings (Section 2 “Approach”):
- Fine-tuning (FT): update weights on supervised task data.
- Few-shot (FS): provide K demonstrations (context→completion pairs) in the prompt; no gradient updates. Typical K ≈ 10–100, limited by context window nctx = 2048 tokens.
- One-shot (1S): FS with K = 1.
- Zero-shot (0S): task description/instruction only, K = 0.
- Few-shot evaluation procedure (Section 2.4):
- For each eval example, randomly draw K examples from the task training set as conditioning; delimiter 1–2 newlines depending on task.
- If no training set (e.g., LAMBADA, StoryCloze): draw conditioning examples from dev, evaluate on test.
- Some tasks add a natural-language prompt and/or answer formatting changes.
- Free-form completion decoding: beam search with beam width = 4, length penalty α = 0.6.
- Key empirical results (Tables 3.1–3.5):
- CoQA (F1): 0S 81.5, 1S 84.0, FS 85.0.
- TriviaQA (acc): 0S 64.3, 1S 68.0, FS 71.2 (FS reported as SOTA in closed-book comparison).
- LAMBADA (acc): 0S 76.2, FS 86.4.
- SuperGLUE (FS, 32 examples): Avg 69.0; notable: COPA 52.0, ReCoRD F1 91.1, WiC 49.4 (near chance).
- Design rationale (Intro/Fig 1.1): performance improves with model size and # in-context examples; gap between 0S/1S/FS often grows with capacity, suggesting larger models are better at in-context learning/meta-learning.
📊 JSONSchemaBench metrics for structured-output constrained decoding
Benchmark · source
Compliance-rate + efficiency methodology for JSON-Schema–constrained decoding (incl. failure analysis; TTFT/TPOT)
Key content
- Constrained decoding definition (Intro): masks invalid tokens at each step given constraints + prefix, forcing only valid tokens → schema-conpliant JSON.
- Benchmark: JSONSchemaBench = 9,558 real-world JSON Schemas (10 datasets; GitHub split by field count: trivial <10, easy 10–30, medium 30–100, hard 100–500, ultra >500). Experiments exclude GitHub-Trivial & GitHub-Ultra (too easy/hard).
- Efficiency metrics (Sec. 4):
- GCT = Grammar Compilation Time (s)
- TTFT = Time To First Token (s)
- TPOT = Time Per Output Token after first (ms)
- Fairness: compute efficiency on intersection of covered instances across all engines to avoid coverage bias.
- Setup: Llama-3.1-8B-Instruct; single A100 80GB; batch=1. Outlines/Guidance/Llamacpp via llama.cpp; XGrammar via HF Transformers.
- Example (GlaiveAI, llama.cpp backend): LM-only TPOT 15.40ms vs Guidance 6.37ms (TTFT 0.24s), Llamacpp 29.98ms, Outlines GCT 3.48s, TTFT 3.65s, TPOT 30.33ms.
- Coverage notions (Sec. 5):
- Declared coverage: accepts schema w/o explicit reject/runtime error.
- Empirical coverage: generated outputs validate against schema.
- True coverage: constraints semantically equivalent to schema (ideal; not directly measurable).
- Compliance Rate (CR): CR = Empirical / Declared (reliability conditional on accepting schema).
- Coverage experiment defaults (Sec. 5.1): Llama-3.2-1B-Instruct; prompt = instruction + 2-shot examples; greedy, temperature=0, single sample; 40s compile timeout + 40s generation timeout; validation via
jsonschema(Draft 2020-12) with format checks enabled. - Empirical results (Sec. 5.2, selected):
- GitHub Easy: Guidance Declared 0.90, Empirical 0.86, CR 0.96; LM-only Empirical 0.65.
- GitHub Hard: Guidance Empirical 0.41 (CR 0.69); LM-only 0.13.
- Closed-source (OpenAI/Gemini): often low declared/empirical but CR ~1.00 (conservative feature subset).
- Failure analysis via JSON Schema Test Suite (Sec. 5.3):
- Failure modes: Over-constrained (rejects valid instances) vs Under-constrained (allows invalid).
- Category-level failures: Under-constrained counts—Guidance 1 vs XGrammar 38; Compile errors—Outlines 42, Llamacpp 37, XGrammar 3, Guidance 25.
- Quality (Sec. 6): constrained decoding improved downstream accuracy up to ~4%; on reasoning tasks (Llama-3.1-8B): GSM8K LM-only 80.1% vs Guidance 83.8%.
📖 Chat Completions API — message schema, tools, streaming
Reference Doc · source
Canonical request/response objects for /chat/completions, message roles, tool-choice defaults, streaming options, and related JSON fields.
Key content
- Endpoint & operations
- Create:
POST /chat/completions - List:
GET /chat/completions - Get:
GET /chat/completions/{completion_id} - Update:
POST /chat/completions/{completion_id} - Delete:
DELETE /chat/completions/{completion_id} - Get stored messages:
GET /chat/completions/{completion_id}/messages
- Create:
- Core response object
ChatCompletion = { id, choices, created, ... }(response returned by model based on provided messages)- Streaming:
ChatCompletionChunk = { id, choices, created, ... }
- Message roles & precedence
ChatCompletionRoleincludes"developer","system","user", …ChatCompletionDeveloperMessageParam = { role, content, name }: developer instructions the model should follow; with o1 models and newer, developer messages replace previous system messages.ChatCompletionSystemMessageParam = { role, content, name }: same purpose, but developer messages preferred for o1+.- User message:
ChatCompletionUserMessageParam = { role, content, name }
- Multimodal content parts
ChatCompletionContentPartText = { type, text }ChatCompletionContentPartImage = { type, image_url }ChatCompletionContentPartInputAudio = { type, input_audio }- Refusal part:
{ type, refusal }
- Audio output
- Output object:
ChatCompletionAudio = { id, data, expires_at, transcript } - Request params:
ChatCompletionAudioParam = { format, voice }(required when requesting modalities["audio"]) - Modalities:
ChatCompletionModality = "text" | "audio"
- Output object:
- Tools & tool choice (defaults matter)
- Tool types:
ChatCompletionTool = function tool | custom tool - Force a function call:
ChatCompletionFunctionCallOption = { name } tool_choice:"none" | "auto" | "required" | AllowedToolChoice | NamedToolChoice | NamedToolChoiceCustom"none"= model won’t call tools; generates a message"auto"= model may choose message vs tool call(s)"required"= model must call ≥1 tool- Defaults:
"none"when no tools present;"auto"when tools are present
- Tool types:
- Streaming options
ChatCompletionStreamOptions = { include_obfuscation, include_usage }(only whenstream: true)
📖 Schema-Constrained Decoding (Structured Outputs)
Reference Doc · source
Design rationale + mechanism for schema-constrained decoding vs JSON mode (guarantees/limits)
Key content
- JSON mode vs Structured Outputs
- JSON mode: improves validity of JSON but does not guarantee conformance to a specific schema.
- Structured Outputs: designed to ensure outputs exactly match developer-supplied JSON Schemas.
- How to enable (2 paths)
- Function calling: set
strict: trueinside the tool/function definition → outputs match the supplied tool schema. - response_format: set
response_format: { type: "json_schema", json_schema: { strict: true, schema: ... } }→ outputs match schema (supported ongpt-4o-2024-08-06,gpt-4o-mini-2024-07-18).
- Function calling: set
- Reliability / empirical results
- On complex JSON schema-following evals:
gpt-4o-2024-08-06+ Structured Outputs scores 100%;gpt-4-0613scores <40%. - Model training alone reached 93% on benchmark; deterministic constrained decoding used to reach 100% reliability.
- On complex JSON schema-following evals:
- Mechanism: constrained decoding (dynamic token masking)
- Convert JSON Schema → context-free grammar (CFG).
- During sampling, after every token, compute valid next tokens from CFG and mask invalid tokens (probability → 0).
- First request with a new schema incurs preprocessing latency; artifacts are cached for reuse.
- Why CFG (vs FSM/regex)
- CFGs express broader languages; better for nested/recursive schemas (e.g.,
$ref: "#") where FSMs struggle.
- CFGs express broader languages; better for nested/recursive schemas (e.g.,
- Operational limits
- Can still fail schema if refusal,
max_tokens/stop truncation, or parallel tool calls (setparallel_tool_calls: false). - Structured Outputs ensures structure, not correctness of values (e.g., math step may be wrong).
- First-schema latency: typical <10s, complex up to ~1 min.
- Refusals surfaced via
message.refusalstring; if no refusal and not interrupted (finish_reason), output matches schema.
- Can still fail schema if refusal,
📖 Structured Outputs & JSON Mode (OpenAI API)
Reference Doc · source
Concrete SDK pattern client.responses.parse(...) / client.chat.completions.parse(...) with typed schemas + documented constraints/behavior of Structured Outputs vs JSON mode.
Key content
- Structured Outputs (SO): guarantees valid JSON + schema adherence to supplied JSON Schema (
strict: true), preventing missing required keys / invalid enums. Recommended over JSON mode when supported. - SDK parsing workflow (Python/Pydantic):
- Chat Completions:
client.chat.completions.parse(..., response_format=MyModel)→completion.choices[0].message.parsed - Responses API:
client.responses.parse(..., text_format=MyModel)→response.output_parsed
- Chat Completions:
- When to use:
- Function calling: bridge model ↔ tools/functions/data.
- response_format / text.format: structure the assistant’s user-facing response (e.g., tutoring UI sections).
- Model support:
- SO via
response_format: {type:"json_schema", json_schema:{strict:true, schema:...}}supported on gpt-4o-mini, gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06 and later snapshots. - JSON mode:
response_format: {type:"json_object"}(Chat Completions) ortext.format: {type:"json_object"}(Responses).
- SO via
- Refusals: if safety refusal occurs, API includes a refusal field/content (programmatically detectable) rather than schema output.
- Schema constraints (SO subset):
- Root schema must be an object (not top-level
anyOf). - All fields must be required; emulate optional via union with
null(e.g.,"type": ["string","null"]). - Objects must set
additionalProperties: false. - Limits: ≤5000 total object properties, ≤10 nesting levels; total schema string length ≤120,000 chars; ≤1000 enum values overall.
- Key ordering in output follows schema key order.
- Root schema must be an object (not top-level
- JSON mode gotcha: must explicitly instruct output as JSON; API errors if “JSON” absent; otherwise model may emit endless whitespace.
📖 Structured Outputs (JSON Schema) — OpenAI Responses API
Reference Doc · source
Exact request/response patterns + guarantees/limits for Structured Outputs (client.responses.parse(...), schema subset, refusals, streaming, supported models)
Key content
- Guarantee (Structured Outputs): Model output always adheres to supplied JSON Schema (type-safety; no missing required keys; no invalid enum values). Distinct from JSON mode which guarantees valid JSON only, not schema adherence.
- Enable Structured Outputs (Responses API):
- SDK pattern (Python/Pydantic):
response = client.responses.parse(model=..., input=[...], text_format=MyPydanticModel)→ parsed object atresponse.output_parsed. - REST/format equivalent:
text: { format: { type: "json_schema", strict: true, schema: ... } }
- SDK pattern (Python/Pydantic):
- Supported models (json_schema):
gpt-4o-mini,gpt-4o-mini-2024-07-18,gpt-4o-2024-08-06and later. Older models use JSON mode. - JSON mode enable:
text: { format: { type: "json_object" } }- Must explicitly instruct to output JSON; API errors if “JSON” not present in context. Risk: endless whitespace stream if not instructed.
- Refusals: If safety refusal occurs, response includes
refusalcontent (programmatically detectable) rather than matching schema. - Streaming: Use
client.responses.stream(..., text_format=Schema); handle events likeresponse.output_text.delta,response.refusal.delta,response.completed. SDK recommended for parsing. - Schema subset + hard limits:
- Types: string, number, boolean, integer, object, array, enum, anyOf.
- Root schema must be object (not anyOf). All fields required; emulate optional via union with
null(e.g.,"type": ["string","null"]). - Objects must set
additionalProperties: false. - Limits: ≤5000 total object properties; ≤10 nesting levels; total schema string length ≤120,000 chars; ≤1000 enum values overall; per enum property string total ≤15,000 chars when >250 values.
- Key ordering: output keys follow schema order.