Prompting

Video (best)

Andrej Karpathy — “Intro to Large Language Models”
Watch: YouTube
Why: Karpathy’s talk naturally covers how prompting works in the context of LLM inference, including zero-shot and few-shot patterns, temperature, and how the model responds to context. It’s the most pedagogically grounded explanation of why prompting works, not just how to do it — rooted in the mechanics of next-token prediction. Already curated for this platform.
Level: beginner/intermediate

Coverage note: This video is a strong general LLM intro but does not deeply cover structured outputs, top-p sampling, or advanced prompting techniques. A more prompting-specific video would strengthen this topic.

Blog / Written explainer (best)

Lilian Weng — “Prompt Engineering”
Link: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
Why: Weng’s post is the gold standard written reference for prompting. It systematically covers zero-shot, few-shot, chain-of-thought, self-consistency, and structured output strategies with clear examples and citations. Her writing bridges intuition and rigor, making it suitable for learners who want depth without reading papers directly.
Level: intermediate

[NOT VERIFIED] — URL structure is consistent with her blog conventions; confirm post slug is exact.

Deep dive

DAIR.AI / Elvis Saravia — “Prompt Engineering Guide”
Link: https://www.promptingguide.ai/
Why: The most comprehensive freely available reference covering the full prompting landscape: zero-shot, few-shot, chain-of-thought, ReAct, structured outputs, temperature/top-p parameters, and more. Actively maintained, well-organized, and widely used in both academic and industry settings. Serves as a living technical reference rather than a static article.
Level: intermediate/advanced

Original paper

Brown et al. (OpenAI), 2020 — “Language Models are Few-Shot Learners” (GPT-3 paper)
Link: https://arxiv.org/abs/2005.14165
Why: This is the seminal paper that introduced and formalized the concepts of zero-shot, one-shot, and few-shot prompting as distinct in-context learning paradigms. It is the foundational citation for virtually all prompting research. The results sections are readable without deep ML background, making it accessible to motivated learners.
Level: intermediate/advanced

Code walkthrough

OpenAI Cookbook — “Techniques to improve reliability” (few-shot, structured outputs, temperature)
Link: https://cookbook.openai.com/articles/techniques_to_improve_reliability
Why: Hands-on, runnable examples demonstrating few-shot prompting, structured output formatting (JSON mode), and the practical effect of temperature and top-p on outputs. Uses the OpenAI API directly, which is the most common practical context learners will encounter. Bridges conceptual understanding to working code.
Level: beginner/intermediate

[NOT VERIFIED] — OpenAI Cookbook URLs have shifted; confirm this slug resolves correctly.

Coverage notes

Strong: Zero-shot and few-shot prompting (well covered by GPT-3 paper + Weng blog + Karpathy video); in-context learning conceptual foundations; temperature intuition
Weak: Top-p (nucleus) sampling mechanics — most resources mention it but few explain it deeply at a pedagogical level; structured outputs / JSON mode is underrepresented in video format
Gap: No single excellent YouTube video exists that is specifically about prompting techniques end-to-end (zero-shot → few-shot → structured outputs → sampling parameters). Karpathy’s video is the best available but is not a dedicated prompting tutorial. A video from a source like Serrano.Academy or a Stanford lecture specifically on prompt engineering would significantly strengthen this topic’s video coverage.

Cross-validation

This topic appears in 2 courses: intro-to-agentic-ai, intro-to-llms

For intro-to-llms: the Karpathy video and GPT-3 paper are the natural anchors; Weng’s blog provides the written complement.
For intro-to-agentic-ai: the promptingguide.ai deep dive and OpenAI Cookbook are more actionable for learners building agents who need structured outputs and reliable prompting patterns.

Additional Resources for Tutor Depth

9 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.

📄 Demonstrations in ICL: labels often don’t matter

Paper · source

Ablations on demonstration properties (label correctness, exemplar order, input distribution, random labels) isolating what drives ICL performance.

Key content

ICL inference objective (classification/multi-choice) (Section 3/4 framing): predict via
[ \hat y=\arg\max_{y\in C} P(y\mid x) ] Zero-shot (“No demonstrations”).
k-shot (“Demonstrations w/ gold labels”):
[ \hat y=\arg\max_{y\in C} P(y\mid x_1,y_1,\ldots,x_k,y_k,x) ] Random-label ablation (“Demonstrations w/ random labels”):
[ \hat y=\arg\max_{y\in C} P(y\mid x_1,\tilde y_1,\ldots,x_k,\tilde y_k,x) ] where (C) is the discrete label set; ((x_i,y_i)) are demonstrations; (\tilde y_i) are randomly replaced labels.
Core empirical result (Section 4, Fig. 1/3): Replacing gold labels in demonstrations with random labels causes only a marginal performance drop across classification + multi-choice tasks, consistent over 12 models including GPT-3.
Meta-training effect (Section 4/6): In MetaICL, the drop from randomizing demo labels is 0.1–0.9% absolute, suggesting meta-trained ICL models ignore input–label mapping even more.
What actually drives gains (Section 5, Fig. 7–10): demonstrations help mainly by specifying
1. label space, 2) input-text distribution (in-distribution examples matter), 3) overall format (input–label pairing).
  Removing format (“labels only” or “inputs only” without pairing) is close to or worse than zero-shot.
Label-space ablation (Section 5.2): For direct models, using labels from the true label space vs random English-word labels yields a 5–16% absolute gap → label-space specification is a key contributor.

📄 Nucleus (top‑p) sampling definition & rationale

Paper · source

Primary-source definition + algorithm for nucleus (top‑p) sampling; contrasts with top‑k/beam; explains “unreliable tail” and degeneration.

Key content

LM factorization (Eq. 1): For tokens (x_{1:m+n}),
[ P(x_{1:m+n})=\prod_{i=1}^{m+n} P(x_i \mid x_{1}\ldots x_{i-1}) ] Generation proceeds token-by-token using a decoding strategy.
Nucleus / top‑p set (Section 3.1, Eq. 2): Given next-token distribution (P(x\mid x_{1:i-1})) over vocabulary (V), define top‑p vocabulary (V^{(p)}\subset V) as the smallest set such that
[ \sum_{x\in V^{(p)}} P(x\mid x_{1:i-1}) \ge p ]
Renormalize + sample (Eq. 3): Let (p’=\sum_{x\in V^{(p)}} P(x\mid x_{1:i-1})). Define truncated distribution
[ P’(x\mid x_{1:i-1})= \begin{cases} P(x\mid x_{1:i-1})/p’ & x\in V^{(p)}\ 0 & \text{otherwise} \end{cases} ] Then sample next token from (P’). Candidate set size is dynamic (expands/contracts with distribution shape).
Top‑k contrast (Section 3.2): (V^{(k)}) is the size-(k) set maximizing (\sum_{x\in V^{(k)}}P(x\mid \cdot)); renormalize as Eq. 3. Unlike top‑p, the retained mass (p’) “can vary wildly” across steps.
Temperature (Eq. 4): With logits (u_l) and temperature (t),
[ p(x=V_l\mid x_{1:i-1})=\frac{\exp(u_l/t)}{\sum_{l’}\exp(u_{l’}/t)} ] Lower (t\in[0,1)) skews toward high-probability tokens, reducing diversity.
Design rationale / empirical claims: Beam/greedy (maximization) yields repetitive/generic “degeneration”; pure sampling can be incoherent due to an “unreliable tail” of many low-probability tokens. Authors report nucleus sampling best overall by human evaluation (HUSE) and matches human-like perplexity/diversity better than top‑k/beam.
Experimental defaults mentioned: GPT‑2 Large (762M params); 5,000 conditional generations; max length 200 tokens; context = initial paragraph truncated to 1–40 tokens. HUSE: 200 generations × 20 annotations = 4,000 per decoding scheme; KNN with (k=13); smoothing for truncated methods by interpolating 0.1 mass of original distribution.

📄 Temperature Scaling for Neural Net Calibration

Paper · source

Temperature scaling equation (logits ÷ T) + fitting T by NLL on validation set

Key content

Perfect calibration definition (Eq. 1):
[ \Pr(\hat Y = Y \mid \hat P = p)=p,\ \forall p\in[0,1] ] where (\hat Y) is predicted class, (\hat P) is predicted confidence.
Reliability diagram binning (Section 2): Partition confidences into (M) bins (I_m=((m-1)/M,m/M]). For bin (B_m):
[ \text{acc}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\mathbf{1}(\hat y_i=y_i),\quad \text{conf}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}\hat p_i ]
Expected Calibration Error (ECE) (Eq. 3):
[ \text{ECE}=\sum_{m=1}^M \frac{|B_m|}{n},|\text{acc}(B_m)-\text{conf}(B_m)| ]
Negative Log Likelihood objective (Eq. 6):
[ L=-\sum_{i=1}^n \log \hat\pi(y_i\mid x_i) ]
Temperature scaling (multiclass) (Eq. 9, Section 4.2): Given logits vector (z_i), calibrated probs use softmax on scaled logits:
[ \sigma_{\text{SM}}(z_i/T)^{(k)}=\frac{e^{z_i^{(k)}/T}}{\sum_{j=1}^K e^{z_i^{(j)}/T}},\quad \hat q_i=\max_k \sigma_{\text{SM}}(z_i/T)^{(k)} ] (T>0) fit by minimizing NLL on a held-out validation set; model weights fixed. Argmax unchanged (accuracy unchanged).
Rationale: Modern nets overfit NLL → overconfident; a single scalar (T) often corrects miscalibration (“intrinsically low dimensional”).
Empirical (Table 1, (M=15) bins): CIFAR-100 ResNet-110 (SD) ECE 12.67% → 0.96% with temperature scaling; CIFAR-10 ResNet-110 4.6% → 0.54%.
Implementation note: Insert multiplicative constant (1/T) between logits and softmax; set (T=1) during training, tune after.

📊 GPT-3 Few-shot / One-shot / Zero-shot Evaluation Protocol & Benchmarks

Benchmark · source

Definitions + evaluation protocol for 0S/1S/FS; benchmark tables showing scaling trends with model size and # in-context examples.

Key content

Learning settings (Section 2 “Approach”):
- Fine-tuning (FT): update weights on supervised task data.
- Few-shot (FS): provide K demonstrations (context→completion pairs) in the prompt; no gradient updates. Typical K ≈ 10–100, limited by context window nctx = 2048 tokens.
- One-shot (1S): FS with K = 1.
- Zero-shot (0S): task description/instruction only, K = 0.
Few-shot evaluation procedure (Section 2.4):
- For each eval example, randomly draw K examples from the task training set as conditioning; delimiter 1–2 newlines depending on task.
- If no training set (e.g., LAMBADA, StoryCloze): draw conditioning examples from dev, evaluate on test.
- Some tasks add a natural-language prompt and/or answer formatting changes.
- Free-form completion decoding: beam search with beam width = 4, length penalty α = 0.6.
Key empirical results (Tables 3.1–3.5):
- CoQA (F1): 0S 81.5, 1S 84.0, FS 85.0.
- TriviaQA (acc): 0S 64.3, 1S 68.0, FS 71.2 (FS reported as SOTA in closed-book comparison).
- LAMBADA (acc): 0S 76.2, FS 86.4.
- SuperGLUE (FS, 32 examples): Avg 69.0; notable: COPA 52.0, ReCoRD F1 91.1, WiC 49.4 (near chance).
Design rationale (Intro/Fig 1.1): performance improves with model size and # in-context examples; gap between 0S/1S/FS often grows with capacity, suggesting larger models are better at in-context learning/meta-learning.

📊 JSONSchemaBench metrics for structured-output constrained decoding

Benchmark · source

Compliance-rate + efficiency methodology for JSON-Schema–constrained decoding (incl. failure analysis; TTFT/TPOT)

Key content

Constrained decoding definition (Intro): masks invalid tokens at each step given constraints + prefix, forcing only valid tokens → schema-conpliant JSON.
Benchmark: JSONSchemaBench = 9,558 real-world JSON Schemas (10 datasets; GitHub split by field count: trivial <10, easy 10–30, medium 30–100, hard 100–500, ultra >500). Experiments exclude GitHub-Trivial & GitHub-Ultra (too easy/hard).
Efficiency metrics (Sec. 4):
- GCT = Grammar Compilation Time (s)
- TTFT = Time To First Token (s)
- TPOT = Time Per Output Token after first (ms)
- Fairness: compute efficiency on intersection of covered instances across all engines to avoid coverage bias.
- Setup: Llama-3.1-8B-Instruct; single A100 80GB; batch=1. Outlines/Guidance/Llamacpp via llama.cpp; XGrammar via HF Transformers.
- Example (GlaiveAI, llama.cpp backend): LM-only TPOT 15.40ms vs Guidance 6.37ms (TTFT 0.24s), Llamacpp 29.98ms, Outlines GCT 3.48s, TTFT 3.65s, TPOT 30.33ms.
Coverage notions (Sec. 5):
- Declared coverage: accepts schema w/o explicit reject/runtime error.
- Empirical coverage: generated outputs validate against schema.
- True coverage: constraints semantically equivalent to schema (ideal; not directly measurable).
- Compliance Rate (CR): CR = Empirical / Declared (reliability conditional on accepting schema).
Coverage experiment defaults (Sec. 5.1): Llama-3.2-1B-Instruct; prompt = instruction + 2-shot examples; greedy, temperature=0, single sample; 40s compile timeout + 40s generation timeout; validation via jsonschema (Draft 2020-12) with format checks enabled.
Empirical results (Sec. 5.2, selected):
- GitHub Easy: Guidance Declared 0.90, Empirical 0.86, CR 0.96; LM-only Empirical 0.65.
- GitHub Hard: Guidance Empirical 0.41 (CR 0.69); LM-only 0.13.
- Closed-source (OpenAI/Gemini): often low declared/empirical but CR ~1.00 (conservative feature subset).
Failure analysis via JSON Schema Test Suite (Sec. 5.3):
- Failure modes: Over-constrained (rejects valid instances) vs Under-constrained (allows invalid).
- Category-level failures: Under-constrained counts—Guidance 1 vs XGrammar 38; Compile errors—Outlines 42, Llamacpp 37, XGrammar 3, Guidance 25.
Quality (Sec. 6): constrained decoding improved downstream accuracy up to ~4%; on reasoning tasks (Llama-3.1-8B): GSM8K LM-only 80.1% vs Guidance 83.8%.

📖 Chat Completions API — message schema, tools, streaming

Reference Doc · source

Canonical request/response objects for /chat/completions, message roles, tool-choice defaults, streaming options, and related JSON fields.

Key content

Endpoint & operations
- Create: POST /chat/completions
- List: GET /chat/completions
- Get: GET /chat/completions/{completion_id}
- Update: POST /chat/completions/{completion_id}
- Delete: DELETE /chat/completions/{completion_id}
- Get stored messages: GET /chat/completions/{completion_id}/messages
Core response object
- ChatCompletion = { id, choices, created, ... } (response returned by model based on provided messages)
- Streaming: ChatCompletionChunk = { id, choices, created, ... }
Message roles & precedence
- ChatCompletionRole includes "developer", "system", "user", …
- ChatCompletionDeveloperMessageParam = { role, content, name }: developer instructions the model should follow; with o1 models and newer, developer messages replace previous system messages.
- ChatCompletionSystemMessageParam = { role, content, name }: same purpose, but developer messages preferred for o1+.
- User message: ChatCompletionUserMessageParam = { role, content, name }
Multimodal content parts
- ChatCompletionContentPartText = { type, text }
- ChatCompletionContentPartImage = { type, image_url }
- ChatCompletionContentPartInputAudio = { type, input_audio }
- Refusal part: { type, refusal }
Audio output
- Output object: ChatCompletionAudio = { id, data, expires_at, transcript }
- Request params: ChatCompletionAudioParam = { format, voice } (required when requesting modalities ["audio"])
- Modalities: ChatCompletionModality = "text" | "audio"
Tools & tool choice (defaults matter)
- Tool types: ChatCompletionTool = function tool | custom tool
- Force a function call: ChatCompletionFunctionCallOption = { name }
- tool_choice: "none" | "auto" | "required" | AllowedToolChoice | NamedToolChoice | NamedToolChoiceCustom
  - "none" = model won’t call tools; generates a message
  - "auto" = model may choose message vs tool call(s)
  - "required" = model must call ≥1 tool
  - Defaults: "none" when no tools present; "auto" when tools are present
Streaming options
- ChatCompletionStreamOptions = { include_obfuscation, include_usage } (only when stream: true)

📖 Schema-Constrained Decoding (Structured Outputs)

Reference Doc · source

Design rationale + mechanism for schema-constrained decoding vs JSON mode (guarantees/limits)

Key content

JSON mode vs Structured Outputs
- JSON mode: improves validity of JSON but does not guarantee conformance to a specific schema.
- Structured Outputs: designed to ensure outputs exactly match developer-supplied JSON Schemas.
How to enable (2 paths)
1. Function calling: set strict: true inside the tool/function definition → outputs match the supplied tool schema.
2. response_format: set response_format: { type: "json_schema", json_schema: { strict: true, schema: ... } } → outputs match schema (supported on gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18).
Reliability / empirical results
- On complex JSON schema-following evals: gpt-4o-2024-08-06 + Structured Outputs scores 100%; gpt-4-0613 scores <40%.
- Model training alone reached 93% on benchmark; deterministic constrained decoding used to reach 100% reliability.
Mechanism: constrained decoding (dynamic token masking)
- Convert JSON Schema → context-free grammar (CFG).
- During sampling, after every token, compute valid next tokens from CFG and mask invalid tokens (probability → 0).
- First request with a new schema incurs preprocessing latency; artifacts are cached for reuse.
Why CFG (vs FSM/regex)
- CFGs express broader languages; better for nested/recursive schemas (e.g., $ref: "#") where FSMs struggle.
Operational limits
- Can still fail schema if refusal, max_tokens/stop truncation, or parallel tool calls (set parallel_tool_calls: false).
- Structured Outputs ensures structure, not correctness of values (e.g., math step may be wrong).
- First-schema latency: typical <10s, complex up to ~1 min.
- Refusals surfaced via message.refusal string; if no refusal and not interrupted (finish_reason), output matches schema.

📖 Structured Outputs & JSON Mode (OpenAI API)

Reference Doc · source

Concrete SDK pattern client.responses.parse(...) / client.chat.completions.parse(...) with typed schemas + documented constraints/behavior of Structured Outputs vs JSON mode.

Key content

Structured Outputs (SO): guarantees valid JSON + schema adherence to supplied JSON Schema (strict: true), preventing missing required keys / invalid enums. Recommended over JSON mode when supported.
SDK parsing workflow (Python/Pydantic):
- Chat Completions: client.chat.completions.parse(..., response_format=MyModel) → completion.choices[0].message.parsed
- Responses API: client.responses.parse(..., text_format=MyModel) → response.output_parsed
When to use:
- Function calling: bridge model ↔ tools/functions/data.
- response_format / text.format: structure the assistant’s user-facing response (e.g., tutoring UI sections).
Model support:
- SO via response_format: {type:"json_schema", json_schema:{strict:true, schema:...}} supported on gpt-4o-mini, gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06 and later snapshots.
- JSON mode: response_format: {type:"json_object"} (Chat Completions) or text.format: {type:"json_object"} (Responses).
Refusals: if safety refusal occurs, API includes a refusal field/content (programmatically detectable) rather than schema output.
Schema constraints (SO subset):
- Root schema must be an object (not top-level anyOf).
- All fields must be required; emulate optional via union with null (e.g., "type": ["string","null"]).
- Objects must set additionalProperties: false.
- Limits: ≤5000 total object properties, ≤10 nesting levels; total schema string length ≤120,000 chars; ≤1000 enum values overall.
- Key ordering in output follows schema key order.
JSON mode gotcha: must explicitly instruct output as JSON; API errors if “JSON” absent; otherwise model may emit endless whitespace.

📖 Structured Outputs (JSON Schema) — OpenAI Responses API

Reference Doc · source

Exact request/response patterns + guarantees/limits for Structured Outputs (client.responses.parse(...), schema subset, refusals, streaming, supported models)

Key content

Guarantee (Structured Outputs): Model output always adheres to supplied JSON Schema (type-safety; no missing required keys; no invalid enum values). Distinct from JSON mode which guarantees valid JSON only, not schema adherence.
Enable Structured Outputs (Responses API):
- SDK pattern (Python/Pydantic):
  response = client.responses.parse(model=..., input=[...], text_format=MyPydanticModel) → parsed object at response.output_parsed.
- REST/format equivalent: text: { format: { type: "json_schema", strict: true, schema: ... } }
Supported models (json_schema): gpt-4o-mini, gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06 and later. Older models use JSON mode.
JSON mode enable: text: { format: { type: "json_object" } }
- Must explicitly instruct to output JSON; API errors if “JSON” not present in context. Risk: endless whitespace stream if not instructed.
Refusals: If safety refusal occurs, response includes refusal content (programmatically detectable) rather than matching schema.
Streaming: Use client.responses.stream(..., text_format=Schema); handle events like response.output_text.delta, response.refusal.delta, response.completed. SDK recommended for parsing.
Schema subset + hard limits:
- Types: string, number, boolean, integer, object, array, enum, anyOf.
- Root schema must be object (not anyOf). All fields required; emulate optional via union with null (e.g., "type": ["string","null"]).
- Objects must set additionalProperties: false.
- Limits: ≤5000 total object properties; ≤10 nesting levels; total schema string length ≤120,000 chars; ≤1000 enum values overall; per enum property string total ≤15,000 chars when >250 values.
- Key ordering: output keys follow schema order.

SocraticTutor LLM Wiki

Explorer

Prompting

Prompting

Video (best)

Blog / Written explainer (best)

Deep dive

Original paper

Code walkthrough

Coverage notes

Cross-validation

Additional Resources for Tutor Depth

📄 Demonstrations in ICL: labels often don’t matter

📄 Nucleus (top‑p) sampling definition & rationale

📄 Temperature Scaling for Neural Net Calibration

📊 GPT-3 Few-shot / One-shot / Zero-shot Evaluation Protocol & Benchmarks

📊 JSONSchemaBench metrics for structured-output constrained decoding

📖 Chat Completions API — message schema, tools, streaming

📖 Schema-Constrained Decoding (Structured Outputs)

📖 Structured Outputs & JSON Mode (OpenAI API)

📖 Structured Outputs (JSON Schema) — OpenAI Responses API

Graph View

Table of Contents

Backlinks

SocraticTutor LLM Wiki

Explorer

Prompting

Prompting

Video (best)

Blog / Written explainer (best)

Deep dive

Original paper

Code walkthrough

Coverage notes

Cross-validation

Additional Resources for Tutor Depth

📄 Demonstrations in ICL: labels often don’t matter

📄 Nucleus (top‑p) sampling definition & rationale

📄 Temperature Scaling for Neural Net Calibration

📊 GPT-3 Few-shot / One-shot / Zero-shot Evaluation Protocol & Benchmarks

📊 JSONSchemaBench metrics for structured-output constrained decoding

📖 Chat Completions API — message schema, tools, streaming

📖 Schema-Constrained Decoding (Structured Outputs)

📖 Structured Outputs & JSON Mode (OpenAI API)

📖 Structured Outputs (JSON Schema) — OpenAI Responses API

Related Topics

Graph View

Table of Contents

Backlinks