Cnns
Video (best)
- Andrej Karpathy / Stanford CS231n — “Lecture 5: Convolutional Neural Networks”
- Watch: YouTube
- Why: CS231n Lecture 5 is the canonical academic treatment of CNNs — covers convolution mechanics, stride, padding, pooling, and feature maps with precise mathematical intuition. Karpathy’s delivery bridges theory and practice better than any other single lecture on this topic.
- Level: intermediate
Blog / Written explainer (best)
- Christopher Olah — “Conv Nets: A Modular Perspective”
- Link: https://colah.github.io/posts/2014-07-Conv-Nets-Modular/
- Why: Olah builds CNN intuition from first principles using a modular decomposition — convolution as a patch operation, pooling as downsampling, and how these compose into feature hierarchies. His visual style makes abstract spatial operations concrete. Pairs well with his companion post on understanding convolutions.
- Level: beginner/intermediate
Deep dive
- CS231n Course Notes — “Convolutional Neural Networks for Visual Recognition”
- Link: https://cs231n.github.io/convolutional-networks/
- Why: The most comprehensive freely available written reference for CNNs. Covers the full stack: convolution arithmetic, parameter sharing, pooling variants, common architectures (LeNet, AlexNet, VGG, ResNet), and practical implementation considerations. Regularly cited in research and industry onboarding.
- Level: intermediate/advanced
Original paper
- Krizhevsky, Sutskever, Hinton (2012) — “ImageNet Classification with Deep Convolutional Neural Networks” (AlexNet)
- Link: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
- Why: AlexNet is the inflection point for modern CNNs — it demonstrated that deep convolutional architectures trained on ImageNet with GPUs and ReLU activations could dramatically outperform prior methods. Highly readable for a systems paper; directly motivates every concept in the topic (convolution, pooling, feature maps, ImageNet benchmark). The arxiv mirror is: https://arxiv.org/abs/1404.5997 [NOT VERIFIED]
- Level: intermediate
Code walkthrough
- Sebastian Raschka — “Convolutional Neural Networks from Scratch in PyTorch”
- Link: https://github.com/rasbt/machine-learning-book
- Why: Raschka’s implementations are pedagogically structured — he builds from a manual convolution operation up to a full training loop, with clear annotation of each step. His code prioritizes readability over performance, making it ideal for learners.
- Level: intermediate
Alternative (higher confidence): The official PyTorch tutorial “Training a Classifier” at https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html is a well-maintained, verified code walkthrough that covers CNN construction, feature maps, and training on a real dataset (CIFAR-10, closely related to ImageNet-scale thinking).
Coverage notes
- Strong: Core convolution mechanics, pooling, feature maps, ImageNet benchmarking, AlexNet-era architecture design — all extremely well covered across video, blog, and reference materials.
- Weak: The transition from CNNs to Vision Transformers (ViT) and DINOv2 specifically is underserved in single resources — most CNN resources predate or treat ViT as a separate topic.
- Gap: No single excellent resource cleanly bridges CNNs → DINOv2 (self-supervised ViT) in one narrative. For the
intro-to-multimodalcourse context, instructors will need to supplement with the DINOv2 paper (https://arxiv.org/abs/2304.07193) and Yannic Kilcher’s ViT walkthrough separately. The related conceptvision transformerlikely needs its own resource card.
Cross-validation
This topic appears in 2 courses: intro-to-multimodal, ml-engineering-foundations
- For
intro-to-multimodal: Emphasis should be on CNNs as feature extractors and the conceptual bridge to ViT/DINOv2. The Olah blog and CS231n notes cover the extractor framing well. - For
ml-engineering-foundations: Emphasis should be on implementation, parameter counts, memory/compute tradeoffs, and practical training. The PyTorch CIFAR-10 walkthrough and CS231n notes are most relevant here.
Additional Resources for Tutor Depth
10 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 DINOv2 (LVD-142M curation + eval results)
Paper · source
DINOv2 details: LVD-142M dataset curation/processing, evaluation protocols, ImageNet + downstream results, linear vs k-NN vs finetune, ablations, ViT sizes/patch sizes/distillation.
Key content
- LVD-142M data pipeline (Sec. 3):
- Start from 1.2B unique web images after URL filtering + postprocess: PCA-hash dedup, NSFW filtering, blur identifiable faces.
- Remove near-duplicates via copy detection (Pizzi et al. 2022) and remove near-duplicates of any benchmark val/test images.
- Self-supervised retrieval curation: embed images with self-supervised ViT-H/16 pretrained on ImageNet-22k; use cosine similarity; k-means cluster uncurated pool; retrieve typically k=4 nearest neighbors per curated query (more neighbors increased “collisions”).
- Uses Faiss GPU IVF+PQ; processing distributed on 20 nodes × 8×V100-32GB, <2 days.
- Training objective (Sec. 4): combined DINO (image-level) + iBOT (patch-level masked tokens) with SwAV centering.
- DINO loss: cross-entropy between student/teacher prototype distributions from [CLS] token (teacher is EMA of student).
- iBOT loss: cross-entropy on masked patch tokens (student sees masks; teacher sees visible patches).
- Sinkhorn-Knopp centering: 3 iterations for teacher; student uses softmax.
- KoLeo regularizer: (L=\frac{1}{n}\sum_i \log d_i), (d_i=\min_{j\neq i}|x_i-x_j|) on L2-normalized features (encourages spread).
- Resolution schedule: short high-res phase: train at 224, then +10k iters at 518.
- Ablation (Table 1, ViT-L on INet-22k): iBOT kNN 72.9 / linear 82.3 → DINOv2 kNN 82.0 / linear 84.5; key steps include 128k prototypes, patch size 14, teacher momentum 0.994, batch size 3k, untying DINO/iBOT heads.
- Data source impact (Table 2, ViT-g/14): Uncurated INet-1k 83.3, Im-A 59.4 vs LVD-142M INet-1k 85.8, Im-A 73.9, Oxford-M 64.6.
- Loss ablations (Table 3): KoLeo improves Oxford-M 55.6→63.9; removing MIM drops ADE20k 47.1→44.2.
- Distillation (Sec. 5, Table in 6.5): ViT-L/14 scratch INet-1k 84.5 vs distill 86.3 (teacher ViT-g/14 scratch 86.5).
- ImageNet linear eval (Table 4, frozen): DINOv2 ViT-B/14 84.5, ViT-L/14 86.3, ViT-g/14 86.5 (val top-1).
- Finetuning sanity check (Table 5): ViT-g/14 linear 86.5@224 → finetuned 88.5; 86.7@448 → 88.9.
- Robustness (Table 6, linear head): DINOv2 ViT-g/14: Im-A 75.9, Im-R 78.8, Sketch 62.5.
- Dense tasks (Table 10): ADE20k mIoU (linear): OpenCLIP-G 39.3 vs DINOv2 ViT-g/14 49.0; +ms: 46.0 vs 53.0.
📄 DeiT/ViT tokenization + distillation token (DeiT)
Paper · source
Practical ViT token pipeline (patch→tokens + class token + positional embeddings) and DeiT’s distillation token + key training/accuracy impacts.
Key content
- Self-attention (Eq. 1):
[ \text{Attention}(Q,K,V)=\text{Softmax}(QK^\top/\sqrt{d})V ] with queries (Q\in\mathbb{R}^{N\times d}), keys/values (K,V\in\mathbb{R}^{k\times d}) (self-attn uses (k=N)). (Q=XW^Q, K=XW^K, V=XW^V) for token matrix (X\in\mathbb{R}^{N\times D}). - ViT tokenization (Section 3): image split into (N) patches of size (16\times16) (for 224px: (N=14\times14)). Each patch (dim (3\cdot16\cdot16=768)) is linearly projected to embedding dim (D). Positional embeddings (fixed or trainable) are added before transformer blocks.
- Class token (Section 3): trainable vector appended → sequence length (N+1). Only class token is used for classification head; forces attention to route patch info into class token.
- Resolution change: keep patch size fixed ⇒ (N) changes; interpolate positional embeddings when fine-tuning at higher resolution (ViT approach). DeiT uses bicubic interpolation to better preserve embedding norms (bilinear reduced norms harmed accuracy without fine-tuning).
- Distillation losses:
Soft KD (Eq. 2):
(L=(1-\lambda)L_{CE}(\psi(Z_s),y)+\lambda\tau^2 KL(\psi(Z_s/\tau),\psi(Z_t/\tau))). Defaults: (\tau=3.0,\lambda=0.1).
Hard KD (Eq. 3): (y_t=\arg\max_c Z_t(c)),
(L=\tfrac12 CE(\psi(Z_s),y)+\tfrac12 CE(\psi(Z_s),y_t)). - Distillation token (Section 4): add a new token alongside class+patch tokens; its head predicts teacher label (hard). At test time: use class head, distill head, or late-fuse by summing softmax outputs.
- Empirical (Table 3, ImageNet): DeiT-B 224: 81.8%; usual soft distill: 81.8%; hard distill: 83.0%; DeiT-B⚗ class: 83.0%, distill: 83.1%, class+distill fusion: 83.4%. At 384: DeiT-B 83.1% vs DeiT-B⚗ fusion 84.5%.
- Teacher choice (Table 2): convnet teachers (RegNetY-16GF teacher acc 82.9%) yield better student; DeiT-B⚗ at 384 reaches 84.2% with RegNetY-16GF teacher.
📄 Inception v1 / GoogLeNet (ILSVRC14) — efficiency via multi-branch + 1×1 bottlenecks
Paper · source
ImageNet results + Inception module rationale (1×1 reductions, multi-scale branches, auxiliary classifiers) and compute/parameter efficiency
Key content
- Compute/parameter efficiency targets (Intro):
- Designed around ~1.5 billion multiply-adds at inference (computational budget).
- GoogLeNet uses ~12× fewer parameters than Krizhevsky et al. (AlexNet, ILSVRC12 winner) while being more accurate.
- Inception module design (Section 4, Fig. 2):
- Parallel branches: 1×1 conv, 3×3 conv, 5×5 conv, and 3×3 max-pool, then concatenate along channels.
- 1×1 convolutions used as dimension reduction (“#3×3 reduce”, “#5×5 reduce”) before expensive 3×3/5×5 convs to prevent compute blow-up; also used as pool projection after pooling.
- Rationale: approximate an “optimal sparse” structure with dense ops; multi-scale processing (different receptive fields) aggregated per module.
- Depth / architecture facts (Section 4):
- GoogLeNet is 22 layers deep (counting layers with parameters); Inception modules stacked with occasional stride-2 pooling.
- Uses ReLU throughout (including reduction/projection layers).
- Auxiliary classifiers (Section 5):
- Added at outputs of Inception (4a) and (4d) to help gradient propagation + regularization.
- Aux loss weight = 0.3 added to main loss during training; discarded at inference.
- Aux head: avg pool 5×5, stride 3 → 1×1 conv 128 → FC 1024 → dropout 70% dropped → softmax (1000 classes).
- Training/testing procedure (Sections 6–7):
- Async SGD, momentum 0.9; learning rate decreased by 4% every 8 epochs; Polyak averaging for final model.
- ILSVRC metric: top-5 error (correct if GT in top 5).
- Key empirical results (Section 7):
- Final ILSVRC14 submission: top-5 error = 6.67% (validation and test), 7-model ensemble with 144 crops.
- Detection: ensemble of 6 GoogLeNets improves region classification accuracy 40% → 43.9%; proposal reduction to ~60% of R-CNN proposals while coverage 92% → 93%; ~+1% mAP single-model effect from proposal changes.
📄 ResNet—Residual blocks, degradation problem, ImageNet depth results
Paper · source
ImageNet benchmark across depths (18/34/50/101/152), residual-block rationale + optimization evidence (degradation)
Key content
- Degradation problem (Intro, Fig.1/Fig.4): Increasing depth in plain nets can increase training error (not just overfitting). Example: on ImageNet, 34-layer plain has higher training error throughout training than 18-layer plain (Fig.4 left), despite BN and healthy gradient norms.
- Residual learning reformulation (Sec.3.1):
- Desired mapping: H(x). Residual mapping: F(x) := H(x) − x.
- Output with shortcut: y = F(x, {Wi}) + x (Eq.1).
- If dimensions differ: y = F(x, {Wi}) + Ws x (Eq.2), where Ws is a linear projection (typically 1×1 conv) used mainly for dimension matching.
- Rationale: If identity mapping is optimal, easier to push F(x) → 0 than to make stacked nonlinear layers approximate identity.
- Empirical optimization benefit (Sec.4, Fig.4): With residual blocks, situation reverses: ResNet-34 > ResNet-18 and shows lower training error + better validation generalization (Fig.4 right).
- Bottleneck design for deeper nets (Sec.3.3): Residual block uses 1×1, 3×3, 1×1 conv stack (Fig.5) for efficiency; identity shortcuts preferred; projection shortcuts can increase cost.
- Complexity & depth (Sec.4):
- ResNet-50: ~3.8B FLOPs.
- ResNet-152: ~11.3B FLOPs, still < VGG-16/19 (15.3/19.6B FLOPs).
- Headline ImageNet results (Abstract/Table 5): 152-layer evaluated; ensemble top-5 test error = 3.57% (ILSVRC 2015 winner). Single-model 152-layer top-5 val error = 4.49%.
- Training defaults (Sec.3.4): SGD, batch size 256, LR 0.1 then ÷10 on plateau; BN after each conv before activation; weight decay 0.0001, momentum 0.9; no dropout; global average pooling + 1000-way FC.
📄 VGG (Very Deep ConvNets) — depth + 3×3 stacks on ImageNet
Paper · source
Canonical ImageNet error tables across depth variants (11/13/16/19) + rationale for stacking 3×3 convs vs larger kernels
Key content
- Architecture defaults (Sec. 2.1):
- Input: 224×224 RGB; preprocess: subtract mean RGB (train-set mean).
- Convs: 3×3 filters, stride 1, padding 1 (preserve spatial size). Optional 1×1 conv in one config.
- Pooling: 5 max-pool layers, 2×2 window, stride 2.
- Classifier: 3 FC layers: 4096, 4096, 1000 + softmax. ReLU on hidden layers.
- LRN: tested (A-LRN) but no improvement → omitted in deeper nets (B–E).
- Depth variants (Sec. 2.2): configs differ mainly by depth: A=11 (8 conv + 3 FC) … E=19 (16 conv + 3 FC). Channels start 64, double after each max-pool up to 512.
- Design rationale (Sec. 2.3): effective receptive field + parameter count
- Two 3×3 convs ⇒ 5×5 effective RF; three 3×3 ⇒ 7×7 effective RF.
- Parameter comparison (C channels in/out):
- 3-layer 3×3 stack: 3·(3²)·C² = 27C² weights
- single 7×7: 7²·C² = 49C² weights (81% more)
- Benefit: more non-linearities (3 ReLUs vs 1) + fewer params (regularizing decomposition).
- Training hyperparameters (Sec. 3.1):
- Mini-batch SGD: batch 256, momentum 0.9, weight decay 5e−4, dropout 0.5 (first two FC).
- LR: start 1e−2, drop ×10 when val stalls; 3 drops total; stop 370k iters (~74 epochs).
- Init: weights ~ N(0, 1e−2), biases 0; deeper nets partially initialized from net A.
- Scale: train at S=256, optionally fine-tune at S=384 (LR 1e−3); also scale jittering with S sampled from [Smin, Smax].
- Testing workflow (Sec. 3.2):
- Resize so min side Q ≥ 224; convert FC→conv (FC1→7×7 conv; FC2/FC3→1×1 conv); apply densely over whole image; spatially average (sum-pool) score map; average with horizontal flip.
- Key empirical findings (Sec. 4):
- Error decreases with depth 11→16 layers; 19 layers saturates (no further gain).
- Config D (all 3×3) beats Config C (includes 1×1) despite same depth → spatial context matters.
- Replacing pairs of 3×3 with single 5×5 (same RF) yields top-1 error ~7% higher (shallower worse).
- Best single-network reported with scale jittering: 25.9% top-1 / 8.0% top-5 error.
- Ensemble: 2 nets achieve 6.8% top-5 test error (multi-crop & dense eval).
📖 Torchvision vit_b_16 (Vision Transformer Base, patch=16) API + Weights/Transforms
Reference Doc · source
Official builder signature, available pretrained weight enums, and exact inference preprocessing transforms (resize/crop/normalize), plus key metrics (acc, params, GFLOPs, min input size).
Key content
- Model builder (signature):
torchvision.models.vit_b_16(*, weights=None, progress=True, **kwargs) -> VisionTransformerweights: optional pretrained weights enum (or string like'DEFAULT').progress: download progress bar (defaultTrue).**kwargs: forwarded totorchvision.models.vision_transformer.VisionTransformerbase class.
- Weights enum:
torchvision.models.ViT_B_16_WeightsViT_B_16_Weights.DEFAULT == ViT_B_16_Weights.IMAGENET1K_V1
- Empirical results / model stats (ImageNet-1K):
IMAGENET1K_V1: acc@1 81.072, acc@5 95.318, 86,567,656 params, 17.56 GFLOPs, min_size 224×224, file 330.3 MB.IMAGENET1K_SWAG_E2E_V1(end-to-end fine-tune): acc@1 85.304, acc@5 97.65, 86,859,496 params, 55.48 GFLOPs, min_size 384×384, file 331.4 MB.IMAGENET1K_SWAG_LINEAR_V1(frozen trunk + linear head): acc@1 81.886, acc@5 96.18, 86,567,656 params, 17.56 GFLOPs, min_size 224×224, file 330.3 MB.
- Inference preprocessing (via
weights.transforms): acceptsPIL.Imageortorch.Tensorimages: single(C,H,W)or batched(B,C,H,W).- Common final steps: rescale to [0.0, 1.0], then normalize with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].
IMAGENET1K_V1.transforms: resize [256] (BILINEAR) → center crop [224].SWAG_E2E_V1.transforms: resize [384] (BICUBIC) → center crop [384].SWAG_LINEAR_V1.transforms: resize [224] (BICUBIC) → center crop [224].
📋 # Source: https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
Source ·
📋 # Source: https://docs.pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html
Source ·
🔍 Convolution / Pooling / Transposed Conv Output-Size Arithmetic
Explainer · source
Output-size formulas for conv/pooling/transposed conv (incl. padding/stride/dilation), with canonical “same/full” cases.
Key content
- Per-axis independence (Sec. 2): compute output size separately for each axis (j) using input (i_j), kernel/window (k_j), stride (s_j), padding (p_j). (Paper often shows square 2D case: (i,k,s,p).)
- Direct convolution output size
- No padding, stride 1 (Rel. 1): (o = (i-k)+1).
- Padding (p), stride 1 (Rel. 2): (o = (i-k)+2p+1) (effective input (i+2p)).
- “Same” / half padding, stride 1, odd (k=2n+1) (Rel. 3): (p=\lfloor k/2\rfloor=n \Rightarrow o=i).
- “Full” padding, stride 1 (Rel. 4): (p=k-1 \Rightarrow o=i+(k-1)).
- No padding, stride (s) (Rel. 5): (o=\left\lfloor\frac{i-k}{s}\right\rfloor+1).
- General padding + stride (Rel. 6): (o=\left\lfloor\frac{i+2p-k}{s}\right\rfloor+1).
- Pooling output size (Sec. 3, Rel. 7): same as strided conv without padding: (o=\left\lfloor\frac{i-k}{s}\right\rfloor+1).
- Transposed convolution output size
- Stride 1 (Rel. 9): for conv with kernel (k), padding (p): transposed has (p’ = k-p-1), output (o’ = i’ + (k-1) - 2p). Special cases:
- No padding (Rel. 8): (p=0 \Rightarrow p’=k-1,\ o’=i’+(k-1)).
- Same padding (Rel. 10): (p=\lfloor k/2\rfloor \Rightarrow o’=i’).
- Full padding (Rel. 11): (p=k-1 \Rightarrow p’=0,\ o’=i’-(k-1)).
- Stride (s>1) (Rel. 14): (o’ = s(i’-1) + a + k - 2p), where (a=(i+2p-k)\bmod s) (disambiguates cases); conceptual “stretching” inserts (s-1) zeros between input units.
- Stride 1 (Rel. 9): for conv with kernel (k), padding (p): transposed has (p’ = k-p-1), output (o’ = i’ + (k-1) - 2p). Special cases:
- Dilated conv (Sec. 5.1): effective kernel (\hat{k}=k+(k-1)(d-1)); output (Rel. 15):
(o=\left\lfloor\frac{i+2p-k-(k-1)(d-1)}{s}\right\rfloor+1).
📋 DINOv2 model loading + eval entry points (PyTorch Hub)
Code · source
Official model-loading/usage patterns, model variants, and evaluation commands.
Key content
- What DINOv2 provides (design rationale): pretrained self-supervised ViT backbones that yield robust visual features usable “as-is” with simple classifiers (e.g., linear layers) across tasks/domains without fine-tuning. Pretrained on 142M images with no labels/annotations.
- Backbone variants + params (empirical): ViT-S/14 distilled 21M params (table lists additional ViT-B/14, ViT-L/14, ViT-g/14; also “with registers” variants).
- ImageNet performance example (empirical): ViT-S/14 distilled (no registers): 79.0% k-NN, 81.1% linear (ImageNet-1k).
- Load pretrained backbones (procedure):
import torch dinov2_vits14 = torch.hub.load('facebookresearch/dinov2','dinov2_vits14') dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2','dinov2_vitb14') dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2','dinov2_vitl14') dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2','dinov2_vitg14') # with registers dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2','dinov2_vits14_reg') - Load full classifier (“lc”) models (procedure):
dinov2_vitb14_lc = torch.hub.load('facebookresearch/dinov2','dinov2_vitb14_lc') dinov2_vitb14_reg_lc = torch.hub.load('facebookresearch/dinov2','dinov2_vitb14_reg_lc') - Evaluate pretrained weights on ImageNet-1k (procedure):
python dinov2/run/eval/linear.py \ --config-file dinov2/configs/eval/vitg14_pretrain.yaml \ --pretrained-weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth \ --train-dataset ImageNet:split=TRAIN:root=<ROOT>:extra=<EXTRA> \ --val-dataset ImageNet:split=VAL:root=<ROOT>:extra=<EXTRA> - Training defaults/examples (procedure + parameters):
- Requires PyTorch 2.0 + xFormers 0.0.18 (Linux-tested).
- Example runs: 4 nodes / 32 GPUs A100-80GB, config
vitl16_short.yaml, ~1 day, reaches 81.6% k-NN / 82.9% linear; teacher weights saved every 12500 iterations. - Larger: 12 nodes / 96 GPUs, config
vitl14.yaml, ~3.3 days, 82.0% k-NN / 84.5% linear.
- Extra model entry points (procedure):
- dino.txt hub id:
dinov2_vitl14_reg4_dinotxt_tet1280d20h24l. - XRay-DINO / Cell-DINO / Channel-Adaptive DINO loaded via
torch.hub.load(REPO_DIR, ..., source='local', weights=... or pretrained_path/pretrained_url=...).
- dino.txt hub id: