Lora Peft
Video (best)
- Andrej Karpathy — “State of GPT” (covers fine-tuning landscape including LoRA/PEFT in context)
- Watch: YouTube
- Why: Karpathy provides authoritative, intuitive framing of why parameter-efficient fine-tuning matters and where LoRA fits in the modern LLM training pipeline. Accessible to practitioners without sacrificing technical depth.
- Level: intermediate
Note: A more directly focused alternative is Sebastian Raschka’s dedicated LoRA explainer videos on YouTube — search “Sebastian Raschka LoRA” to verify current best candidate. No single canonical 3Blue1Brown/Karpathy video exists that is exclusively about LoRA.
Blog / Written explainer (best)
- Sebastian Raschka — “Parameter-Efficient LLM Fine-Tuning With Low-Rank Adaptation (LoRA)”
- Link: https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
- Why: Raschka systematically explains the mathematical intuition behind low-rank decomposition, compares LoRA to other PEFT methods (prefix tuning, adapters), and includes practical guidance. His writing bridges theory and implementation better than most sources for this specific topic.
- Level: intermediate
Deep dive
- Lilian Weng — “Parameter-Efficient Transfer Learning”
- Link: https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
- Why: Weng’s blog posts are the gold standard for comprehensive, well-cited technical surveys. Her coverage of adapter methods, prompt tuning, and LoRA variants provides the broadest and most rigorous reference for understanding the full PEFT landscape including QLoRA and multi-modal extensions.
- Level: advanced
Better candidate: https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ may not be the exact post — her PEFT-specific post should be verified. Search lilianweng.github.io for “fine-tuning” or “PEFT”. [NOT VERIFIED]
Original paper
- Hu et al. (2021) — “LoRA: Low-Rank Adaptation of Large Language Models”
- Link: https://arxiv.org/abs/2106.09685
- Why: This is the seminal, clearly written paper that introduced LoRA. The authors provide strong motivation, clean mathematical formulation (W = W₀ + BA where B and A are low-rank matrices), and empirical results across GPT-2/3 and RoBERTa. Unusually readable for a systems paper. QLoRA (arxiv.org/abs/2305.14314) is the essential follow-on for quantization-aware fine-tuning.
- Level: intermediate
Code walkthrough
- Hugging Face — PEFT library documentation and LoRA fine-tuning notebook
- Link: https://github.com/huggingface/peft
- Why: The official PEFT library by Hugging Face is the de facto implementation standard. Their example notebooks cover LoRA, QLoRA, and multi-modal fine-tuning (including LLaVA-style VLMs) with working code. Directly maps to how practitioners implement these methods in production. The
examples/directory includes causal LM and sequence classification walkthroughs. - Level: intermediate
Supplementary code resource: Tim Dettmers’ QLoRA repository (github.com/artidoro/qlora) is the canonical reference for quantized LoRA implementation.
Coverage notes
- Strong: Core LoRA mathematics, PEFT comparison, QLoRA, LLM fine-tuning workflows — well covered across the resources above
- Weak: LoRA specifically for Vision-Language Models (VLMs) and multi-modal fine-tuning — fewer dedicated tutorials exist; most resources treat this as an extension of LLM LoRA
- Gap: No single excellent YouTube video exists that covers both LoRA fundamentals AND its application to VLMs (lora-for-vlms, visual instruction tuning) in one place. The multi-modal fine-tuning angle (relevant to
intro-to-multimodal) requires piecing together LLaVA paper + PEFT docs. No 3Blue1Brown or Yannic Kilcher video is exclusively dedicated to LoRA/PEFT as of knowledge cutoff.
Cross-validation
This topic appears in 2 courses: intro-to-llms, intro-to-multimodal
- For
intro-to-llms: The LoRA paper + Raschka blog + PEFT code walkthrough form a complete unit covering adapter methods, low-rank adaptation, and QLoRA. - For
intro-to-multimodal: Additional coverage of visual instruction tuning and LoRA-for-VLMs is needed. The LLaVA paper (arxiv.org/abs/2304.08485) and InstructBLIP serve as companion readings for the multi-modal fine-tuning angle. The PEFT library’s multimodal examples are the best available code resource for this gap.
Additional Resources for Tutor Depth
10 sources — papers, official docs, working code, benchmarks, and deep explainers that give the AI tutor precision on this topic.
📄 LoRAServe—rank-aware distributed serving for heterogeneous LoRA
Paper · source
Cluster-level design to serve many LoRA adapters with heterogeneous ranks; quantifies rank interference + dynamic placement/routing + RDMA-based remote adapter access.
Key content
- Problem (rank heterogeneity interference): Multi-tenant LoRA kernels (Punica BGMV, S-LoRA MBGMV) size compute tiles/pipelines to the maximum rank in the batch, so low-rank requests “pay” for high-rank ones → tail latency skew. Example (Sec. I/III-A5, Fig.1): co-serving rank-8 + rank-128 on Llama-7B increases P95 TTFT of rank-8 by 84% vs serving only rank-8.
- SLO impact: Common SLO cited: P95 TTFT < 10s (Sec. III-A4). Under a 4 RPS Poisson workload with P95 TTFT SLO=20s, ranks 64/128 violate SLO while smaller ranks do not (Fig.6).
- Scaling effects: Rank heterogeneity penalty grows with model size: up to 45% degradation on Llama-70B (Sec. III-A2). Tensor parallelism reduces but doesn’t remove it: with TP=8, rank-128 still causes ~20% TTFT increase vs rank-8 on Llama-7B (Sec. III-A3).
- Memory pressure numbers: For a 200B model quantized to 8-bit, base size ≈ 200GB; LoRA adapters ≈ 1% of model → ~2GB/adapter; 500 adapters ≈ 1TB if replicated per server (Sec. I).
- LoRAServe architecture (Sec. IV): Cluster orchestrator maintains routing table with tuples (adapter a, servers S, probabilities p); route to server s with probability p_s, with ∑_{s∈S} p_s = 1. If adapter absent locally, fetch from remote server via GPUDirect RDMA over InfiniBand, then cache in host memory.
- Placement algorithm (Alg.1, Sec. IV-A): Per timestep: (1) estimate TPS demand per adapter; (2) compute per-rank server budget using profiled rank operating points under SLO (max TPS per rank); (3) fractional bin packing for ranks with budget; (4) place remaining adapters on servers with higher max-rank capacity; (5) permute to minimize deviation from previous placement; (6) update routing + metadata.
- Empirical gains (Abstract/Sec. V-F): On Company X traces: up to 2× throughput, up to 9× lower TTFT, and up to 50% fewer GPUs vs SOTA; reduces per-server adapter storage footprint up to 16× vs Toppings.
📄 MoReS (LLaVA Steering) — VLM PEFT with extreme parameter reduction
Paper · source
VLM-specific PEFT results/ablations: where to add steering modules, parameter counts, benchmark impacts in LLaVA-style visual instruction tuning
Key content
- Autoregressive conditioning (Eq. 1, Sec. 3):
(p(\hat{y})=\prod_{i=1}^{L} p(\hat{y}i \mid \hat{y}{<i}, R_{\text{text}}, R_{\text{image}}, R_{\text{sys}})).
(\hat{y}i): i-th output token; (R{\text{text}}), (R_{\text{image}}): text/vision representations; (R_{\text{sys}}): system context; (L): output length. - Modality balance metric LMAR (Eq. 2, Sec. 3):
(\text{LMAR}l=\frac{1}{N}\sum{i=1}^{N}\frac{\alpha^{l}{\text{image},i}}{\alpha^{l}{\text{text},i}}).
(\alpha^{l}{\text{image},i}), (\alpha^{l}{\text{text},i}): mean per-token attention to visual/text tokens at layer (l) for sample (i); (N): samples. LMAR≈1 implies balanced per-token attention (important because vision tokens can be ~576 vs dozens of text tokens). - MoReS steering (Eqs. 3–4, Sec. 4): freeze LLM; insert per-layer linear steering on visual tokens in a low-dim subspace.
(\text{MoReS}(h)=W_{\text{up}}\cdot \phi(h)); (\phi(h)=\text{Linear}(h)-W_{\text{down}}h).
(h\in\mathbb{R}^D), (W_{\text{down}}\in\mathbb{R}^{d\times D}), (W_{\text{up}}\in\mathbb{R}^{D\times d}), (d<D); constraint (W_{\text{down}}W_{\text{up}}^{T}=I_D). - Training procedure defaults (Sec. 5): LLaVA-1.5 recipe; visual instruction tuning on LLaVA-665k; apply MoReS in each LLM layer but only to 1% of visual tokens (sparse steering).
- Multi-task SFT results (Table 1, LLaVA Steering-3B):
Trainable params in LLM (TP*): FT 2.78B, Adapter 83M, LoRA 188.7M, OFT 39.3M, IA3 0.49M, MoReS-B 0.164M, MoReS-L 0.328M, MoReS-H 0.655M.
MoReS-H: POPE 88.2, MMMU 35.8, SciQA-IMG 71.9, MM-Vet 31.1; achieves 287–1150× fewer TP than LoRA (depending on setup). - Hallucination mitigation (Table 6): MoReS best: POPE Acc 88.2 (vs Full 87.2; LoRA 86.7); HallucinationBench Hard Acc 42.6 (vs IA3 39.3; Full 37.4).
- Ablations (Sec. 5.7):
- Subspace rank (Table 7): rank=1 best avg 81.8 across 4 tasks with 0.164M TP (rank 2: 0.328M; rank 4: 0.655M; rank 8: 1.340M).
- Steered visual token ratio (Table 8): 1% best overall (e.g., SciQA-IMG 89.7, IconQA-blank 94.1); dense 100% hurts (SciQA-IMG 85.8, IconQA-txt 67.7).
📄 MobileVLM training + LoRA insertion points in VLM stacks
Paper · source
Reproducible VLM pipeline (frozen vision encoder + projector + LLM) + 2-step VLM training and LoRA results for PEFT discussion.
Key content
- Architecture (Sec. 3.1): 3 parts: (1) vision encoder, (2) LLM (MobileLLaMA), (3) efficient projector (LDP) aligning vision→text embedding space.
- Eq. (1) (Sec. 3.1): Projector maps visual embeddings to LLM word-embedding dimension:
- Input visual tokens: (Z \in \mathbb{R}^{N \times D_v}) (N patches/tokens, (D_v) vision hidden size).
- Output image tokens: (V \in \mathbb{R}^{M \times D_t}) (M visual tokens after compression/alignment, (D_t) LLM embedding size).
- Eq. (2): Autoregressive generation conditioned on multimodal tokens (image tokens + text tokens) to produce output length (L).
- Projector rationale (Sec. 3.4): Q-Former can lose spatial info + slow convergence + inefficient on edge; plain MLP keeps spatial info but injects many useless/background tokens → slows inference. LDP uses depthwise conv and stride-2 downsampling to reduce tokens while preserving spatial structure.
- Token reduction & quality (Sec. 5.1): LDP reduces visual tokens 576 → 144 (−75%) with equivalent or sometimes better benchmark performance vs baseline.
- Resolution vs token strategy (Sec. 5.3, Table 11): Keeping 144 tokens via LDP beats reducing input resolution (RIR):
- LDP: GQA 56.1, SQA 54.7, VQA 41.5, POPE 84.5, MME 1196.2, MMB 53.2
- RIR: GQA 53.9, SQA 53.1, VQA 37.1, POPE 81.5, MME 1072.5, MMB 46.7
- VLM training procedure (Sec. 4.1): Two-step multimodal training (like LLaVA/mPLUG style):
- Pre-train: freeze vision encoder + LLM, train projector only on CC-595K for 1 epoch, lr 2e-3, batch 256.
- Instruction tuning: fine-tune projector + LLM on LLaVA-Instruct-158K for 1 epoch, lr 2e-5, batch 128. Optimizer AdamW, no weight decay, cosine LR, 3% warmup.
- LoRA PEFT result (Sec. 4.4): During visual instruction tuning, freeze all LLM params except LoRA; trainable params are 8.87% (1.4B) and 7.41% (2.7B) of full LLM; LoRA config r=128, α=256; achieves comparable performance to full finetuning on 6 benchmarks.
📄 PEFT A2Z — PEFT taxonomy + core fine-tuning equations
Paper · source
Broad PEFT survey spanning LLMs/VLMs; taxonomy + mechanisms (LoRA, adapters, prefix/prompt, BitFit) and efficiency motivations.
Key content
- Scaled dot-product attention (Eq. 2, Sec. 3.1):
Given token embeddings (X), projections (Q=XW_Q,\ K=XW_K,\ V=XW_V) (Eq. 1).
[ \text{Attn}(Q,K,V)=\text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V ] where (d_k) is key/query head dimension; scaling stabilizes softmax. - Multi-head attention (Sec. 3.3): per-head attention computed independently then concatenated and projected:
[ \text{MHA}(X)=\text{Concat}(\text{head}_1,\dots,\text{head}_h)W_O ] - FFN (Sec. 3.4): position-wise two-layer MLP: ( \text{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2).
- Full fine-tuning objective (Eq. 12, Sec. 3.6):
[ \theta^*=\arg\min_\theta \sum_{(x,y)\in D}\mathcal{L}(f_\theta(x),y) ] Gradient update (Eq. 13): (\theta_{t+1}=\theta_t-\eta\nabla_\theta \mathcal{L}), learning rate (\eta). - LM pretraining losses (Sec. 3.5): MLM loss (Eq. 10) predicts masked token (x_i) from (x_{\setminus i}); AR loss (Eq. 11) predicts (x_i) from prefix (x_{<i}).
- Design rationale for PEFT (Intro/Sec. 3.7): full FT updates all parameters → high memory for parameters+gradients+optimizer states; prone to overfitting on small data + catastrophic forgetting; PEFT updates a small structured subset (e.g., adapters/LoRA/BitFit/prefix/prompt) to reduce compute/storage and act as implicit regularization.
- Taxonomy (Sec. 5): five families—additive, selective, reparameterized, hybrid, MoE-based/unified—to compare trade-offs (efficiency vs. performance vs. complexity).
- Efficiency procedures (Sec. 4): precision-aware quantization (e.g., 2/4-bit for less critical params; 8/16-bit for sensitive layers), activation checkpointing, gradient offloading, reversible fine-tuning, KV-cache optimization (hierarchical storage; entropy-based pruning), structured pruning (layer-wise adapter pruning; channel-wise LoRA pruning).
📄 QLoRA core procedure + key numbers
Paper · source
Core QLoRA procedure (NF4 + double quantization + paged optimizers), key equations, and benchmark tradeoffs (quality vs memory)
Key content
- LoRA equation (Eq. 3): For linear projection (Y=XW), LoRA uses
[ Y = XW + s X L_1 L_2 ] where (X\in\mathbb{R}^{b\times h}), (W\in\mathbb{R}^{h\times o}), (L_1\in\mathbb{R}^{h\times r}), (L_2\in\mathbb{R}^{r\times o}), (s)=scalar, (r)=rank. - Blockwise quantization (Eq. 1–2):
(X_{\text{Int8}}=\text{round}\big(\frac{127}{\text{absmax}(X_{\text{FP32}})}X_{\text{FP32}}\big)=\text{round}(c\cdot X_{\text{FP32}})); dequant: (X_{\text{FP32}}=X_{\text{Int8}}/c). - QLoRA forward (Eq. 5–6, Sec. 3): store base weights in 4-bit (NF4), compute in BF16:
[ Y_{\text{BF16}} = X_{\text{BF16}};\text{doubleDequant}(c^{(1)}{\text{FP32}}, c^{(2)}{k\text{-bit}}, W_{k\text{-bit}}) + X_{\text{BF16}}L^{(1)}{\text{BF16}}L^{(2)}{\text{BF16}} ] with (\text{doubleDequant}=\text{dequant}(\text{dequant}(c^{(1)},c^{(2)}),W)\Rightarrow W_{\text{BF16}}). Only LoRA params get gradients (base (W) frozen). - NF4 rationale (Sec. 3): weights ~ (N(0,\sigma)); NFk uses theoretical normal quantiles (Eq. 4) normalized to ([-1,1]); asymmetric construction ensures exact zero.
- Double Quantization defaults (Sec. 3): quantize quantization constants: first-level blocksize 64 for (W); second-level uses FP8, blocksize 256 for constants. Memory for constants drops from 0.5 bits/param (32/64) to 0.127 bits/param (8/64 + 32/(64·256)), saving 0.373 bits/param (~3GB for 65B).
- Paged optimizers (Sec. 3): use NVIDIA Unified Memory paging to avoid OOM spikes during gradient checkpointing; reported same speed as regular optimizers for 65B, batch size 16.
- Empirical comparisons:
- Pile Common Crawl PPL (Table 2): Int4 34.34, FP4(E2M1) 31.07, FP4(E3M0) 29.48, NF4+DQ 27.41 (best).
- MMLU 5-shot mean (Table 3): BF16 53.0, Float4 52.2, NF4+DQ 53.1 (matches BF16; FP4 ~1 pt behind).
- Vicuna benchmark vs ChatGPT (Table 4, memory): Guanaco 65B 4-bit 41GB: 99.3% ±4.4; Guanaco 33B 4-bit 21GB: 97.8% ±4.4; Vicuna 13B 16-bit 26GB: 94.9% ±4.5; Guanaco 13B 4-bit 10GB: 90.4% ±5.2; Guanaco 7B 4-bit 5GB: 87.0% ±5.4.
- Elo (Table 1, GPT-4 judge): Guanaco 65B 1022±1, 33B 992±1, ChatGPT 966±1, Vicuna 13B 974±1.
- Hyperparam scaling rule (Sec. 5.1): 7B settings generalize; for 33B/65B halve learning rate and double batch size.
📄 S-LoRA multi-tenant LoRA serving (Unified Paging + heterogeneous batching)
Paper · source
System design + empirical scaling claims for serving thousands of concurrent LoRA adapters (memory pool, batching, kernels, multi-GPU TP).
Key content
- LoRA math (Section 2, Eq. 1–2): For pretrained weight matrix (W), LoRA adds update (\Delta W = BA) where (B\in\mathbb{R}^{d\times r}), (A\in\mathbb{R}^{r\times k}), rank (r). Base forward (h = xW). With LoRA: (h = xW + xBA) (compute on-the-fly rather than merging for multi-adapter serving).
- Design rationale (Section 4): Merging adapters into base weights eliminates per-request overhead for one adapter, but for many adapters it causes weight duplication or serial adapter swapping → missed batching + GPU underutilization. S-LoRA separates batchable base-model compute from per-request LoRA compute and batches LoRA via custom kernels (avoid padding inefficiency from heterogeneous ranks/seq lengths).
- Unified Paging (Section 5.1): Extends vLLM PagedAttention to a unified GPU memory pool jointly managing KV cache and adapter weights to reduce fragmentation. Pool is a large static buffer using GPU space not occupied by base weights/temporary activations. Storage is paged; each page is a vector of length (h) (hidden size). KV cache with seq len (s) uses (s) pages; LoRA weight with rank (r) uses (r) pages; KV + adapters interleaved, non-contiguous.
- Prefetching (Section 5.2): Predict adapters needed for next decoding batch from waiting queue; prefetch to overlap I/O with compute.
- Custom kernels (Section 5.3): MBGMM (prefill, matrix-matrix) in Triton; MBGMV (decode, matrix-vector) implemented via modified Punica kernels to support non-contiguous memory + multiple ranks.
- Multi-GPU TP (Section 6): Align LoRA partitions with Megatron-LM TP; schedule comms on small LoRA intermediates and fuse with base-model comms. Base comm cost: one all-reduce (O(th)). Added LoRA comm: (O(tr)) (3 all-gathers for Q/K/V + 1 all-reduce for output), negligible since (r\ll h). No replicated weights (partitioned across devices).
- Empirical results (Section 7.2, Table 3):
- S-LoRA serves 2,000 adapters simultaneously with stable throughput once adapters ≥ ~100 (active adapters per batch bounded by GPU mem).
- vLLM-packed (merged copies) can serve <5 adapters before OOM.
- Throughput: up to 4× higher than vLLM-packed (small adapter counts) and up to 30× higher than HuggingFace PEFT; “several orders of magnitude” more adapters than naive vLLM LoRA support.
- Eval defaults (Section 7.1–7.2): Models: Llama-7B/13B/30B/70B. Example adapter ranks: S1 {8}; S2 {64,32,16,8}; S4 {64,32,16}; S5 {32}; S6 {64}. Hardware: A10G 24GB; A100 40/80GB; host RAM 64–670GB. SLO attainment metric: % requests with first token ≤ 6s. Synthetic trace: total rate (\lambda) req/s; input/output lengths uniform [8,512] tokens.
📄 UniPELT adapters + Prompt Tuning on RoBERTa (parameter counts & benchmark deltas)
Paper · source
Concrete PEFT adapter comparisons (UniPELT variants), trainable-parameter budgets, and benchmark tables (GLUE/domain/SQuAD)
Key content
- Design rationale
- Goal: match DAPT/TAPT or full fine-tuning performance while training far fewer parameters (keep most pretrained weights frozen).
- Uses UniPELT (unifies LoRA + Prefix Tuning + SeqBn/bottleneck adapters with gating to regulate submodule activation); explores stacking and swapping submodules.
- Adds Prompt Tuning (PT) on top of UniPELT to test whether stacking adapters improves feature capture with minimal parameter increase.
- Mechanism detail (IA3 vs LoRA)
- IA3: “three learned vectors” rescale keys and values in attention layers (vector scaling) vs LoRA’s decomposed low-rank matrices.
- Training procedure / defaults
- Model: RoBERTa-Base.
- Batch size 16, input length 128, dropout 0.1, epochs 50 with early stopping patience=10 (SQuAD: omit early stopping).
- Loss: Cross-Entropy for classification.
- Learning rates tuned: 2e-4 and 5e-4 (reported best).
- Tooling: Adapters library + Hugging Face Transformers.
- Empirical results (GLUE avg + key tasks)
- Fine-tuning avg 86.35 vs UniPELT (Adapter Lib) 85.15; PT+UniPELT (Adapter Lib) 85.66.
- Best per-task examples: CoLA 66.14 (PT+UniPELT Adapter Lib), MRPC 90.90 (PT+UniPELT Adapter Lib), RTE 78.00 (full FT), QNLI 93.12 (PT+UniPELT Paper).
- Trainable parameter budgets (RoBERTa-base total 124,645,632 = 100%)
- UniPELT 11,083,376 (8.892%)
- PT+UniPELT 11,091,056 (8.898%) (negligible +0.006%)
- IA3+Prefix+SeqBn 10,852,988 (8.707%)
- UniPELT Stack-3 33,250,128 (26.68%) (≈3× params; not consistently better)
- Domain tasks (selected deltas)
- CS gains with PT+UniPELT: ACL-ARC 63.0 → 82.10, SCIERC 77.3 → 86.81 (small datasets: 1,688 / 3,219).
- Vocabulary overlap with RoBERTa pretraining: News 54.1%, Reviews 34.5%, BioMed 27.3%, CS 19.2% (lower overlap → larger adapter benefit).
- SQuAD 1.1 (QA)
- Fine-tuning F1 94.6 / EM 88.9
- UniPELT (Paper) 90.23 / 82.37
- PT+UniPELT (Paper) 88.70 / 80.74 (PT hurts vs UniPELT by ~1.5 F1)
📊 PEFT Taxonomy + Method-by-Method Numbers (Adapters, Prompts, BitFit, IA3, LoRA, QLoRA)
Benchmark · source
Taxonomy + comparative breakdown of where PEFT injects/updates params and efficiency tradeoffs.
Key content
- PEFT taxonomy (Section 3):
- Addition-based: add new modules; train only added params (Adapters, Soft/Prompt/Prefix).
- Selection-based: tune subset of existing params (BitFit biases; sparse masks; layer subsets).
- Reparametrization-based: low-rank update parameterization (LoRA, KronA, Intrinsic SAID).
- Hybrid: combine (e.g., UniPELT = LoRA + Prefix + Adapters with gates).
- Memory rationale (Section 3.1 “Why add parameters?”): with Adam, per byte of trainable parameter: +1 byte gradient +2 bytes optimizer moments; overall training often 12–20× model-weight memory. Freezing most weights saves optimizer/grad memory; can also quantize frozen weights.
- LoRA update (Section 9.2): for weight matrix (W), learn low-rank update
(\Delta W = B A) (rank (r)); effective weight (W’ = W + \alpha/r \cdot BA). Train only (A,B); merge after training by adding (\Delta W) into (W). Typically applied to attention projections (often (Q,V)); best performance when applied to all weight matrices (cites Dettmers et al. 2023). - (IA)³ (Section 7.2): learn vectors that rescale activations: key, value, and FFN hidden activations; minimal inference overhead (can fold scaling into linear layers; only one vector remains as overhead).
- Key empirical parameter ranges (Table 2 excerpt):
- Adapters: 0.1–6% trainable.
- BitFit: 0.05–0.1% trainable; underperforms on >1B models; note bias-less architectures (T5 mostly no biases; LLaMA none).
- Prompt tuning: 0.1%; inference overhead from longer sequence.
- Prefix-tuning: 0.1–4%.
- LoRA: 0.01–0.5% trainable, but ~30% “changed parameters” (update affects many weights when merged).
- (IA)³: 0.02% trainable; reported to beat LoRA with 16× more trainable params on T0-3B.
- QLoRA (Section 9.8): memory savings via 4-bit NF4 quantization, double quantization (quantize quant constants), and CPU↔GPU paging for optimizer-state spikes.
📖 bitsandbytes 4-bit Linear Layers (QLoRA)
Reference Doc · source
Exact API surface for bitsandbytes 4-bit layers used in QLoRA-style finetuning: Linear4bit, LinearFP4, LinearNF4, and Params4bit (all via autodoc of their __init__).
Key content
-
QLoRA procedure (high-level workflow):
- Quantize a pretrained model’s weights to 4-bit.
- Add LoRA (low-rank adaptation) weights.
- Finetune LoRA parameters “through the quantized weights” (i.e., base weights remain quantized while adapters are trained).
(Section: “4-bit quantization”)
-
4-bit layer/data-type options (design rationale):
- Introduces two 4-bit quantization data types for linear layers:
- Float4 via
LinearFP4(“standard Float4 data type”). - NormalFloat 4-bit via
LinearNF4(“4-bit NormalFloat”).
- Float4 via
- Rationale for NF4:
LinearNF4is “a quantization data type for normally distributed data” and can improve performance vs standard Float4.
(Section: “4-bit quantization”)
- Introduces two 4-bit quantization data types for linear layers:
-
API entry points (consult autodoc for exact parameters/defaults):
bitsandbytes.nn.Linear4bit.__init__bitsandbytes.nn.LinearFP4.__init__bitsandbytes.nn.LinearNF4.__init__bitsandbytes.nn.Params4bit.__init__(Sections: Linear4bit / LinearFP4 / LinearNF4 / Params4bit)
🔍 S-LoRA serving thousands of concurrent LoRA adapters
Explainer · source
Step-by-step operational explanation of S-LoRA’s serving approach (adapter storage/loading, routing, batching strategy)
Key content
- LoRA equations (Section “Low-Rank Adaptation”)
- For base weight (W\in\mathbb{R}^{h\times d}): Eq.(1) (W’ = W + AB), where (A\in\mathbb{R}^{h\times r}), (B\in\mathbb{R}^{r\times d}), rank (r \ll \min(h,d)).
- If base forward is (h=xW), then with LoRA: Eq.(2) (h=xW’ = x(W+AB)=xW + xAB).
- Rationale: merging adapters into (W) is fast for one adapter, but switching/merge per batch causes GPU under-utilization and throughput collapse with >2 adapters; separating base compute (batchable) from per-adapter LoRA compute scales better.
- Unified Paging memory design (Section “Reserved Memory v.s. Unified Memory”)
- Avoid fixed “reserved adapter memory” because it (1) wastes memory when adapters < reserved (reduces KV cache → smaller batch size → lower throughput) and (2) caps active adapters (hurts continuous batching).
- Put KV cache + adapter weights into one paged pool (extends vLLM paged KV cache).
- KV cache per layer tensor shape ((S,H)) (sequence length (S)); LoRA weights shape ((R,H)) (rank (R)); choose page size = (H) to reduce fragmentation (common factor).
- Non-contiguous layout → custom kernels (Section “Non-contiguous Memory Layout”)
- Interleaved, non-contiguous KV/adapter pages break standard contiguous ops (PyTorch/xFormers/CUTLASS grouped GEMM assumptions).
- Prefill: Triton tiled kernel gathers adapter weights of varying ranks from pool.
- Decode: modified Punica BGMV kernel supports multiple ranks in a batch + fine-grained gathers aligned to pool.
- Multi-GPU scaling: S-LoRA TP (Section “Tensor Parallelism”)
- Align LoRA partitioning with Megatron-LM TP; minimize comms by avoiding unnecessary comms and fusing some comms; overhead from LoRA comms is “small” vs compute; scaling from 2→4 GPUs yields >2× throughput (memory-bound, superlinear).
- Empirical throughput (A100 80GB, Table “Throughput”)
- S1 (Llama-7B, rank {8}): (n=5) adapters 8.05 req/s (vLLM-packed 2.04, PEFT 0.88); (n=100) 7.99 (vLLM-packed OOM, PEFT 0.25); (n=2000) 7.61.
- S2 (Llama-7B, ranks {64,32,16,8}): (n=5) 7.48; (n=2000) 6.71 (vLLM-packed OOM at 100).
- S4 (Llama-13B, ranks {64,32,16}): (n=2) 4.49 (vLLM-packed 3.83, PEFT 0.54); (n=1000) 3.96.
- Claim: serves 2,000 adapters with minimal overhead; up to 4× throughput vs vLLM-packed (small (n)), up to 30× vs PEFT.