Deep Learning
for the GPU Poor
Preface
These notes accompany the two-part lecture series Deep Learning for the GPU Poor, delivered as part of COMP70010 at Imperial College London, Dept of Computing. The goal is to equip researchers and engineers—particularly those without access to large compute clusters—with the quantitative tools and practical techniques needed to train, fine-tune/post-train, and deploy large models effectively. Throughout I try and cite original papers, along with point towards where these methods have been used in state-of-the-art models.
How to read these notes
The notes are organised around four questions:
- How expensive is it? Chapter 1 develops the skills to calculate the memory requirements and FLOPs for different model architectures and training/inference regimes. This is the foundation for understanding the methods introduced in later chapters, allowing the reader to trade off different approaches.
- Why do models keep getting bigger? Chapter 2 explores the empirical scaling laws that govern the relationship between compute, data, and performance, as well as the emerging paradigm of inference-time scaling.
- How do we make training tractable? Chapter 3 covers mixed-precision training, gradient accumulation and checkpointing, mixture of experts (scaling capacity without scaling compute), and parameter-efficient fine-tuning via LoRA and QLoRA.
- How do we make inference efficient? Chapter 4 addresses KV caching, prefill–decode disaggregation, Multi-Head Latent Attention and speculative decoding.
Errata. I am sure there are bugs in here. If you find one, please let me know and I will patch (harry.coppock@imperial.ac.uk) or raise a PR!
Prerequisites
Familiarity with the transformer architecture, basic linear algebra, and backpropagation is assumed. No prior knowledge of quantisation, AI architecture beyond transformers, or hardware architecture is required.