Cache Placement on GPU Optimization
Why this project
Large models bottleneck on memory residency and bandwidth. Strategic placement of tensors in GPU caches/memory can unlock latency savings without hardware changes.
Objective
Minimize training/inference step time subject to GPU memory and interconnect constraints.
Method
- Cost model over kernel traces capturing reuse distance, transfer latency, and eviction penalties.
- Optimization view: mixed‑integer program; practical solvers use greedy and Lagrangian relaxations with approximation bounds.
- Policy integrates with a PyTorch pass to observe live profiling signals and adjust placement online.
Params/Acts ─► Cost Model ─► Solver (Greedy | Lagrange) ─► Placement Plan
│
Runtime Feedback (profiling)
▼
Online Adjustments
Data & setup
- Benchmarks on CNN/Transformer workloads; traces collected with Nsight/torch.profiler.
- Constraints reflect HBM capacity, L2 cache size, and NVLink/PCIe bandwidth caps.
Evaluation & ablations
- KPIs: step time, stall breakdown, bytes moved, cache hit rate, convergence parity.
- Compare against baseline eviction policies and prefetchers.
Outcomes
- 1.2×–1.6× speedups with consistent convergence; reduced cross‑GPU traffic and higher cache hit rates.
- Sensitivity analysis quantifies gains vs. model size and batch size.
Artifacts
- Simulator, placement library, and profiling notebooks; reproducible scripts for all ablations.
