Cache Placement on GPU Optimization

Why this project

Large models bottleneck on memory residency and bandwidth. Strategic placement of tensors in GPU caches/memory can unlock latency savings without hardware changes.

Objective

Minimize training/inference step time subject to GPU memory and interconnect constraints.

Method

  • Cost model over kernel traces capturing reuse distance, transfer latency, and eviction penalties.
  • Optimization view: mixed‑integer program; practical solvers use greedy and Lagrangian relaxations with approximation bounds.
  • Policy integrates with a PyTorch pass to observe live profiling signals and adjust placement online.
Params/Acts ─► Cost Model ─► Solver (Greedy | Lagrange) ─► Placement Plan
                                               │
                                     Runtime Feedback (profiling)
                                               ▼
                                       Online Adjustments

Data & setup

  • Benchmarks on CNN/Transformer workloads; traces collected with Nsight/torch.profiler.
  • Constraints reflect HBM capacity, L2 cache size, and NVLink/PCIe bandwidth caps.

Evaluation & ablations

  • KPIs: step time, stall breakdown, bytes moved, cache hit rate, convergence parity.
  • Compare against baseline eviction policies and prefetchers.

Outcomes

  • 1.2×–1.6× speedups with consistent convergence; reduced cross‑GPU traffic and higher cache hit rates.
  • Sensitivity analysis quantifies gains vs. model size and batch size.

Artifacts

  • Simulator, placement library, and profiling notebooks; reproducible scripts for all ablations.