Real test data from RunPod H200 GPU instances across 31 global regions. Every figure independently verifiable via nvidia-smi telemetry.
Platform: RunPod · Hardware: NVIDIA H200 · Regions: 31 global · Measurement: nvidia-smi · Interval: 1 second
Important context: The 89% figure applies to idle/dormant intervals only — not the full duty cycle average. At 50% GPU utilization (typical enterprise), approximately half of all GPU-hours are idle. Layer 1 captures all of that idle waste.
Mixed enterprise workload profile · Exact + semantic cache combined hit rate: 66.6%
| Request Type | % of Requests | GPU Power | Contribution |
|---|---|---|---|
| Exact cache hit | 51.4% | 0W | 0W |
| Semantic cache hit | 15.2% | 0W | 0W |
| Cache miss → 7B model | 26.0% | 200W | 52W avg |
| Cache miss → 13B model | 5.0% | 350W | 18W avg |
| Cache miss → 70B model | 2.3% | 700W | 16W avg |
| Weighted Average | 100% | 86W | 87% vs 650W baseline |
| Layer | Mechanism | Savings | Measurement Basis |
|---|---|---|---|
| 1 — Idle States | Predictive power scheduling | 89% | 700W → 75W · H200 validated |
| 2 — Inference | Semantic cache + ML routing | 87% | 650W → 86W · V2.0 formula |
| 3 — Total Facility | Combined IT + cooling effect | ~51% | 80% cache hit · PUE 1.6 |