Corvex | Serving LLMs Without Breaking the Bank

Corvex

June 13th, 2025

Why Does LLM Inference Get Expensive So Quickly?

A 70B-parameter model needs roughly 80 GB of HBM just to sit idle; every second that memory occupies a GPU still incurs the hourly fee. Under low traffic, most of your spend is simply paying for unused capacity.

What Actually Drives the Cost of Serving Large Models?

GPU hourly rate — public clouds list H100 at $3–$9/hr; Corvex’s H200 starts at $2.15/hr.
Token throughput (tokens/sec) — higher TPS means fewer GPU-hours per million tokens. vLLM and TensorRT-LLM reach 250–300 t/s on 70B models.
Prompt + generation length — compute scales almost linearly with total tokens.
GPU utilization — sporadic traffic strands expensive GPUs.
Model precision — 8- or 4-bit weights fit in less memory and decode faster.

Which Inference Engines Squeeze the Most Out of Each GPU?

Engine	Key Optimization	Llama-2 70B TPS (H200)	Memory (GB)	Notes
TensorRT-LLM	CUDA kernels + speculative decoding	300 t/s	65	C++/Python; ideal on Corvex H200
vLLM	Paged-attention + continuous batching	268 t/s	72	Python; easy drop-in
Hugging Face TGI	Tensor & pipeline parallel	180 t/s	80	Managed or self-host

How Much Does 4-bit Quantization Save?

Recent papers on GPTQ and QLoRA show weight-only 4-bit models cut VRAM ~75% while preserving accuracy within 1 point on MMLU and TruthfulQA. That shrink lets a single H200 hold Llama-3 70B entirely in memory, boosting throughput 1.5–2×.

Which Hardware Tier Should I Choose?

Tier	Best for	Example Corvex Option	Why It Wins
H200	7–70B models, low-latency chat & RAG	8× H200 HGX node	HBM3e + InfiniBand remove IO bottlenecks
B200	High-QPS inference & training	8× B200 HGX	Blackwell FP4/F8 modes lift TPS by 3–4×
GB200 NVL72	Massive multi-agent or 100B+ models	72× GB200 super-node	NVLink Switch fabric delivers ~870k t/s

Note: Corvex does not sell H100 servers; we focus on newer H200 and Blackwell-class GPUs with better price-performance.

Seven Proven Tactics That Cut Inference Cost

Right-size the model — distill or pick smaller checkpoints
Quantize to 8- or 4-bit
Use an efficient engine — TensorRT-LLM, vLLM, or TGI
Batch + stream — continuous batching keeps utilization high
Cache embeddings and full responses
Autoscale to zero — park GPUs during lulls (Corvex supports auto-park)
Measure $ per million tokens — the only KPI that matters

How Corvex Lowers Total Cost of Ownership

H200 nodes at $2.15/hr with 141 GB HBM3e
3.2 Tb/s non-blocking InfiniBand across every rack
Auto-park policies that spin GPUs down to billing-free idle
One-click images for vLLM, TensorRT-LLM, and TGI
White-glove support from engineers who tune inference daily

Customers report up to 40% lower $ / million tokens vs. public-cloud endpoints charging more for the same silicon.

Real-World Cost per Million Generated Tokens

Model	Engine	Hardware	Cost on Corvex	Typical Public Cloud
Llama-3-8B	vLLM	1× H200	$4.7	$9.3
Llama-2-70B	TensorRT-LLM	4× H200	$19.2	$36.8
Mixtral-8×7B	vLLM	1× B200	$6.8	$13.5

How Big Should My GPU Be?

Allocate ~1 GB of VRAM per billion parameters at FP16, halve for 8-bit, quarter for 4-bit, then add 20% headroom for KV-cache. This prevents mid-request OOMs that waste paid compute.

Frequently Asked Questions

Does batching hurt latency for a single user?
No. Engines like vLLM interleave prompts into micro-batches, keeping first-token latency under 30ms even under high load.
Are “serverless” GPU wrappers cheaper for bursty traffic?
Sometimes. They avoid idle cost, but if bursts last >3 minutes, Corvex auto-parked nodes are usually cheaper.
Is 4-bit quantization production-safe?
Yes for most chat/RAG workloads. Accuracy loss on Llama-2/3 is negligible, but always validate on your domain.

Next Step

Launch a pre-tuned TensorRT-LLM or vLLM endpoint on Corvex in under five minutes, or talk to our solution architects for a cost benchmark tailored to your workload.

Corvex

June 13th, 2025