Serving LLMs Without Breaking the Bank

Why Does LLM Inference Get Expensive So Quickly?

Why Does LLM Inference Get Expensive So Quickly?



A 70B-parameter model needs roughly 80 GB of HBM just to sit idle; every second that memory occupies a GPU still incurs the hourly fee. Under low traffic, most of your spend is simply paying for unused capacity.

What Actually Drives the Cost of Serving Large Models?



  • GPU hourly rate — public clouds list H100 at $3–$9/hr; Corvex’s H200 starts at $2.15/hr.
  • Token throughput (tokens/sec) — higher TPS means fewer GPU-hours per million tokens. vLLM and TensorRT-LLM reach 250–300 t/s on 70B models.
  • Prompt + generation length — compute scales almost linearly with total tokens.
  • GPU utilization — sporadic traffic strands expensive GPUs.
  • Model precision — 8- or 4-bit weights fit in less memory and decode faster.

Which Inference Engines Squeeze the Most Out of Each GPU?



Engine Key Optimization Llama-2 70B TPS (H200) Memory (GB) Notes
TensorRT-LLM CUDA kernels + speculative decoding 300 t/s 65 C++/Python; ideal on Corvex H200
vLLM Paged-attention + continuous batching 268 t/s 72 Python; easy drop-in
Hugging Face TGI Tensor & pipeline parallel 180 t/s 80 Managed or self-host

How Much Does 4-bit Quantization Save?



Recent papers on GPTQ and QLoRA show weight-only 4-bit models cut VRAM ~75% while preserving accuracy within 1 point on MMLU and TruthfulQA. That shrink lets a single H200 hold Llama-3 70B entirely in memory, boosting throughput 1.5–2×.

Which Hardware Tier Should I Choose?



Tier Best for Example Corvex Option Why It Wins
H200 7–70B models, low-latency chat & RAG 8× H200 HGX node HBM3e + InfiniBand remove IO bottlenecks
B200 High-QPS inference & training 8× B200 HGX Blackwell FP4/F8 modes lift TPS by 3–4×
GB200 NVL72 Massive multi-agent or 100B+ models 72× GB200 super-node NVLink Switch fabric delivers ~870k t/s

Note: Corvex does not sell H100 servers; we focus on newer H200 and Blackwell-class GPUs with better price-performance.

Seven Proven Tactics That Cut Inference Cost



  • Right-size the model — distill or pick smaller checkpoints
  • Quantize to 8- or 4-bit
  • Use an efficient engine — TensorRT-LLM, vLLM, or TGI
  • Batch + stream — continuous batching keeps utilization high
  • Cache embeddings and full responses
  • Autoscale to zero — park GPUs during lulls (Corvex supports auto-park)
  • Measure $ per million tokens — the only KPI that matters

How Corvex Lowers Total Cost of Ownership



  • H200 nodes at $2.15/hr with 141 GB HBM3e
  • 3.2 Tb/s non-blocking InfiniBand across every rack
  • Auto-park policies that spin GPUs down to billing-free idle
  • One-click images for vLLM, TensorRT-LLM, and TGI
  • White-glove support from engineers who tune inference daily

Customers report up to 40% lower $ / million tokens vs. public-cloud endpoints charging more for the same silicon.

Real-World Cost per Million Generated Tokens



Model Engine Hardware Cost on Corvex Typical Public Cloud
Llama-3-8B vLLM 1× H200 $4.7 $9.3
Llama-2-70B TensorRT-LLM 4× H200 $19.2 $36.8
Mixtral-8×7B vLLM 1× B200 $6.8 $13.5

How Big Should My GPU Be?



Allocate ~1 GB of VRAM per billion parameters at FP16, halve for 8-bit, quarter for 4-bit, then add 20% headroom for KV-cache. This prevents mid-request OOMs that waste paid compute.

Frequently Asked Questions



  • Does batching hurt latency for a single user?
    No. Engines like vLLM interleave prompts into micro-batches, keeping first-token latency under 30ms even under high load.
  • Are “serverless” GPU wrappers cheaper for bursty traffic?
    Sometimes. They avoid idle cost, but if bursts last >3 minutes, Corvex auto-parked nodes are usually cheaper.
  • Is 4-bit quantization production-safe?
    Yes for most chat/RAG workloads. Accuracy loss on Llama-2/3 is negligible, but always validate on your domain.

Next Step



Launch a pre-tuned TensorRT-LLM or vLLM endpoint on Corvex in under five minutes, or talk to our solution architects for a cost benchmark tailored to your workload.

Ready to Try an Alternative to Traditional Hyperscalers?

Let Corvex make it easy for you.

Talk to an Engineer