GPU Architecture Deep Dive: H200 vs B200 vs Custom Silicon

Understanding the hardware powering the AI revolution is essential for practitioners.

NVIDIA Hopper Generation

H100 Specifications

The H100 established new baselines:

SpecificationH100 SXM5H100 PCIeNotes
Transistors80B80BTSMC 4N
CUDA Cores16,89616,896-
Tensor Cores5284564th generation
FP16 Performance989 TFLOPS756 TFLOPSSXM higher bandwidth
Memory80GB HBM380GB HBM33.35 TB/s
TDP700W700WRequires serious cooling
Price~$25K~$20KMSRP (lol)

H200 Improvements

The H200 added memory bandwidth improvements:

MetricH100H200Improvement
Memory80GB141GB+76%
Bandwidth3.35 TB/s4.8 TB/s+43%
HBM3eNoYesLower power
FP8 TrainingNativeNative-

Key insight: H200 isn’t faster compute-wise—it’s faster due to memory, critical for LLM inference.

Blackwell Architecture: B200

B200 Specifications

The B200 represents NVIDIA’s architectural leap:

SpecificationB200H100Improvement
Transistors208B80B2.6x
FP4 Performance20 PFLOPS-New precision
Memory192GB80GB2.4x
Bandwidth8 TB/s3.35 TB/s2.4x
NVLink1.8 TB/s900 GB/s2x
TDP1000W700W43% increase

Blackwell Innovations

  1. FP4 Support: Half-precision floating point, 2x throughput
  2. Second-gen Transformer Engine: Better attention mechanisms
  3. NVLink Switch: Faster multi-GPU communication
  4. RAS Features: Reliability for datacenter deployment

Custom Silicon Landscape

Google TPU v5

MetricTPU v5H100Notes
Performance459 TFLOPS989 TFLOPSFP16
Memory95GB HBM80GB HBM-
Memory BW2.76 TB/s3.35 TB/sLower
Interconnect300 GB/s900 GB/sNVLink wins
AvailabilityGCP onlyWidely available-

Amazon Trainium 2

  • 65B parameters chip
  • Designed for distributed training
  • ~40% cost reduction vs H100
  • Limited ecosystem support

Apple Silicon (M4 Ultra)

SpecificationM4 UltraH100
Neural Engine810 TOPSN/A
Unified Memory192GB80GB
Memory BW800 GB/s3.35 TB/s
Best ForOn-deviceDatacenter

Architecture Comparison

Transformer Workloads

For LLM training/inference:

ChipTrainingInferenceCost Efficiency
H100⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
H200⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
B200⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
TPU v5⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Trainium⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐

Practical Recommendations

For training: H100/H200/B200 with NVLink for multi-GPU For inference: H200/B200 for long context, A100 for cost-sensitive For budget: Spot H100 or Trainium for batch inference For research: M4 Ultra for experimentation, cloud for scale


Media & Sources

Embedded Images