GPU Architecture Deep Dive: H200 vs B200 vs Custom Silicon

Understanding the hardware powering the AI revolution is essential for practitioners.

NVIDIA Hopper Generation

H100 Specifications

The H100 established new baselines:

Specification	H100 SXM5	H100 PCIe	Notes
Transistors	80B	80B	TSMC 4N
CUDA Cores	16,896	16,896	-
Tensor Cores	528	456	4th generation
FP16 Performance	989 TFLOPS	756 TFLOPS	SXM higher bandwidth
Memory	80GB HBM3	80GB HBM3	3.35 TB/s
TDP	700W	700W	Requires serious cooling
Price	~$25K	~$20K	MSRP (lol)

H200 Improvements

The H200 added memory bandwidth improvements:

Metric	H100	H200	Improvement
Memory	80GB	141GB	+76%
Bandwidth	3.35 TB/s	4.8 TB/s	+43%
HBM3e	No	Yes	Lower power
FP8 Training	Native	Native	-

Key insight: H200 isn’t faster compute-wise—it’s faster due to memory, critical for LLM inference.

Blackwell Architecture: B200

B200 Specifications

The B200 represents NVIDIA’s architectural leap:

Specification	B200	H100	Improvement
Transistors	208B	80B	2.6x
FP4 Performance	20 PFLOPS	-	New precision
Memory	192GB	80GB	2.4x
Bandwidth	8 TB/s	3.35 TB/s	2.4x
NVLink	1.8 TB/s	900 GB/s	2x
TDP	1000W	700W	43% increase

Blackwell Innovations

FP4 Support: Half-precision floating point, 2x throughput
Second-gen Transformer Engine: Better attention mechanisms
NVLink Switch: Faster multi-GPU communication
RAS Features: Reliability for datacenter deployment

Custom Silicon Landscape

Google TPU v5

Metric	TPU v5	H100	Notes
Performance	459 TFLOPS	989 TFLOPS	FP16
Memory	95GB HBM	80GB HBM	-
Memory BW	2.76 TB/s	3.35 TB/s	Lower
Interconnect	300 GB/s	900 GB/s	NVLink wins
Availability	GCP only	Widely available	-

Amazon Trainium 2

65B parameters chip
Designed for distributed training
~40% cost reduction vs H100
Limited ecosystem support

Apple Silicon (M4 Ultra)

Specification	M4 Ultra	H100
Neural Engine	810 TOPS	N/A
Unified Memory	192GB	80GB
Memory BW	800 GB/s	3.35 TB/s
Best For	On-device	Datacenter

Architecture Comparison

Transformer Workloads

For LLM training/inference:

Chip	Training	Inference	Cost Efficiency
H100	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
H200	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
B200	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
TPU v5	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Trainium	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐

Practical Recommendations

For training: H100/H200/B200 with NVLink for multi-GPU For inference: H200/B200 for long context, A100 for cost-sensitive For budget: Spot H100 or Trainium for batch inference For research: M4 Ultra for experimentation, cloud for scale

Keith Kitchen

Explorer

GPU Architecture Deep Dive: H200 vs B200 vs Custom Silicon

GPU Architecture Deep Dive: H200 vs B200 vs Custom Silicon

NVIDIA Hopper Generation

H100 Specifications

H200 Improvements

Blackwell Architecture: B200

B200 Specifications

Blackwell Innovations

Custom Silicon Landscape

Google TPU v5

Amazon Trainium 2

Apple Silicon (M4 Ultra)

Architecture Comparison

Transformer Workloads

Practical Recommendations

Media & Sources

Embedded Images

Source Links

Stacked notes

Graph View

Map

Table of Contents