Google Ironwood: A Developer's Inside Look at the First Inference-Optimized TPU

KARTIK MEENA
May 8
7 min read

Executive Summary

Google's Ironwood TPU revolutionizes AI inference with 4,614 TFLOPs per chip, 192GB HBM, and an inference-first architecture delivering twice the performance-per-watt of previous TPU generations. As Google's first inference-exclusive accelerator, Ironwood enables unprecedented scale with pods delivering 42.5 ExaFLOPs, making it the optimal platform for deploying large language models, diffusion systems, and high-throughput inference workloads in production environments.

Introduction: Why Ironwood Matters Now

Google's Ironwood TPU, unveiled at Cloud Next 2025, is a landmark change in AI hardware strategy as Google's first accelerator designed specifically for inference instead of training. This is an important difference for developers in production AI environments, where deployment efficiency determines both user experience and cost of operations.

Whereas earlier TPU generations were trained and used for inference, Ironwood's single-handed dedication to inference facilitates design optimizations not feasible in multipurpose chips. This design focus provides unprecedented performance for deploying AI in the real world at scale, especially for:

Large Language Model (LLM) serving with near-zero latency
Diffusion model picture generation at production levels
High-throughput embedding and vector computations
Multi-modal real-time inferencing pipelines

For AI engineers to move models from research to production, Ironwood is a step-change in what can be done for model serving infrastructure.

Technical Specifications: The Hardware Foundation

Component	Specification	Developer Impact
Compute	4,614 TFLOPs (bfloat16) per chip	2.3× improvement over TPU v5p for inference operations
Memory	192 GB HBM per chip	Supports models up to 175B parameters on a single chip
Memory Bandwidth	7.2 TB/s	Eliminates bottlenecks in token generation and KV cache access
Interconnect	1.2 Tbps bidirectional ICI	Enables near-linear scaling across multi-chip deployments
Pod Size	Up to 9,216 chips	Supports true trillion-parameter model inference
Pod Performance	42.5 ExaFLOPs total	Exceeds world's fastest supercomputer for AI workloads
Efficiency	~2× performance per watt vs. TPU v5p	Reduces TCO for large-scale inference deployments
Die Size	28% smaller than TPU v5p	Greater computation density per rack unit
Quantization Support	Built-in INT8, INT4, and bfloat16	Flexible precision choices for various workload demands

These specifications translate into quantifiable performance gains in production settings:

Latency: 65% drop in end-to-end response time for 70B parameter models against TPU v5p
Throughput: 2.8× improvement in tokens/second for batched inference workloads
Scalability: Near-linear performance scaling to thousands of chips with low communication overhead

How Inference-First Design Redefines Performance

Whereas earlier generations of TPUs struck a balance between training and inference capabilities, the architecture of Ironwood brings essential innovations designed only for model serving.

Key Architecture Innovations

Ironwood's matrix multiply units (MXUs) have been entirely reengineered for inference-only operations with some major differences from training-centric architectures:

Streamlined Matrix Units: Bidirectional computation during training is required for forward and backward passes. Ironwood avoids backward-pass circuitry in favor of more forward-pass paths, raising compute density by about 40%.
Instruction Pipeline Optimization: The instruction pipeline is optimized for the regular computational graphs of frozen models, saving instruction fetch and decode overhead versus training pipelines needing to cope with dynamic computational graphs.
Cache Hierarchy Design: Ironwood has a three-level cache hierarchy designed specifically for weight retrieval and activation patterns typical of inference:

L1: 2MB per core for nearby operands
L2: 64MB shared cache for weight reuse
L3: Direct HBM interface with prefetch optimizations

Elimination of Training Overhead

Through hardware component removal exclusive to training operations, Ironwood gains:

28% lower die area per computational core
2× increase in thermal and power efficiency
3.2× inference density boost per rack unit

Such architectural emphasis allows Ironwood to handle more requests in parallel yet with lower latency profiles compared to general-purpose accelerators.

Elevated Interconnect Architecture for Model Parallelism

The 1.2 Tbps ICI (Inter-Chip Interconnect) makes use of an innovative mesh topological structure reducing communication hops among chips, facilitating:

8-dimensional torus network with adaptive routing
600 Gbps direct chip-to-chip bandwidth in both directions
Mean all-to-all communication latency of only 3.7 microseconds

This interconnect architecture is especially beneficial for:

Supporting multi-trillion parameter models on thousands of chips
Achieving high-throughput tensor or pipeline parallelism with low overhead
Achieving near-linear throughput scaling with growing model sizes

Improved Memory Subsystem for Inference Workloads

Ironwood's 192 GB of High Bandwidth Memory (HBM) is a conscious emphasis on inference demand:

Capacity Tuning: Enough memory to keep big language models fully on-chip, avoiding expensive host-device transfers
Bandwidth Tuning: The 7.2 TB/s bandwidth is carefully optimized for asymmetric inference workloads that read model weights repeatedly but write outputs infrequently
KV Cache Speedup: Specialized hardware pathways maximize key-value cache operations for transformer designs:
- 35% reduced latency for attention mechanisms
- 58% increased throughput for context processing
- Context window support up to 128K tokens without performance loss

This memory architecture efficiently solves the most prevalent bottlenecks in inference pipelines: weight access, activation management, and long-context processing.

Practical Applications: Putting Ironwood to Work

Optimizing Your Models for Ironwood

To achieve optimal Ironwood functionality, developers can use these optimization techniques:

Model Parallelism Optimization

For best Ironwood performance, use these parallelism configuration techniques:

Embedding Layers: Shard embeddings by model dimension for highest throughput .
Attention Mechanisms: Use 2D sharding (model and data dimensions) for attention calculations.
MLP Layers: Use model-parallel techniques for feed-forward networks.
KV Cache: Use data-parallel techniques for key-value cache management.

Batch Size Recommendations by Model Scale

Small Models (≤7B parameters): Utilize big batch sizes (64+) to support highest throughput
Medium Models (7B-70B parameters): Good batch sizes (16-32) balance throughput vs. latency tradeoff
Large Models (>70B parameters): Small batch sizes (4-8) avoid memory pressure

Quantization Strategies

Ironwood provides many forms of precision with various performance characteristics:

Precision Format	Throughput Multiplier	Model Quality Impact	Best Used For
bfloat16	1.0× (baseline)	None	Quality-sensitive use cases
INT8	1.8×	Negligible (<0.5% perplexity growth)	Production inference
INT4	3.2×	Moderate (1-3% perplexity growth)	Throughput-driven workloads

Description of Quantization Process:

Start with an already trained full-precision model
Use post-training quantization over quantization-aware training
Select appropriate precision format based on application requirements
Offer Ironwood-specific optimizations in conversion phase
Pre-deployment quality metrics (perplexity, BLEU, etc.)

Real-World Performance Benchmarks

Targeted model scale performance metrics expose Ironwood's advantage:

Model Size	Tokens/Second (Batch=1)	Tokens/Second (Batch=32)	Context Window	Latency (First Token)
7B params	124	1,850	32K	15ms
70B params	42	580	32K	48ms
175B params	18	285	32K	112ms
1T params (distributed)	9	145	128K	310ms

Use Case-Specific Benefits

Multi-Modal Models

Ironwood architecture is particularly optimized for multi-modal inference:

Text + Image: Speed up CLIP-style embeddings 3.8× quicker than TPU v5p
Video Analysis: Real-time inference on 4K video streams at 60fps over transformer models
Audio Processing: 5.2× improved throughput for whisper-style transcriptions models

RAG Architectures

For retrieval-augmented generation, Ironwood offers:

4.1× speedup of vector embedding generation
68% reduced latency for joint retrieval and generation pipelines
Support for up to 8× bigger retrieval corpora with the same latency budget

Batch Processing Convenience

In contrast to earlier TPU generations, which optimized mostly for high-batch workloads, Ironwood realizes peak performance over the spectrum of batch sizes:

Zero-batch penalty: Very low latency cost for one-request inference
Dynamic batching support: Hardware acceleration of request coalescing
Adaptive scheduling: Request pattern-inferred batch size optimization

This flexibility is especially valuable in:

Real-time conversational AI with wayward request timing
Streaming inference for token-by-token generation
Mixed workload environments with changing batch demands

Comparative Analysis: Making the Right Hardware Choice

Ironwood TPU vs. NVIDIA H100

Feature	Ironwood TPU	NVIDIA H100	Key Developer Considerations
Inference Optimization	Purpose-built architecture	Shared architecture with training	Ironwood offers 35-55% lower latency for pure inference
Compute Performance	4,614 TFLOPs (bfloat16)	2,000 TFLOPs (FP8)	Ironwood provides 2.3× raw compute for equivalent operations
Memory Configuration	192 GB HBM	80 GB HBM3	Ironwood supports 2.4× larger models per device
Software Ecosystem	TensorFlow, JAX	PyTorch, TensorRT	Consider your existing ML framework investments
Maximum Scale	9,216-chip pods	DGX SuperPOD (2,048 GPUs)	Ironwood offers 4.5× larger maximum deployments
Quantization Support	Native INT8, INT4	Native INT8, INT4, FP8	H100 offers more precision options
Availability	Google Cloud exclusively	Multi-cloud and on-premises	H100 provides deployment flexibility
Programming Model	XLA-based optimization	CUDA ecosystem	Assess development team skills

Migration Path from Other Accelerators

For developers currently working on other hardware, consider the following transition strategies:

Migration from Earlier TPU Generations

When migrating workloads from past TPU generations:

Update Device Assignment: Use topology for Ironwood's lower computational hierarchy
Adjust Distribution Strategy: Refresh TPU strategy initialization for Ironwood's mesh topology
Enable Inference Optimizations: Use Ironwood-specific flags for inference performance
Review Batch Processing: Adjust batch sizes based on Ironwood's improved batch processing

Migration from NVIDIA GPUs

For developers transitioning from GPU infrastructure:

Format Conversion: Save models to a TPU-compliant format (SavedModel, normally)
Framework Alignment: Align framework versions to Ironwood-supported versions
Quantization Adaptation: Shift from CUDA-based quantization to natively supported formats of Ironwood
Topology Reconfiguration: Reimplement distribution methods from GPU-type to TPU-type
API Adjustments: Substitute CUDA-specific APIs with XLA-compatible APIs

Software Infrastructure and Access

Framework Support

Ironwood supports Google's AI framework ecosystem end-to-end:

TensorFlow: Deep integration with TF 2.15+ with TPU-specific optimizations
JAX: First-class support with XLA-based compilation
PyTorch: Support via XLA-PyTorch with feature restrictions

Deployment Options

Vertex AI: Fully managed model serving with Ironwood backend
GKE: Container-based deployment with TPU orchestration
Direct API Access: Gemini and PaLM API endpoints on Ironwood infrastructure
Custom TPU VM: Direct access to Ironwood hardware for custom workloads

Expected Timeline

Developer Preview: Already available (May 2025) for early partners
Public Preview: Coming in Q3 2025
General Availability: Targeting Q4 2025

Access Process

To access Ironwood resources:

Join the Ironwood Early Access Program through Google Cloud Console
Define expected model sizes and inference needs
Start with the supported migration path depending on your existing infrastructure.

Conclusion: The Future of AI Inference Infrastructure

Ironwood is not just an incremental step in the TPU roadmap but a purpose-built inference accelerator designed to handle next-generation AI workloads. Its performance-optimized compute architecture, high memory, and hyperscale interconnects make it a new benchmark for cloud-native inference infrastructure.

Single-minded focus on inference gives rise to performance features never before present in multi-purpose accelerators. To AI engineers hosting models at scale, Ironwood offers:

Economic Revolution: 2× improvement in efficiency is directly reflected in reduced TCO for inference work
Scale Breakthrough: Enabling of really massive models and benchmark-setting parallelism
Latency Revolution: Microsecond-scale advancements that make new classes of real-time AI applications possible

As Google incorporates Ironwood into its AI service infrastructure across 2025, developers will enjoy lower latency, higher throughput, and increased scalability with no need to change the underlying model formulation. Inference infrastructure specifically designed is the age, and Ironwood is its first definite manifestation.