top of page

Blogorithm

Google Ironwood: A Developer's Inside Look at the First Inference-Optimized TPU



Executive Summary


Google's Ironwood TPU revolutionizes AI inference with 4,614 TFLOPs per chip, 192GB HBM, and an inference-first architecture delivering twice the performance-per-watt of previous TPU generations. As Google's first inference-exclusive accelerator, Ironwood enables unprecedented scale with pods delivering 42.5 ExaFLOPs, making it the optimal platform for deploying large language models, diffusion systems, and high-throughput inference workloads in production environments.


Introduction: Why Ironwood Matters Now


Google's Ironwood TPU, unveiled at Cloud Next 2025, is a landmark change in AI hardware strategy as Google's first accelerator designed specifically for inference instead of training. This is an important difference for developers in production AI environments, where deployment efficiency determines both user experience and cost of operations.


Whereas earlier TPU generations were trained and used for inference, Ironwood's single-handed dedication to inference facilitates design optimizations not feasible in multipurpose chips. This design focus provides unprecedented performance for deploying AI in the real world at scale, especially for:


  • Large Language Model (LLM) serving with near-zero latency 

  • Diffusion model picture generation at production levels 

  • High-throughput embedding and vector computations 

  • Multi-modal real-time inferencing pipelines 

For AI engineers to move models from research to production, Ironwood is a step-change in what can be done for model serving infrastructure. 


Technical Specifications: The Hardware Foundation


Component 

Specification 

Developer Impact 

Compute 

4,614 TFLOPs (bfloat16) per chip 

2.3× improvement over TPU v5p for inference operations 

Memory 

192 GB HBM per chip 

Supports models up to 175B parameters on a single chip 

Memory Bandwidth 

7.2 TB/s 

Eliminates bottlenecks in token generation and KV cache access 

Interconnect 

1.2 Tbps bidirectional ICI 

Enables near-linear scaling across multi-chip deployments 

Pod Size 

Up to 9,216 chips 

Supports true trillion-parameter model inference 

Pod Performance 

42.5 ExaFLOPs total 

Exceeds world's fastest supercomputer for AI workloads 

Efficiency 

~2× performance per watt vs. TPU v5p 

Reduces TCO for large-scale inference deployments 

Die Size 

28% smaller than TPU v5p 

Greater computation density per rack unit 

Quantization Support 

Built-in INT8, INT4, and bfloat16 

Flexible precision choices for various workload demands 

These specifications translate into quantifiable performance gains in production settings:


  • Latency: 65% drop in end-to-end response time for 70B parameter models against TPU v5p

  • Throughput: 2.8× improvement in tokens/second for batched inference workloads

  • Scalability: Near-linear performance scaling to thousands of chips with low communication overhead


How Inference-First Design Redefines Performance 

Whereas earlier generations of TPUs struck a balance between training and inference capabilities, the architecture of Ironwood brings essential innovations designed only for model serving. 


Key Architecture Innovations


Ironwood's matrix multiply units (MXUs) have been entirely reengineered for inference-only operations with some major differences from training-centric architectures:


  • Streamlined Matrix Units: Bidirectional computation during training is required for forward and backward passes. Ironwood avoids backward-pass circuitry in favor of more forward-pass paths, raising compute density by about 40%.

  • Instruction Pipeline Optimization: The instruction pipeline is optimized for the regular computational graphs of frozen models, saving instruction fetch and decode overhead versus training pipelines needing to cope with dynamic computational graphs.

  • Cache Hierarchy Design: Ironwood has a three-level cache hierarchy designed specifically for weight retrieval and activation patterns typical of inference:


  • L1: 2MB per core for nearby operands

  • L2: 64MB shared cache for weight reuse

  • L3: Direct HBM interface with prefetch optimizations


Elimination of Training Overhead


Through hardware component removal exclusive to training operations, Ironwood gains:


  • 28% lower die area per computational core

  • 2× increase in thermal and power efficiency

  • 3.2× inference density boost per rack unit


Such architectural emphasis allows Ironwood to handle more requests in parallel yet with lower latency profiles compared to general-purpose accelerators.


Elevated Interconnect Architecture for Model Parallelism


The 1.2 Tbps ICI (Inter-Chip Interconnect) makes use of an innovative mesh topological structure reducing communication hops among chips, facilitating:


  • 8-dimensional torus network with adaptive routing

  • 600 Gbps direct chip-to-chip bandwidth in both directions

  • Mean all-to-all communication latency of only 3.7 microseconds


This interconnect architecture is especially beneficial for:


  • Supporting multi-trillion parameter models on thousands of chips

  • Achieving high-throughput tensor or pipeline parallelism with low overhead

  • Achieving near-linear throughput scaling with growing model sizes



Improved Memory Subsystem for Inference Workloads


Ironwood's 192 GB of High Bandwidth Memory (HBM) is a conscious emphasis on inference demand:


  • Capacity Tuning: Enough memory to keep big language models fully on-chip, avoiding expensive host-device transfers

  • Bandwidth Tuning: The 7.2 TB/s bandwidth is carefully optimized for asymmetric inference workloads that read model weights repeatedly but write outputs infrequently

  • KV Cache Speedup: Specialized hardware pathways maximize key-value cache operations for transformer designs:

    • 35% reduced latency for attention mechanisms

    • 58% increased throughput for context processing

    • Context window support up to 128K tokens without performance loss


This memory architecture efficiently solves the most prevalent bottlenecks in inference pipelines: weight access, activation management, and long-context processing.


Practical Applications: Putting Ironwood to Work


Optimizing Your Models for Ironwood


To achieve optimal Ironwood functionality, developers can use these optimization techniques:


Model Parallelism Optimization


For best Ironwood performance, use these parallelism configuration techniques:


  • Embedding Layers: Shard embeddings by model dimension for highest throughput .

  • Attention Mechanisms: Use 2D sharding (model and data dimensions) for attention calculations.

  • MLP Layers: Use model-parallel techniques for feed-forward networks.

  • KV Cache: Use data-parallel techniques for key-value cache management.


Batch Size Recommendations by Model Scale


  • Small Models (≤7B parameters): Utilize big batch sizes (64+) to support highest throughput

  • Medium Models (7B-70B parameters): Good batch sizes (16-32) balance throughput vs. latency tradeoff

  • Large Models (>70B parameters): Small batch sizes (4-8) avoid memory pressure



Quantization Strategies


Ironwood provides many forms of precision with various performance characteristics: 


Precision Format 

Throughput Multiplier 

Model Quality Impact 

Best Used For 

bfloat16 

1.0× (baseline) 

None 

Quality-sensitive use cases 

INT8 

1.8× 

Negligible (<0.5% perplexity growth) 

Production inference 

INT4 

3.2× 

Moderate (1-3% perplexity growth) 

Throughput-driven workloads 

Description of Quantization Process: 

  1. Start with an already trained full-precision model 

  2. Use post-training quantization over quantization-aware training 

  3. Select appropriate precision format based on application requirements 

  4. Offer Ironwood-specific optimizations in conversion phase 

  5. Pre-deployment quality metrics (perplexity, BLEU, etc.) 


Real-World Performance Benchmarks 


Targeted model scale performance metrics expose Ironwood's advantage: 


Model Size 

Tokens/Second (Batch=1) 

Tokens/Second (Batch=32) 

Context Window 

Latency (First Token) 

7B params 

124 

1,850 

32K 

15ms 

70B params 

42 

580 

32K 

48ms 

175B params 

18 

285 

32K 

112ms 

1T params (distributed) 

145 

128K 

310ms 

Use Case-Specific Benefits 


Multi-Modal Models 


Ironwood architecture is particularly optimized for multi-modal inference: 

  • Text + Image: Speed up CLIP-style embeddings 3.8× quicker than TPU v5p 

  • Video Analysis: Real-time inference on 4K video streams at 60fps over transformer models 

  • Audio Processing: 5.2× improved throughput for whisper-style transcriptions models 


RAG Architectures 


For retrieval-augmented generation, Ironwood offers: 

  • 4.1× speedup of vector embedding generation 

  • 68% reduced latency for joint retrieval and generation pipelines 

  • Support for up to 8× bigger retrieval corpora with the same latency budget 


Batch Processing Convenience

 

In contrast to earlier TPU generations, which optimized mostly for high-batch workloads, Ironwood realizes peak performance over the spectrum of batch sizes: 

  • Zero-batch penalty: Very low latency cost for one-request inference 

  • Dynamic batching support: Hardware acceleration of request coalescing 

  • Adaptive scheduling: Request pattern-inferred batch size optimization 

This flexibility is especially valuable in: 

  • Real-time conversational AI with wayward request timing 

  • Streaming inference for token-by-token generation 

  • Mixed workload environments with changing batch demands 


Comparative Analysis: Making the Right Hardware Choice 


Ironwood TPU vs. NVIDIA H100 


Feature 

Ironwood TPU 

NVIDIA H100 

Key Developer Considerations 

Inference Optimization 

Purpose-built architecture 

Shared architecture with training 

Ironwood offers 35-55% lower latency for pure inference 

Compute Performance 

4,614 TFLOPs (bfloat16) 

2,000 TFLOPs (FP8) 

Ironwood provides 2.3× raw compute for equivalent operations 

Memory Configuration 

192 GB HBM 

80 GB HBM3 

Ironwood supports 2.4× larger models per device 

Software Ecosystem 

TensorFlow, JAX 

PyTorch, TensorRT 

Consider your existing ML framework investments 

Maximum Scale 

9,216-chip pods 

DGX SuperPOD (2,048 GPUs) 

Ironwood offers 4.5× larger maximum deployments 

Quantization Support 

Native INT8, INT4 

Native INT8, INT4, FP8 

H100 offers more precision options 

Availability 

Google Cloud exclusively 

Multi-cloud and on-premises 

H100 provides deployment flexibility 

Programming Model 

XLA-based optimization 

CUDA ecosystem 

Assess development team skills 

Migration Path from Other Accelerators 


For developers currently working on other hardware, consider the following transition strategies:

 

Migration from Earlier TPU Generations 


When migrating workloads from past TPU generations: 

  1. Update Device Assignment: Use topology for Ironwood's lower computational hierarchy 

  2. Adjust Distribution Strategy: Refresh TPU strategy initialization for Ironwood's mesh topology 

  3. Enable Inference Optimizations: Use Ironwood-specific flags for inference performance 

  4. Review Batch Processing: Adjust batch sizes based on Ironwood's improved batch processing

     

Migration from NVIDIA GPUs 


For developers transitioning from GPU infrastructure: 

  1. Format Conversion: Save models to a TPU-compliant format (SavedModel, normally) 

  2. Framework Alignment: Align framework versions to Ironwood-supported versions 

  3. Quantization Adaptation: Shift from CUDA-based quantization to natively supported formats of Ironwood 

  4. Topology Reconfiguration: Reimplement distribution methods from GPU-type to TPU-type 

  5. API Adjustments: Substitute CUDA-specific APIs with XLA-compatible APIs 


Software Infrastructure and Access


Framework Support

 

Ironwood supports Google's AI framework ecosystem end-to-end: 

  • TensorFlow: Deep integration with TF 2.15+ with TPU-specific optimizations 

  • JAX: First-class support with XLA-based compilation 

  • PyTorch: Support via XLA-PyTorch with feature restrictions 


Deployment Options 


  • Vertex AI: Fully managed model serving with Ironwood backend 

  • GKE: Container-based deployment with TPU orchestration 

  • Direct API Access: Gemini and PaLM API endpoints on Ironwood infrastructure 

  • Custom TPU VM: Direct access to Ironwood hardware for custom workloads

     

Expected Timeline 


  • Developer Preview: Already available (May 2025) for early partners 

  • Public Preview: Coming in Q3 2025 

  • General Availability: Targeting Q4 2025 


Access Process 


To access Ironwood resources: 

  1. Join the Ironwood Early Access Program through Google Cloud Console 

  2. Define expected model sizes and inference needs 

  3. Start with the supported migration path depending on your existing infrastructure. 


Conclusion: The Future of AI Inference Infrastructure 

Ironwood is not just an incremental step in the TPU roadmap but a purpose-built inference accelerator designed to handle next-generation AI workloads. Its performance-optimized compute architecture, high memory, and hyperscale interconnects make it a new benchmark for cloud-native inference infrastructure. 


Single-minded focus on inference gives rise to performance features never before present in multi-purpose accelerators. To AI engineers hosting models at scale, Ironwood offers: 

  • Economic Revolution: 2× improvement in efficiency is directly reflected in reduced TCO for inference work 

  • Scale Breakthrough: Enabling of really massive models and benchmark-setting parallelism 

  • Latency Revolution: Microsecond-scale advancements that make new classes of real-time AI applications possible 

As Google incorporates Ironwood into its AI service infrastructure across 2025, developers will enjoy lower latency, higher throughput, and increased scalability with no need to change the underlying model formulation. Inference infrastructure specifically designed is the age, and Ironwood is its first definite manifestation. 


 
 
 

Comments


bottom of page