TBPN
← Back to Blog

The AI Infrastructure Stack Nobody Talks About: Storage, Networking, Power, and Cooling

Beyond GPUs and models, discover the full AI infrastructure stack — power, cooling, networking, and storage — that actually makes artificial intelligence work at scale.

The AI Infrastructure Stack Nobody Talks About: Storage, Networking, Power, and Cooling

Open any tech publication and you will find wall-to-wall coverage of the latest foundation model, the newest GPU architecture, or the funding round that just broke a record. What you will almost never find is a serious discussion of the infrastructure that makes all of it actually run. That is a problem — because when AI deployments fail at scale, the root cause is almost never the model. It is the power delivery, the cooling system, the network fabric, or the storage layer underneath.

On the TBPN daily show, John Coogan and Jordi Hays have repeatedly made the point that understanding the full stack is not optional for anyone building in AI. The companies that win will be the ones that treat infrastructure as a first-class engineering discipline, not an afterthought you hand to a facilities team. This post breaks down every layer of the AI infrastructure stack that the mainstream press ignores — and explains why each one can make or break your AI ambitions.

Layer 1: Power — The Foundation of Everything

Before a single tensor is computed, someone has to deliver an enormous amount of electricity to a building. The scale of power consumption in modern AI data centers is difficult to overstate.

The Numbers That Matter

A single NVIDIA DGX GB200 NVL72 rack can draw upward of 120 kilowatts. A training cluster with a few hundred of these racks easily pushes into the tens of megawatts. The largest AI data centers now being planned or constructed target 1 to 2 gigawatts of total capacity — roughly the output of a nuclear power plant dedicated entirely to computation.

  • Power Purchase Agreements (PPAs): AI companies are signing 10- to 15-year contracts to lock in electricity at 2 to 4 cents per kilowatt-hour. Microsoft, Google, and Amazon have all announced nuclear PPAs.
  • Uninterruptible Power Supply (UPS): Every watt entering the facility passes through UPS systems — battery banks or rotary UPS units — that protect against grid fluctuations. A 100 MW data center might have 20 MW of battery backup alone.
  • Power Distribution Units (PDUs): Transformers step voltage down from utility-grade (138 kV or 69 kV) to rack-level (208 V or 480 V). Each step introduces efficiency losses, typically 3 to 8 percent cumulatively.
  • Power Usage Effectiveness (PUE): The industry-standard metric. A PUE of 1.0 means every watt goes to compute. Real-world AI facilities target 1.1 to 1.2, meaning 10 to 20 percent overhead for cooling and other systems.

Key Companies in Power Infrastructure

Vertiv, Eaton, and Schneider Electric dominate power distribution and UPS hardware. On the generation side, companies like Oklo (micro-nuclear), Fervo Energy (geothermal), and traditional utilities like Dominion Energy and Duke Energy are racing to serve data center demand. The bottleneck is not technology — it is permitting and interconnection queues, which can stretch three to five years.

Cost Breakdown

Power typically represents 30 to 40 percent of the total cost of ownership (TCO) for an AI data center over its lifetime. For a 100 MW facility, annual electricity costs alone can exceed $50 million at market rates. This is why location selection — proximity to cheap, reliable power — is the single most important decision in data center development.

Layer 2: Cooling — The Silent Constraint

Every watt of electricity consumed by a GPU becomes heat that must be removed. At traditional data center densities of 5 to 10 kW per rack, air cooling works fine. At AI densities of 40 to 120+ kW per rack, air cooling is physically insufficient. This is driving a revolution in thermal management.

Why Air Cooling Cannot Keep Up

Air has a low specific heat capacity compared to liquids. To remove 100 kW of heat from a rack using air, you need enormous volumes of airflow — fans running at full speed, raised floors, hot-aisle/cold-aisle containment. The energy cost of moving that much air erodes your PUE. More critically, air simply cannot maintain safe operating temperatures for densely packed GPU clusters. Hotspots develop, chips throttle, and training runs fail.

The Liquid Cooling Revolution

Three approaches are emerging, each with different trade-offs:

  1. Rear-Door Heat Exchangers (RDHx): A coil of chilled water mounted on the back of a server rack. Hot exhaust air passes through the coil and is cooled before it recirculates. Relatively easy to retrofit into existing facilities. Handles densities up to about 40 to 50 kW per rack.
  2. Direct-to-Chip (Cold Plate) Cooling: Cold plates attached directly to GPU and CPU packages circulate coolant (water or specialized fluid) within millimeters of the heat source. This is what NVIDIA recommends for GB200 deployments. Handles 80 to 120+ kW per rack. Requires plumbing throughout the data hall — a significant infrastructure investment.
  3. Immersion Cooling: Entire servers are submerged in a dielectric fluid (typically engineered by companies like 3M or GRC). The fluid absorbs heat directly from all components. Handles the highest densities and eliminates fans entirely. The challenge: maintenance requires extracting servers from fluid, and the supply chain for dielectric coolant is still maturing.

Key Companies in Cooling

CoolIT Systems and Motivair lead in direct-to-chip solutions. GRC (Green Revolution Cooling) and LiquidCool Solutions are prominent in immersion. Vertiv and Schneider also offer liquid cooling product lines. On the facility side, cooling towers and chillers from companies like Trane and Carrier remain critical for rejecting heat to the atmosphere.

The Water Problem

Liquid cooling and evaporative cooling towers consume significant amounts of water. A large data center can use millions of gallons per day. In water-stressed regions, this creates political and environmental friction. Some operators are moving to closed-loop dry coolers or air-cooled chillers to reduce water dependency, but these are less efficient in hot climates.

Layer 3: Networking — The Invisible Bottleneck

The fastest GPU in the world is useless if it cannot communicate with its neighbors quickly enough. In large-scale AI training, network performance often determines overall training speed more than raw compute throughput.

InfiniBand vs. Ethernet at Scale

For years, InfiniBand — developed by Mellanox (now part of NVIDIA) — was the undisputed champion for high-performance computing interconnects. It offers ultra-low latency (sub-microsecond), high bandwidth (400 Gbps per port, moving to 800 Gbps), and sophisticated congestion management through credit-based flow control.

Ethernet, traditionally a lossy protocol designed for general-purpose networking, has been fighting to close the gap. Technologies like RDMA over Converged Ethernet (RoCE v2), Priority Flow Control (PFC), and Ultra Ethernet Consortium standards are making Ethernet viable for AI training at scale. The advantage of Ethernet: it is cheaper, more widely available, and supported by a broader ecosystem of vendors.

  • NVIDIA's position: Heavily promotes InfiniBand through its Quantum switches, bundled with DGX systems. Lock-in is real — if you buy the full NVIDIA stack, you get InfiniBand.
  • Broadcom's position: The leading merchant silicon provider for Ethernet switches (Memory: Memory, Memory), now pushing hard on 800G Ethernet and AI-optimized switch ASICs (Memory: Jericho3-AI, Ramon3).
  • Arista Networks: Builds high-performance Ethernet switches for hyperscalers and AI clusters. Their 7800R series is designed specifically for AI back-end networks.

Why Network Topology Matters

In distributed AI training, GPUs must synchronize gradients after every forward and backward pass. The all-reduce communication pattern means every GPU needs to exchange data with every other GPU. Network topology — fat-tree, dragonfly, rail-optimized — determines how efficiently this happens.

A poorly designed network fabric can leave GPUs idle for 30 to 50 percent of the training cycle, waiting for data. A well-designed fabric keeps GPU utilization above 90 percent. The difference in training cost is enormous — potentially tens of millions of dollars for a single large model training run.

Cost Breakdown

Networking typically represents 10 to 20 percent of total cluster cost. For a 10,000-GPU training cluster, the network fabric (switches, cables, transceivers) can cost $50 million to $150 million, depending on whether you use InfiniBand or Ethernet.

Layer 4: Storage — The Forgotten Layer

AI workloads have voracious and varied storage requirements. Training data, model checkpoints, logs, and inference caches all demand different storage characteristics.

Training Data Storage

Large language models train on datasets measured in tens of terabytes of tokens. Multimodal models (text + image + video) can require petabytes of raw data. This data must be served to GPUs fast enough to avoid starving the compute pipeline.

  • Parallel file systems like Lustre, GPFS (IBM Spectrum Scale), and WekaFS are designed to provide high aggregate throughput by striping data across hundreds of storage nodes. WekaFS, in particular, has gained significant traction in AI workloads for its ability to deliver millions of IOPS with low latency.
  • NVMe flash arrays from Pure Storage, NetApp, and VAST Data provide the raw performance needed for checkpoint storage and fast dataset access. A single VAST Data cluster can deliver multiple terabytes per second of read throughput.
  • Object stores like Amazon S3, Google Cloud Storage, or MinIO serve as the long-term repository for training datasets. They are cheap and durable but too slow for real-time training data feeding — hence the need for a caching layer.

Checkpoint Storage

During training, models periodically save checkpoints — full snapshots of model weights and optimizer state. For a large model, a single checkpoint can be 1 to 5 terabytes. Training runs save checkpoints every few minutes to hours. The storage system must be able to absorb these massive writes without disrupting the training pipeline. This is where NVMe flash and parallel file systems earn their keep.

Key Companies

VAST Data has emerged as a favorite among AI labs for its unified storage architecture. Pure Storage continues to innovate in all-flash arrays. DDN remains a stalwart in HPC and AI storage with its EXAScaler Lustre offering. NetApp and Dell compete in enterprise AI storage with ONTAP AI and PowerScale, respectively.

Layer 5: Compute Management and Orchestration

Having thousands of GPUs is meaningless without software to schedule, allocate, and manage workloads across them.

Job Scheduling

Slurm (Simple Linux Utility for Resource Management) remains the dominant job scheduler for AI training clusters. Originally built for academic HPC, it has been extended by companies like SchedMD and NVIDIA to handle GPU-aware scheduling, multi-node training jobs, and preemption policies. Alternatives include PBS Pro and LSF, but Slurm's open-source model and community have made it the default choice.

Container Orchestration for Inference

Kubernetes has become the standard for deploying and scaling AI inference workloads. The NVIDIA GPU Operator and Device Plugin enable Kubernetes to treat GPUs as first-class schedulable resources. KServe, Triton Inference Server, and vLLM provide serving frameworks that sit atop Kubernetes to handle model loading, batching, and auto-scaling.

Monitoring and Observability

At scale, you need to know instantly when a GPU fails, a network link degrades, or a cooling system underperforms. Tools like Prometheus and Grafana provide metrics collection and visualization. NVIDIA's DCGM (Data Center GPU Manager) exposes GPU-level telemetry — temperature, utilization, memory errors, power draw — that feeds into these monitoring stacks.

Why Founders Need to Understand the Full Stack

If you are building an AI product, you might think infrastructure is someone else's problem — just rent GPUs from a cloud provider and focus on your model. That works at small scale. It breaks at large scale for several reasons:

  • Cost: Cloud GPU pricing includes a margin for the cloud provider's infrastructure investment. At scale, the difference between renting and owning can be 3 to 5x in annual cost. Understanding infrastructure lets you make informed build-vs-buy decisions.
  • Performance: Network topology, storage throughput, and cooling effectiveness directly impact training speed and inference latency. Ignorance of these factors leads to wasted compute and slower iteration.
  • Reliability: Hardware failures in large clusters are not exceptions — they are constants. Understanding the failure modes of each layer (power, cooling, network, storage, compute) lets you design fault-tolerant systems.
  • Competitive advantage: The companies that master infrastructure — Google, Meta, and increasingly startup labs like xAI — train faster, iterate more, and deploy more efficiently than competitors who treat infrastructure as a black box.

As John and Jordi often emphasize on TBPN, the AI race is not just about who has the best researchers. It is about who can build and operate the most efficient infrastructure at scale. That is an engineering and operations challenge, not a research challenge.

The Investment Landscape

For investors, the AI infrastructure stack presents opportunities at every layer. The picks-and-shovels thesis is well-worn, but many investors still focus narrowly on GPUs and cloud compute. The real alpha may be in the less glamorous layers:

  1. Power infrastructure: Companies building modular substations, advanced UPS systems, and grid interconnection solutions.
  2. Cooling technology: Liquid cooling companies are experiencing demand that far outstrips supply. Direct-to-chip and immersion cooling are early enough that winners have not been decided.
  3. Networking: As Ethernet competes with InfiniBand for AI workloads, networking companies that can deliver AI-grade performance at Ethernet prices will capture enormous value.
  4. Storage: AI-optimized storage is a growing category, with VAST Data, WekaFS, and others raising significant capital.
  5. Management software: Orchestration, monitoring, and capacity planning tools purpose-built for AI infrastructure.

If you want to represent the builder's mindset while you explore the AI stack, check out TBPN hoodies and t-shirts — gear made for the people who actually build the future, not just talk about it.

Frequently Asked Questions

What is the most common bottleneck in AI infrastructure?

Power availability is the single most common bottleneck. Many regions have multi-year queues for new grid interconnections, which delays data center construction regardless of how much capital is available. After power, cooling capacity is the next most frequent constraint — especially as GPU density increases beyond what air cooling can handle.

How much does it cost to build an AI data center from scratch?

A modern AI-ready data center costs approximately $10 million to $15 million per megawatt of IT capacity, fully built out with liquid cooling, high-performance networking, and redundant power. A 100 MW facility therefore costs $1 billion to $1.5 billion before you install a single GPU. The GPUs themselves can double or triple this figure.

Can startups realistically build their own AI infrastructure?

Most startups should not build their own data centers. The capital requirements and operational complexity are prohibitive below a certain scale (typically 1,000+ GPUs running continuously). However, startups can make smarter infrastructure decisions by understanding the stack: choosing colocation providers with liquid cooling, selecting network fabrics that minimize training overhead, and using storage architectures that match their workload patterns.

Why is liquid cooling becoming mandatory for AI workloads?

Modern AI accelerators — NVIDIA's GB200, AMD's MI300X — generate too much heat per square foot for air cooling to remove efficiently. A single GB200 NVL72 rack produces over 100 kW of heat. Moving that much thermal energy with air requires fan power that destroys energy efficiency and still risks hotspots. Liquid cooling removes heat 1,000 times more efficiently per unit volume than air, enabling higher GPU density, lower PUE, and more reliable operation.