
Chip investment expert Gavin Baker delves into the differences between NVIDIA GPUs (Hopper, Blackwell) and Google TPUs in a recent interview, analyzing from the perspectives of technology, performance, cost, and collaboration. He notes that while Google TPUs have a temporary advantage in the short term, NVIDIA's GPU ecosystem still possesses stronger monopolistic power in the long run.
GPUs are full-stack platforms, while TPUs are single-point ASICs.
Baker stated that the divergence of AI accelerators has already emerged from the fundamental design philosophy. NVIDIA's GPUs, from Hopper and Blackwell to the future Rubin, emphasize being a full-stack platform, covering everything from the GPU itself, the GPU bidirectional interconnect technology NVLink, network cards, switches, to software layers like CUDA and TensorRT, all managed by NVIDIA. Once enterprises purchase GPUs, they essentially obtain a complete environment that can be directly used for training and inference, without needing to assemble networks or rewrite software themselves.
In contrast, Google TPU (v4, v5e, v6, v7) is essentially a special application integrated circuit (ASIC), which is an accelerator specifically designed for certain AI computations. Google is responsible for front-end logic design, but the back end is made by Broadcom and then produced by TSMC. Other indispensable components of TPU, such as switches, network cards, and software ecosystems, must be integrated by Google itself, making supply chain collaboration much more complex than that of GPUs.
Overall, the advantage of GPUs does not lie in the performance of a single chip, but in the completeness of the entire platform and ecosystem. This is also the starting point for the increasingly obvious competitive gap between the two.
Blackwell performance has made a significant leap, putting greater pressure on TPU v6/v7.
Baker pointed out that as we enter 2024-2025, the performance gap between GPUs and TPUs will become increasingly evident. The leap from GB200 to GB300 represents a significant architectural jump, transitioning to liquid cooling design, with a single rack consuming up to 130kW, and overall complexity unprecedented. The time for large-scale deployment is only three to four months away, and it is still in a very new stage.
The next-generation GB300 can directly fit into the GB200 rack, allowing enterprises to expand faster. Among them, xAI is seen as the first batch of customers that can maximize the performance of Blackwell due to its rapid construction of server rooms. Baker likens it to:
"If Hopper is described as the most advanced airplane at the end of World War II, then TPU v6/v7 is like the F-4 Phantom, an aircraft from two generations later. Blackwell, on the other hand, is the F-35, belonging to a completely different level of performance."
This explains that TPU v6/v7 is at a different hardware level from Blackwell, and points out that the current Google Gemini 3 is still using TPU v6/v7, not equipment at the Blackwell level. Although Google can train high-level models like Gemini 3 using TPU v6/v7, as the Blackwell series is launched on a large scale, the performance differences between the two architectures will become increasingly apparent.
TPU was once the king of lowest costs, but GB300 will rewrite the situation.
Baker stated that the most critical advantage of TPU in the past was having the lowest training costs in the world. Moreover, Google indeed utilized this advantage to compress the fundraising and operational space of competitors.
However, Baker pointed out that once GB300 is deployed on a large scale, the lowest-cost training platform in the market will shift to companies adopting GB300, especially teams like XAI that have vertical integration capabilities and self-built server rooms. If OpenAI can break through the computing power bottleneck in the future and has the capability to build its own hardware, it may also join the GB300 camp.
This means that once Google no longer maintains a cost leadership position, the previous low-price strategy will be difficult to sustain. The dominance of training costs will also shift from being long-term controlled by TPUs to being redistributed by GB300.
GPU expands collaboration speed faster, while TPU integration burden is heavier.
The faster the progress of large models, the greater the demand for large-scale GPU collaboration, which is also one of the key factors for GPUs to significantly outperform TPUs in recent years. Baker pointed out that GPU clusters can push the collaboration scale to 200,000 to 300,000 GPUs through NVLink, allowing large models to utilize a larger training budget. The rapid establishment of large data centers by XAI has further forced NVIDIA to release optimization solutions earlier, accelerating the evolution of the entire GPU ecosystem.
In contrast, for TPU, since Google needs to integrate switches and networks by itself, and also coordinate with Broadcom and TSMC's supply chain, the overall engineering complexity is higher than that of GPUs.
GPUs are moving towards one generation per year, while TPU iterations are constrained by the supply chain.
Baker mentioned that in response to the competitive pressure from ASICs, both NVIDIA and AMD are accelerating their update frequency, and GPUs are moving towards an "annual generation" direction. This rhythm is extremely advantageous in the era of large models, as the expansion of model scale is almost uninterrupted.
The iteration speed of TPU is relatively limited. From v1 to v4, and then to v6, each generation took several years to mature. Future versions v8 and v9 are even more constrained due to the supply chain involving Google, Broadcom, TSMC, and other players, making development and iteration speed unable to match that of GPUs. Therefore, in the next three years, the advantages of GPUs in iteration speed will become increasingly apparent.
(The technical differences between NVIDIA GPU, Google TPU, and Amazon's self-developed AI chips, and future market trends)
The three major giants are clearly aligning with NVIDIA, while Google remains isolated with TPU.
Currently, the four leading model players globally are OpenAI, Gemini (Google), Anthropic, and xAI, but the overall alignment situation is increasingly leaning towards NVIDIA.
Baker stated that Anthropic has signed a $5 billion long-term procurement contract with NVIDIA, officially binding itself to the GPU camp. xAI is the largest early customer of Blackwell and has heavily invested in building GPU server rooms. OpenAI, needing to rent computing power externally, faces price increases leading to excessive cost pressure, thus hoping to solve the long-term computing power bottleneck through the Stargate project.
Among the four companies, Google is the only one that heavily uses TPU, but it also faces pressure from the declining cost competitiveness of TPUs and slower iteration speed. This has formed an overall computing power pattern of "three against one," with OpenAI, Anthropic, and XAI clustered in the GPU camp, while Google is relatively isolated in the TPU camp.
(NVIDIA's financial report shows impressive revenue: AI data center business explodes, Huang Jen-Hsun: Blackwell sells out)
This article discusses chip investment experts: Google TPU currently has the upper hand, but NVIDIA GPU has greater long-term advantages. It first appeared in Chain News ABMedia.
