Nvidia and the AI Hardware Race

Picture of John Santa
John Santa

The Beat Drops at GTC 2024

In a world where tech conferences sometimes feel like a déjà vu fest, Nvidia’s GTC 2024 was anything but. Picture this: Jensen Huang, Nvidia’s fearless rockstar leader, struts onto the stage in his iconic leather jacket, ready to drop some serious AI hardware knowledge on us. And drop he did, unveiling the Blackwell platform—a beast of a processor tailor-made for the Generative AI epoch, an NVlink Switch chip promising seamless integration of TPUs/GPUs, and performance stats that had us picking our jaws up off the floor. The Blackwell GB200, with NVlink, boasts a 30x performance leap over the H100, coupled with a 25x jump in energy efficiency. Mind. Blown.

But Nvidia didn’t stop there. They also introduced Nvidia inference microservices (NIMs) and Omniverse, a digital world for robots to train. It’s not just hardware; it’s Nvidia shaping the future of AI ecosystems. Ok, so that’s a recap of the news! But, what does all of this mean? Where is AI hardware heading?

Dissecting Moore’s Law

Remember Moore’s Law? Some say it’s dead, but let’s not hold a funeral just yet. While Jensen Huang himself proclaimed the demise of the law—pointing to the soaring costs of cutting-edge tech (ahem, RTX 4090 at $1600 MSRP)—the story is more “Game of Thrones” complex than “Sesame Street” simple.

We believe answering this question requires thinking about it from 2 perspectives: Performance and Cost. Yes, the costs are climbing, but the performance? It’s scaling the Everest without breaking a sweat (see Chart 1), thanks to the ingenious mix of squeezing more transistors onto chips and cleverly paralleling multiple systems. So WTH is happening on the cost side? Well, our hypothesis is that parallelization (horizontal scaling) and the plain limits of physics are increasingly adding challenges for manufacturers to keep costs down.

Chart 1 (see References and Specs for sources)

    The Performance Trajectory: Our Research

    To truly understand the trajectory of AI hardware performance, we need to delve into the historical data and trends that have shaped the landscape of computing power over the past three decades. Our research, which compiles data on CPUs, GPUs, and specialized processing units like TPUs and NPUs since 1990, offers invaluable insights into this evolution.

    In Chart 1, we plot the performance of top processors from leading manufacturers such as Intel, Nvidia, Apple, Google, AMD, and others. This visualization spans from 1990 to 2023, showcasing the journey of computational power. The chart encompasses four primary types of processors:

    1. CPUs (Central Processing Units): Measured by Instructions per Second (IPS), CPUs have long been the workhorses of computing, driving general-purpose tasks and computations.
    2. GPUs (Graphics Processing Units): Measured by Floating-Point Operations per Second (FLOPS), GPUs revolutionized parallel computing, particularly in graphics rendering and scientific simulations.
    3. TPUs (Tensor Processing Units): Primarily utilized in cloud or data center environments, TPUs are specialized hardware designed for accelerating machine learning tasks, particularly those involving tensor operations. Their performance is measured in Tensor Operations per Second (TOPS).
    4. NPUs (Neural Processing Units): Found predominantly in mobile devices, NPUs are optimized for AI tasks on the go. Their performance is also measured in TOPS, but with a focus on inference, power efficiency and mobility.

    Our analysis reveals three distinct stages in the evolution of processing power: CPU era -> GPU era -> TPU era. One of the most striking observations from our research is the seamless hand-off of performance between the CPU, GPU, and TPU stages. Despite the differences in architectures and performance metrics, each stage builds upon the foundations laid by its predecessor, enabling a continuation of exponential growth in computational power. While our analysis acknowledges the limitations of comparing IPS, FLOPS, and TOPS directly, it provides valuable insights into the overarching trends driving the AI hardware race. 

    Ok, enough history! So, where are we now, and where are we heading?

    Nvidia: Still the King of the Hill

    In the datacenter AI hardware realm, Nvidia remains firmly at the forefront. While AMD and Intel vie for a slice of the market, Nvidia’s Blackwell GB200 asserts its supremacy with an impressive 2250 TFLOPS of computational power. In comparison, AMD’s MI300x offers 1307 TFLOPS, while Intel’s Gaudi 3, projected to launch this year, is estimated to deliver 1600 TFLOPS (see Chart 2). Despite these commendable efforts, Nvidia maintains a significant lead, boasting an average advantage of 55%. Based on our extrapolation of year-to-year performance improvements since 2017—an average of 60% per year—Nvidia’s lead equates to nearly a year’s worth of advancements over its closest rivals.

    Chart 2

    In the realm of consumer mobile devices, competitors like Apple and Intel make strides in integrating AI capabilities into their chips. Apple touts its Neural Engine and custom-designed chips, while Intel and Microsoft outlined 40 (Tera)TOPS NPU Performance as the Minimum Requirement For Windows Copilot & AI PC Platforms. Yet, it is Nvidia that stands as the undisputed leader with its RTX 4090 Laptop at an estimated performance of around 200 TFLOPS (see Chart 3).

    Chart 3

    In summary, Nvidia’s position as the industry leader remains unassailable, with AMD and Intel slowly closing the gap but still far from offering strong alternatives except for specific use cases where their solutions might be more cost effective. 

    Peering into the Crystal Ball: Cheap AI for Everyone

    Fast forward to 2029 at current performance growth pace. Imagine a world where datacenter-grade hardware delivers between 16-30 PetaFLOPS per chip, making today’s best look like calculators. To make this more tangible, think of the original groundbreaking ChatGPT 3.5. Based on this paper, it takes 1024 x NVIDIA A100s x 34 days to train the model. With a hypothetical future-generation TPU in 2029 (5 years from the writing of this blog), it may be possible to train a GPT3-equivalent model with just 10 x TPUs in about the same time span (<34 days). Assuming a cost per TPU per hour of $5 (see cost reference), this would add up to around $40,000 USD of training cost (compared to the original $4M+ cost estimate). By then, almost any organization would be able to train these modern models (given they have the data).

    On the other hand, we estimate mobile AI chips hitting ~300 TFLOPS in 2027, which would make them equivalent to a datacenter-grade NVIDIA A100! Imagine that dear reader! What could you build if AI hardware limitations were a thing of the past? The AI hardware race is on, and Nvidia is leading the charge. But this race isn’t just for the tech giants; it’s for all of us. Whether you’re a developer, a dreamer, or a disruptor, the question isn’t what Nvidia will do next—it’s what you will do with these incredible tools at your disposal. Come join us at Factored to build the future of AI together!

    References and Specs

    Related Posts