Deep Learning Weekly: AI Hardware Deep Dive

An overview of the hardware used to power AI research and industrial applications

Hey folks,

Today, we send you our second deep dive! Our aim is to provide a precise examination of a given topic. Today we dive into the hardware used to train AI models and produce inferences at scale.

We give a history of the field, an overview of what’s available now and we detail the most promising research directions which are defining what will be used in the future.

AI Hardware: Past, Present and Future

Artificial Intelligence (AI) has been around for a few decades, and its advance is mostly driven by 3 factors: algorithmic innovation, access to large datasets and the amount of computing power available.

The first two factors - algorithmic innovation and access to large datasets - have obviously improved very significantly in the last decades, although their progress is difficult to track and quantify. We focus here on the third factor: what is done to increase the amount of computing power available to train AI models and to produce inferences at scale?

The emergence of industrial applications at an unprecedented scale - for example search engines, self-driving vehicles or speech recognition - have driven a wave of investment in hardware research. We present here an overview of the recent history of the field, and summarize the most promising directions for the future.

Past and Present

In a must-read analysis of the evolution of the amount of compute used in AI research, OpenAI estimates that since 2012, the amount of compute used in the largest AI models has been increasing exponentially with a 3.4-month doubling time, much more than Moore’s law’s 2-year doubling period. This impressive increase comes from several factors which are detailed below.

The use of specialized chips

Before 2010, ML models were mostly trained on CPUs. Graphics Processing Units (GPUs) were popularized for computing by Nvidia in the early 2000s, with the introduction of parallel GPUs for applications that require complex and simultaneous calculations. Using GPUs for deep learning was introduced by Andrew Ng’s team in a famous paper, and became the standard in the following years.

Then, a race went on between the main chip providers to develop faster and more powerful domain-specific hardware chips. In 2016, the release of the Tensor Processing Unit (TPU) by Google, giving an impressive speedup for deep learning processing in TensorFlow, as well as the acquisition of Nervana Systems and Movidius by Intel, marked the start of an arms race. Now, the chips’ architecture are more and more designed and optimized for specific applications.

The availability of ever bigger and cheaper hardware infrastructure

After GPUs had been introduced to run deep learning calculations way faster than traditional chips, a wave of research and investments enabled the training of AI models on many GPUs, leading to massive trainings at a much larger scale:

  • Before 2014, infrastructure to train on more than a dozen GPUs was very uncommon, and most state-of-the-art results were obtained with 1-8 GPUs

  • Then, the release of bigger and cheaper GPU clusters (see for example Nvidia’s data center GPUs or Cerebras’ wafer scale hardware) enabled to scale trainings to hundreds of GPUs

  • Also, some theoretical advances (for example huge batch sizes or neural architecture search) allowed greater algorithmic parallelism, hence leading to more efficiency when parallelizing workloads on many GPUs

The promising directions for the future

It appears that AI hardware is still in its infancy stage, and there are a lot of uncertainties about the future of the field. IBM Research has built a team dedicated to developing new devices and hardware architectures optimized for AI, and released a very nice introduction to their work.

The most promising directions to follow in the future are the following:


We conclude this article by stating that when it comes to AI, hardware and software are deeply linked: the release of more efficient hardware enables testing and validating at scale new algorithmic approaches, and the latest theoretical advances drive the chips’ architecture.