Beat the Heat
Artificial intelligence and machine learning continue to move toward center stage. But the powerful processing they require is tied to high power dissipation that results in a lot of heat to manage. In his article, Tom Gregory from 6SigmaET explores the alternatives available today with a special look at cooling Google’s Tensor Processor Unit 3.0 (TPUv3) which was designed with machine learning in mind.
For much of the electronics community, 2018 was the year that artificial intelligence (AI)—and in particular machine learning—became a reality. Used in everything from retail recommendations to driverless cars, machine learning represents the ongoing effort to get computers to solve problems without being explicitly programmed to do so.
While both machine learning and the wider notion of artificial intelligence come with revolutionary new implications for the technology sector, they also require significant investment in new forms of electronics, silicon and customized hardware. But just why do we need such an extensive ground-up redesign? And why shouldn’t AI firms simply keep building on the processing and acceleration technologies that are already in place?
BUILDING A BRAIN
For most of those working in AI, the end goal is to create artificial general intelligence (AGI); “thinking machines” that assess, learn, and handle nuance and inference in a way that replicates the thought processes of the human brain. Based on the current design and architecture of electronics however, this simply isn’t possible.
As it stands, the vast majority of current-generation computing devices are based on the same Von Neumann architecture—a beautifully simple, but fundamentally limited way of organizing information. In this structure programs and data are held in memory, separate from the processor, with data having to move between the two in order for operations to be completed. The limitations of this structure result in what has become known as the “Von Neumann bottleneck,” in which latency is unavoidable.
Unlike today’s computing devices, the human brain does not separate memory from processing. Biology makes no distinction between the two, with every neuron and every synapse storing and computing information simultaneously.
While even the largest investments in AI are nowhere near recreating such a system, there is a growing demand to rethink the Von Neumann architecture and to overcome the natural bottleneck of the two-part memory/processing system. It is here that new hardware developments are proving so vital. While plenty of AI companies have had significant success simply throwing more processing power and evermore GPUs at the problem, the reality is that AI and machine learning will never reach their full potential until a new more “biological” hardware is developed from the ground-up.
THE RACE FOR HARDWARE
This demand for new and increasingly advanced hardware has become known as the “Cambrian Explosion”—an apt reference to the most important evolutionary event in the history of life. This colossal growth in the AI market has resulted in a huge variety of new electronics, as well as an increasing number of high-end investments being made in any start-up that can offer a solution to the ever expanding disconnect between existing hardware outputs and the massive potential of machine learning technology.
Already in 2018, we saw several such investments. In March, AI chip start-up SambaNova Systems received $56 million in funding for its new GPU replacement, designed specifically with AI operations in mind. More recently, it’s competitor Mythic received a similar $40m investment to continue its efforts to replace the traditional GPU. And it’s not just start-ups that are getting in on the action. Tech giants such as Microsoft, Google, Nvidia and IBM are all looking for their own ‘killer app’ solution to the AI hardware problem.
While each of these different companies has its own particular solution – often attacking the problem from completely different angles – one of the most common pieces of hardware being developed is an accelerated GPU.
In traditional computing environments, CPUs have been used for the bulk of processing, with GPUs being added to ramp things up where required (such as for rendering videos or animation). However, in the age of machine learning, GPUs still aren’t enough. What’s needed now is a new, more powerful, and more streamlined processing unit which can undertake the heavy lifting needed for machine learning – a unit that can analyze large data sets, recognize patterns and draw meaningful conclusions.
TENSOR PROCESSOR UNIT
The recently released Tensor Processor Unit 3.0 (TPUv3) is Google’s latest foray into the AI hardware market. Designed specifically with the future of machine learning in mind, the TPU is a custom ASIC, tailored specifically for TensorFlow, Google’s open source software library for machine learning and AI applications (Figure 1).
Unlike GPUs, which act as a processor in their own right, Google’s TPU is a coprocessor—shifting all code execution to the CPU in order to free up the TPU for a stream of machine learning-based microoperations. In purely practical terms, TPUs are designed to be significantly cheaper and to (theoretically) use less power than GPUs, despite playing a pivotal role in some pretty hefty machine learning calculations, predictions and processes. The only question is, does Google’s TPU really achieve what it claims?
While positioned as a game-changing development in the AI space, the new TPUv3.0 still faces many of the same challenges as competitor products offered by Amazon and Nvidia—in particular, the potential for thermal complications.
As with so much of the hardware developed specifically for the machine learning market, Google’s TPUv3 offers a colossal amount of processing power. In fact, according to Google CEO Sundar Pichai, the new TPU will be eight times more powerful than any of Google’s previous efforts in this area.
From an AI standpoint this is hugely beneficial, with the process of machine learning relying on the ability to crunch huge volumes of data instantaneously in order to make self-determined decisions. From a thermal perspective however, this dramatic increase in processing represents a minefield of potential complications – with increased power meaning more heat is generated throughout the device. This accumulation of heat could potentially impact performance, ultimately risking the reliability and longevity of the TPU.
In a market where reliability is essential, and buyers have little room for system downtime, this issue could prove the deciding factor in which hardware manufacturer ultimately claims ownership of the AI space.
Given these high stakes, Google has clearly invested significant time in maximizing the thermal design of its TPUv3. Unlike the company’s previous tensor processing units, the TPUv3 is the first to bring liquid cooling to the chip—with coolant being delivered to a cold plate sitting atop each TPUv3 ASIC chip (Figure 2). According to Pichai, this will be the first time ever that Google has needed to incorporate a form of liquid cooling into its data centers.
KEEPING IT COOL
While the addition of liquid cooling technology has been positioned as a new innovation for the industry, the reality is that high powered electronics running in rugged environments have been using similar heat dissipation systems for some time.
When preparing equipment for these environments, engineers are faced with the challenge of working with limited cooling resources. As a result, they must find clever ways of dissipating heat away from the critical components. In this context, carefully designed hybrid liquid and air-cooling systems have proved vital in ensuring that servers and other critical electronics systems function reliably.
As an example, 6SigmaET thermal simulation software has been used to model liquid cooling systems for servers by The University of Texas at Arlington for their research. One significant challenge the research faced was the components other than the main processing chip within a server—like the DIMMs, PCH, HDD and other heat generating components that are not directly cooled by a liquid cooling loop. Hence, the combination of warm water and recirculated air was used in the research to cool the server to keep the critical temperatures within the recommended range.
While such liquid cooling systems are extremely effective, they should not necessarily be used as the go-to solution for thermal management. With those in the machine learning space looking to optimize efficiency—both in terms of energy and cost—it’s vital that designers minimize thermal issues across their entire designs, and do not rely on the sledgehammer approach of installing a liquid cooling system just because the option is available.
For some of the most powerful chips, such as the Google TPUv3, it may be that liquid cooling is the most viable solution. In future however, as ever more investment is placed in machine learning hardware, engineers should not grow complacent when it comes to exploring different thermal management solutions. Liquid cooling may be sufficient to dissipate heat build-up in the most high-powered components currently available. However, this may not always be the case in the future. It may be more efficient to strive for designs that do not risk such accumulations of heat in the first place.
If AI hardware manufacturers are truly going to overcome the thermal complications associated with their increasingly powerful designs, they must take every opportunity to optimize thermal management at every stage of the design process.
At the chip level, appropriate materials for substrates, bonding, die attaches and interface materials need to be selected. At the system level there are equally important decisions to be made regarding PCB materials, heat sinks and where to incorporate liquid cooling or thermoelectric coolers.
The more robust materials used in high power electronics also brings their own challenges. Compared to typical FR4 PCBs, materials like ceramic or copper have high thermal conductivity, which can be advantageous in thermal management, but at the same time these materials can also add significant cost and weight to a design if not used optimally.
While these may seem like minor considerations, it is the businesses that maximize their use of innovative solutions within their design constraints and build thermal considerations into the fabric of their designs that will be most effective at minimizing energy waste and mitigating unnecessary risk and costs due to component failure.
UP FRONT CONSIDERATION
According to 6SigmaET’s research, which incorporates data from over 350 professional engineers, 75% don’t test the thermal performance of their designs until late in the design process. And 56% don’t run these tests until after the first prototype has been developed, while 27% wait until after a design is complete before even considering thermal complications.
Instead of relying on physical prototypes, which are expensive and time consuming to produce, more and more of today’s engineers are choosing to test the thermal qualities of their designs virtually—in the form of thermal simulations.
By creating a thermal simulation model in advance, AI hardware engineers can test their designs using a wide variety of different materials and configurations—for example, switching apart from copper to aluminum at the click of a button. Simulation also enables designs to be tested in a massive array of different environments, temperatures and operating mode scenarios (Figure 3). This will not only help to identify potential inefficiencies, but also reduces the need for multiple real-world prototypes.
Through the early-stage incorporation of thermal simulation into the design process, it is becoming increasingly easy for engineers to precisely understand the unique thermal challenges facing AI hardware. This means that thermal considerations can be dealt with far earlier, enabling the thermal performance of TPUs and related AI hardware to be fully optimized and reducing the risk of expensive late-stage ‘fixes’ and unnecessary over-engineering.
With numerous industry titans, and hundreds of electronics start-ups all racing to be the first to develop truly effective and efficient AI hardware, those that fail to account for the ‘little things’ will quickly fall behind. When working in such a high-precision field, every wasted second or lost watt represents a significant burden on the effectiveness of the resulting system. Inevitably, those firms that produce the most elegant, efficient and—ultimately—the most reliable products, will be those that claim ownership of the space, and win their place as the leaders in the AI hardware market.
For detailed article references and additional resources go to:
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • FEBRUARY 2019 #343 – Get a PDF of the issueBecome a Sponsor