Why CPUs also make sense for AI inference - interview with Ampere Computing's Jeff Wittich

13/11/234 min read

As CPO of US-based chipmaker Ampere Computing, Jeff Wittich has an important message for IT executives: artificial intelligence inference doesn’t necessarily need supercomputers, or GPUs. In many cases, he claims, CPUs are not only good enough, they’re even ideal. Why? Because they can offer right-sized compute power with minimal energy consumption, thereby limiting AI’s impact on the planet and on cloud budgets. We spoke to Wittich ahead of his keynote at ai-PULSE on November 17…

How does Ampere want to be considered by cloud providers today when it comes to AI?

Jeff Wittich: Ampere’s mission from day one has been to deliver sustainable computing for modern performance environments like the cloud. That extends to AI too. Cloud service providers (CSPs) should consider Ampere for all needs in the cloud, including when looking to build AI workload capabilities.

We know one of CSPs’ biggest challenges is power consumption. Using more power is costly, plus power is scarce, and you can’t expand your data center infinitely. This means we need to deliver more efficient systems over time, to provide more compute capacity without consuming more power.

AI inference has really brought this into the forefront, as demand for it has increased rapidly, making that power challenge even more difficult to solve. We have a solution that tackles that.

Often when we talk about AI, we forget that AI training and inference are two different tasks.

Training [or teaching the AI model with large quantities of data] is a one-off, gigantic task that takes a long time; and for that one time, you might be OK to use the considerable amounts of power required by GPUs and supercomputers.

Inference [or using the trained AI model on a regular basis] is different, as it can be millions of tasks running every second. Inference is your “scale” model, that you’re running all the time, so efficiency is more important here.

So whereas accelerators can make a lot of sense for training, building inference workload doesn’t need to be done on supercomputing hardware.

In fact, general-purpose CPUs are good at inference, and they always have been. Our CPUs are especially well-suited to the task because they are high-performance and balanced. Plus you need predictable latency in these cases, and to keep processing close to the core, not have it bouncing around all over the place. Having a lot of cores is useful too, as is flexibility. It may be that AI inference isn’t 100% of what you’re asking a CPU to do. If it can do other things at the same time, you get higher overall utilization.

How can CPUs be enough for inference, when the current trend is “throwing more expensive, power-hungry, and narrowly specialized hardware at AI”*?

JW: AI needs today cover a whole spectrum. What are your project’s compute requirements? Do you need to be inferencing all the time? What about memory bandwidth? For the vast majority of that spectrum, CPUs will be the right-sized solution. Some inference needs may have a particularly high memory footprint, and therefore need a GPU.

But I think we’ll see a shift in time to smaller, more versatile solutions. It’s like I could have come to work in a Ferrari today, when what I actually need is a more economical electric vehicle that’ll get me here in the same time.

We’re still in the hype and research phase for AI, due to the euphoria around these massive large language models (LLMs), where the instinct is to throw the most possible power at a problem and see what happens. But at some point, these use cases will mature, and efficiency and sustainability will be the victor.

Not everyone will be able to pay for a solution like ChatGPT, which features all of human knowledge. We’ll see more specialization of models, as well as refinement of existing models. Overall, models will become smaller, and more focused on specific tasks.

*A quote from Ampere's recent white paper.

What are the most interesting inference use cases for Ampere chips today?

JW: We’re already seeing some great examples, from real-time voice-to-text translation in any language, which makes things easier for meetings with colleagues in other countries, or increasing accessibility for hearing-deficient people; or generative AI use cases, like artwork, videos, or simplifying everyday routine tasks. These cases all work well with our CPUs.

More specifically, Matoha uses Ampere CPUs to power its near-infra-red spectroscopy. This allows them to scan a 30-year-old landfill for waste noone back then thought of recycling. They can scan a bottle, figure out what type of plastic it is, and send it to the right recycling location. And it works with other materials too, like fabrics.

We also have Red Bull Racing, the highly successful Formula One team, which uses our processors for pre- and in-race day analysis, to optimize their racing strategies. They have a limited amount of time to run these analyses, using complex models based on past race data. Our CPUs allow them to process a lot of data in a very short time, so they can change strategies in real-time, for example, if the weather changes.

How exactly do Ampere CPUs transfer training data from Nvidia GPUs, for inference?

JW: It’s a common misperception that you need to run training and inference on the same models. It’s actually very easy to take one framework and run it on another piece of hardware. It’s particularly easy when you use [AI frameworks like] PyTorch and Tensorflow; the models are extremely portable.

We have a whole AI team at Ampere, which has developed software called AI-O, that allows us to have compatibility across all AI frameworks. So there’s no need to adapt data models at all. Just take a model trained with any GPU, put it on an Ampere CPU and it’ll run great. AI-O does some optimization on the data and processing sides, but you don’t need to use it unless you really want to improve performance. Otherwise, no need for quantization or anything like that. People think (transferring from training GPUs to inference CPUs) is incredibly complicated, but it’s not!

Can data models be adapted to get maximum performance from Ampere CPUs?

JW: Yes, just use the software library we have (AI-O): it’s sophisticated, it gets better results, and it makes sure the way the code is compiled is well-suited to our processors. You’ll get several times higher performance for some models, should you choose that option (but you don’t have to).

Sometimes, there’s an advantage to running at lower precision. So instead of running an FP32 [data model], run the model in something like int8. Our processors support FP32, FP16, Bfloat16, int8… any numerical format you’ll want to run in. In the case of int8, you’re essentially getting four times more performance capacity than FP32, and in many cases you’re not losing any accuracy as a result if doing so. And that’s just as easy to do on our processors as it would be on an Nvidia GPU, or Intel or AMD CPU.

To make things even easier, we ensure you get full support from our AI engineers. That doesn’t exist with all the manufacturers today: they’ll have hardware support, but not software. Better still: we haven’t had many help requests yet, so we like to think that means our solution just works. We do know a lot of people are using AI-O: we’ve seen a sevenfold usage increase in the past six months, so that’s fantastic.

AI consumes considerable amounts of energy and (indirectly) water. Can you quantify the energy savings of Ampere CPUs vs other GPUs for AI inference?

JW: If you run [OpenAI’s generative speech recognition model] Whisper on our 128-core Altra CPU versus Nvidia’s A10 card, we consume 3.6 times less power per inference. Or for something lower-power, like Nvidia Tesla T4 cards, we consume 5.6 times less.

You also have the cooling aspect: the power you’re drawing turns into dissipated heat. So doing this with 3.6 less times power means it’s that much easier to cool. So our hardware doesn’t require super-exotic cooling systems, just standard fans.

Water requirements are harder to calculate because there are so many different ways of cooling data centers. But it’s a fact that the easier a CPU is to cool, the less water you need to cool it.

How can Ampere help cloud providers to become sustainability leaders?

JW: That’s absolutely our mission, as sustainability is one of CSPs’ main pillars. Most people only see the cost, so if we can provide a more efficient processor, great. But we’re seeing more and more CSPs stepping up and providing sustainability messages too, with energy figures, and carbon consumption of Ampere versus Intel and other chipmakers, and so on. We encourage CSPs to be vocal about that.

We’re tackling how to reduce the amount of energy consumed without asking people to use less compute power. So we should be at the forefront of finding ways to create as little impact as possible. Especially with AI: you hear about some AI usages causing data centers to double their energy consumption. We need to pick the right solutions to make sure that doesn’t happen.

What can we expect from Ampere in terms of future developments in CPU technology, particularly in the context of AI and emerging technologies?

JW: Over the next few years, we’ll continue to release CPUs that are more efficient, and deliver ever-higher core counts, as that gives you more and more throughputs for things like AI inferencing. So you’ll see us looking to increase output compute without requiring more incremental power, by adding more cores and increasing memory bandwidth and I/O bandwidth, so that’s perfect for AI inferencing too.

In AI, as we have a team of dedicated engineers, you’ll see us put more new features into our CPUs: we’ve got some interesting ideas in the pipeline to increase inference performance disproportionately. The pace of innovation in the AI space is extraordinarily fast. We’re releasing new products extremely quickly for that reason. We’re also learning from how our clients are using our CPUs in AI today, to anticipate innovations we’ll work into products we’ll release very soon. If you take five years to make this tech, you’re already obsolete. So this is why we’ve adapted our development cycle.

What are you most excited about today?

JW: Sustainability has to be one. Doing something that has a huge impact globally is really exciting. The cloud has a big emissions footprint, globally speaking, so it’s important we take the lead here, including with regards other industries.

More broadly speaking, I’m excited that we're building a new type of general-purpose compute for the world, which isn’t constrained by the limits of data centers to date. By thinking “What does the cloud need?” we’ve done some really cool things, and that’s why we can deliver such great performance across all CSPs. We have limitless capacity to innovate within our CPUs. It’s a new generation for the cloud era!

_Jeff Wittich presents "The Key to AI's Power Efficiency Revolution" (17:25) at ai-PULSE November 17, followed by a panel with Gladia and Powder, "How to make Inference as cost-efficient, sustainable and performant as possible?", from 17:45. [More info here](https://www.ai-pulse.eu/agenda)..._

How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai

Why did Scaleway partner Golem.ai decide to experiment with LLMs? Because Symbolic & Generative AI approaches can be complementary. So here's how to optimize the latter!

Build

Kevin Baude

03/11/238 min read

AIGuest Post