AI Workloads' Impact on Processor Design

AI is fundamentally changing processor design by combining custom processing elements tailored to specific AI workloads with more traditional processors for other tasks.

However, the trade-offs are becoming increasingly confusing, complex, and difficult to manage. For example, workload changes can occur faster than it takes to produce a custom design. In addition, an AI-specific accelerator may exceed power and thermal budgets, which can require adjustments to workloads. Integrating all of these pieces can create problems that need to be solved at the system level, not just within the chip.

Architecture shifts and heterogeneous designs

"AI workloads have thoroughly changed processor architectures," said Steven Woo from Rambus. "It became clear that existing architectures were not working well. Once people realized around 2014 that you could use GPUs and get huge gains in throughput, AI got a major push. People then started to call GPUs a specialized architecture and asked whether we could do more. It became obvious that multiply-accumulate operations common in AI were a bottleneck. Now we have a lot of dedicated hardware for MACs. The question is what else needs to be in hardware. The key is finding the long poles in the tent and driving them in."

Others agree. "AI initially mapped well onto GPU architectures, which explains much of NVIDIA's market success," said Rich Goldman, director at Ansys. "Intel has long driven video processing within its CPUs, and now it also builds discrete GPUs. AMD has an architecture where GPU and CPU share memory. But CPUs remain important. NVIDIA's Grace Hopper is a CPU-GPU combination because not everything fits a GPU architecture. Even in such applications, parts of the system run small CPUs. For decades we ran everything on a single CPU architecture—x86 or maybe RISC—so different applications may run better on different architectures. NVIDIA focused on gaming and then adapted that architecture for animation and film. That same architecture fits many AI workloads, and AI is now driving a lot of the market."

The current challenge is how to develop more efficient platforms that can be optimized for specific use cases. "When you implement a capability in truly scalable hardware rather than as a one-off use case, the challenge becomes how to operate that capability," said Suhas Mitra, AI product marketing director at Tensilica. "Traditionally, a system has a CPU and, for mobile, a GPU, DSP, and so on. These all become unsettled when workloads show awkward parallelism. GPUs became popular because they have excellent hardware engines for parallel processing—vendors could quickly benefit."

Expedera's chief scientist Sharad Chole said architectures perform best when workloads are well-understood. "In these types of architectures, suppose you are integrating an ISP and an NPU tightly in an edge architecture. SoC teams are looking at how to reduce area and power," he said.

Chole added that understanding memory-latency implications is a key challenge. "What does memory look like when the NPU is slow? What does memory look like when the NPU is fast? Ultimately, the question of balancing MACs versus balanced memory comes from trying to minimize I/O buffering wherever possible."

External memory bandwidth is also a key part of the equation, particularly for edge devices. "No one has enough bandwidth," he noted. "So how do we partition workloads or schedule neural networks to preserve external memory bandwidth and keep it as low as possible? Essentially we do this by packing or splitting networks into smaller parts and attempting to execute those parts appropriately."

Designing for a fast-changing future

A major issue with AI is that algorithms and compute models evolve faster than the time it takes to design hardware from scratch.

"If you say you will build a CPU that excels on LSTM models, that cycle takes years," Woo said. "Then you realize that LSTM models may rise and fall or be replaced as dominant models within two years. You want to build specialized hardware, but you must do it faster to keep up. The holy grail would be creating hardware as quickly as algorithms change. That would be ideal, but the industry is under pressure and cannot fully achieve that speed."

This also means processor architectures that handle AI workloads will differ from processors not focused on AI. "If you look at engines used for training, they are not running Linux or general-purpose applications because they are not designed for general branching, multiple instruction types, or multi-language support," Woo said. "They are very simple engines that run a small set of operations quickly. They are highly optimized for the specific data-movement patterns required for compute. For example, Google's TPU uses a systolic-array-style architecture that has been around since the 1980s. It excels at a specific, highly regular type of work on large data arrays, making it well suited for dense neural networks. But running general code is not the design intent. These accelerators behave like large coprocessors that can handle the bulk of computation while still needing interfaces to other components to manage the remainder of the workload."

Benchmarking is also difficult because comparisons are not always apples-to-apples, complicating architecture development. "This is a hard topic because different teams use different tools," Chole said. "In engineering practice, the task looks like system-level benchmarking. You benchmark each SoC component separately and try to infer required bandwidth from those numbers—performance, latency, and so on. Based on that, you estimate system behavior. As designs mature, we consider transaction-accurate simulation or hybrid analog approaches to get exact performance and bandwidth requirements for different modules. For instance, a RISC-V core and an NPU must cooperate and coexist. Do they need to be pipelined? Can their workloads be pipelined? How many exact cycles does RISC require? For that, we must compile programs for RISC-V and the NPU and run joint simulation."

Impact on processor PPA (power, performance, area)

All these variables affect power, performance, and area/cost trade-offs.

"ML workload PPA trade-offs are similar to the choices architects face when considering acceleration—energy efficiency versus area," said Ian Bratt, researcher and senior technical director at Arm. "Over the past few years, CPUs have improved significantly for ML workloads through added ML-specific instructions. Many ML workloads can run on modern CPUs. However, in highly constrained energy environments, it can be worthwhile to accept additional silicon area to add a dedicated NPU that is more energy-efficient for inference than a CPU. That efficiency comes at the cost of extra silicon area and reduced flexibility; NPU IP typically runs neural networks only. Compared with more flexible components like CPUs, dedicated units such as NPUs may achieve higher overall performance or lower latency."

Russell Klein, program director for Siemens EDA's Catapult software group, explained two design aspects that most affect PPA. "One is the data representation used in compute. For most ML computations, floating point is very inefficient. Using a more suitable representation makes designs faster, smaller, and lower power," Klein said.

The other major factor is the number of compute elements in the design. "Essentially, how many multipliers are included," Klein said. "This determines the parallelism needed to deliver throughput. A design can include many multipliers, making it large, power-hungry, and fast. Or it can include only a few, making it small and low-power but much slower. Beyond power, performance, and area, another important metric is energy per inference. Any battery-powered or energy-harvesting device may be more sensitive to energy per inference than to peak power."

Numeric representation of activations and weights also significantly impacts PPA.

"In data centers, everything is often 32-bit floating point. Alternative representations reduce operator size and the amount of data that needs to be moved and stored," Klein noted. "Most AI algorithms do not need the full dynamic range of floating point and work well in fixed point. Fixed-point multipliers typically have much smaller area and power compared with corresponding floating-point multipliers, and they operate faster. Often 32-bit fixed point is unnecessary. Many algorithms can reduce feature and weight bit widths to 16 bits, or in some cases to 8 bits or smaller. Multiplier size and capability scale roughly with the square of data width. Thus a 16-bit multiplier is much smaller than a 32-bit one; an 8-bit fixed-point multiplier may consume only about 3% of the area and power of a 32-bit floating-point multiplier. If algorithms can use 8-bit fixed point instead of 32-bit float, memory storage and bus bandwidth requirements fall, yielding substantial area and power savings. Quantization-aware training can further reduce required bit widths. Typically, networks trained with quantization-aware methods need only about the bit width used for post-training quantized networks. Quantization-aware networks often require only 3–8 bits of fixed-point representation; some layers may use a single bit. A 1-bit multiplier is essentially an AND gate."

Active quantization introduces overflow concerns. "With 32-bit float, developers do not worry about values exceeding representable range. With small fixed-point types, overflow must be addressed. Overflow may occur frequently. Using saturation arithmetic is one solution: operations clamp to the maximum representable value instead of wrapping. This approach works well for many ML algorithms because the exact magnitude of a large intermediate sum is less important than the fact that it is large. Saturating math allows developers to reduce fixed-point sizes by one or two bits. Some networks do need the dynamic range offered by floating point; when converted to fixed point they lose too much precision or would require more than 32 bits for acceptable accuracy. In such cases, there are several floating-point formats. Bfloat16 was developed by Google for their NPUs; it is a 16-bit floating-point format that converts easily with standard floats. There is also IEEE-754 16-bit float and NVIDIA's TensorFloat," Klein added.

Using any of these approaches leads to smaller, faster, and lower-power designs.

Woo also noted that a general-purpose core is good at many things but not optimized for anything. "At any time, portions of a general-purpose core are used while other parts are idle. These cores require area and energy. As Moore's Law continues to provide more transistors, there is merit to building specialized cores along the AI pipeline that excel at certain tasks. Sometimes those cores are idle; sometimes they are active. That can be better than always using general-purpose cores because general cores waste area and power and never achieve optimal performance. Coupled with a market willing to pay for higher-margin solutions, this becomes a viable strategy."

Ansys product marketing director Marc Swinnen said this approach is straightforward for hardware engineers. "You ship a first version, observe what is used and what is not, then address those gaps. The applications you run are critical to understanding trade-offs. If you align your hardware to the applications you want to run, you can achieve a more efficient design than using an off-the-shelf solution. A custom chip fits the intended workload well."

Some generative AI developers are exploring building their own silicon, which suggests that current commodity semiconductors may not meet future needs. This is another example of how AI is changing processor design and the surrounding market dynamics.

AI can also play a role in smaller chip markets where semi-custom and custom hardware modules can be characterized and added into designs without building everything from scratch. Large chipmakers such as Intel and AMD have done this internally for some time, but fabless companies may have disadvantages.

"The issue is that your small chip must compete with existing solutions," said Andy Henig of the Fraunhofer Institute for Adaptive Systems Engineering. "If you are not focusing on performance, you cannot compete. People are concerned with getting the ecosystem running. From our perspective, it's a chicken-and-egg problem. You need performance because chips are more expensive than SoC solutions. But you cannot focus on performance until you first get the ecosystem up and running."

Getting the start right

Unlike the past, when many chips were designed for a single socket, AI-driven design starts with workloads.

"When these trade-offs occur, understanding the target is extremely important," Chole said. "If you say 'I want to do everything and support everything,' you are not really optimizing for anything. You are essentially putting a general solution in and hoping it meets your power budget. In our experience, that seldom works. Every neural network and deployment case on edge devices is unique. If your chip goes into a headset running RNNs rather than into an ADAS chip running transformers, the use case is completely different. NPU, memory system, configuration, and power are all different. So it is important to understand a set of key workloads you want to optimize for. These can be multiple networks. Teams must agree on the important networks and optimize for them. Engineering teams often lack this focus when considering NPUs; they just want the best possible thing, but you cannot have the best of everything without trade-offs."

Mitra at Cadence noted that everyone considers PPA similarly but emphasizes different parts of the trade-off depending on their domain. "If you are in the data center, you might accept slightly more area because your goal is very high-throughput machines for billions of inferences or large-scale model training. You need massive datasets and large-scale compute clusters. Running inference for large language models on a desktop is no longer realistic; even inference for these large models requires substantial resources."

Other factors matter as well. "Hardware architecture decisions drive much of this, but software also plays a critical role," said William Ruby, product management director at Synopsys EDA. "Performance versus energy efficiency is key. How much memory is needed? How will the memory subsystem be partitioned? Can software be optimized for energy efficiency? Process technology choices are also important for all PPAC reasons."

According to Synopsys AI/ML processor product manager Gordon Cooper, if power efficiency is not the primary concern, an embedded GPU can provide maximum coding flexibility. "However, it will never match dedicated processors in area and energy efficiency. If you design with an NPU, there are still trade-offs between area and power. Minimizing on-chip memory will reduce total area but increase off-chip memory traffic, thereby increasing power. Increasing on-chip memory reduces off-chip read/write power."

Conclusion

Many of these issues are increasingly system-level rather than solely chip-level problems.

"People think of the training portion of AI as extremely compute-intensive with a lot of data movement," Woo said. "Once you add acceleration hardware, the rest of the system can become the bottleneck. For this reason, we see platforms from NVIDIA and others that combine complex training engines with Xeon-class CPUs. There are parts of computation that AI engines are not suited for; they are not designed to run general code. This becomes a heterogeneous system problem: everything must cooperate."

The software side of the problem also presents opportunities for efficiency improvements, such as in-network aggregation and simplification. "There is a recognition that AI workloads include specific algorithmic reductions and aggregation operations that reduce many numbers into a single number or a small set of numbers," Woo explained. "Traditionally, you would send all that data across the interconnect to a processor and then aggregate. Why not perform aggregation in the switch since the data already passes through it? The advantage is that you do online reduction and only need to send a single result, which reduces network traffic."

Architectural approaches like that address multiple problems at once: data movement between nodes is slow, so moving less data is beneficial; they eliminate redundant data transfer to processors; and they enable parallelism by letting each switch perform partial computation.

Expedera's Chole added that AI workloads can be defined as graphs. "The graph is not a small instruction set. We may perform millions of adds or tens of millions of matrix multiplies at once. That changes how you think about execution, instruction encoding, compression, prediction, and scheduling. Doing this on a general CPU is impractical and too costly. As a neural network, the number of active MACs is enormous, and the way you generate, compress, and schedule instructions substantially affects utilization and bandwidth. That is a major impact AI has on processor architecture."

AI Workloads' Impact on Processor Design

Content

Architecture shifts and heterogeneous designs

Designing for a fast-changing future

Impact on processor PPA (power, performance, area)

Getting the start right

Conclusion

Key Technologies of Robotic Vacuum Cleaners

Are Humanoid Robots a Future Breakthrough for AI?

Relationship Between Industrial Robots, PLCs and Automation