Hardware acceleration is fundamental for modern AI algorithms and high-performance computing in general. We use it for AI training and inference and to accelerate signal and image processing and mathematical computations.
What is a CPU?
The CPUs are by far the most used processing devices over the modern computational systems. They are usually based on processing pipelines that perform instruction fetching, decoding, execution and memory accesses. A processing core usually takes the instruction list from an instruction segment and, aided by a program counter, goes through the instruction list one by one until completing the list.
Modern CPUs have the ability to execute out-of-order by analyzing the data dependencies, increasing the execution performance. They also have multiple processing cores that execute multiple instruction lists (tasks, threads and processes) simultaneously. Moreover, most are equipped with vector instructions that perform operations on various data with a single instruction (SIMD, Single-Instruction Multiple-Data) and operate between 1.0 GHz and 6.0 GHz, depending on the models.
The most important considerations when choosing a CPU are the following:
Serial algorithms: serial algorithms like statistics and with high data dependencies (like temporal data) are mostly suitable for CPU. The execution time of these algorithms is often proportional to the CPU frequency.
Branching: branching implies bifurcations in the execution flow. It means selecting a set of instructions depending on the result of a condition. CPUs are optimized for this kind of workload, given the branch prediction.
Control: CPUs are widely used for controlling other devices, processing inputs from the user and interacting with him. It also manages the memory and computational resources available in the system.
Floating-point data: CPUs are highly skilled in managing floating-point operations. Modern processors execute this numeric representation almost at the speed of integer operations for trivial arithmetic operations.
Parallel data: CPUs can process data efficiently thanks to SIMD instructions. Modern processors can process from 1 to 16 floating-point numbers per instructions. However, the inflexion point is that processing data on the CPU may lead to taking down one of the processing cores and problems with data consumption.
Some of the weaknesses to avoid CPUs:
Highly dimensional data: CPUs are inherently serial execution units. When dealing with massive and high-dimensional data, they are often saturated in core utilization, requiring multi-core approaches. However, using all the CPU cores will affect the overall system performance because of the lack of CPU time dedicated to the Operating System.
Memory access: CPU memory bandwidth is often limited compared to other alternatives. This is a crucial factor when processing and moving data.
What is a GPU?
The most popular device used for these tasks is the Graphical Processing Unit (GPU). GPUs are well-suited for algorithm acceleration, given their architecture comprising several compute units. Unlike CPUs, GPUs have simpler processing cores with many execution units inside, capable of processing much more data per instruction. A rough comparison would be that a GPU can be modelled as a processing core with a much larger vector unit. Moreover, a GPU can be composed of many processing cores.
The terminology changes from one vendor to another. Here is a stable that illustrates a comparison:
Component Name | NVIDIA | AMD | Intel |
Execution Unit | CUDA Core | Vector Unit | Execution Unit |
Processor Unit | Stream Multiprocessor | Compute Unit | Subslice |
AI has made GPU vendors use multiple execution units to process different data types. Still, most of them coincide with floating-point operations, having an outstanding performance while completing tasks with these data types. Moreover, GPUs have lower frequency than CPUs but higher memory bandwidth, making them ideal for processing large amounts of data.
The most important considerations when choosing a GPU are the following:
Uniform algorithms: The maximum efficiency of a GPU is when an instruction is operated on a large amount of data without any branching and spatial data dependencies. This is because all the execution units can process the data at once.
Large amounts of data: GPUs are handy for analyzing large amounts of data in parallel. For instance, in AI training, data samples are batched, and the GPU can analyze each batch in parallel until the errors and deviations are corrected, which are later collected and reduced to perform one training step.
Sectorized data: data divided into sectors is one of the most common ways to process data with specific spatial dependencies in the GPU. A dataset can be split into sectors of two or more dimensions and operated in isolation for later reduction.
Floating-point data: GPUs are also highly skilled in managing floating-point operations. Modern GUs execute this numeric representation almost at the same speed as integer operations for trivial arithmetic operations.
The weaknesses when choosing if an algorithm is suitable for GPUs are:
Spatial dependencies: data with non-coalesced or uniform accesses degrade the performance of GPU processing by factors of ten times.
Branching and Control: GPUs are not well equipped to deal with branches. Modern architectures are optimized to mitigate branching. However, a branch serializes the execution depending on the result, given that the execution units are usually vector units.
Serial processing: Serial processing is perhaps a GPU's most relevant performance killer. Not having enough workload or having highly serial processing leads to suboptimal usage of the GPU.
How can FPGAs help?
FPGAs are another kind of device. In contrast to CPUs and GPUs, FPGAs do not execute instructions per se, and they are specialized in hardware-level implementations. FPGAs usually comprise thousands of cells containing Look-Up Tables (LUTs), Registers, Full Adders and Muxes.
In contrast to other alternatives, one of the key advantages of FPGAs is power consumption. Given that FPGAs specialize in their structures to construct custom implementations, they are optimized to consume the least power possible. Interestingly, it is even possible to build a GPU or a CPU inside of an FPGA.
Now, landing into the hardware acceleration, FPGAs are adequate for massive parallelism, which competes against GPUs and low latency, allowing dataflow-like patterns to have better throughput. However, they run at lower frequencies than GPUs and CPUs, oscillating between 100 and 500 MHz.
Some considerations when taking FPGAs into account for acceleration:
Low-latency and dataflows: FPGAs are well-suited for dataflow processing, given the implementation happens at the hardware level and the processing can be represented through execution pipelines. Moreover, having large amounts of data is unnecessary to get high performance.
Third-party hardware connections: FPGAs are electronic devices capable of interconnecting with other devices, such as cameras, sensors, actuators, and others. It is possible to capture data from a sensor and do the pre-processing and core processing inside the FPGA at low power.
Multiple functionality: FPGAs can implement multiple designs within the programmable logic. It is possible to have the FPGA divided into different tasks that can run simultaneously.
Determinism: FPGAs are highly deterministic, making them suitable for critical applications.
Low power: FPGAs are well-known for low power consumption and better performance/power trade-offs.
However, there are some cases where we have to reconsider:
Floating-point operations: FPGAs are weak when dealing with floating-points, given their complexity in terms of hardware. Instead, it is highly recommended that fixed points be used when possible.
Communication bottlenecks: for massive parallelism, the Aquiles heel is communication and moving data. FPGAs are recommended when all data is moved inside of the implementation or connected directly to the sensors.
Workflow: depending on the workflow, the development times are high. However, in HPC, we recommend using high-level synthesis, allowing C/C++ to represent algorithms and optimise them for implementation in a certain direction.
Conclusions
In this blog post, we have covered some details of choosing an accelerator, including CPU, GPU, and FPGA. We excluded the device cost from the equation and focused on technical details. Moreover, we promote the selection based on objectives, always focusing on performance and energy consumption.
RidgeRun Is Here To Help You Reach The Next Level
RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:
Algorithm Acceleration using FPGAs.
Image Signal Processing IP Cores.
Linux Device Drivers.
Low Power AI Acceleration using FPGAs.
Accelerated C++ Applications using AMD XRT.
And it includes much more. Contact us at https://www.ridgerun.com/contact.