AI Hardware Accelerators

Posted on November 4, 2023 in Accelerators ASIC

AI hardware accelerators, including FPGA, and other emerging ASICs like Google TPU, AMD AIE, AWS Trainium, etc. This note does not include details about the software stack, but only focuses on the hardware architecture and its design choices.

AMD Versal AIE

AIE is essentially an AI-specific CGRA, where each functional unit is a VLWI processor with AI-optimized ISA.
- The AIE tiles (i.e., FU) can be used as systolic array to compute GEMM, or as CGRA to deploy spatial heterogenous dataflow applications.
- As time of writing, AIE is only capable of doing systolic-array-style computation as a slave co-processor. One cannot expect anything more than that from AIE

FPGAs

Applications

Latency-sensitive real-time applications
- Image or signal de-noising, de-blurring, brightness adjustment, ROI extraction
- DSP such as signal synchronization, data aggregation, data compression
ASIC prototyping and emulation
Small DNN inference. Not as good as GPU/TPU
- With small batch, custom precision, simple kernels.
- Xilinx has decided to introduce Versal AIE architecture for DNN inference, as FPGA has limited circuitry for massive parallelism, low frequency, and architecture debts to maintain its flexibility.

AWS Trainium

Architecture components in each NeuronCore:

128x128 systolic array
Compute engine:
- Scalar engine
- Vector engine
- GSIMD engine (custom ops like issue DMA requests)
24MB SBUF (compiler managed scratchpad, with 128 partitions) to pass data to the systolic array + 2MB PSUM buffer for draining PE results out
NeuronLink between NeuronCores