AI Hardware Accelerators
AI hardware accelerators, including FPGA, and other emerging ASICs like Google TPU, AMD AIE, AWS Trainium, etc. This note does not include details about the software stack, but only focuses on the hardware architecture and its design choices.
AMD Versal AIE
- AIE is essentially an AI-specific CGRA, where each functional unit is a VLWI processor with AI-optimized ISA.
-
The AIE tiles (i.e., FU) can be used as systolic array to compute GEMM, or as CGRA to deploy spatial heterogenous dataflow applications.
-
As time of writing, AIE is only capable of doing systolic-array-style computation as a slave co-processor. One cannot expect anything more than that from AIE
-
FPGAs
Applications
- Latency-sensitive real-time applications
- Image or signal de-noising, de-blurring, brightness adjustment, ROI extraction
- DSP such as signal synchronization, data aggregation, data compression
- ASIC prototyping and emulation
- Small DNN inference. Not as good as GPU/TPU
- With small batch, custom precision, simple kernels.
- Xilinx has decided to introduce Versal AIE architecture for DNN inference, as FPGA has limited circuitry for massive parallelism, low frequency, and architecture debts to maintain its flexibility.
AWS Trainium
Architecture components in each NeuronCore:
- 128x128 systolic array
- Compute engine:
- Scalar engine
- Vector engine
- GSIMD engine (custom ops like issue DMA requests)
- 24MB SBUF (compiler managed scratchpad, with 128 partitions) to pass data to the systolic array + 2MB PSUM buffer for draining PE results out
- NeuronLink between NeuronCores