Boqueria - Next Generation At-Memory Inference Acceleration Device with 1000+ RISC-V Cores @ HotChips 2022
Table of Contents
- video
- https://www.youtube.com/watch?v=J8N9bG5YQ_g
- slide
- https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9895618
Motivation
- Data movement is the costliest part of inference (90% energy consumption)
- Optimizing compute architecture to minimizing distance travelled results in inference-specific AI accelerators.
- Right balance between coarse-grained and fine-grained approach (WHAT IS THAT?)
- Utilize most efficient data types.
Boqueria propose "at-memory computation" to reduce the distance from processing element and memory. It's different from In-Memory Computation whose computes units resides on memory.
Overall Architecture
Figure 1: source: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9895618
- 729 memory banks in total, each has a dual RISC-V processor.
- 1.35 GHz, TSMC 7nm
- 30 TFLOPs/W (I have no idea how good it is)
- 238MB on-chip SRAM (328KB per memory bank)
- 1 PB/s SRAM bandwidth
Memory Bank Design
Figure 2: source: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9895618
- Each RISC-V manages 4 row controllers.
- Row controllers operates independently. (64 SIMD PEs)
- Rotator cuff moves activations bewteen nearest neighbor PEs.
- 8 E/W NOC (7GB/s, bi-directional)
- 1 N/S NOC (70GB/s, bi-directional)
SRAM Array & Processing Element Design
- Low Power SRAM Array (0.4V datapath operation)
- Processing elements include int4/8, fp8 and bf16, detect zero to save power, structured sparsity and dedicated circuitry for softmax/layernorm.
Custom RISC-V Processor
Figure 4: source: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9895618&tag=1
- Standard RV32EMC instruction set + some custom instructions
- Each processor has 6KB memory, 32-bit ALU, 32-bit multiplier, x16 register file and 4-way context switching.
High-bandwidth I/O
- I/O Ring NOC (141GB/s in both clockwise and counter-clockwise direction).
- 1.5 TB/s E/W throughput and 1.9 TB/s N/S throughput
- X16PCIe Gen5 for host connectivity (63 GB/s)
- X8PCIe Gen5 for intra-chip connectivity (31.5GB/s)
- 4MB scratchpad for data manipulation (why?) and 32GB of external LPDDR5 (>100GB/s)