OuterSPACE @ HPCA 2018
Table of Contents
OuterSPACE is a SpGEMM accelerator that exploits:
- outer-product based matrix multiplication
- multiply and merge
- software controlled scratchpad memory, non-coherent.
Advantage of Outer-Product
- No index-matching.
- Maximize the re-use of non-zeros.
- Minimize loads of a column and row.
Multiply Phase
Below is a figure illustrating the outer-product of two sparse matrices, the input matrix \(\mathbb{A}\) is stored in CSC format while \(\mathbb{B}\) is stored in CSR format, the output format could either be CSR or CSC.
The intermediate partial products are stored as a set of linked lists (see \(R_i\) / \(C_i\) in the figure).
Merge Phase
In this phase the processing elements in OuterSPACE walked through and linked list and merging values. The paper claims that merging is rare for highly sparse matrices.
Matrix Format Conversion
OuterSPACE also supports CSR/CSC conversion, by partition the conversion operator to conversion-load and conversion-merge phases, this is equivalent to multiplying the input matrix \(mathbb{A}\) with an identity matrix \(\mathbb{I}\):
\[\mathbb{I}_{CC}\times\mathbb{A}_{CR} \rightarrow \mathbb{A}_{CC} \]
OuterSPACE Architecture
- Processing Element
- Data-streaming and compute engine: ALU, scratchpad memory, a control unit, a work queue and outstanding request queue. PEs are grouped into processing tiles.
- Local Control Processor
- coordinates the PEs inside a tile, streaming instructions for the PE to execute.
- Central Control Processor
- power-efficient core that is responsible for scheduling work and allocating memory for intermediate data structures.
- High-speed crossbars and coalescing caches
- can be reconfigured into scratchpad memories.
- HBM
- stores input matrices and intermediate partial products.