Spatial Sparse Matrix Multiplier @ HPCA 2021
Table of Contents
link: https://arxiv.org/pdf/2101.08884.pdf
TL;DR: It's possible and advantageous to implement sparse operations directly in programmable hardware.
Background
Reservoir Computing
Reservoir computing (Echo State Networks) uses very large sparse matrix.
\[ \textbf{x}(n) = f(W^{in}\textbf{u}(n) + W\textbf{x}(n-1))\] \[ \textbf{y}(n) = W^{out}\textbf{x}(n)\]
\(n\) refers to the time stamp. \(W^{in}\) and \(W^{out}\) are input/output weight matrix, \(W\) is the large fixed (fixed during training and inference) sparse matrix. In this paper \(W\) has a dimension of 800 with %75 sparsity.
The paper discusses how to accelerate SPMV on an FPGA using bit serial arithmetic when sparsity pattern is fixed.
Bit-serial Vector Matrix Multiplier
FPGA consists of an array of Lookup Tables (LUTs), Registers (FFs), and programmable interconnect.
This paper uses FPGA's feature about re-purpose some LUTs into small RAMs or shift registers (LUTRAMs).
Bit-serial adder was popular decades ago when parallel arithmetic is too costly:
Figure 1: Bit-serial adder (instead of chaining the carries "in space", they use a single adder and process the carry "in time".
We bit-serial adder as a basic construct, we can build dot-product multiplier, below is an example of multi-bit vector and single-bit vector dot product:
Figure 2: multi-bit vector a
multiplies single bit vector b
, trapezoid refers to the bit-serial adder.
Note that we can use bit-level sparsity of b
vector to reduce the circuit.
The multiplier could then be used as basic construct for multi-bit multi-bit vector dot product and multi-bit vector matrix multiplier.
Synthesis and Evaluation
Hardware utilization vs bit sparsity, matrix size, bandwidth.
Large scale synthesis
Achieved frequency was bottlenecked by interconnect delay bewteen LUTs and flops:
- Initial layer has a large fanout.
- Nets cross chiplet boundaries have significantly slower propagation delays.
Related terms:
Canonical Signed Digit (CSD)
CSD was proposed to reduce the cost of multiplier. The key idea of CSD is tot decompose a unsigned integer \(c\) to sum of positive integer \(a\) and negative integer \(b\), where \(a\) and \(b\) have better bit-sparsity, for example
\[ 15_{10} = 16_{10} - 1_{10} \leftrightarrow 1111_{2} = {10000}_{2} - {00001}_2 \]