# Spatial Sparse Matrix Multiplier @ HPCA 2021

## Table of Contents

link: https://arxiv.org/pdf/2101.08884.pdf

TL;DR: It's possible and advantageous to implement sparse operations directly in programmable hardware.

## Background

### Reservoir Computing

Reservoir computing (Echo State Networks) uses very large sparse matrix.

\[ \textbf{x}(n) = f(W^{in}\textbf{u}(n) + W\textbf{x}(n-1))\] \[ \textbf{y}(n) = W^{out}\textbf{x}(n)\]

\(n\) refers to the time stamp. \(W^{in}\) and \(W^{out}\) are input/output weight matrix, \(W\) is the large **fixed** (fixed during training and inference) sparse matrix.
In this paper \(W\) has a dimension of 800 with %75 sparsity.

The paper discusses how to accelerate SPMV on an FPGA using bit serial arithmetic when sparsity pattern is fixed.

### Bit-serial Vector Matrix Multiplier

FPGA consists of an array of Lookup Tables (LUTs), Registers (FFs), and programmable interconnect.

This paper uses FPGA's feature about re-purpose some LUTs into small RAMs or shift registers (LUTRAMs).

Bit-serial adder was popular decades ago when parallel arithmetic is too costly:

Figure 1: Bit-serial adder (instead of chaining the carries "in space", they use a single adder and process the carry "in time".

We bit-serial adder as a basic construct, we can build dot-product multiplier, below is an example of multi-bit vector and single-bit vector dot product:

Figure 2: multi-bit vector `a`

multiplies single bit vector `b`

, trapezoid refers to the bit-serial adder.

Note that we can use bit-level sparsity of `b`

vector to reduce the circuit.

The multiplier could then be used as basic construct for multi-bit multi-bit vector dot product and multi-bit vector matrix multiplier.

### Synthesis and Evaluation

Hardware utilization vs bit sparsity, matrix size, bandwidth.

#### Large scale synthesis

Achieved frequency was bottlenecked by interconnect delay bewteen LUTs and flops:

- Initial layer has a large fanout.
- Nets cross chiplet boundaries have significantly slower propagation delays.

Related terms:

#### Canonical Signed Digit (CSD)

CSD was proposed to reduce the cost of multiplier. The key idea of CSD is tot decompose a unsigned integer \(c\) to sum of positive integer \(a\) and negative integer \(b\), where \(a\) and \(b\) have better bit-sparsity, for example

\[ 15_{10} = 16_{10} - 1_{10} \leftrightarrow 1111_{2} = {10000}_{2} - {00001}_2 \]