# Spatial Sparse Matrix Multiplier @ HPCA 2021

TL;DR: It's possible and advantageous to implement sparse operations directly in programmable hardware.

## Background

### Reservoir Computing

Reservoir computing (Echo State Networks) uses very large sparse matrix.

$\textbf{x}(n) = f(W^{in}\textbf{u}(n) + W\textbf{x}(n-1))$ $\textbf{y}(n) = W^{out}\textbf{x}(n)$

$$n$$ refers to the time stamp. $$W^{in}$$ and $$W^{out}$$ are input/output weight matrix, $$W$$ is the large fixed (fixed during training and inference) sparse matrix. In this paper $$W$$ has a dimension of 800 with %75 sparsity.

The paper discusses how to accelerate SPMV on an FPGA using bit serial arithmetic when sparsity pattern is fixed.

### Bit-serial Vector Matrix Multiplier

FPGA consists of an array of Lookup Tables (LUTs), Registers (FFs), and programmable interconnect.

This paper uses FPGA's feature about re-purpose some LUTs into small RAMs or shift registers (LUTRAMs).

Bit-serial adder was popular decades ago when parallel arithmetic is too costly: Figure 1: Bit-serial adder (instead of chaining the carries "in space", they use a single adder and process the carry "in time".

We bit-serial adder as a basic construct, we can build dot-product multiplier, below is an example of multi-bit vector and single-bit vector dot product: Figure 2: multi-bit vector a multiplies single bit vector b , trapezoid refers to the bit-serial adder.

Note that we can use bit-level sparsity of b vector to reduce the circuit.

The multiplier could then be used as basic construct for multi-bit multi-bit vector dot product and multi-bit vector matrix multiplier.

### Synthesis and Evaluation

Hardware utilization vs bit sparsity, matrix size, bandwidth.

#### Large scale synthesis

Achieved frequency was bottlenecked by interconnect delay bewteen LUTs and flops:

1. Initial layer has a large fanout.
2. Nets cross chiplet boundaries have significantly slower propagation delays.

Related terms:

SLR

#### Canonical Signed Digit (CSD)

CSD was proposed to reduce the cost of multiplier. The key idea of CSD is tot decompose a unsigned integer $$c$$ to sum of positive integer $$a$$ and negative integer $$b$$, where $$a$$ and $$b$$ have better bit-sparsity, for example

$15_{10} = 16_{10} - 1_{10} \leftrightarrow 1111_{2} = {10000}_{2} - {00001}_2$

Email: expye@outlook.com

Date: 2022-08-05 Fri 00:00