Spatial Sparse Matrix Multiplier @ HPCA 2021

Table of Contents


TL;DR: It's possible and advantageous to implement sparse operations directly in programmable hardware.


Reservoir Computing

Reservoir computing (Echo State Networks) uses very large sparse matrix.

\[ \textbf{x}(n) = f(W^{in}\textbf{u}(n) + W\textbf{x}(n-1))\] \[ \textbf{y}(n) = W^{out}\textbf{x}(n)\]

\(n\) refers to the time stamp. \(W^{in}\) and \(W^{out}\) are input/output weight matrix, \(W\) is the large fixed (fixed during training and inference) sparse matrix. In this paper \(W\) has a dimension of 800 with %75 sparsity.

The paper discusses how to accelerate SPMV on an FPGA using bit serial arithmetic when sparsity pattern is fixed.

Bit-serial Vector Matrix Multiplier

FPGA consists of an array of Lookup Tables (LUTs), Registers (FFs), and programmable interconnect.

This paper uses FPGA's feature about re-purpose some LUTs into small RAMs or shift registers (LUTRAMs).

Bit-serial adder was popular decades ago when parallel arithmetic is too costly:


Figure 1: Bit-serial adder (instead of chaining the carries "in space", they use a single adder and process the carry "in time".

We bit-serial adder as a basic construct, we can build dot-product multiplier, below is an example of multi-bit vector and single-bit vector dot product:


Figure 2: multi-bit vector a multiplies single bit vector b , trapezoid refers to the bit-serial adder.

Note that we can use bit-level sparsity of b vector to reduce the circuit.

The multiplier could then be used as basic construct for multi-bit multi-bit vector dot product and multi-bit vector matrix multiplier.

Synthesis and Evaluation

Hardware utilization vs bit sparsity, matrix size, bandwidth.

Large scale synthesis

Achieved frequency was bottlenecked by interconnect delay bewteen LUTs and flops:

  1. Initial layer has a large fanout.
  2. Nets cross chiplet boundaries have significantly slower propagation delays.

Related terms:


Canonical Signed Digit (CSD)

CSD was proposed to reduce the cost of multiplier. The key idea of CSD is tot decompose a unsigned integer \(c\) to sum of positive integer \(a\) and negative integer \(b\), where \(a\) and \(b\) have better bit-sparsity, for example

\[ 15_{10} = 16_{10} - 1_{10} \leftrightarrow 1111_{2} = {10000}_{2} - {00001}_2 \]

Author: expye(Zihao Ye)


Date: 2022-08-05 Fri 00:00

Last modified: 2022-08-12 Fri 09:07

Licensed under CC BY-NC 4.0