SparseLNR @ ICS 2022
Table of Contents
paper link: https://arxiv.org/pdf/2205.11622.pdf
Motivation
TACO schedule is sub-optimal for some of the (fused) sparse operators.
SparseLNR proposes Branch Iteration Graph that breakdown the computation into smaller ones and then fuse outermost dimensions of the loop nests, while the innermost dimensions are distributed.
SparseLNR can describe the scheduling space of FusedMMa operator.
Prerequisite (FusedMM)
- paper link: https://arxiv.org/pdf/2011.06391.pdf
FusedMM is a template of fusing SpMM and SDDMM, and use vectorized instructions to accelerate the computation.
The author conduct experiments on CPU, and obtained significant speedup, this is somehow desirable because I haven't implemented any optimizations on CPU for DGL.
My Question
How to implement the back propagation FusedMM operator? (especially, GAT is more than fused SDDMM and SpMM, there is also sparse softmax).
I don't think this paper address the problem properly and they do not give any example of GAT.
Motivating Example
Below figure showing the different way to schedule a single sparse operator.
In the language of TVM, these schedules can be implemented with rfactor
primitive.
Core Idea
Extend the Iteration Graph to enable branches.
The schedules in SparseLNR are basically rfactor
and compute_at~/~reverse_compute_at
in TVM. In SparseTIR, we can also support them in Stage-I if we want the created temporary buffer to be sparse.
My Digest
I had a chat with SparseLNR group.