Notes on CAAQA 6th Edition
Table of Contents
Chapter 4: Vector Architecture
Gather-Scatter
for (i = 0; i < n; ++i)
A[K[i]] = A[K[i]] + C[M[i]];
Though indexed loads and stores can be pipelined, they typically run very slow because of unpredicted memory access pattern.
Carefully design memory system can deliver better performance by utilizing more hardware resources.
For GPUs, programmers need to ensure all addresses in scatter/gather are to adjacent locations for efficient unit-stride access to memory.
Chapter 7: Domain Specific Architectures
The challenge for DSA is to find a target whose demand is large enough.
Nonrecurrent Enginnering (NRE) cost of a custom chip and supporting softwares are amortized over the number of chips manufactured. It's not reasonable to design a DSA w/ only 1000 users.
FPGA has lower NRE then ASIC, however, hardware is not as efficient as ASIC.