Data movement between HBM and on-chip SRAM during standard self-attention. Watch how every float is shuttled through the 4 memory controller lanes, square by square.
Compare with Flash AttentionEvery square represents one float. Transfers are accounted for square-by-square through the memory controller lanes. SRAM shows the actual blocks loaded for each computation step.