Double buffer implementation for matrix multiply


#1

I used the following tvm schedule to generate and run sGEMM kernel on vega10 machine.

One feature that gives high performance is double buffering of loads for A and B into local memory, such as to pipeline fetches with compute. Here’s a sequence of operations as observed with the generated gcn assembly.

            global_load_dwordx4 v[72:75], v[72:73], off

            global_load_dwordx4 v[76:79], v[70:71], off

            global_load_dwordx4 v[81:84], v[81:82], off

            global_load_dwordx4 v[85:88], v[85:86], off



            ds_write_b64 v90, v[78:79] offset:8

            ds_write2_b64 v90, v[76:77], v[74:75] offset1:17

            s_waitcnt vmcnt(0)

            ds_write2_b64 v89, v[85:86], v[87:88] offset1:1

            ds_write_b64 v89, v[83:84] offset:136

            ds_write2st64_b64 v91, v[81:82], v[72:73] offset1:8



            ds_read2_b64 v[76:79], v80 offset1:16

            ds_read2_b64 v[72:75], v70 offset1:16

            ds_read2_b64 v[81:84], v70 offset0:1 offset1:17

            ds_read_b64 v[85:86], v71 offset:6024

            ds_read_b64 v[87:88], v80 offset:8



            v_mac_f32_e32 v66, v76, v72

          …

            ds_read2_b64 v[76:79], v76 offset1:15

I believe both global and local latency hiding mechanisms are absent in this implementation.

I think the double buffering has to be tied to the unroll factor for each of the tensors

For a 4K matrix this TVM schedule gives a performance of 7.5 Tflops as compared to 12.2 Tflops with an optimized kernel on Vega10


#2

One thing that might be helpful to do, is to use the callback hack(maybe in opencl) to manually hijack the code. Start from a TVM generated version, and do minimum manual change to arrive at a double buffered version. see

there is also similar callback for opencl that allows us to hijack the code and do gradual manual change. Doing so would help us understand what is the minimum code transformation we need to get the best perf


#3

Looking at the assembly snippet, we need to insert the math operations (v_mac_f32) after the loads and before the s_waitcnt vmcnt(0). Ideally we would do the global_load, then many iterations of macs, then the vmcnt and ds_write operations at the end of the loop. Is it possible to modify the TVM scheduler to move down this waitcnt?


#4

Here’s how the double buffering structured right now

load_global AA
store_local AA
load_global BB
store_local BB

branch BB0_2

BB0_1:
Load global AA
Store local AA
load_global BB
store_local BB
Load local AL
Load local BL
FMA
Load local AL
Load local BL
FMA

BB0_2:
barrier
branch if loopcount > 0 BB0_1

The correct implementation would put a distance between global load and its write to local memory.

load_global AA
load_global BB

branch BB0_2

BB0_1:
Load global AA
load_global BB
Load local AL(i-1)
Load local BL(i-1)
FMA
Load local AL(i-1)
Load local BL(i-1)
FMA

BB0_2:
store_local AA(i)
store_local BB(i)
barrier
branch if loopcount > 0 BB0_1


#5

I tried another experiment to check for double buffering into registers from local memory

s[AL].double_buffer()
s[BL].double_buffer()

instead of

s[AA].double_buffer()
s[BB].double_buffer()

This improves performance to 7.9 TF