I used the following tvm schedule to generate and run sGEMM kernel on vega10 machine.
One feature that gives high performance is double buffering of loads for A and B into local memory, such as to pipeline fetches with compute. Here’s a sequence of operations as observed with the generated gcn assembly.
global_load_dwordx4 v[72:75], v[72:73], off global_load_dwordx4 v[76:79], v[70:71], off global_load_dwordx4 v[81:84], v[81:82], off global_load_dwordx4 v[85:88], v[85:86], off ds_write_b64 v90, v[78:79] offset:8 ds_write2_b64 v90, v[76:77], v[74:75] offset1:17 s_waitcnt vmcnt(0) ds_write2_b64 v89, v[85:86], v[87:88] offset1:1 ds_write_b64 v89, v[83:84] offset:136 ds_write2st64_b64 v91, v[81:82], v[72:73] offset1:8 ds_read2_b64 v[76:79], v80 offset1:16 ds_read2_b64 v[72:75], v70 offset1:16 ds_read2_b64 v[81:84], v70 offset0:1 offset1:17 ds_read_b64 v[85:86], v71 offset:6024 ds_read_b64 v[87:88], v80 offset:8 v_mac_f32_e32 v66, v76, v72 … ds_read2_b64 v[76:79], v76 offset1:15
I believe both global and local latency hiding mechanisms are absent in this implementation.
I think the double buffering has to be tied to the unroll factor for each of the tensors
For a 4K matrix this TVM schedule gives a performance of 7.5 Tflops as compared to 12.2 Tflops with an optimized kernel on Vega10