The #65953 added a test `128x64xf16` that does a single TMA load. This PR adds more complex test that does 2 additional TMA loads with 128B Swizzling: ``` TMA Load: Matrix-A[0:128][0:64] TMA Load: Matrix-B[0:64][0:64] TMA Load: Matrix-B[64:128][0:64] ``` The program tests the loaded data for Matrix-B.