This commit adds the rewrite
```
llvm.amdgcn.tensor.{load.to/store.from}.lds(
<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer,
<4 x i32> zeroinitializer, i32 [cachepolicy])
=>
llvm.amdgcn.tensor.{load.to/store.from}.lds.d2(
<4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy])
```
This is justifed because, when the short encoding that uses the NULL
SGPR for registers 2 and 3 is used, the hardware acts as if those
registers were 0, including in the gather mode.
It is always safe not to run this transformation.
(Note: tests were LLM'd and then tweaked.)