The allocation order of 16 bit registers is vgpr0lo16, vgpr0hi16,
vgpr1lo16, vgpr1hi16, vgpr2lo16.... We prefer (essentially require) that
allocation order, because it uses the minimum number of registers. But
when you have 16 bit data passing between 16 and 32 bit instructions you
get lots of COPY.
This patch teach the compiler that a COPY of a 16-bit value from a 32
bit register to a lo-half 16 bit register is free, to a hi-half 16 bit
register is not.
This might get improved to coalescing with additional cases, and perhaps
as an alternative to the RA hints. For now upstreaming this solution
first.
T16D16 table is implemented in
https://github.com/llvm/llvm-project/pull/127673
this is a follow up patch to add load/store pseudo for:
flat_store
global_load/global_store
scratch_load/scratch_store
in true16 mode and updated the codegen test file
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
Similar to 806761a7629df268c8aed49657aeccffa6bca449.
For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.
This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:
```
LLVM :: CodeGen/AMDGPU/fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fabs.ll
LLVM :: CodeGen/AMDGPU/floor.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```