We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
Since LowerBufferFatPointers runs before PreISelIntrinsicLowering, which
normally handles unsupported memcpy()s,, and since you can't have a
`noalias {ptr addrspace(8), i32}` becasue it crashes later passes,
manually expand memcpy()s involving buffer fat pointers to loops.
Additionally, though they're unlikely to be used, this commit adds
support for memset().
This commit doesn't implement writing direct-to-LDS loads as the
intrinsics, but leaves the option in the future.