On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write
results to any VGPRs. Previously, we unconditionally inserted
s_wait_loadcnt 0 at function returns even when the only pending loadcnt
was from GLOBAL_INV instructions.
This patch optimizes waitcnt insertion by skipping the loadcnt wait at
function boundaries when no VGPRs have pending loads. This is determined
by checking if any VGPR has a score greater than the lower bound for
LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing
instructions like GLOBAL_INV.
The optimization is limited to GFX12+ targets where GLOBAL_INV exists
and uses the extended wait count instructions.
This is a follow-up optimization to PR #135340 which added tracking for
GLOBAL_INV in the waitcnt pass.