Implicit defs and uses on spill stores were accounted as real defs and
uses, while only exist for liveness accounting. As a result unneded
waits were generated.
Fixes: SWDEV-484177
This reverts commit
7792b4ae79.
The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).
...if only we had a merge queue.
- Updated `phi_moveimm_subreg_input` test case to introduce
sub-registers as PHI input operands.
Currently subreg is making the testcase in non-SSA format, need to fix
this by giving subreg as an input operand to PHI instead defining the
subreg register.
This change is relevant for : [[AMDGPU] Add MachineVerifier check to
detect illegal copies from vector register to SGPR
](https://github.com/llvm/llvm-project/pull/105494)
Fix a bug which resulted in selection of s_load_b96 on GFX11, which only
exists in GFX12.
The root cause was a mismatch between legalization and selection. The
condition used to check that the load was uniform in legalization
(SITargetLowering::LowerLOAD) was "!Op->isDivergent()". The condition
used to detect a non-uniform load during selection
(AMDGPUDAGToDAGISel::isUniformLoad()) was
"N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(MMO)". This makes a
difference when IR uniformity analysis has more information than SDAG's
built in analysis. In the test case this is because IR UA reports that
everything is uniform if isSingleLaneExecution() returns true, e.g. if
the specified max flat workgroup size is 1, but SDAG does not have this
optimization.
The immediate fix is to use the same condition to detect uniform loads
in legalization and selection. In future SDAG should learn about
isSingleLaneExecution(), and then it could probably stop relying on IR
metadata to detect uniform loads.
Reverts llvm/llvm-project#108173
si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
Always generate v_cndmask_b32 instead of modifying exec around
v_mov_b32. This is expected to be faster because
modifying exec generally causes pipeline stalls.
This is a large patch includes the MC level support for V_CVT_F16_F32,
V_CVT_F32_F16 and V_LDEXP_F16 in true16 format.
This patch includes the asm/disasm changes to encode/decode the 16bit
vsrc, vdst and src modifieres for vop and dpp format. This patch is a
dependency for many 16 bit instructions while only three instructions
are updated to make it easier to review.
There will be another patch to support these three instructions in the
codeGen level, this patch just replaces these two instructions with its
fake16 format.
trunc (binop X, C) --> binop (trunc X, trunc C) --> binop (trunc X, C`)
Try to narrow the width of math or bitwise logic instructions by pulling
a truncate ahead of binary operators.
Vx and Nx cores consider 32-bit and 64-bit basic arithmetic equal in
costs.
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.
As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):
```
entry:
%func_exec = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %func_exec, label %func, label %tail
func:
; ... stuff that should run with the actual EXEC mask
br label %tail
tail:
; ... stuff that runs with all the lanes enabled;
; can contain more than one basic block
```
It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).
The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
LLPC can generate llvm.amdgcn.image.atomic.swap intrinsic with data
argument as float type as well as float return type. This went unnoticed
until CreateIntrinsic with implicit mangling was used.
Optimize V_SET_INACTIVE by allow it to run in WWM.
Hence WWM sections are not broken up for inactive lane setting.
WWM V_SET_INACTIVE can typically be lower to V_CNDMASK.
Some cases require use of exec manipulation V_MOV as previous code.
GFX9 sees slight instruction count increase in edge cases due to
smaller constant bus.
Additionally avoid introducing exec manipulation and V_MOVs where
a source of V_SET_INACTIVE is the destination.
This is a common pattern as WWM register pre-allocation often
assigns the same register.
Dynamic lds and Table lds both use the amdgpu_lds_kernel_id intrinsic.
Kernels and functons that make an indirect use of this should not have
the
"amdgpu-no-lds-kernel-id" attribute.
For the later, this was done. For the dynamic lds case, this was
missing. This patch fixes it.
Any SGPR read by a VALU can potentially obscure SALU writes to the same
register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.
Compute a global cache of SGPRs with any VALU reads and use this to
avoid inserting mitigation for SGPRs never accessed by VALUs.
To avoid excessive search when compile time is priority implement
secondary mode where all SALU writes are mitigated.
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
This patch is part of a set of patches that add an `-fextend-lifetimes`
flag to clang, which extends the lifetimes of local variables and
parameters for improved debuggability. In addition to that flag, the
patch series adds a pragma to selectively disable `-fextend-lifetimes`,
and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes`
for this pointers only. All changes and tests in these patches were
written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer)
has handled review and merging. The extend lifetimes flag is intended to
eventually be set on by `-Og`, as discussed in the RFC
here:
https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850
This patch implements a new intrinsic instruction in LLVM,
`llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand
and has no effect other than "using" its operand, to ensure that its
operand remains live until after the fake use. This patch does not emit
fake uses anywhere; the next patch in this sequence causes them to be
emitted from the clang frontend, such that for each variable (or this) a
fake.use operand is inserted at the end of that variable's scope, using
that variable's value. This patch covers everything post-frontend, which
is largely just the basic plumbing for a new intrinsic/instruction,
along with a few steps to preserve the fake uses through optimizations
(such as moving them ahead of a tail call or translating them through
SROA).
Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).
Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
The implementation of `AAPointerInfo::RangeList::set_difference` doesn't
consider the case where two ranges have the same offset but different
sizes.
This could cause `AccessList` and `OffsetBins` out of sync because a
range has
been already updated in `AccessList` but missing in `ToRemove`.
I do have a reproducer but the reproducer itself is 248kb. `llvm-reduce`
can't
further reduce it. Not sure how I can make a smaller reproducer.
Fixes: SWDEV-479757.