Adds rules for G_ATOMICRMW_{MAX, MIN, UMAX, UMIN, UINC_WRAP, UDEC_WRAP}.
Each of these generic opcode are supported for S32 and S64 types
on flat, global and local address spaces.
Start trying to use SimplifyDemandedFPClass on instructions, starting
with fmul. This subsumes the old transform on multiply of 0. The
main change is the introduction of nnan/ninf. I do not think anywhere
was systematically trying to introduce fast math flags before, though
a few odd transforms would set them.
Previously we only called SimplifyDemandedFPClass on function returns
with nofpclass annotations. Start following the pattern of
SimplifyDemandedBits, where this will be called from relevant root
instructions.
I was wondering if this should go into InstCombineAggressive, but that
apparently does not make use of InstCombineInternal's worklist.
Make it possible to use `s_alloc_vgpr` at the IR level. This is a huge
footgun and use for anything other than compiler internal purposes is
heavily discouraged. The calling code must make sure that it does not
allocate fewer VGPRs than necessary - the intrinsic is NOT a request to
the backend to limit the number of VGPRs it uses (in essence it's not so
different from what we do with the dynamic VGPR flags of the
`amdgcn.cs.chain` intrinsic, it just makes it possible to use this
functionality in other scenarios).
Sinking ShuffleVectors / ExtractElement / InsertElement into user blocks
can help enable SDAG combines by providing visibility to the values
instead of emitting CopyTo/FromRegs. The sink IR pass disables sinking
into loops, so this PR extends the CodeGenPrepare target hook
shouldSinkOperands.
Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
`getLit64Encoding` uses a different approach to determine whether 64-bit
literal encoding is used, which caused a size mismatch between the
`MachineInstr` and the `MCInst`.
For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) ||
isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but
in `getLit64Encoding`, it is `!isInt<32>(Val) || !isUInt<32>(Val)`.
The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand
allows immediates, but the expansion logic handles only register cases
now. This can result in expansion failures when e.g.
llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used
as an insertelement idx.
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.
Solves SWDEV-516398.
Exactly match the s_wait_event instruction. For some reason we already
had this instruction used through llvm.amdgcn.s.wait.event.export.ready,
but that hardcodes a specific value. This should really be a bitmask
that
can combine multiple wait types.
gfx11 -> gfx12 broke compatabilty in a weird way, by inverting the
interpretation of the bit but also shifting the used bit by 1. Simplify
the selection of the old intrinsic by just using the magic number 2,
which should satisfy both cases.
This PR fixes#177753, converting disjoint S_OR_B32 to S_ADDK_I32
whenever possible, it avoids this transformation in case S_OR_B32 can be
converted to bitset.
Note on Test Failures (Draft Status) This change causes significant
register reshuffling across the test suite due to the new allocation
hints and the swaps performed in case src0 is not a register and src1,
along with the change from or to addk. To avoid a massive, noisy diff
during the initial logic review:
This Draft PR only includes a representative sample of updated tests.
CodeGen/AMDGPU/combine-reg-or-const.ll -> Showcases change from S_OR to
S_ADDK
CodeGen/AMDGPU/s-barrier.ll -> Showcases swap between Src0 and Src1 if
src0 is not a register
The rest of the tests show the result of the register allocation hint we
give, I have checked every test I updated and they seem ok to me.
Once the core logic is approved, I will run the update script across the
remaining ~70 failing tests and mark the PR as "Ready for Review."
Select (fsub x, y) -> (fma y, -1.0, x). Using -1.0 as the constant
avoids the need for ComplexPatterns to negate x or y.
This also fixes the bad pattern (fsub x, y) -> (fma -x, 1.0, y).
Mark the memory operand of spill load/stores as MOThreadPrivate, so that
these loads and stores are emitted with `nv` set.
The reason is that scratch memory used by spills will never be shared by
another thread. It's purely thread local and thus a good fit for the
`nv` bit, which is controlled by the MOThreadPrivate flag.
- Add `MONonVolatile` MachineMemOperand flag.
- Set nv=1 on memory operations on GFX12.5 if the operation accesses a
constant address space,
is an invariant load, or has the `MONonVolatile` flag set.
Previously, the DAG combiner did not optimize exact signed division by a
power-of-two constant divisor for integer types exceeding the size of
division supported by the target architecture (e.g., i128 on x86-64).
However, such an optimization was expected by the division expansion
logic, leading to unsupported division operations making it to
instruction selection.
This commit addresses this issue by making an exception to the existing
exclusion of signed division with the exact flag for the aforementioned
operations. That is, the DAG combiner will now optimize exact signed
division if the divisor is a power-of-two constant and the integer type
exceeds the size of division supported by the target architecture.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
This reverts commit bff619f91015a633df659d7f60f842d5c49351df.
This was reverted due to regressions caused by poor copysign
optimization, which have been fixed.
`global_load_lds` and `buffer_load to lds` do only increment `vmcnt` and
not touch `lgkmcnt`. This causes invalid `waitcnts` for some Triton
kernels, similar to the added lit tests.
Note that the change for buffer ops is not necesssary, i.e. the lit test
passes even before this PR, because it seems like `SIInsertWaitcnts`
does not use `LGKM_CNT` for buffer ops. But this change might prevent a
bug in the future.
On GFX9, BUFFER_WBL2 is used to write back dirty cache lines and
requires an s_waitcnt vmcnt(0) afterwards to ensure completion.
This patch fixes by incrementing vmcnt for buffer_wbl2 instruction
---------
Co-authored-by: Jay Foad <jay.foad@gmail.com>
GISel CallLowering currently does a Type -> EVT -> Type roundtrip early
on when populating ArgInfo in splitToValueType(). This is a bit odd as
this structure operates at the IR Type level. Keep the original type
there and only convert to EVT when performing assignments.
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
Accurately represent both the load and the store part of those intrinsics.
The test changes seem to be mostly fairly insignificant changes caused
by subtly different scheduler behavior.
Targets without a `modf` libcall lower the intrinsic directly, matching
the existing `llvm.frexp` expansion. Targets with an existing libcall
are unchanged.
Fixes#173021
create t16 pseudos for mubuffer d16 load/store with vgpr16 in vdst/vdata
and use these t16 pseudo for isel pattern. Lower them back to d16
machine inst in mc level.