When a global variable has a size that exceeds the size of the address
space it resides in, the verifier should fail as the variable can
neither be materialized nor fully accessed. This patch adds a check to
the verifier to enforce it.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
Co-authored-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
Adds rules for G_ATOMICRMW_{MAX, MIN, UMAX, UMIN, UINC_WRAP, UDEC_WRAP}.
Each of these generic opcode are supported for S32 and S64 types
on flat, global and local address spaces.
This patch adds support in Clang for the PRFM IR instruction, by adding
the following builtin:
void __pldir(void const *addr);
This builtin is described in the following ACLE proposal:
https://github.com/ARM-software/acle/pull/406
Start trying to use SimplifyDemandedFPClass on instructions, starting
with fmul. This subsumes the old transform on multiply of 0. The
main change is the introduction of nnan/ninf. I do not think anywhere
was systematically trying to introduce fast math flags before, though
a few odd transforms would set them.
Previously we only called SimplifyDemandedFPClass on function returns
with nofpclass annotations. Start following the pattern of
SimplifyDemandedBits, where this will be called from relevant root
instructions.
I was wondering if this should go into InstCombineAggressive, but that
apparently does not make use of InstCombineInternal's worklist.
This lowers the splitting as:
```
any_active(hi_mask)
? (find_last_active(hi_mask) + lo_mask.getVectorElementCount())
: find_last_active(lo_mask)
```
And trivially lowers `<1 x i1>` scalarization to returning zero. Which
is a natural result of the splitting (and the lack of a sentinel
"none-active" result value).
The lowerings likely can be improved. This patch is for completeness.
Should fix:
https://github.com/llvm/llvm-project/pull/178862#issuecomment-3862310334Fixes#180212
Make it possible to use `s_alloc_vgpr` at the IR level. This is a huge
footgun and use for anything other than compiler internal purposes is
heavily discouraged. The calling code must make sure that it does not
allocate fewer VGPRs than necessary - the intrinsic is NOT a request to
the backend to limit the number of VGPRs it uses (in essence it's not so
different from what we do with the dynamic VGPR flags of the
`amdgcn.cs.chain` intrinsic, it just makes it possible to use this
functionality in other scenarios).
When comparing multi-word integers with Zicond, we generate:
(or (czero_eqz (lo1 < lo2), (hi1 == hi2)),
(czero_nez (hi1 < hi2), (hi1 == hi2)))
The czero_nez is redundant because when hi1 == hi2 is true, hi1 < hi2 is
already 0. This patch adds a DAG combine to recognize:
czero_nez (setcc X, Y, CC), (setcc X, Y, eq) -> (setcc X, Y, CC)
when CC is a strict inequality (lt, gt, ult, ugt).
This saves one instruction in 128-bit comparisons on RV64 with Zicond.
Note the czero_nez becomes a czero.eqz in the final assembly because the
seteq is replaced by an xor that produces 0 when the values are equal.
Part of #179584
Assisted-by: claude
This adds atomicrmw `uinc_wrap` and `udec_wrap` operations support for
SPIR-V. Since SPIR-V doesn't provide dedicated instructions for those
two operations, we have to use the `AtomicExpand` pass to expand the
operations into CAS forms.
Closes#177204.
The target function to be checked by the Control Flow Guard Check
function is stored in `X15` on AArch64. This register is guaranteed to
be preserved by that function (on success), thus after it returns `X15`
can be used to branch to the target function instead of having to load
it from another register or the stack.
Compressing to a single shuffle doesn't remove any information and the backend can better apply specific optimizations to a single shuffle.
Addresses #176218.
---------
Co-authored-by: Luke Lau <luke_lau@igalia.com>
This patch implements the SPIR-V lowering for the following HLSL
intrinsics:
- SampleBias
- SampleGrad
- SampleLevel
- SampleCmp
- SampleCmpLevelZero
It defines the required LLVM intrinsics in 'IntrinsicsDirectX.td' and
'IntrinsicsSPIRV.td'.
It updates 'SPIRVInstructionSelector.cpp' to handle the new intrinsics
and
generates the correct 'OpImageSample*' instructions with the required
operands
(Bias, Grad, Lod, ConstOffset, MinLod, etc.).
CodeGen tests are added to verify the implementation for images with
dimension 1D, 2D, 3D, and Cube.
Assisted-by: Gemini
If 32-bit (or less) "v0" registers coming from inline asm are treated as
vector ones, codegen might produce incorrect vector<->scalar
conversions. This causes types mismatch assertion failures later during
compile-time. The fix treats 32-bit or less v0-v31 AArch64 registers as
scalar, along with 64-bit ones.
Fixes#153442
Sinking ShuffleVectors / ExtractElement / InsertElement into user blocks
can help enable SDAG combines by providing visibility to the values
instead of emitting CopyTo/FromRegs. The sink IR pass disables sinking
into loops, so this PR extends the CodeGenPrepare target hook
shouldSinkOperands.
Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Extend the existing combineADDDToWMACC DAG combine to also match
RISCVISD::WMULSU and produce RISCVISD::WMACCSU. This is similar to
how ADDD+UMUL_LOHI is combined to WMACCU and ADDD+SMUL_LOHI is
combined to WMACC.
This patch was generated by AI, but I reviewed it.
This commit adds initial support to lower the intrinsinc
`@llvm.structured.gep` into proper SPIR-V.
For now, the backend continues to support both GEP formats. We might
want to revisit this at some point for the logical part.
`getLit64Encoding` uses a different approach to determine whether 64-bit
literal encoding is used, which caused a size mismatch between the
`MachineInstr` and the `MCInst`.
For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) ||
isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but
in `getLit64Encoding`, it is `!isInt<32>(Val) || !isUInt<32>(Val)`.
selectConst() was asserting for constants wider than 64 bits. Add APInt
overloads of getOrCreateConstInt and getOrCreateConstVector that avoid
the uint64_t truncation.
Fix bug where Fast ISel incorrectly set `IncomingArgSize` to `0` for
functions with no arguments, since `MIPS O32` uses _the reserved
argument area_ of 16 bytes even for the functions with no args at all.
If the scalar integer selection sources are freely transferable to the
FPU, then splat to create an allbits select condition and create a
vector select instead
The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand
allows immediates, but the expansion logic handles only register cases
now. This can result in expansion failures when e.g.
llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used
as an insertelement idx.
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.
Solves SWDEV-516398.
Exactly match the s_wait_event instruction. For some reason we already
had this instruction used through llvm.amdgcn.s.wait.event.export.ready,
but that hardcodes a specific value. This should really be a bitmask
that
can combine multiple wait types.
gfx11 -> gfx12 broke compatabilty in a weird way, by inverting the
interpretation of the bit but also shifting the used bit by 1. Simplify
the selection of the old intrinsic by just using the magic number 2,
which should satisfy both cases.
We add pseudos/patterns for `vabs.v` instruction and handle the
lowering in `RISCVTargetLowering::lowerABS`.
Reviewers: topperc, 4vtomat, mshockwave, preames, lukel97, tclin914
Reviewed By: mshockwave
Pull Request: https://github.com/llvm/llvm-project/pull/180142
We directly lower `ISD::ABDS`/`ISD::ABDU` to `Zvabd` instructions.
Note that we only support SEW=8/16 for `vabd.vv`/`vabdu.vv`.
Reviewers: mshockwave, lukel97, topperc, preames, tclin914, 4vtomat
Reviewed By: lukel97, topperc
Pull Request: https://github.com/llvm/llvm-project/pull/180141
We should add used callee-saved registers as implicit used to save
libcall and as implicit defined to restore libcall. It likes what we did
for CM_PUSH/CM_POPRET. That can help to construct correct dataflow. In
entry bb, save libcall implicitly uses the callee-saved registers which
live in. And in return bb, restore libcall implicitly defines the
callee-saved registers which live out.
Combine the pattern:
ADDD(addlo, addhi, UMUL_LOHI(x, y).0, UMUL_LOHI(x, y).1)
into:
WMACCU(x, y, addlo, addhi)
And similarly for SMUL_LOHI -> WMACC.
This patch was written with AI, but I reviewed it carefully.