56971 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
21704a685d
[AMDGPU] Fix printing hasInitWholeWave in mir (#123232) 2025-01-17 03:00:02 -08:00
Phoebe Wang
fbb9d49506
[X86][APX] Support APX + AMX-MOVRS/AMX-TRANSPOSE (#123267)
Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266
2025-01-17 17:51:42 +08:00
Will Froom
c8ba551da1
[AArch64] Return early rather than asserting when Size of value passed to targetShrinkDemandedConstant is not 32 or 64 (#123084)
See https://github.com/llvm/llvm-project/issues/123029 for details.
2025-01-17 08:41:33 +00:00
Phoebe Wang
1274bca2ad
[X86][APX] Support APX + MOVRS (#123264)
Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266
2025-01-17 16:06:31 +08:00
Vikram Hegde
225fc4f356
[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218)
The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.

This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.

Fixes: SWDEV-487672, SWDEV-487669
2025-01-17 11:09:39 +05:30
Matt Arsenault
7475f0a345
DAG: Avoid forming shufflevector from a single extract_vector_elt (#122672)
This avoids regressions in a future AMDGPU commit. Previously we
would have a build_vector (extract_vector_elt x), undef with free
access to the elements bloated into a shuffle of one element + undef,
which has much worse combine support than the extract.

Alternatively could check aggressivelyPreferBuildVectorSources, but
I'm not sure it's really different than isExtractVecEltCheap.
2025-01-17 08:44:43 +07:00
Matt Arsenault
ca95519704
AMDGPU: Implement isExtractVecEltCheap (#122460)
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
2025-01-17 08:38:01 +07:00
Matt Arsenault
4431106630
DAG: Fix vector bin op scalarize defining a partially undef vector (#122459)
This avoids some of the pending regressions after AMDGPU implements
isExtractVecEltCheap.

In a case like shl <value, undef>, splat k, because the second operand
was fully defined, we would fall through and use the splat value for the
first operand, losing the undef high bits. This would result in an additional
instruction to handle the high bits. Add some reduced testcases for different
opcodes for one of the regressions.
2025-01-17 08:34:03 +07:00
Luke Lau
a761e26b23
[RISCV] Allow non-loop invariant steps in RISCVGatherScatterLowering (#122244)
The motivation for this is to allow us to match strided accesses that
are emitted from the loop vectorizer with EVL tail folding (see #122232)

In these loops the step isn't loop invariant and is based off of
@llvm.experimental.get.vector.length.

We can relax this as long as we make sure to construct the updates after
the definition inside the loop, instead of the preheader.

I presume the restriction was previously added so that the step would
dominate the insertion point in the preheader. I can't think of why it
wouldn't be safe to calculate it in the loop otherwise.
2025-01-17 08:58:56 +08:00
Philip Reames
bb6e94a05d
[RISCV] Custom legalize <N x i128>, <4 x i256>, etc.. shuffles (#122352)
I have a particular user downstream who likes to write shuffles in terms
of unions involving _BitInt(128) types. This isn't completely crazy
because there's a bunch of code in the wild which was written with SSE
in mind, so 128 bits is a common data fragment size.

The problem is that generic lowering scalarizes this to ELEN, and we end
up with really terrible extract/insert sequences if the i128 shuffle is
between other (non-i128) operations.

I explored trying to do this via generic lowering infrastructure, and
frankly got lost. Doing this a target specific DAG is a bit ugly -
really, there's nothing hugely target specific here - but oh well. If
reviewers prefer, I could probably phrase this as a generic DAG combine,
but I'm not sure that's hugely better. If reviewers have a strong
preference on how to handle this, let me know, but I may need a bit of
help.

A couple notes:

* The argument passing weirdness is due to a missing combine to turn a
build_vector of adjacent i64 loads back into a vector load. I'm a bit
surprised we don't get that, but the isel output clearly has the
build_vector at i64.
* The splat case I plan to revisit in another patch. That's a relatively
common pattern, and the fact I have to scalarize that to avoid an
infinite loop is non-ideal.
2025-01-16 14:55:45 -08:00
Brox Chen
8a0c2e7567
[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736)
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.

Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
2025-01-16 17:18:28 -05:00
Princeton Ferro
3ba339b5e7
[NVPTX] Improve support for {ex2,lg2}.approx (#120519)
- Add support for `@llvm.exp2()`:
  - LLVM: `float`        -> PTX: `ex2.approx{.ftz}.f32`
  - LLVM: `half`         -> PTX: `ex2.approx.f16`
  - LLVM: `<2 x half>`   -> PTX: `ex2.approx.f16x2`
  - LLVM: `bfloat`       -> PTX: `ex2.approx.ftz.bf16`
  - LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2`
  - Any operations with non-native vector widths are expanded. On
    targets not supporting f16/bf16, values are promoted to f32.

- Add *CONDITIONAL* support for `@llvm.log2()` [^1]:
  - LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32`
  - Support for f16/bf16 is emulated by promoting values to f32.

[1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is
implemented differently, so this is off by default. To enable, use the
flag `-nvptx-approx-log2f32`.
2025-01-16 12:21:32 -08:00
Raphael Moreira Zinsly
01d7f434d2
[RISCV] Stack clash protection for dynamic alloca (#122508)
Create a probe loop for dynamic allocation and add the corresponding
SelectionDAG support in order to use it.
2025-01-16 11:58:42 -08:00
Adam Yang
4446a9849a
[HLSL][SPIRV][DXIL] Implement WaveActiveSum intrinsic (#118580)
```    - add clang builtin to Builtins.td
      - link builtin in hlsl_intrinsics
      - add codegen for spirv intrinsic and two directx intrinsics to retain
        signedness information of the operands in CGBuiltin.cpp
      - add semantic analysis in SemaHLSL.cpp
      - add lowering of spirv intrinsic to spirv backend in
        SPIRVInstructionSelector.cpp
      - add lowering of directx intrinsics to WaveActiveOp dxil op in
    DXIL.td

      - add test cases to illustrate passespendent pr merges.
```
Resolves #70106

---------

Co-authored-by: Finn Plummer <canadienfinn@gmail.com>
2025-01-16 10:35:23 -08:00
Craig Topper
fc7a1ed0ba
[RISCV] Fold vp.reverse(vp.load(ADDR, MASK)) -> vp.strided.load(ADDR, -1, MASK). (#123115)
Co-authored-by: Brandon Wu <brandon.wu@sifive.com>
2025-01-16 08:20:17 -08:00
Luke Lau
437e1a70ca
[RISCV][VLOPT] Handle tied pseudos in getOperandInfo (#123170)
For .wv widening instructions when checking if the opperand is vs1 or
vs2, we take into account whether or not it has a passthru. For tied
pseudos though their passthru is the vs2, and we weren't taking this
into account.
2025-01-16 23:00:13 +08:00
peterbell10
5e5fd0e6fc
[NVPTX] Select bfloat16 add/mul/sub as fma on SM80 (#121065)
SM80 has fma for bfloat16 but not add/mul/sub. Currently these ops incur
a promotion to f32, but we can avoid this by writing them in terms of
the fma:
```
FADD(a, b) -> FMA(a, 1.0, b)
FMUL(a, b) -> FMA(a, b, -0.0)
FSUB(a, b) -> FMA(b, -1.0, a)
```

Unfortunately there is no `fma.ftz` so when ftz is enabled, we still
fall back to promotion.
2025-01-16 14:53:24 +00:00
Simon Pilgrim
95ff3b5167 [X86] vector-compress.ll - regenerate with missing AVX2 test coverage
Shows some really poor codegen for the maskbit extraction that we should address.
2025-01-16 14:17:33 +00:00
Vyacheslav Levytskyy
6ada0022ce
[SPIR-V] Fix --target-env version value in the test case (#123191)
This PR fixes `--target-env` version value in the test case
`llvm/test/CodeGen/SPIRV/validate/sycl-tangle-group-algorithms.ll`: the
issue was introduced in https://github.com/llvm/llvm-project/pull/122755
2025-01-16 14:26:29 +01:00
Simon Pilgrim
24df8f5da4 [X86] vector-compress.ll - add nounwind attoribute to remove cfi noise 2025-01-16 10:13:28 +00:00
Oliver Stannard
9e436c2daa
[MachineCP] Correctly handle register masks and sub-registers (#122734)
When passing an instruction with a register mask, the machine copy
propagation pass was dropping the information about some copy
instructions which define a register which is preserved by the mask,
because that register overlaps a register which is partially clobbered
by it. This resulted in a miscompilation for AArch64, because this
caused a live copy to be considered dead.

The fix is to clobber register masks by finding the set of reg units
which is preserved by the mask, and clobbering all units not in that
set.

This is based on #122472, and fixes the compile time performance
regressions which were caused by that.
2025-01-16 09:39:27 +00:00
David Green
ccd8d0b548 [AArch64][GlobalISel] Add gisel coverage for double-reductions. NFC
The extra tests are simpler for GISel to detect.
2025-01-16 09:24:09 +00:00
Pedro Lobo
c23f2417dc
[CodeGenPrepare] Replace undef use with poison [NFC] (#123111)
When generating a constant vector, if `UseSplat` is false, the indices
different from the index of the extract can be filled with `poison`
instead of `undef`.
2025-01-16 08:17:55 +00:00
Christudasan Devadasan
1797fb6b23
[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045) 2025-01-16 11:06:38 +05:30
LiqinWeng
d2484127cd
[VP] IR expansion to Int Func Call (#122867)
Add basic handling for VP ops that can expand to Int intrinsics, which
includes: ctpop/cttz/ctlz/sadd.sat/uadd.sat/ssub.sat/usub.sat/fshl/fshr
2025-01-16 10:12:29 +08:00
Ashley Coleman
4f48abff0f
[HLSL] Implement elementwise firstbitlow builtin (#116858)
Closes https://github.com/llvm/llvm-project/issues/99116

Implements `firstbitlow` by extracting common functionality from
`firstbithigh` into a shared function while also fixing a bug for an edge
case where `u64x3` and larger vectors will attempt to create vectors
larger than the SPRIV max of 4.
---------

Co-authored-by: Steven Perron <stevenperron@google.com>
2025-01-15 15:36:50 -07:00
Brian Cain
d8a68fe680
[Hexagon] Omit calls to specialized {float,fix} routines (#117423)
These were introduced in 1213a7a57fdc (Hexagon backend support,
2011-12-12) but they aren't present in libclangrt.builtins-hexagon. The
generic versions of these functions are present in the builtins, though.
So it should suffice to call those instead.
2025-01-15 14:21:48 -06:00
peterbell10
0068078dca
[NVPTX] Remove NVPTX::IMAD opcode, and rely on intruction selection only (#121724)
I noticed that NVPTX will sometimes emit `mad.lo` to multiply by 1, e.g.
in https://gcc.godbolt.org/z/4j47Y9W4c.

This happens when DAGCombiner operates on the add before the mul, so the
imad contraction happens regardless of whether the mul could have been
simplified.

To fix this, I remove `NVPTXISD::IMAD` and only combine to mad during
selection. This allows the default DAGCombiner patterns to simplify
the graph without any NVPTX-specific intervention.
2025-01-15 20:09:18 +00:00
Philip Reames
e19bc76812 [RISCV] Precommit test coverage for pr118873 2025-01-15 10:18:56 -08:00
jofrn
c8bbbaa5c7
[SelectionDAG][AMDGPU] Negative offset when selecting scratch sv offsets (#122251)
APInt will fail when given a negative offset. SelectScratchSVAddr
utilizes this function and can be given a negative offset as well, so
this change modifies it to use APSInt instead.
2025-01-15 06:56:28 -05:00
David Green
9025c269aa [AArch64] Add an extra test case for adds and subs combines. NFC 2025-01-15 10:51:44 +00:00
Mariusz Sikora
b3924cb9ec
[AMDGPU] Set Convergent property for image.(getlod/sample*) intrinsics which uses WQM (#122908)
This change adds IntrConvergent property to image.getlod intrinsic and
to several image.sample intrinsics. All image.sample intrinsics apart
from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.
2025-01-15 10:23:28 +01:00
Simon Pilgrim
8ac00ca486
[X86] lowerShuffleWithUndefHalf - don't split vXi8 unary shuffles if the 128-bit source lanes are already in place (#122919)
Allows us to use PSHUFB to shuffle the lanes, and then perform a sub-lane permutation down to the lower half

Fixes #116815
2025-01-15 08:19:54 +00:00
Luke Lau
02403f4e45 [RISCV] Split strided-load-store.ll tests into EVL and VP. NFC
None of the changes in #122232 or the upcoming #122244 are specific to
the EVL, so split out the EVL tail-folded loops into separate
"integration tests" that reflect the output of the loop vectorizer.
2025-01-15 13:42:53 +08:00
Alex MacLean
273a94b3d5
[NVPTX] Add some more immediate instruction variants (#122746)
While this likely won't impact the final SASS, it makes for more compact
PTX.
2025-01-14 21:28:29 -08:00
Shoreshen
b665dddd70
[AMDGPU] Add tests for v_sat_pk_u8_i16 codegen (#122438)
Preparation for #121124 

This PR provides tests added into
[PR](https://github.com/llvm/llvm-project/pull/121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
2025-01-14 19:26:46 -05:00
Florian Hahn
0b3912622e [ARM] Update LV test in test/Codegen/ARM after 1de3dc7d23. 2025-01-14 22:41:31 +00:00
Steven Perron
e511b3e24a
[SPIRV] Fix graphic test to use correct triple. (#122738) 2025-01-14 14:17:17 -05:00
S. Bharadwaj Yadavalli
a4b7a2d021
[DirectX] Propagate shader flags mask of callees to callers (#118306)
Propagate shader flags mask of callees to callers.

Add tests to verify propagation of shader flags
2025-01-14 13:18:16 -05:00
Vyacheslav Levytskyy
9ba27ca5c7
[SPIR-V] Ensure no uses of intrinsic global variables after module translation (#122729)
Ensure that the backend satisfies the requirement of the verifier that
disallows uses of intrinsic global variables. This PR fixes
https://github.com/llvm/llvm-project/issues/110495
2025-01-14 17:45:10 +01:00
Vyacheslav Levytskyy
b74d3e179d
[SPIR-V] Specify target environment in tests referring to the BuiltIn WorkgroupSize variable (#122755)
https://github.com/KhronosGroup/SPIRV-Tools/pull/5407 introduces a check
for WorkgroupSize variable to be a 3-component 32-bit int vector, and
indeed, we see this requirement in
https://registry.khronos.org/vulkan/specs/latest/man/html/WorkgroupSize.html#VUID-WorkgroupSize-WorkgroupSize-04427

However, OpenCL imposes different requirements, documented here:
https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Env.html#_built_in_variables
OpenCL environment requires WorkgroupSize variable to have components of
size_t size that will be 32 or 64 depending on a target. This is the way
how the SPIR-V Backend implements this, by querying pointer size of the
current platform/target.

To allow spirv-val to account target environments difference, this PR
adds `--target-env <env>` to test cases referring to the BuiltIn
WorkgroupSize variable.
2025-01-14 17:44:07 +01:00
Brox Chen
f1b1c7f3c1
[AMDGPU][True16][CodeGen] Undo sub(x,c) to add in true16 flow (#118854)
Undo sub x, c -> add x, -c canonicalization in true16 fow.

This duplicating the pattern from fake16 and implemement the same
pattern in true16 format
2025-01-14 10:57:33 -05:00
Brox Chen
5e26ff35c1
[AMDGPU][True16][MC] true16 for v_cmp_lt_f16 (#122499)
True16 format for v_cmp_lt_f16. Update VOPC t16 and fake16 pseudo.
2025-01-14 10:03:36 -05:00
Simon Pilgrim
0bd098b1cc
[X86] Fold VPERMV3(WIDEN(X),M,WIDEN(Y)) -> VPERMV(CONCAT(X,Y),M') iff the CONCAT is free (#122750)
Minor followup to #122485 - if the source operands were widened half-size subvectors, then attempt to concatenate the subvectors directly, and then adjust the shuffle mask so references to the second operand now refer to the upper half of the concat result.
2025-01-14 13:26:38 +00:00
Jay Foad
e87f94a6a8
[llvm-project] Fix typos mutli and mutliple. NFC. (#122880) 2025-01-14 11:59:41 +00:00
Acim Maravic
cc3aab580b
[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139) 2025-01-14 11:22:20 +01:00
Piotr Sobczak
40fa7f5e8b
[AMDGPU] Fix computed kill mask (#122736)
Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill
lowering. This has the effect of AND'ing demote/kill condition with exec
which is needed for proper live mask update.

The S_XOR is inadequate because it may return true for lane with exec=0.

This patch fixes an image corruption in game.

I think the issue went unnoticed because demote/kill condition is often
naturally dependent on exec, so AND'ing with exec is usually not
required.
2025-01-14 10:00:40 +01:00
Guy David
1a935d7a17
[llvm] Mark scavenging spill-slots as *spilled* stack objects. (#122673)
This seems like an oversight when copying code from other backends.
2025-01-14 10:18:31 +02:00
Piotr Fusik
cfe5a0847a
[RISCV] Enable Zbb ANDN/ORN/XNOR for more 64-bit constants (#122698)
This extends PR #120221 to 64-bit constants that don't match
the 12-low-bits-set pattern.
2025-01-14 09:15:14 +01:00
Piotr Fusik
87d7aebdd4 [RISCV][test] Add more 64-bit tests in zbb-logic-neg-imm.ll 2025-01-14 08:28:58 +01:00