82186 Commits

Author SHA1 Message Date
Vyacheslav Levytskyy
fe7cb15606
[SPIR-V] Improve portability of the code (#123584)
Adding SPIRV to LLVM_ALL_TARGETS
(https://github.com/llvm/llvm-project/pull/119653) revealed a series of
minor compilation problems and sanitizer complaints. This PR is to
address the problem.
2025-01-20 12:05:15 +01:00
Akshat Oke
96c4f978d0
[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572) 2025-01-20 16:34:01 +05:30
yingopq
754ed95b66
[Mips] Fix compiler crash when returning fp128 after calling a functi… (#117525)
…on returning { i8, i128 }

Fixes https://github.com/llvm/llvm-project/issues/96432.
2025-01-20 16:47:40 +08:00
ZhaoQi
333562e7ec
[LoongArch] Avoid compilation warning. NFC (#123553)
Avoid `warning: enumerated mismatch in conditional expression:
'llvm::LoongArchISD::NodeType' vs 'llvm::ISD::NodeType'` while compiling
`LoongArchISelLowering.cpp`.
2025-01-20 16:37:27 +08:00
ZhaoQi
ca4886bf96
[LoongArch] Impl TTI hooks for LoongArch to support LoopDataPrefetch pass (#118437)
Inspired by https://reviews.llvm.org/D146600, this commit adds
some TTI hooks for LoongArch to make LoopDataPrefetch pass
really work. Including:

- `getCacheLineSize()`: 64 for loongarch64.
- `getPrefetchDistance()`: After testing SPEC CPU 2017, improvements
taken by prefetching are more obvious when set PrefetchDistance to
200(results shown blow), although different benchmarks fit for different
best choice.
- `enableWritePrefetching()`: store prefetch is supported by LoongArch,
so set WritePrefetching to true in default.
- `getMinPrefetchStride()` and `getMaxPrefetchIterationsAhead()` still
use default values: 1 and UINT_MAX, so not override them.

After this commit, the test added by https://reviews.llvm.org/D146600
can generate llvm.prefetch intrinsic IR correctly.

Results of spec2017rate benchmarks (testing date: ref, copies: 1):
- For all C/C++ benchmarks, compared to O3+novec/lsx/lasx, prefetch can
bring about -1.58%/0.31%/0.07% performance improvement for int
benchmarks and 3.26%/3.73%/3.78% improvement for floating point
benchmarks. (Only O3+novec+prefetch decreases when testing intrate.)
- But prefetch results in performance reduction almost for every Fortran
benchmark compiled by flang. While considering all C/C++/Fortran
benchmarks, prefetch performance will decrease about 1% ~ 5%.

FIXME: Keep `loongarch-enable-loop-data-prefetch` option default to
false for now due to the bad effect for Fortran.
2025-01-20 16:20:15 +08:00
ZhaoQi
84220eccb6
[LoongArch] Add generation support for preld instruction (#118436)
Instruction `preld` is used to prefetch one cache-line of data from
memory in advance into the cache.

This commit allows it to be generated automatically.
2025-01-20 16:11:09 +08:00
Hervé Poussineau
be68f35bf5
[MC][CodeGen][Mips] Add CodeView mapping (#120877)
Also add support for new relocation types required by debug information.

Constants have been taken from CodeView Symbolic Debug Information
Specification.
2025-01-20 15:00:24 +08:00
Guy David
295d1c361e
[AArch64] apple-m4 & apple-a15 have ADRP+ADD fusion (#123504) 2025-01-20 08:18:56 +02:00
ZhaoQi
0288d065ee
[LoongArch] Avoid scheduling relaxable code sequence and attach relax relocs (#121330)
If linker relaxation enabled, relaxable code sequence expanded from
pseudos should avoid being separated by instruction scheduling. This
commit tags scheduling boundary for them to avoid being scheduled.
(Except for `tls_le/tls_ie` and `call36/tail36`. Because `tls_le/tls_ie`
can be scheduled and have no influence to relax, `call36/tail36` are
expanded later in `LoongArchExpandPseudo` pass.)

A new mask target-flag is added to attach relax relocs to the relaxable
code sequence. (No need to add it for `tls_le` and `call36/tail36`
because we can simply add relax relocs for them according to their
relocs. But for other code sequence, such as `PCALA_{HI20/LO12}`, we
must use the mask flag, mainly because relax should not be added when
code model is large.)

Because of the new mask target-flag, get "direct" flags is necessary
when using their target-flags. In addition, code sequence after being
optimized by `MergeBaseOffset` pass may not relaxable any more, so the
relax "bitmask" flag should be removed.
2025-01-20 10:00:05 +08:00
Patryk Wychowaniec
814b34f31e
[AVR] Force relocations for non-encodable jumps (#121498)
This commit changes the branch emission logic so that instead of
throwing the "branch target out of range" error, we emit a relocation
instead.
2025-01-20 09:23:57 +08:00
Simon Pilgrim
6adeda8f55
[X86] combinePTESTCC - fold PTESTC(PCMPEQ(X,0),-1) == PTESTZ(X,X) (#123466)
Simplifies the hidden "all_of(X == 0)" pattern

Fixes #123456
2025-01-19 13:09:17 +00:00
Craig Topper
4a486e773e [CodeGen] Use Register/MCRegister::isPhysical. NFC 2025-01-18 23:37:03 -08:00
Carl Ritson
f811482a74
[AMDGPU] SIWholeQuadMode: Ensure earliest WQM entry point for PS (#123266)
Ensure shaders running WQM (PS) enter at the earliest point irrespective
of WQM marking.
2025-01-19 15:50:33 +09:00
ssijaric-nv
6789442eb2
[AArch64] Fix a corner case with large stack allocation (#122038)
In the unlikely case where the stack size is greater than 4GB, we may run into
the situation where the local stack size and the callee saved registers stack
size get combined incorrectly when restoring the callee saved registers. This
happens because the stack size in shouldCombineCSRLocalStackBumpInEpilogue
is represented as an 'unsigned', but is passed in as an 'int64_t'. We end up with
something like

$fp, $lr = frame-destroy LDPXi $sp, 536870912

This change just makes 'shouldCombineCSRLocalStackBumpInEpilogue' match
'shouldCombineCSRLocalStackBump' where 'StackBumpBytes' is an 'uint64_t'
2025-01-18 22:09:25 -08:00
Simon Pilgrim
33f9d839ef [X86] X86FixupVectorConstants - split ConvertToBroadcastAVX512 helper to handle single bitwidth at a time.
Attempt 32-bit broadcasts first, and then fallback to 64-bit broadcasts on failure.

We lose an explicit assertion for matching operand numbers but X86InstrFoldTables already does something similar.

Pulled out of WIP patch #73509
2025-01-18 14:08:48 +00:00
David Green
d666616804
[AArch64] Fold swapped sub/SUBS conditions (#121412)
This fold already exists in a couple places (DAG and CGP), where an
icmps operands are swapped to allow CSE with a sub. They do not handle
constants though. This patch adds an AArch64 version that can be more
precise.
2025-01-18 12:12:51 +00:00
Simon Pilgrim
82be3adcff
[X86] Consistently use getVectorIdxConstant for element/subvector extract/insertion nodes. NFC. (#123312)
Avoid the use of getIntPtrConstant for anything other than address pointer related code.

Noticed while trying to use getVectorIdxConstant as a breakpoint.
2025-01-18 11:47:03 +00:00
Simon Pilgrim
26c9be2b8d
[X86] Only call combineBitcastToBoolVector after legalization (#123386)
Prevents infinite loop between combineBitcastToBoolVector and hoistLogicOpWithSameOpcodeHands, which only performs the "logicop(bitcast(A),bitcast(B)) -> bitcast(logicop(A,B))" upto type legalization.

combineBitcastToBoolVector doesn't care much as its mainly for AVX512 cleanup that X86DomainReassignment can't handle for us.

Fixes #123333
2025-01-18 11:44:46 +00:00
Simon Pilgrim
67c3f2b430
[X86] mayFoldIntoStore - peek through oneuse bitcase users to find a store node (#123366)
mayFoldIntoStore currently just checks the direct (oneuse) user of a
SDValue to check its stored, which prevents cases where we bitcast the
value prior to storing (usually the bitcast will be removed later).

This patch peeks up through oneuse BITCAST nodes chain to see if its
eventually stored.

The main use of mayFoldIntoStore is v8i16 EXTRACT_VECTOR_ELT lowering
which will only use PEXTRW/PEXTRB for index0 extractions (vs the faster
MOVD) if the extracted value will be folded into a store on SSE41+
targets.

Fixes #107086
2025-01-18 14:08:24 +05:30
Kristof Beyls
10cfd54e6a
[AArch64] Correct defs and uses on {PAC,AUT}I{A,B}171615 (#123354)
I'm not adding tests for this, as I don't think we usually have tests to
verify correct description of defs and uses in instructions?

This fix will be tested when #122304 lands, as one of the regression
tests in that PR fails without this fix.
2025-01-18 07:56:15 +01:00
Philip Reames
143c33c6df
[RISCV] Consider only legally typed splats to be legal shuffles (#123415)
Given the comment, I'd expected test coverage. There was none so let's
do the simple thing which benefits the one thing we have tests for.
2025-01-17 19:13:04 -08:00
Craig Topper
0c6e03eea0
[RISCV] Fold vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK) (#123123)
Co-authored-by: Brandon Wu <brandon.wu@sifive.com>
2025-01-17 14:22:25 -08:00
Farzon Lotfi
eddeb36cf1
[SPIRV] add pre legalization instruction combine (#122839)
- Add the boilerplate to support instcombine in SPIRV
- instcombine length(X-Y) to distance(X,Y)
- switch HLSL's distance intrinsic to not special case for SPIRV.
- fixes #122766
- This RFC we were requested to add in the infra for pattern matching:
https://discourse.llvm.org/t/rfc-add-targetbuiltins-for-spirv-to-support-hlsl/83329/13
2025-01-17 14:46:14 -05:00
Shubham Sandeep Rastogi
ee1c852252
[DebugInfo][InstrRef] Treat ORRWrr as a copy instr (#123102)
The insturction selector uses the `MachineFunction::copySalvageSSA`
function to insert `DBG_PHIs` or identify a defining instruction for a
copy-like instruction when finalizing Instruction References.

AArch64 has the ORR instruction which is a logical OR with the variants
ORRWrr which refers to a register to register variant, and ORRWrs which
is a register to a shifted register variant.

An ORRWrs where the shift amount is 0, and the zero register ($wzr) is
used is considered a copy, for example:

`$w0 = ORRWrs $wzr, killed $w3, 0`

However an ORRWrr with a zero register is not considered a copy

`$w0 = ORRWrr $wzr, killed $w3`

This causes an issue in the livedebugvalues pass because in aarch64-isel
the instruction is the ORRWrr variant, but is then changed to the ORRWrs
variant before the livedebugvalues pass.

This causes a mismatch between the two passes which leads to a crash in
the livedebugvalues pass.

This patch fixes the issue.
2025-01-17 09:27:36 -08:00
Steven Perron
4b692a95d1
[SPIRV] Expand RWBuffer load and store from HLSL (#122355)
The code pattern that clang will generate for HLSL has changed from the
original plan. This allows the SPIR-V backend to generate code for the
current code generation.

It looks for patterns of the form:

```
%1 = @llvm.spv.resource.handlefrombinding
%2 = @llvm.spv.resource.getpointer(%1, index)
load/store %2
```

These three llvm-ir instruction are treated as a single unit that will

1. Generate or find the global variable identified by the call to
   `resource.handlefrombinding`.
2. Generate an OpLoad of the variable to get the handle to the image.
3. Generate an OpImageRead or OpImageWrite using that handle with the
   given index.

This will generate the OpLoad in the same BB as the read/write.

Note: Now that `resource.handlefrombinding` is not processed on its own,
many existing tests had to be removed. We do not have intrinsics that
are able to use handles to sampled images, input attachments, etc., so
we cannot generate the load of the handle. These tests are removed for
now, and will be added when those resource types are fully implemented.
2025-01-17 12:22:28 -05:00
Cullen Rhodes
f719771f25
Revert "[AArch64] Combine and and lsl into ubfiz" (#123356)
Reverts llvm/llvm-project#118974
2025-01-17 16:53:33 +00:00
Simon Pilgrim
76569025dd
[X86] Fold (v4i32 (scalar_to_vector (i32 (anyext (bitcast (f16)))))) -> (v4i32 bitcast (v8f16 scalar_to_vector)) (#123338)
This pattern tends to appear during f16 -> f32 promotion

Partially addresses the unnecessary XMM->GPR->XMM moves when working with f16 types (#107086)
2025-01-17 14:46:22 +00:00
Brox Chen
a18f4bdb18
[AMDGPU][True16][MC] true16 for v_cmpx_lt_f16 (#122936)
True16 format for v_cmpx_lt_f16. Update VOPCX t16 and fake16 pseudo.
2025-01-17 09:38:52 -05:00
Brox Chen
703e9e97d9
[AMDGPU][True16][CodeGen] true16 codegen for bswap (#122849)
true16 codegen pattern for bswap
2025-01-17 09:36:55 -05:00
Phoebe Wang
48803bc8c7
[X86][AMX-AVX512][NFC] Remove P from intrinsic and instruction name (#123270)
Ref.: https://cdrdv2.intel.com/v1/dl/getContent/828965
2025-01-17 22:21:19 +08:00
Wesley Wiser
41f430a48d
[X86] Don't fold very large offsets into addr displacements during ISel (#121678)
Doing so can cause the resulting displacement after frame layout to
become inexpressible (or cause over/underflow currently during frame
layout).

Fixes the error reported in
https://github.com/llvm/llvm-project/pull/101840#issuecomment-2306975944.
2025-01-17 20:09:00 +07:00
Simon Pilgrim
a864906772 [X86] Fix logical operator warnings. NFC. 2025-01-17 11:55:22 +00:00
Stanislav Mekhanoshin
21704a685d
[AMDGPU] Fix printing hasInitWholeWave in mir (#123232) 2025-01-17 03:00:02 -08:00
Simon Pilgrim
ad282f4c1f [X86] Rename combineScalarToVector to combineSCALAR_TO_VECTOR. NFC.
Match the file style of using the ISD NodeType name for the combine/lower method name.
2025-01-17 10:50:13 +00:00
Alexandros Lamprineas
831527a5ef
[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939)
To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:

 * If the callee's feature set is a subset of the caller's feature set,
   then the callee is a candidate for direct call.

 * Among such candidates the one of highest priority is the best match
   and it shall be picked, unless there is a version of the callee with
   higher priority than the best match which cannot be picked from a
   higher priority caller (directly or through the resolver).

 * For every higher priority callee version than the best match, there
   is a higher priority caller version whose feature set availability
   is implied by the callee's feature set.

Example:

Callers and Callees are ordered in decreasing priority.
The arrows indicate successful call redirections.

  Caller        Callee      Explanation
=========================================================================
mops+sve2 --+--> mops       all the callee versions are subsets of the
            |               caller but mops has the highest priority
            |
     mops --+    sve2       between mops and default callees, mops wins

      sve        sve        between sve and default callees, sve wins
                            but sve2 does not have a high priority caller

  default -----> default    sve (callee) implies sve (caller),
                            sve2(callee) implies sve (caller),
                            mops(callee) implies mops(caller)
2025-01-17 10:49:43 +00:00
Benjamin Maxwell
32a4650f3c
[AArch64] Avoid hardcoding spill size/align in FrameLowering (NFC) (#123080)
This is already defined for each register class in AArch64RegisterInfo,
not hardcoding it here makes these values easier to change (perhaps
based on hardware mode).
2025-01-17 10:10:21 +00:00
Benjamin Maxwell
2c9dc089fd
[AArch64] Use spill size when calculating callee saves size (NFC) (#123086)
This is an NFC right now, as currently, all register and spill sizes are
the same, but the spill size is the correct size to use here.
2025-01-17 10:09:31 +00:00
Phoebe Wang
fbb9d49506
[X86][APX] Support APX + AMX-MOVRS/AMX-TRANSPOSE (#123267)
Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266
2025-01-17 17:51:42 +08:00
ZhaoQi
31b62e2d3d
[LoongArch] Add relax relocations for tls_le code sequence (#121329)
This commit add relax relocations for `tls_le` code sequence.
Handwritten assembly and generating source code by clang are both
affected.

Scheduled `tls_le` code sequence can be relaxed normally and we can add
relax relocs when code emitting according to their relocs. Other
relaxable macros' code sequence cannot simply add relax relocs according
to their relocs, such as `PCALA_{HI20/LO12}`, we do not want to add
relax relocs when code model is large. This will be implemented in later
commit.
2025-01-17 17:30:57 +08:00
ZhaoQi
89e3a649f2
[LoongArch] Emit R_LARCH_RELAX when expanding some macros (#120067)
Emit `R_LARCH_RELAX` relocations when expanding some macros, including:

- `la.tls.ie`, `la.tls.ld`, `la.tls.gd`, `la.tls.desc`,
- `call36`, `tail36`.

Other macros that need to emit `R_LARCH_RELAX` relocations was
implemented in https://github.com/llvm/llvm-project/pull/72961, including:

- `la.local`, `la.pcrel`, `la.pcrel` expanded as `la.abs`, `la`,
`la.global`, `la/la.global` expanded as `la.pcrel`, `la.got`.

Note: `la.tls.le` macro can be relaxed when expanded with
`R_LARCH_TLS_LE_{HI20/ADD/LO12}_R` relocations. But if we do so,
previously handwritten assembly code will occur error due to the
redundant `add.{w/d}` followed by `la.tls.le`. So `la.tls.le` keeps to
expands with `R_LARCH_TLS_LE_{HI20/LO12}`.
2025-01-17 17:29:22 +08:00
Will Froom
c8ba551da1
[AArch64] Return early rather than asserting when Size of value passed to targetShrinkDemandedConstant is not 32 or 64 (#123084)
See https://github.com/llvm/llvm-project/issues/123029 for details.
2025-01-17 08:41:33 +00:00
Phoebe Wang
1274bca2ad
[X86][APX] Support APX + MOVRS (#123264)
Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266
2025-01-17 16:06:31 +08:00
Kazu Hirata
bfb6bb69fd [AMDGPU] Fix a warning
This patch fixes:

  llvm/lib/Target/AMDGPU/SIISelLowering.cpp:13908:46: error:
  comparison of integers of different signs: 'uint32_t' (aka 'unsigned
  int') and 'int' [-Werror,-Wsign-compare]
2025-01-16 22:40:08 -08:00
Vikram Hegde
225fc4f356
[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218)
The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.

This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.

Fixes: SWDEV-487672, SWDEV-487669
2025-01-17 11:09:39 +05:30
Matt Arsenault
ca95519704
AMDGPU: Implement isExtractVecEltCheap (#122460)
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
2025-01-17 08:38:01 +07:00
Luke Lau
a761e26b23
[RISCV] Allow non-loop invariant steps in RISCVGatherScatterLowering (#122244)
The motivation for this is to allow us to match strided accesses that
are emitted from the loop vectorizer with EVL tail folding (see #122232)

In these loops the step isn't loop invariant and is based off of
@llvm.experimental.get.vector.length.

We can relax this as long as we make sure to construct the updates after
the definition inside the loop, instead of the preheader.

I presume the restriction was previously added so that the step would
dominate the insertion point in the preheader. I can't think of why it
wouldn't be safe to calculate it in the loop otherwise.
2025-01-17 08:58:56 +08:00
Philip Reames
bb6e94a05d
[RISCV] Custom legalize <N x i128>, <4 x i256>, etc.. shuffles (#122352)
I have a particular user downstream who likes to write shuffles in terms
of unions involving _BitInt(128) types. This isn't completely crazy
because there's a bunch of code in the wild which was written with SSE
in mind, so 128 bits is a common data fragment size.

The problem is that generic lowering scalarizes this to ELEN, and we end
up with really terrible extract/insert sequences if the i128 shuffle is
between other (non-i128) operations.

I explored trying to do this via generic lowering infrastructure, and
frankly got lost. Doing this a target specific DAG is a bit ugly -
really, there's nothing hugely target specific here - but oh well. If
reviewers prefer, I could probably phrase this as a generic DAG combine,
but I'm not sure that's hugely better. If reviewers have a strong
preference on how to handle this, let me know, but I may need a bit of
help.

A couple notes:

* The argument passing weirdness is due to a missing combine to turn a
build_vector of adjacent i64 loads back into a vector load. I'm a bit
surprised we don't get that, but the isel output clearly has the
build_vector at i64.
* The splat case I plan to revisit in another patch. That's a relatively
common pattern, and the fact I have to scalarize that to avoid an
infinite loop is non-ideal.
2025-01-16 14:55:45 -08:00
Brox Chen
8a0c2e7567
[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736)
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.

Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
2025-01-16 17:18:28 -05:00
Princeton Ferro
3ba339b5e7
[NVPTX] Improve support for {ex2,lg2}.approx (#120519)
- Add support for `@llvm.exp2()`:
  - LLVM: `float`        -> PTX: `ex2.approx{.ftz}.f32`
  - LLVM: `half`         -> PTX: `ex2.approx.f16`
  - LLVM: `<2 x half>`   -> PTX: `ex2.approx.f16x2`
  - LLVM: `bfloat`       -> PTX: `ex2.approx.ftz.bf16`
  - LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2`
  - Any operations with non-native vector widths are expanded. On
    targets not supporting f16/bf16, values are promoted to f32.

- Add *CONDITIONAL* support for `@llvm.log2()` [^1]:
  - LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32`
  - Support for f16/bf16 is emulated by promoting values to f32.

[1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is
implemented differently, so this is off by default. To enable, use the
flag `-nvptx-approx-log2f32`.
2025-01-16 12:21:32 -08:00
Raphael Moreira Zinsly
01d7f434d2
[RISCV] Stack clash protection for dynamic alloca (#122508)
Create a probe loop for dynamic allocation and add the corresponding
SelectionDAG support in order to use it.
2025-01-16 11:58:42 -08:00