57106 Commits

Author SHA1 Message Date
Justin Bogner
2f39d138dc
[DirectX] Handle dx.RawBuffer in DXILResourceAccess (#121725)
This adds handling for raw and structured buffers when lowering resource
access via `llvm.dx.resource.getpointer`.

Fixes #121714
2025-01-23 21:35:34 -08:00
Pradeep Kumar
435609b70c
[LLVM][NVPTX] Add support for griddepcontrol instruction (#123511)
This commit adds support for griddepcontrol PTX instruction with tests
under griddepcontrol.ll
2025-01-24 09:33:16 +05:30
Cinhi Young
6735d527f9
[MIPS] [MSA] Widen v2i8, v216 and v2i32 vectors (#123040)
- Widen v2i8, v2i16 and v2i32 vectors so they don't cast back and forth,
and make sure that instructions with correct data unit is being used.
- Handle undef indices for VSHF when lowering VECTOR_SHUFFLE (it crashes
if such index is present).
2025-01-24 11:23:34 +08:00
Jeffrey Byrnes
acb7859f07
[MachineSink] Extend loop sinking capability (#117247)
The current MIR cycle sinking capabilities are rather limited. It only
support sinking copies into a single successor block while obeying
limits.

This opt-in feature adds a more aggressive option, that is not limited
to the above concerns. The feature will try to "sink" by duplicating any
top-level preheader instruction (that we are sure is safe to sink) into
any user block, then does some dead code cleanup. In particular, this is
useful for high RP situations when loop bodies have control flow.
2025-01-23 17:08:23 -08:00
Phoebe Wang
24f177df61
[X86][AVX10.2-BF16] Update VCOMISBF16 intrinsics and instructions (#123307)
- Add `I` to intrinsics and instructions
- Add `_` before sbf16 in intrinsics

Ref.: https://cdrdv2.intel.com/v1/dl/getContent/828965
2025-01-24 08:37:29 +08:00
Sam Elliott
33c4407471
[RISCV] Support cR Inline Asm Constraint (#124174)
This denotes RVC-compatible GPR Pairs, which are used by the Zclsd
extension.

C API PR: riscv-non-isa/riscv-c-api-doc#102
2025-01-23 16:19:19 -08:00
Min-Yih Hsu
bc74a1edbe
[IA] Generalize the support for power-of-two (de)interleave intrinsics (#123863)
Previously, AArch64 used pattern matching to support
llvm.vector.(de)interleave of 2 and 4; RISC-V only supported
(de)interleave of 2.

This patch consolidates the logics in these two targets by factoring out
the common factor calculations into the InterleaveAccess Pass.
2025-01-23 15:27:51 -08:00
Michael Maitland
f5bd623d06 [RISCV][VLOPT] Rename vx to vf where appropriate in test case 2025-01-23 14:02:15 -08:00
David Green
fc952b2a69 [AArch64] Add pre-index store patterns for bf16.
These, like the postinc patterns, need adding very similarly to fp16.

Fixes #97870
2025-01-23 21:52:20 +00:00
Michael Maitland
bf258dbd57
[RISCV][VLOPT] support fp sign injection instructions (#124195) 2025-01-23 16:50:35 -05:00
Michael Maitland
f402e06e7d
[RISCV][VLOPT] Add vector fp min/max instructions to isSupportedInstr (#124196) 2025-01-23 16:47:14 -05:00
mingmingl
1688c8719f s/requires/REQUIRES to fix the test on release build 2025-01-23 12:59:42 -08:00
Craig Topper
e30a4fc3e2
[TargetLowering] Improve one signature of forceExpandWideMUL. (#123991)
We have two forceExpandWideMUL functions. One takes the low and high
half of 2 inputs and calculates the low and high half of their product.
This does not calculate the full 2x width product.

The other signature takes 2 inputs and calculates the low and high half
of their full 2x width product. Previously it did this by sign/zero
extending the inputs to create the high bits and then calling the other
function.

We can instead copy the algorithm from the other function and use the
Signed flag to determine whether we should do SRA or SRL. This avoids
the need to multiply the high part of the inputs and add them to the
high half of the result. This improves the generated code for signed
multiplication.

This should improve the performance of #123262. I don't know yet how
close we will get to gcc.
2025-01-23 12:49:35 -08:00
mingmingl
c3ecbe6792 Disable the test again.
* https://lab.llvm.org/buildbot/#/builders/127/builds/2148/steps/7/logs/stdio shows a failure.
2025-01-23 11:07:14 -08:00
Florian Hahn
0d0190815d
[TailDup] Allow large number of predecessors/successors without phis. (#116072)
This adjusts the threshold logic added in #78582 to only trigger for
cases where there are actually phis to duplicate in either TailBB or in
one of the successors.

In cases there are no phis, we only have to pay the cost of extra edges,
but have no explosion in PHI related instructions.

This improves performance of Python on some inputs by 2-3% on Apple
Silicon CPUs.

PR: https://github.com/llvm/llvm-project/pull/116072
2025-01-23 18:24:20 +00:00
Ulrich Weigand
6d5697f7cb [SystemZ] Fix ICE with i128->i64 uaddo carry chain
We can only optimize a uaddo_carry via specialized instruction
if the carry was produced by another uaddo(_carry) instruction;
there is already a check for that.

However, i128 uaddo(_carry) use a completely different mechanism;
they indicate carry in a vector register instead of the CC flag.
Thus, we must also check that we don't mix those two - that check
has been missing.

Fixes: https://github.com/llvm/llvm-project/issues/124001
2025-01-23 19:15:11 +01:00
mingmingl
3dec24d2a2 Stats are sorted before they are printed. Try fixing test failure by checking stats in its print order. 2025-01-23 10:11:11 -08:00
Craig Topper
2f6b0b4a85
[RISCV] Add SiFive sf.vqmacc tests to vmv-copy.mir. NFC (#124075)
The vqmaccu.2x8x2 test is currently being miscompiled. We need to use a
whole register move instead of vmv.v.v. The input has VL elements with
EEW=8 EMUL=4. The output has VL/4 elements with EEW=32 EMUL=4. We can't
use the original VL or input SEW for a vmv.v.v.
2025-01-23 10:03:33 -08:00
Nikita Popov
bca6dbd3a2 [X86] Add additional i128 abi test (NFC) 2025-01-23 17:34:47 +01:00
mingmingl
96410edd47 mark test as unsupported as I investigate test failure on certain environments 2025-01-23 08:07:48 -08:00
Nikita Popov
c3b40c7ea2 [X86] Regenerate test checks (NFC)
Regenerate some tests for the new vpternlog printing.
2025-01-23 16:15:04 +01:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Mikołaj Piróg
25653e558c
[AVX10.2] Update convert chapter intrinsic and mnemonics names (#123656)
Intel spec for avx10.2
(https://cdrdv2.intel.com/v1/dl/getContent/828965) has been updated.
This PR changes relevant names from the "AVX10 CONVERT INSTRUCTIONS"
chapter .
2025-01-23 22:23:56 +08:00
Nico Weber
99d450e9f5 Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)"
This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab.
Breaks check-llvm, see
https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953
2025-01-23 09:19:42 -05:00
Matt Arsenault
e28e93550a
AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684)
For VALU shuffles, this saves an instruction in some case.
2025-01-23 20:58:02 +07:00
Kareem Ergawy
ff55c9bc63
[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089)
Fixes #123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>
2025-01-23 14:53:11 +01:00
Frederik Harwath
6fdaaafd89
[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)
This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.

See the draft PR #123221 for an attempt to fix the actual issue.

---------

Co-authored-by: Frederik Harwath <fharwath@amd.com>
2025-01-23 14:32:01 +01:00
Michael Liao
590e5e20b1 [M68k] Fix llc pass test after 3630d9ef65b30af7e4ca78e668649bbc48b5be66 2025-01-23 08:28:25 -05:00
Simon Pilgrim
90e9895a93
[X86] Handle BSF/BSR "zero-input pass through" behaviour (#123623)
Intel docs have been updated to be similar to AMD and now describe
BSF/BSR as not changing the destination register if the input value was
zero, which allows us to support CTTZ/CTLZ zero-input cases by setting
the destination to support a NumBits result (BSR is a bit messy as it
has to be XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also
been confirmed to match this behaviour.

This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through"
argument for zero-input cases, by default this is set to UNDEF to match
existing behaviour, but it can be set to a suitable value if supported.

There are still some limits to this - its only supported for x86_64
capable processors (and I've only enabled it for x86_64 codegen), and
Intel CPUs sometimes zero the upper 32-bits of a pass through register
when used for BSR32/BSF32 with a zero source value (i.e. the whole
64bits may not get passed through).

Fixes #122004
2025-01-23 12:59:59 +00:00
Abhilash Majumder
fa7f0e582b
[NVPTX] Add Bulk Copy Prefetch Intrinsics (#123226)
This patch adds NVVM intrinsics and NVPTX codegen for:

- cp.async.bulk.prefetch.L2.* variants 
- These intrinsics optionally support cache_hints as indicated by the
   boolean flag argument.
- Lit tests are added for all combinations of these intrinsics in
   cp-async-bulk.ll.
- The generated PTX is verified with a 12.3 ptxas executable.
- Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch


Co-authored-by: abmajumder <abmajumder@nvidia.com>
2025-01-23 16:49:44 +05:30
SivanShani-Arm
ee99c4d484
[LLVM][Clang][AArch64] Implement AArch64 build attributes (#123990)
- Added support for AArch64-specific build attributes.
- Print AArch64 build attributes to assembly.
- Emit AArch64 build attributes to ELF.

Specification: https://github.com/ARM-software/abi-aa/pull/230
2025-01-23 09:46:59 +00:00
mconst
3fb8c5b431
[X86] Fix invalid instructions on x32 with large stack frames (#124041)
`X86FrameLowering::emitSPUpdate()` assumes that 64-bit targets use a
64-bit stack pointer, but that's not true on x32.
When checking the stack pointer size, we need to look at
`Uses64BitFramePtr` rather than `Is64Bit`. This avoids generating
invalid instructions like `add esp, rcx`.

For impossibly-large stack frames (4 GiB or larger with a 32-bit stack
pointer), we were also generating invalid instructions like `mov eax,
5000000000`. The inline stack probe code already had a check for that
situation; I've moved the check into `emitSPUpdate()`, so any attempt to
allocate a 4 GiB stack frame with a 32-bit stack pointer will now trap
rather than adjusting ESP by the wrong amount. This also fixes the
"can't have 32-bit 16GB stack frame" assertion, which used to be
triggerable by user code but is now correct.

To help catch situations like this in the future, I've added
`-verify-machineinstrs` to the stack clash tests that generate large
stack frames.

This fixes the expensive-checks buildbot failure caused by #113219.
2025-01-23 12:37:07 +05:30
Heejin Ahn
c3dfd34e54
[WebAssembly] Add unreachable before catch destinations (#123915)
When `try_table`'s catch clause's destination has a return type, as in
the case of catch with a concrete tag, catch_ref, and catch_all_ref. For
example:
```wasm
block exnref
  try_table (catch_all_ref 0)
    ...
  end_try_table
end_block
... use exnref ...
```

This code is not valid because the block's body type is not exnref. So
we add an unreachable after the 'end_try_table' to make the code valid
here:
```wasm
block exnref
  try_table (catch_all_ref 0)
    ...
  end_try_table
  unreachable                    ;; Newly added
end_block
```
Because 'unreachable' is a terminator we also need to split the BB.

---

We need to handle the same thing for unwind mismatch handling. In the
code below, we create a "trampoline BB" that will be the destination for
the nested `try_table`~`end_try_table` added to fix a unwind mismatch:
```wasm
try_table (catch ... )
  block exnref
    ...
    try_table (catch_all_ref N)
      some code
    end_try_table
    ...
  end_block                      ;; Trampoline BB
  throw_ref
end_try_table
```
While the `block` added for the trampoline BB has the return type
`exnref`, its body, which contains the nested `try_table` and other
code, wouldn't have the `exnref` return type. Most times it didn't
become a problem because the block's body ended with something like `br`
or `return`, but that may not always be the case, especially when there
is a loop. So we add an `unreachable` to make the code valid here too:
```wasm
try_table (catch ... )
  block exnref
    ...
    try_table (catch_all_ref N)
      some code
    end_try_table
    ...
    unreachable                  ;; Newly added
  end_block                      ;; Trampoline BB
  throw_ref
end_try_table
```
In this case we just append the `unreachable` at the end of the layout
predecessor BB. (This was tricky to do in the first (non-mismatch) case
because there `end_try_table` and `end_block` were added in the
beginning of an EH pad in `placeTryTableMarker` and moving
`end_try_table` and the new `unreachable` to the previous BB caused
other problems.)

---

This adds many `unreaachable`s to the output, but this adds
`unreachable` to only a few places to see if this is working. The
FileCheck lines in `exception.ll` and `cfg-stackify-eh.ll` are already
heavily redacted to only leave important control-flow instructions, so I
don't think it's worth adding `unreachable`s everywhere.
2025-01-22 22:39:43 -08:00
mingmingl
5d8390d48e Temporarily disable test on Fuchsia 2025-01-22 22:33:17 -08:00
mingmingl
ea49d474fd Specify triple for llc test 2025-01-22 21:46:51 -08:00
Mingming Liu
de209fa11b
[CodeGen] Introduce Static Data Splitter pass (#122183)
https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
proposes to partition static data sections.

This patch introduces a codegen pass. This patch produces jump table
hotness in the in-memory states (machine jump table info and entries).
Target-lowering and asm-printer consume the states and produce `.hot`
section suffix. The follow up PR
https://github.com/llvm/llvm-project/pull/122215 implements such
changes.

---------

Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>
2025-01-22 21:06:46 -08:00
quic_hchandel
163935a48d
[RISCV] Add Qualcomm uC Xqcilo (Large Offset Load Store) extension (#123881)
This extension adds eight 48 bit load store instructions.

The current spec can be found at:
https://github.com/quic/riscv-unified-db/releases/latest

This patch adds assembler only support.

---------

Co-authored-by: Harsh Chandel <hchandel@qti.qualcomm.com>
2025-01-23 10:14:25 +05:30
tangaac
19834b4623
[LoongArch] Support sc.q instruction for 128bit cmpxchg operation (#116771)
Two options for clang
  -mno-scq:                Disable sc.q instruction.
  -mscq:                   Enable sc.q instruction.
The default is -mno-scq.
2025-01-23 12:11:07 +08:00
Akshay Deodhar
892a804d93
[NVPTX] Stop using 16-bit CAS instructions from PTX (#120220)
Increases minimum CAS size from 16 bit to 32 bit, for better SASS
codegen.

When atomics are emulated using atom.cas.b16, the SASS generated
includes 2 (nested) emulation loops. When emulated using an atom.cas.b32
loop, the SASS too has a single emulation loop. Using 32 bit CAS thus
results in better codegen.
2025-01-22 19:37:11 -08:00
Finn Plummer
0fe8e70c66
Revert "Reland "[HLSL] Implement the reflect HLSL function"" (#124046)
Reverts llvm/llvm-project#123853

The introduction of `reflect-error.ll` surfaced a bug with the use of
`report_fatal_error` in `SPIRVInstructionSelector` that was propagated
into the pr. This has caused a build-bot breakage, and the work to solve
the underlying issue is tracked here:
https://github.com/llvm/llvm-project/issues/124045. We can re-apply this
commit when the underlying issue is resolved.
2025-01-22 18:22:03 -08:00
Hua Tian
a9d2834508
[llvm][CodeGen] Fix the issue caused by live interval checking in window scheduler (#123184)
At some corner cases, the cloned MI still retains an old slot index,
which leads to the compiler crashing. This patch update the slot index
map before delete the recycled MI.

https://github.com/llvm/llvm-project/issues/123165
2025-01-23 09:39:03 +08:00
Craig Topper
96dbd0006c [RISCV] Re-generate test checks so we pick up implicit on whole register moves. NFC 2025-01-22 16:11:43 -08:00
Deric Cheung
2656928d0c
Reland "[HLSL] Implement the reflect HLSL function" (#123853)
This PR relands
[#122992](https://github.com/llvm/llvm-project/pull/122992).

Some machines were failing to run the `reflect-error.ll` test due to the
RUN lines
```llvm
; RUN: not %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown %s -o /dev/null 2>&1 -filetype=obj %}
; RUN: not %if spirv-tools %{ llc -O0 -mtriple=spirv32-unknown-unknown %s -o /dev/null 2>&1 -filetype=obj %}
```
which failed when `spirv-tools` was not present on the machine due to
running the command `not` without any arguments.

These RUN lines have been removed since they don't actually test
anything new compared to the other two RUN lines due to the expected
error during instruction selection.
```llvm
; RUN: not llc -verify-machineinstrs -O0 -mtriple=spirv64-unknown-unknown %s -o /dev/null 2>&1 | FileCheck %s
; RUN: not llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown %s -o /dev/null 2>&1 | FileCheck %s
```
2025-01-22 13:29:19 -08:00
Michael Maitland
1687aa2a99
[RISCV][VLOPT] Don't reduce the VL is the same as CommonVL (#123878)
This fixes the slowdown in #123862.
2025-01-22 13:49:54 -05:00
Stefan Pintilie
340706f311
[PowerPC] Fix saving of Link Register when using ROP Protect (#123101)
An optimization was added that tries to move the uses of the mflr
instruction away from the instruction itself. However, this doesn't work
when we are using the hashst instruction because that instruction needs
to be run before the stack frame is obtained.

This patch disables moving instructions away from the mflr in the case
where ROP protection is being used.

---------

Co-authored-by: Lei Huang <lei@ca.ibm.com>
2025-01-22 13:44:20 -05:00
Kazu Hirata
b40739a6e9 Revert "[LLVM][Clang][AArch64] Implement AArch64 build attributes (#118771)"
This reverts commit d7fb4a275c98f4035d1083b5eb3edd2ffb2da00e.

Buildbots failing:
https://lab.llvm.org/buildbot/#/builders/169/builds/7671
https://lab.llvm.org/buildbot/#/builders/65/builds/11046
2025-01-22 10:12:27 -08:00
Simon Pilgrim
44f3168110 [X86] vector reduction tests - regenerate VPTERNLOG comments 2025-01-22 17:23:37 +00:00
Simon Pilgrim
a25f2cb3e6 [X86] vector rotate tests - regenerate VPTERNLOG comments 2025-01-22 17:23:37 +00:00
Simon Pilgrim
bb754f2c98 [X86] avx512 intrinsics tests - regenerate VPTERNLOG comments 2025-01-22 17:23:37 +00:00
Simon Pilgrim
e6c7d6a56a [X86] avx512-broadcast-unfold.ll - regenerate VPTERNLOG comments 2025-01-22 17:23:37 +00:00