llvm-project

Author	SHA1	Message	Date
Justin Bogner	2f39d138dc	[DirectX] Handle dx.RawBuffer in DXILResourceAccess (#121725 ) This adds handling for raw and structured buffers when lowering resource access via `llvm.dx.resource.getpointer`. Fixes #121714	2025-01-23 21:35:34 -08:00
Pradeep Kumar	435609b70c	[LLVM][NVPTX] Add support for griddepcontrol instruction (#123511 ) This commit adds support for griddepcontrol PTX instruction with tests under griddepcontrol.ll	2025-01-24 09:33:16 +05:30
Cinhi Young	6735d527f9	[MIPS] [MSA] Widen v2i8, v216 and v2i32 vectors (#123040 ) - Widen v2i8, v2i16 and v2i32 vectors so they don't cast back and forth, and make sure that instructions with correct data unit is being used. - Handle undef indices for VSHF when lowering VECTOR_SHUFFLE (it crashes if such index is present).	2025-01-24 11:23:34 +08:00
Jeffrey Byrnes	acb7859f07	[MachineSink] Extend loop sinking capability (#117247 ) The current MIR cycle sinking capabilities are rather limited. It only support sinking copies into a single successor block while obeying limits. This opt-in feature adds a more aggressive option, that is not limited to the above concerns. The feature will try to "sink" by duplicating any top-level preheader instruction (that we are sure is safe to sink) into any user block, then does some dead code cleanup. In particular, this is useful for high RP situations when loop bodies have control flow.	2025-01-23 17:08:23 -08:00
Phoebe Wang	24f177df61	[X86][AVX10.2-BF16] Update VCOMISBF16 intrinsics and instructions (#123307 ) - Add `I` to intrinsics and instructions - Add `_` before sbf16 in intrinsics Ref.: https://cdrdv2.intel.com/v1/dl/getContent/828965	2025-01-24 08:37:29 +08:00
Sam Elliott	33c4407471	[RISCV] Support cR Inline Asm Constraint (#124174 ) This denotes RVC-compatible GPR Pairs, which are used by the Zclsd extension. C API PR: riscv-non-isa/riscv-c-api-doc#102	2025-01-23 16:19:19 -08:00
Min-Yih Hsu	bc74a1edbe	[IA] Generalize the support for power-of-two (de)interleave intrinsics (#123863 ) Previously, AArch64 used pattern matching to support llvm.vector.(de)interleave of 2 and 4; RISC-V only supported (de)interleave of 2. This patch consolidates the logics in these two targets by factoring out the common factor calculations into the InterleaveAccess Pass.	2025-01-23 15:27:51 -08:00
Michael Maitland	f5bd623d06	[RISCV][VLOPT] Rename vx to vf where appropriate in test case	2025-01-23 14:02:15 -08:00
David Green	fc952b2a69	[AArch64] Add pre-index store patterns for bf16. These, like the postinc patterns, need adding very similarly to fp16. Fixes #97870	2025-01-23 21:52:20 +00:00
Michael Maitland	bf258dbd57	[RISCV][VLOPT] support fp sign injection instructions (#124195 )	2025-01-23 16:50:35 -05:00
Michael Maitland	f402e06e7d	[RISCV][VLOPT] Add vector fp min/max instructions to isSupportedInstr (#124196 )	2025-01-23 16:47:14 -05:00
mingmingl	1688c8719f	s/requires/REQUIRES to fix the test on release build	2025-01-23 12:59:42 -08:00
Craig Topper	e30a4fc3e2	[TargetLowering] Improve one signature of forceExpandWideMUL. (#123991 ) We have two forceExpandWideMUL functions. One takes the low and high half of 2 inputs and calculates the low and high half of their product. This does not calculate the full 2x width product. The other signature takes 2 inputs and calculates the low and high half of their full 2x width product. Previously it did this by sign/zero extending the inputs to create the high bits and then calling the other function. We can instead copy the algorithm from the other function and use the Signed flag to determine whether we should do SRA or SRL. This avoids the need to multiply the high part of the inputs and add them to the high half of the result. This improves the generated code for signed multiplication. This should improve the performance of #123262. I don't know yet how close we will get to gcc.	2025-01-23 12:49:35 -08:00
mingmingl	c3ecbe6792	Disable the test again. * https://lab.llvm.org/buildbot/#/builders/127/builds/2148/steps/7/logs/stdio shows a failure.	2025-01-23 11:07:14 -08:00
Florian Hahn	0d0190815d	[TailDup] Allow large number of predecessors/successors without phis. (#116072 ) This adjusts the threshold logic added in #78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs. PR: https://github.com/llvm/llvm-project/pull/116072	2025-01-23 18:24:20 +00:00
Ulrich Weigand	6d5697f7cb	[SystemZ] Fix ICE with i128->i64 uaddo carry chain We can only optimize a uaddo_carry via specialized instruction if the carry was produced by another uaddo(_carry) instruction; there is already a check for that. However, i128 uaddo(_carry) use a completely different mechanism; they indicate carry in a vector register instead of the CC flag. Thus, we must also check that we don't mix those two - that check has been missing. Fixes: https://github.com/llvm/llvm-project/issues/124001	2025-01-23 19:15:11 +01:00
mingmingl	3dec24d2a2	Stats are sorted before they are printed. Try fixing test failure by checking stats in its print order.	2025-01-23 10:11:11 -08:00
Craig Topper	2f6b0b4a85	[RISCV] Add SiFive sf.vqmacc tests to vmv-copy.mir. NFC (#124075 ) The vqmaccu.2x8x2 test is currently being miscompiled. We need to use a whole register move instead of vmv.v.v. The input has VL elements with EEW=8 EMUL=4. The output has VL/4 elements with EEW=32 EMUL=4. We can't use the original VL or input SEW for a vmv.v.v.	2025-01-23 10:03:33 -08:00
Nikita Popov	bca6dbd3a2	[X86] Add additional i128 abi test (NFC)	2025-01-23 17:34:47 +01:00
mingmingl	96410edd47	mark test as unsupported as I investigate test failure on certain environments	2025-01-23 08:07:48 -08:00
Nikita Popov	c3b40c7ea2	[X86] Regenerate test checks (NFC) Regenerate some tests for the new vpternlog printing.	2025-01-23 16:15:04 +01:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Mikołaj Piróg	25653e558c	[AVX10.2] Update convert chapter intrinsic and mnemonics names (#123656 ) Intel spec for avx10.2 (https://cdrdv2.intel.com/v1/dl/getContent/828965) has been updated. This PR changes relevant names from the "AVX10 CONVERT INSTRUCTIONS" chapter .	2025-01-23 22:23:56 +08:00
Nico Weber	99d450e9f5	Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 )" This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab. Breaks check-llvm, see https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953	2025-01-23 09:19:42 -05:00
Matt Arsenault	e28e93550a	AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684 ) For VALU shuffles, this saves an instruction in some case.	2025-01-23 20:58:02 +07:00
Kareem Ergawy	ff55c9bc63	[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089 ) Fixes #123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: ```llvm @lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8 %store_type = type { i32, ptr } @place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8 define amdgpu_kernel void @offloading_kernel() { store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8 call void @call_unknown() ret void } define void @call_unknown() { %1 = alloca ptr, align 8 %2 = call i32 %1() ret void } define void @indirectly_load_lds() { call void @directly_load_lds() ret void } define void @directly_load_lds() { %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8 ret void } ``` With the above input, prior to this patch, LDS lowering failed to lower the reference to `@lds_item_to_indirectly_load` because: 1. it is indirectly called by a function whose address is taken in the kernel. 2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls). Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>	2025-01-23 14:53:11 +01:00
Frederik Harwath	6fdaaafd89	[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 ) This is meant as a short-term workaround for an invalid conversion in this pass that occurs because existing SDWA selections are not correctly taken into account during the conversion. See the draft PR #123221 for an attempt to fix the actual issue. --------- Co-authored-by: Frederik Harwath <fharwath@amd.com>	2025-01-23 14:32:01 +01:00
Michael Liao	590e5e20b1	[M68k] Fix llc pass test after 3630d9ef65b30af7e4ca78e668649bbc48b5be66	2025-01-23 08:28:25 -05:00
Simon Pilgrim	90e9895a93	[X86] Handle BSF/BSR "zero-input pass through" behaviour (#123623 ) Intel docs have been updated to be similar to AMD and now describe BSF/BSR as not changing the destination register if the input value was zero, which allows us to support CTTZ/CTLZ zero-input cases by setting the destination to support a NumBits result (BSR is a bit messy as it has to be XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also been confirmed to match this behaviour. This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through" argument for zero-input cases, by default this is set to UNDEF to match existing behaviour, but it can be set to a suitable value if supported. There are still some limits to this - its only supported for x86_64 capable processors (and I've only enabled it for x86_64 codegen), and Intel CPUs sometimes zero the upper 32-bits of a pass through register when used for BSR32/BSF32 with a zero source value (i.e. the whole 64bits may not get passed through). Fixes #122004	2025-01-23 12:59:59 +00:00
Abhilash Majumder	fa7f0e582b	[NVPTX] Add Bulk Copy Prefetch Intrinsics (#123226 ) This patch adds NVVM intrinsics and NVPTX codegen for: - cp.async.bulk.prefetch.L2.* variants - These intrinsics optionally support cache_hints as indicated by the boolean flag argument. - Lit tests are added for all combinations of these intrinsics in cp-async-bulk.ll. - The generated PTX is verified with a 12.3 ptxas executable. - Added docs for these intrinsics in NVPTXUsage.rst file. PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch Co-authored-by: abmajumder <abmajumder@nvidia.com>	2025-01-23 16:49:44 +05:30
SivanShani-Arm	ee99c4d484	[LLVM][Clang][AArch64] Implement AArch64 build attributes (#123990 ) - Added support for AArch64-specific build attributes. - Print AArch64 build attributes to assembly. - Emit AArch64 build attributes to ELF. Specification: https://github.com/ARM-software/abi-aa/pull/230	2025-01-23 09:46:59 +00:00
mconst	3fb8c5b431	[X86] Fix invalid instructions on x32 with large stack frames (#124041 ) `X86FrameLowering::emitSPUpdate()` assumes that 64-bit targets use a 64-bit stack pointer, but that's not true on x32. When checking the stack pointer size, we need to look at `Uses64BitFramePtr` rather than `Is64Bit`. This avoids generating invalid instructions like `add esp, rcx`. For impossibly-large stack frames (4 GiB or larger with a 32-bit stack pointer), we were also generating invalid instructions like `mov eax, 5000000000`. The inline stack probe code already had a check for that situation; I've moved the check into `emitSPUpdate()`, so any attempt to allocate a 4 GiB stack frame with a 32-bit stack pointer will now trap rather than adjusting ESP by the wrong amount. This also fixes the "can't have 32-bit 16GB stack frame" assertion, which used to be triggerable by user code but is now correct. To help catch situations like this in the future, I've added `-verify-machineinstrs` to the stack clash tests that generate large stack frames. This fixes the expensive-checks buildbot failure caused by #113219.	2025-01-23 12:37:07 +05:30
Heejin Ahn	c3dfd34e54	[WebAssembly] Add unreachable before catch destinations (#123915 ) When `try_table`'s catch clause's destination has a return type, as in the case of catch with a concrete tag, catch_ref, and catch_all_ref. For example: ```wasm block exnref try_table (catch_all_ref 0) ... end_try_table end_block ... use exnref ... ``` This code is not valid because the block's body type is not exnref. So we add an unreachable after the 'end_try_table' to make the code valid here: ```wasm block exnref try_table (catch_all_ref 0) ... end_try_table unreachable ;; Newly added end_block ``` Because 'unreachable' is a terminator we also need to split the BB. --- We need to handle the same thing for unwind mismatch handling. In the code below, we create a "trampoline BB" that will be the destination for the nested `try_table`~`end_try_table` added to fix a unwind mismatch: ```wasm try_table (catch ... ) block exnref ... try_table (catch_all_ref N) some code end_try_table ... end_block ;; Trampoline BB throw_ref end_try_table ``` While the `block` added for the trampoline BB has the return type `exnref`, its body, which contains the nested `try_table` and other code, wouldn't have the `exnref` return type. Most times it didn't become a problem because the block's body ended with something like `br` or `return`, but that may not always be the case, especially when there is a loop. So we add an `unreachable` to make the code valid here too: ```wasm try_table (catch ... ) block exnref ... try_table (catch_all_ref N) some code end_try_table ... unreachable ;; Newly added end_block ;; Trampoline BB throw_ref end_try_table ``` In this case we just append the `unreachable` at the end of the layout predecessor BB. (This was tricky to do in the first (non-mismatch) case because there `end_try_table` and `end_block` were added in the beginning of an EH pad in `placeTryTableMarker` and moving `end_try_table` and the new `unreachable` to the previous BB caused other problems.) --- This adds many `unreaachable`s to the output, but this adds `unreachable` to only a few places to see if this is working. The FileCheck lines in `exception.ll` and `cfg-stackify-eh.ll` are already heavily redacted to only leave important control-flow instructions, so I don't think it's worth adding `unreachable`s everywhere.	2025-01-22 22:39:43 -08:00
mingmingl	5d8390d48e	Temporarily disable test on Fuchsia	2025-01-22 22:33:17 -08:00
mingmingl	ea49d474fd	Specify triple for llc test	2025-01-22 21:46:51 -08:00
Mingming Liu	de209fa11b	[CodeGen] Introduce Static Data Splitter pass (#122183 ) https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744 proposes to partition static data sections. This patch introduces a codegen pass. This patch produces jump table hotness in the in-memory states (machine jump table info and entries). Target-lowering and asm-printer consume the states and produce `.hot` section suffix. The follow up PR https://github.com/llvm/llvm-project/pull/122215 implements such changes. --------- Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>	2025-01-22 21:06:46 -08:00
quic_hchandel	163935a48d	[RISCV] Add Qualcomm uC Xqcilo (Large Offset Load Store) extension (#123881 ) This extension adds eight 48 bit load store instructions. The current spec can be found at: https://github.com/quic/riscv-unified-db/releases/latest This patch adds assembler only support. --------- Co-authored-by: Harsh Chandel <hchandel@qti.qualcomm.com>	2025-01-23 10:14:25 +05:30
tangaac	19834b4623	[LoongArch] Support sc.q instruction for 128bit cmpxchg operation (#116771 ) Two options for clang -mno-scq: Disable sc.q instruction. -mscq: Enable sc.q instruction. The default is -mno-scq.	2025-01-23 12:11:07 +08:00
Akshay Deodhar	892a804d93	[NVPTX] Stop using 16-bit CAS instructions from PTX (#120220 ) Increases minimum CAS size from 16 bit to 32 bit, for better SASS codegen. When atomics are emulated using atom.cas.b16, the SASS generated includes 2 (nested) emulation loops. When emulated using an atom.cas.b32 loop, the SASS too has a single emulation loop. Using 32 bit CAS thus results in better codegen.	2025-01-22 19:37:11 -08:00
Finn Plummer	0fe8e70c66	Revert "Reland "[HLSL] Implement the `reflect` HLSL function"" (#124046 ) Reverts llvm/llvm-project#123853 The introduction of `reflect-error.ll` surfaced a bug with the use of `report_fatal_error` in `SPIRVInstructionSelector` that was propagated into the pr. This has caused a build-bot breakage, and the work to solve the underlying issue is tracked here: https://github.com/llvm/llvm-project/issues/124045. We can re-apply this commit when the underlying issue is resolved.	2025-01-22 18:22:03 -08:00
Hua Tian	a9d2834508	[llvm][CodeGen] Fix the issue caused by live interval checking in window scheduler (#123184 ) At some corner cases, the cloned MI still retains an old slot index, which leads to the compiler crashing. This patch update the slot index map before delete the recycled MI. https://github.com/llvm/llvm-project/issues/123165	2025-01-23 09:39:03 +08:00
Craig Topper	96dbd0006c	[RISCV] Re-generate test checks so we pick up implicit on whole register moves. NFC	2025-01-22 16:11:43 -08:00
Deric Cheung	2656928d0c	Reland "[HLSL] Implement the `reflect` HLSL function" (#123853 ) This PR relands [#122992](https://github.com/llvm/llvm-project/pull/122992). Some machines were failing to run the `reflect-error.ll` test due to the RUN lines ```llvm ; RUN: not %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown %s -o /dev/null 2>&1 -filetype=obj %} ; RUN: not %if spirv-tools %{ llc -O0 -mtriple=spirv32-unknown-unknown %s -o /dev/null 2>&1 -filetype=obj %} ``` which failed when `spirv-tools` was not present on the machine due to running the command `not` without any arguments. These RUN lines have been removed since they don't actually test anything new compared to the other two RUN lines due to the expected error during instruction selection. ```llvm ; RUN: not llc -verify-machineinstrs -O0 -mtriple=spirv64-unknown-unknown %s -o /dev/null 2>&1 \| FileCheck %s ; RUN: not llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown %s -o /dev/null 2>&1 \| FileCheck %s ```	2025-01-22 13:29:19 -08:00
Michael Maitland	1687aa2a99	[RISCV][VLOPT] Don't reduce the VL is the same as CommonVL (#123878 ) This fixes the slowdown in #123862.	2025-01-22 13:49:54 -05:00
Stefan Pintilie	340706f311	[PowerPC] Fix saving of Link Register when using ROP Protect (#123101 ) An optimization was added that tries to move the uses of the mflr instruction away from the instruction itself. However, this doesn't work when we are using the hashst instruction because that instruction needs to be run before the stack frame is obtained. This patch disables moving instructions away from the mflr in the case where ROP protection is being used. --------- Co-authored-by: Lei Huang <lei@ca.ibm.com>	2025-01-22 13:44:20 -05:00
Kazu Hirata	b40739a6e9	Revert "[LLVM][Clang][AArch64] Implement AArch64 build attributes (#118771 )" This reverts commit d7fb4a275c98f4035d1083b5eb3edd2ffb2da00e. Buildbots failing: https://lab.llvm.org/buildbot/#/builders/169/builds/7671 https://lab.llvm.org/buildbot/#/builders/65/builds/11046	2025-01-22 10:12:27 -08:00
Simon Pilgrim	44f3168110	[X86] vector reduction tests - regenerate VPTERNLOG comments	2025-01-22 17:23:37 +00:00
Simon Pilgrim	a25f2cb3e6	[X86] vector rotate tests - regenerate VPTERNLOG comments	2025-01-22 17:23:37 +00:00
Simon Pilgrim	bb754f2c98	[X86] avx512 intrinsics tests - regenerate VPTERNLOG comments	2025-01-22 17:23:37 +00:00
Simon Pilgrim	e6c7d6a56a	[X86] avx512-broadcast-unfold.ll - regenerate VPTERNLOG comments	2025-01-22 17:23:37 +00:00

1 2 3 4 5 ...

57106 Commits