llvm-project

Author	SHA1	Message	Date
David Green	d9d71bdc14	[AArch64] Move BSL generation to lowering. (#151855 ) It is generally better to allow the target independent combines before creating AArch64 specific nodes (providing they don't mess it up). This moves the generation of BSL nodes to lowering, not a combine, so that intermediate nodes are more likely to be optimized. There is a small change in the constant handling to detect legalized buildvector arguments correctly. Fixes #149380 but not directly. #151856 contained a direct fix for expanding the pseudos.	2025-08-21 09:54:42 +01:00
David Green	bd94aabfb6	[AArch64][GlobalISel] Remove Selection code for s/uitofp. NFC (#154488 ) These are already handled by tablegen patterns.	2025-08-20 20:52:42 +01:00
Benjamin Maxwell	478b4b012f	[AArch64][SME] Rework VG CFI information for streaming-mode changes (#152283 ) This patch reworks how VG is handled around streaming mode changes. Previously, for functions with streaming mode changes, we would: - Save the incoming VG in the prologue - Emit `.cfi_offset vg, <offset>` and `.cfi_restore vg` around streaming mode changes Additionally, for locally streaming functions, we would: - Also save the streaming VG in the prologue - Emit `.cfi_offset vg, <incoming VG offset>` in the prologue - Emit `.cfi_offset vg, <streaming VG offset>` and `.cfi_restore vg` around streaming mode changes In both cases, this ends up doing more than necessary and would be hard for an unwinder to parse, as using `.cfi_offset` in this way does not follow the semantics of the underlying DWARF CFI opcodes. So the new scheme in this patch is to: In functions with streaming mode changes (inc locally streaming) - Save the incoming VG in the prologue - Emit `.cfi_offset vg, <offset>` in the prologue (not at streaming mode changes) - Emit `.cfi_restore vg` after the saved VG has been deallocated - This will be in the function epilogue, where VG is always the same as the entry VG - Explicitly reference the incoming VG expressions for SVE callee-saves in functions with streaming mode changes - Ensure the CFA is not described in terms of VG in functions with streaming mode changes A more in-depth discussion of this scheme is available in: https://gist.github.com/MacDue/b7a5c45d131d2440858165bfc903e97b But the TLDR is that following this scheme, SME unwinding can be implemented with minimal changes to existing unwinders. All unwinders need to do is initialize VG to `CNTD` at the start of unwinding, then everything else is handled by standard opcodes (which don't need changes to handle VG).	2025-08-20 14:06:12 +01:00
Paul Walker	d6a688fb3d	[LLVM][CodeGen][SME] hasB16b16() is not sufficient to prove BFADD availability. (#154143 ) The FEAT_SVE_B16B16 arithmetic instructions are only available to streaming mode functions when SME2 is available.	2025-08-20 11:12:43 +01:00
SivanShani-Arm	460e9a8837	[LLVM][AArch64] Build attributes: Support switching to a defined subsection by name only (#154159 ) The AArch64 build attribute specification now allows switching to an already-defined subsection using its name alone, without repeating the optionality and type parameters. This patch updates the parser to support that behavior. Spec reference: https://github.com/ARM-software/abi-aa/pull/230/files	2025-08-20 10:50:48 +01:00
jyli0116	9df7ca1f0f	[GlobalISel] Legalize Saturated Truncate instructions and intrinsics (#154340 ) Adds legalization support for `G_TRUNC_SSAT_S`, `G_TRUNC_SSAT_S`, `G_TRUNC_USAT_U` instructions for GlobalISel.	2025-08-20 10:37:22 +01:00
Kerry McLaughlin	c34cba0413	[AArch64][SME] Lower aarch64.sme.cnts* to vscale when in streaming mode (#154305 ) In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when in streaming-mode.	2025-08-20 09:48:36 +01:00
David Tellenbach	0542355147	[AArch64] Fix zero-register copying with zero-cycle moves (#154362 ) Fix incorrect super-register lookup when copying from $wzr on subtargets that lack zero-cycle zeroing but support 64-bit zero-cycle moves. When copying from $wzr, we used the wrong register class to lookup the super-register, causing $w0 = COPY $wzr to get expanded as $x0 = ORRXrr $xzr, undef $noreg, implicit $wzr, rather than the correct $x0 = ORRXrr $xzr, undef $xzr, implicit $wzr.	2025-08-19 21:07:16 +02:00
Tomer Shafir	ffddf33beb	[AArch64] Remove wrong processor feature (#151289 ) `fmov dX, dY` is not a preferred instruction. Previously introduced by: https://github.com/llvm/llvm-project/pull/144152	2025-08-19 21:20:09 +03:00
Jie Fu	81f1b46cc6	[AArch64] Silent an unused-variable warning (NFC) /llvm-project/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp:1042:11: error: unused variable 'TRI' [-Werror,-Wunused-variable] auto *TRI = MBB.getParent()->getSubtarget().getRegisterInfo(); ^ 1 error generated.	2025-08-19 20:32:33 +08:00
Benjamin Maxwell	7170a81241	[AArch64][SME] Rename `EdgeBundles` to `Bundles` (NFC) (#154295 ) It seems some buildbots do not like the shadowing. See: https://lab.llvm.org/buildbot/#/builders/137/builds/23838	2025-08-19 10:58:19 +01:00
Benjamin Maxwell	eb764040bc	[AArch64][SME] Implement the SME ABI (ZA state management) in Machine IR (#149062 ) ## Short Summary This patch adds a new pass `aarch64-machine-sme-abi` to handle the ABI for ZA state (e.g., lazy saves and agnostic ZA functions). This is currently not enabled by default (but aims to be by LLVM 22). The goal is for this new pass to more optimally place ZA saves/restores and to work with exception handling. ## Long Description This patch reimplements management of ZA state for functions with private and shared ZA state. Agnostic ZA functions will be handled in a later patch. For now, this is under the flag `-aarch64-new-sme-abi`, however, we intend for this to replace the current SelectionDAG implementation once complete. The approach taken here is to mark instructions as needing ZA to be in a specific ("ACTIVE" or "LOCAL_SAVED"). Machine instructions implicitly defining or using ZA registers (such as $zt0 or $zab0) require the "ACTIVE" state. Function calls may need the "LOCAL_SAVED" or "ACTIVE" state depending on the callee (having shared or private ZA). We already add ZA register uses/definitions to machine instructions, so no extra work is needed to mark these. Calls need to be marked by glueing Arch64ISD::INOUT_ZA_USE or Arch64ISD::REQUIRES_ZA_SAVE to the CALLSEQ_START. These markers are then used by the MachineSMEABIPass to find instructions where there is a transition between required ZA states. These are the points we need to insert code to set up or restore a ZA save (or initialize ZA). To handle control flow between blocks (which may have different ZA state requirements), we bundle the incoming and outgoing edges of blocks. Bundles are formed by assigning each block an incoming and outgoing bundle (initially, all blocks have their own two bundles). Bundles are then combined by joining the outgoing bundle of a block with the incoming bundle of all successors. These bundles are then assigned a ZA state based on the blocks that participate in the bundle. Blocks whose incoming edges are in a bundle "vote" for a ZA state that matches the state required at the first instruction in the block, and likewise, blocks whose outgoing edges are in a bundle vote for the ZA state that matches the last instruction in the block. The ZA state with the most votes is used, which aims to minimize the number of state transitions.	2025-08-19 10:00:28 +01:00
David Sherwood	13d8ba7dea	[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086 ) There are a couple of places in the loop vectoriser where we want to calculate the cost of extracting the last lane in a vector. However, we wrongly assume that asking for the cost of extracting lane (VF.getKnownMinValue() - 1) is an accurate representation of the cost of extracting the last lane. For SVE at least, this is non-trivial as it requires the use of whilelo and lastb instructions. To solve this problem I have added a new getReverseVectorInstrCost interface where the index is used in reverse from the end of the vector. Suppose a vector has a given ElementCount EC, the extracted/inserted lane would be EC - 1 - Index. For scalable vectors this index is unknown at compile time. I've added a AArch64 hook that better represents the cost, and also a RISCV hook that maintains compatibility with the behaviour prior to this PR. I've also taken the liberty of adding support in vplan for calculating the cost of VPInstruction::ExtractLastElement.	2025-08-19 09:31:37 +01:00
Kazu Hirata	5fdc7478a6	[AArch64] Replace SmallSet with SmallPtrSet (NFC) (#154264 ) This patch replaces SmallSet<T , N> with SmallPtrSet<T , N>. Note that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer element types: template <typename PointeeType, unsigned N> class SmallSet<PointeeType, N> : public SmallPtrSet<PointeeType, N> {}; We only have 30 instances that rely on this "redirection". Since the redirection doesn't improve readability, this patch replaces SmallSet with SmallPtrSet for pointer element types. I'm planning to remove the redirection eventually.	2025-08-18 22:39:53 -07:00
AZero13	08a140add8	[AArch64] Fix build-bot assertion error in AArch64 (#154124 ) Fixes build bot assertion. I forgot to include logic that will be added in a future PR that handles -1 correctly. For now, let's just return nullptr like we used to.	2025-08-18 15:12:07 +00:00
AZero13	0e52092ff7	[AArch64] Adjust comparison constant if adjusting it means less instructions (#151024 ) Prefer constants that require less instructions to materialize, in both Global-ISel and Selection-DAG	2025-08-18 14:56:45 +01:00
Benjamin Maxwell	81c06d198e	Reland "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153417 ) This updates everywhere we emit/check an SME routines to use RuntimeLibcalls to get the function name and calling convention.	2025-08-18 14:53:40 +01:00
Jonathan Thackray	f38c83c582	[AArch64][llvm] Disassemble instructions in `SYS` alias encoding space more correctly (#153905 ) For instructions in the `SYS` alias encoding space which take no register operands, and where the unused 5 register bits are not all set (0x31, 0b11111), then disassemble to a `SYS` alias and not the instruction, since it is not considered valid. This is because it is specified in the Arm ARM in text similar to this (e.g. page C5-1037 of DDI0487L.b for `TLBI ALLE1`, or page C5-1585 for `GCSPOPX`): ``` Rt should be encoded as 0b11111. If the Rt field is not set to 0b11111, it is CONSTRAINED UNPREDICTABLE whether: * The instruction is UNDEFINED. * The instruction behaves as if the Rt field is set to 0b11111. ``` Since we want to follow "should" directives, and not encourage undefined behaviour, only assemble or disassemble instructions considered valid. Add an extra test-case for this, and all existing test-cases are continuing to pass.	2025-08-18 14:41:41 +01:00
Jonathan Cohen	c6fe567064	[AArch64][MachineCombiner] Combine sequences of gather patterns (#152979 ) Reland of #142941 Squashed with fixes for #150004, #149585 This pattern matches gather-like patterns where values are loaded per lane into neon registers, and replaces it with loads into 2 separate registers, which will be combined with a zip instruction. This decreases the critical path length and improves Memory Level Parallelism. rdar://151851094	2025-08-18 15:10:59 +03:00
David Green	8f98529209	[AArch64] Remove SIMDLongThreeVectorTiedBHSabal tablegen class. Similar to #152987 this removes SIMDLongThreeVectorTiedBHSabal as it is equivalent to SIMDLongThreeVectorTiedBHS with a better TriOpFrag pattern.	2025-08-18 09:11:13 +01:00
Ahmad Yasin	1b0bce972b	Reorder checks to speed up getAppleRuntimeUnrollPreferences() (#154010 ) - Delay load/store values calculation unless a best unroll-count is found - Remove extra getLoopLatch() invocation	2025-08-18 11:06:37 +03:00
Kazu Hirata	cbf5af9668	[llvm] Remove unused includes (NFC) (#154051 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-08-17 23:46:35 -07:00
David Green	732eb5427c	[AArch64] Replace SIMDLongThreeVectorBHSabd with SIMDLongThreeVectorBHS. (#152987 ) We just need to use a BinOpFrag to share the patterns. This also moves UABDL to where it belongs in with similar instructions, and removes some patterns that are now handled by abd nodes. This is mostly NFC except for GISel, which will catch back up when it handles abd nodes in the same way.	2025-08-15 20:35:27 +01:00
Nikita Popov	01bc742185	[CodeGen] Give ArgListEntry a proper constructor (NFC) (#153817 ) This ensures that the required fields are set, and also makes the construction more convenient.	2025-08-15 18:06:07 +02:00
David Green	144f3c4cbf	[AArch64] Adjust the scheduling info of SVE FCMP on Cortex-A510. (#153810 ) According to the SWOG, these have a lower throughput than other instructions. Mark them as taking multiple cycles to model that.	2025-08-15 15:45:33 +01:00
Gaëtan Bossu	9828745661	[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction (#152554 ) The patch adds patterns to select the EXT_ZZI_CONSTRUCTIVE pseudo instead of the EXT_ZZI destructive instruction for vector_splice. This only works when the two inputs to vector_splice are identical. Given that registers aren't tied anymore, this gives the register allocator more freedom and a lot of MOVs get replaced with MOVPRFX. In some cases however, we could have just chosen the same input and output register, but regalloc preferred not to. This means we end up with some test cases now having more instructions: there is now a MOVPRFX while no MOV was previously needed.	2025-08-15 14:30:24 +01:00
Gaëtan Bossu	fdd2d4df12	[AArch64] Define constructive EXT_ZZI pseudo instruction (#152552 ) It will get expanded into MOVPRFX_ZZ and EXT_ZZI by the AArch64ExpandPseudo pass. This instruction takes a single Z register as input, as opposed to the existing destructive EXT_ZZI instruction. Note this patch only defines the pseudo, it isn't used in any ISel pattern yet. It will later be used for vector.extract.	2025-08-15 09:05:10 +01:00
David Green	5836bae463	[AArch64] Change the cost of fma and fmuladd to match fmul. (#152963 ) As fmul and fmadd are so similar, their performance characteristics tend to be the same on most platforms, at least in terms of reciprocal throughputs. Processors capable of performing a given number of fmul per cycle can usually perform the same number of fma, with the extra add being relatively simple on top. This patch makes the scores of the two operations the same, which brings the throughput cost of a fma/fmuladd to 2, and the latency to 3, which are the defaults for fmul. Note that we might also want to change the throughput cost of a fmul to 1, as most processors have ample bandwidth for them, but they should still stay in-line with one another.	2025-08-14 21:53:45 +01:00
Elvis Wang	01fac67e2a	[TTI] Add cost kind to getAddressComputationCost(). NFC. (#153342 ) This patch add cost kind to `getAddressComputationCost()` for #149955. Note that this patch also remove all the default value in `getAddressComputationCost()`.	2025-08-14 16:01:44 +08:00
David Green	4c28bbf5b8	[AArch64] Fix ‘>= 0’ is always true warning. NFC	2025-08-14 08:17:10 +01:00
Nikita Popov	48beed5b71	Revert "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153392 ) This introduced a 5% compile-time regression on AArch64, see https://llvm-compile-time-tracker.com/compare.php?from=b9138bde3562de5c28a239dbd303caf2406678c6&to=271688b87abe7cf45aceaff8266270a25eb7b436&stat=instructions:u. Reverts llvm/llvm-project#152505.	2025-08-13 11:54:39 +00:00
Ahmad Yasin	1f2fb8e979	[AArch64] Tune unrolling prefs for more patterns on Apple CPUs (#149358 ) Enhance the heuristics in `getAppleRuntimeUnrollPreferences` to let a bit more loops to be unrolled. Specifically, this patch adjusts two checks: I. Tune the loop size budget from 8 to 10 II. Include immediate in-loop users of loaded values in the load/stores dependencies predicate --------- Co-authored-by: Florian Hahn <flo@fhahn.com> PR: https://github.com/llvm/llvm-project/pull/149358	2025-08-13 11:16:54 +01:00
Benjamin Maxwell	271688b87a	[AArch64][SME] Port all SME routines to RuntimeLibcalls (#152505 ) This updates everywhere we emit/check an SME routines to use RuntimeLibcalls to get the function name and calling convention. Note: RuntimeLibcallEmitter had some issues with emitting non-unique variable names for sets of libcalls, so I tweaked the output to avoid the need for variables.	2025-08-13 08:48:59 +01:00
Trevor Gross	919021b0df	[Arm64EC] Add support for `half` (#152843 ) `f16` is passed and returned in vector registers on both x86 on AArch64, the same calling convention as `f32`, so it is a straightforward type to support. The calling convention support already exists, added as part of a6065f0fa55a ("Arm64EC entry/exit thunks, consolidated. (#79067)"). Thus, add mangling and remove the error in order to make `half` work. MSVC does not yet support `_Float16`, so for now this will remain an LLVM-only extension. Fixes the `f16` portion of https://github.com/llvm/llvm-project/issues/94434	2025-08-12 14:15:52 -07:00
Min-Yih Hsu	ca05058b49	[IA][RISCV] Recognize deinterleaved loads that could lower to strided segmented loads (#151612 ) Turn the following deinterleaved load patterns ``` %l = masked.load(%ptr, /mask=/110110110110, /passthru=/poison) %f0 = shufflevector %l, [0, 3, 6, 9] %f1 = shufflevector %l, [1, 4, 7, 10] %f2 = shufflevector %l, [2, 5, 8, 11] ``` into ``` %s = riscv.vlsseg2(/passthru=/poison, %ptr, /mask=/1111) %f0 = extractvalue %s, 0 %f1 = extractvalue %s, 1 %f2 = poison ``` The mask `110110110110` is regarded as 'gap mask' since it effectively skips the entire third field / component. Similarly, turning the following snippet ``` %l = masked.load(%ptr, /mask=/110000110000, /passthru=/poison) %f0 = shufflevector %l, [0, 3, 6, 9] %f1 = shufflevector %l, [1, 4, 7, 10] ``` into ``` %s = riscv.vlsseg2(/passthru=/poison, %ptr, /mask=/1010) %f0 = extractvalue %s, 0 %f1 = extractvalue %s, 1 ``` Right now this patch only tries to detect gap mask from a constant mask supplied to a masked.load/vp.load.	2025-08-12 14:08:18 -07:00
Ricardo Jesus	ef5e65d27b	[AArch64] Fix stp kill when merging forward. (#152994 ) As an alternative to #149177, iterate through all instructions in `AArch64LoadStoreOptimizer`.	2025-08-12 14:19:43 +01:00
Sam Tebbs	0bfa1718af	[LV] Create in-loop sub reductions (#147026 ) This PR allows the loop vectorizer to handle in-loop sub reductions by forming a normal in-loop add reduction with a negated input. Stacked PRs: 1. -> https://github.com/llvm/llvm-project/pull/147026 2. https://github.com/llvm/llvm-project/pull/147255 3. https://github.com/llvm/llvm-project/pull/147302 4. https://github.com/llvm/llvm-project/pull/147513	2025-08-12 10:22:41 +01:00
Benjamin Maxwell	d0c9599c41	[AArch64][SME] Use entry pstate.sm for conditional streaming-mode changes (#152169 ) We only do conditional streaming mode changes in two cases: - Around calls in streaming-compatible functions that don't have a streaming body - At the entry/exit of streaming-compatible functions with a streaming body In both cases, the condition depends on the entry pstate.sm value. Given this, we don't need to emit calls to __arm_sme_state at every mode change. This patch handles this by placing a "AArch64ISD::ENTRY_PSTATE_SM" node in the entry block and copying the result to a register. The register is then used whenever we need to emit a conditional streaming mode change. The "ENTRY_PSTATE_SM" node expands to a call to "__arm_sme_state" only if (after SelectionDAG) the function is determined to have streaming-mode changes. This has two main advantages: 1. It allows back-to-back conditional smstart/stop pairs to be folded 2. It has the correct behaviour for EH landing pads - These are entered with pstate.sm = 0, and should switch mode based on the entry pstate.sm - Note: This is not fully implemented yet	2025-08-12 09:15:30 +01:00
Peter Collingbourne	a9227316bf	MC: Introduce R_AARCH64_PATCHINST relocation type. The R_AARCH64_PATCHINST relocation type is to support deactivation symbols. For more information, see the RFC: https://discourse.llvm.org/t/rfc-deactivation-symbols/85556 Part of the AArch64 psABI extension: https://github.com/ARM-software/abi-aa/issues/340	2025-08-11 11:31:36 -07:00
Luke Lau	acb86fb9e0	[TTI] Consistently pass the pointer type to getAddressComputationCost. NFCI (#152657 ) In some places we were passing the type of value being accessed, in other cases we were passing the type of the pointer for the access. The most "involved" user is LoopVectorizationCostModel::getMemInstScalarizationCost, which is the only call site that passes in the SCEV, and it passes along the pointer type. This changes call sites to consistently pass the pointer type, and renames the arguments to clarify this. No target actually checks the contents of the type passed, only to see if it's a vector or not, so this shouldn't have an effect.	2025-08-11 18:00:12 +08:00
Sander de Smalen	2ad1d77b17	[AArch64] Match constants in SelectSMETileSlice (#151494 ) If the slice is a constant then it should try to use `WZR + <imm>` addressing mode if the constant fits the range.	2025-08-11 10:19:26 +01:00
Nikita Popov	e92b7e9641	[CodeGen] Provide original IR type to CC lowering (NFC) (#152709 ) It is common to have ABI requirements for illegal types: For example, two i64 argument parts that originally came from an fp128 argument may have a different call ABI than ones that came from a i128 argument. The current calling convention lowering does not provide access to this information, so backends come up with various hacks to support it (like additional pre-analysis cached in CCState, or bypassing the default logic entirely). This PR adds the original IR type to InputArg/OutputArg and passes it down to CCAssignFn. It is not actually used anywhere yet, this just does the mechanical changes to thread through the new argument.	2025-08-11 08:57:53 +02:00
AZero13	e6b4daf48c	[AArch64] Support MI and PL (#150314 ) Now, why would we want to do this? There are a small number of places where this works: 1. It helps peepholeopt when less flag checking. 2. It allows the folding of things such as x - 0x80000000 < 0 to be folded to cmp x, register holding this value 3. We can refine the other passes over time for this.	2025-08-11 07:41:38 +01:00
David Green	26b302fd8b	[AArch64] Rename Cost -> PromotedCost to avoid shadowing error	2025-08-08 14:37:24 +01:00
David Green	7f1638efc1	[AArch64] Generalize costing for FP16 instructions (#150033 ) This extracts the code for modelling a fp16 operation as `fptrunc(fpop(fpext,fpext))` into a new function named getFP16BF16PromoteCost so that it can be reused by the arithmetic instructions. The function takes a lambda to calculate the cost of the operation with the promoted type.	2025-08-08 13:40:07 +01:00
Graham Hunter	de72cca671	[CostModel] Provide a default model for histogram intrinsics (#149348 ) Since we scalarize these intrinsics when the target does not support them, we should model that for costing purposes.	2025-08-08 11:00:00 +01:00
Nikita Popov	c23b4fbdbb	[IR] Remove size argument from lifetime intrinsics (#150248 ) Now that #149310 has restricted lifetime intrinsics to only work on allocas, we can also drop the explicit size argument. Instead, the size is implied by the alloca. This removes the ability to only mark a prefix of an alloca alive/dead. We never used that capability, so we should remove the need to handle that possibility everywhere (though many key places, including stack coloring, did not actually respect this).	2025-08-08 11:09:34 +02:00
Cullen Rhodes	e9d71efb83	[AArch64] Mark [usp]mull, [us]addl, [us]abdl as commutative (#152158 ) Fixes #61461.	2025-08-08 09:35:28 +01:00
David Green	229ab5aa2b	[AArch64] Drop flags from BSP pseudos (#151856 ) This prevents cases where some of the operands match from hitting verifier errors with kill flags. These nodes should have been removed earlier in most cases. Fixes the direct issue from #149380. #151855 cleans up the codegen.	2025-08-08 07:47:56 +01:00
David Green	6f272d1ecf	[AArch64] Move tryCombineToBSL. NFC This is for #151855, to make the changes more obvious.	2025-08-07 16:45:58 +01:00

1 2 3 4 5 ...

9481 Commits