llvm-project

Author	SHA1	Message	Date
Vyacheslav Levytskyy	fe7cb15606	[SPIR-V] Improve portability of the code (#123584 ) Adding SPIRV to LLVM_ALL_TARGETS (https://github.com/llvm/llvm-project/pull/119653) revealed a series of minor compilation problems and sanitizer complaints. This PR is to address the problem.	2025-01-20 12:05:15 +01:00
Akshat Oke	96c4f978d0	[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572 )	2025-01-20 16:34:01 +05:30
yingopq	754ed95b66	[Mips] Fix compiler crash when returning fp128 after calling a functi… (#117525 ) …on returning { i8, i128 } Fixes https://github.com/llvm/llvm-project/issues/96432.	2025-01-20 16:47:40 +08:00
ZhaoQi	333562e7ec	[LoongArch] Avoid compilation warning. NFC (#123553 ) Avoid `warning: enumerated mismatch in conditional expression: 'llvm::LoongArchISD::NodeType' vs 'llvm::ISD::NodeType'` while compiling `LoongArchISelLowering.cpp`.	2025-01-20 16:37:27 +08:00
ZhaoQi	ca4886bf96	[LoongArch] Impl TTI hooks for LoongArch to support LoopDataPrefetch pass (#118437 ) Inspired by https://reviews.llvm.org/D146600, this commit adds some TTI hooks for LoongArch to make LoopDataPrefetch pass really work. Including: - `getCacheLineSize()`: 64 for loongarch64. - `getPrefetchDistance()`: After testing SPEC CPU 2017, improvements taken by prefetching are more obvious when set PrefetchDistance to 200(results shown blow), although different benchmarks fit for different best choice. - `enableWritePrefetching()`: store prefetch is supported by LoongArch, so set WritePrefetching to true in default. - `getMinPrefetchStride()` and `getMaxPrefetchIterationsAhead()` still use default values: 1 and UINT_MAX, so not override them. After this commit, the test added by https://reviews.llvm.org/D146600 can generate llvm.prefetch intrinsic IR correctly. Results of spec2017rate benchmarks (testing date: ref, copies: 1): - For all C/C++ benchmarks, compared to O3+novec/lsx/lasx, prefetch can bring about -1.58%/0.31%/0.07% performance improvement for int benchmarks and 3.26%/3.73%/3.78% improvement for floating point benchmarks. (Only O3+novec+prefetch decreases when testing intrate.) - But prefetch results in performance reduction almost for every Fortran benchmark compiled by flang. While considering all C/C++/Fortran benchmarks, prefetch performance will decrease about 1% ~ 5%. FIXME: Keep `loongarch-enable-loop-data-prefetch` option default to false for now due to the bad effect for Fortran.	2025-01-20 16:20:15 +08:00
ZhaoQi	84220eccb6	[LoongArch] Add generation support for `preld` instruction (#118436 ) Instruction `preld` is used to prefetch one cache-line of data from memory in advance into the cache. This commit allows it to be generated automatically.	2025-01-20 16:11:09 +08:00
Hervé Poussineau	be68f35bf5	[MC][CodeGen][Mips] Add CodeView mapping (#120877 ) Also add support for new relocation types required by debug information. Constants have been taken from CodeView Symbolic Debug Information Specification.	2025-01-20 15:00:24 +08:00
Guy David	295d1c361e	[AArch64] apple-m4 & apple-a15 have ADRP+ADD fusion (#123504 )	2025-01-20 08:18:56 +02:00
ZhaoQi	0288d065ee	[LoongArch] Avoid scheduling relaxable code sequence and attach relax relocs (#121330 ) If linker relaxation enabled, relaxable code sequence expanded from pseudos should avoid being separated by instruction scheduling. This commit tags scheduling boundary for them to avoid being scheduled. (Except for `tls_le/tls_ie` and `call36/tail36`. Because `tls_le/tls_ie` can be scheduled and have no influence to relax, `call36/tail36` are expanded later in `LoongArchExpandPseudo` pass.) A new mask target-flag is added to attach relax relocs to the relaxable code sequence. (No need to add it for `tls_le` and `call36/tail36` because we can simply add relax relocs for them according to their relocs. But for other code sequence, such as `PCALA_{HI20/LO12}`, we must use the mask flag, mainly because relax should not be added when code model is large.) Because of the new mask target-flag, get "direct" flags is necessary when using their target-flags. In addition, code sequence after being optimized by `MergeBaseOffset` pass may not relaxable any more, so the relax "bitmask" flag should be removed.	2025-01-20 10:00:05 +08:00
Patryk Wychowaniec	814b34f31e	[AVR] Force relocations for non-encodable jumps (#121498 ) This commit changes the branch emission logic so that instead of throwing the "branch target out of range" error, we emit a relocation instead.	2025-01-20 09:23:57 +08:00
Simon Pilgrim	6adeda8f55	[X86] combinePTESTCC - fold PTESTC(PCMPEQ(X,0),-1) == PTESTZ(X,X) (#123466 ) Simplifies the hidden "all_of(X == 0)" pattern Fixes #123456	2025-01-19 13:09:17 +00:00
Craig Topper	4a486e773e	[CodeGen] Use Register/MCRegister::isPhysical. NFC	2025-01-18 23:37:03 -08:00
Carl Ritson	f811482a74	[AMDGPU] SIWholeQuadMode: Ensure earliest WQM entry point for PS (#123266 ) Ensure shaders running WQM (PS) enter at the earliest point irrespective of WQM marking.	2025-01-19 15:50:33 +09:00
ssijaric-nv	6789442eb2	[AArch64] Fix a corner case with large stack allocation (#122038 ) In the unlikely case where the stack size is greater than 4GB, we may run into the situation where the local stack size and the callee saved registers stack size get combined incorrectly when restoring the callee saved registers. This happens because the stack size in shouldCombineCSRLocalStackBumpInEpilogue is represented as an 'unsigned', but is passed in as an 'int64_t'. We end up with something like $fp, $lr = frame-destroy LDPXi $sp, 536870912 This change just makes 'shouldCombineCSRLocalStackBumpInEpilogue' match 'shouldCombineCSRLocalStackBump' where 'StackBumpBytes' is an 'uint64_t'	2025-01-18 22:09:25 -08:00
Simon Pilgrim	33f9d839ef	[X86] X86FixupVectorConstants - split ConvertToBroadcastAVX512 helper to handle single bitwidth at a time. Attempt 32-bit broadcasts first, and then fallback to 64-bit broadcasts on failure. We lose an explicit assertion for matching operand numbers but X86InstrFoldTables already does something similar. Pulled out of WIP patch #73509	2025-01-18 14:08:48 +00:00
David Green	d666616804	[AArch64] Fold swapped sub/SUBS conditions (#121412 ) This fold already exists in a couple places (DAG and CGP), where an icmps operands are swapped to allow CSE with a sub. They do not handle constants though. This patch adds an AArch64 version that can be more precise.	2025-01-18 12:12:51 +00:00
Simon Pilgrim	82be3adcff	[X86] Consistently use getVectorIdxConstant for element/subvector extract/insertion nodes. NFC. (#123312 ) Avoid the use of getIntPtrConstant for anything other than address pointer related code. Noticed while trying to use getVectorIdxConstant as a breakpoint.	2025-01-18 11:47:03 +00:00
Simon Pilgrim	26c9be2b8d	[X86] Only call combineBitcastToBoolVector after legalization (#123386 ) Prevents infinite loop between combineBitcastToBoolVector and hoistLogicOpWithSameOpcodeHands, which only performs the "logicop(bitcast(A),bitcast(B)) -> bitcast(logicop(A,B))" upto type legalization. combineBitcastToBoolVector doesn't care much as its mainly for AVX512 cleanup that X86DomainReassignment can't handle for us. Fixes #123333	2025-01-18 11:44:46 +00:00
Simon Pilgrim	67c3f2b430	[X86] mayFoldIntoStore - peek through oneuse bitcase users to find a store node (#123366 ) mayFoldIntoStore currently just checks the direct (oneuse) user of a SDValue to check its stored, which prevents cases where we bitcast the value prior to storing (usually the bitcast will be removed later). This patch peeks up through oneuse BITCAST nodes chain to see if its eventually stored. The main use of mayFoldIntoStore is v8i16 EXTRACT_VECTOR_ELT lowering which will only use PEXTRW/PEXTRB for index0 extractions (vs the faster MOVD) if the extracted value will be folded into a store on SSE41+ targets. Fixes #107086	2025-01-18 14:08:24 +05:30
Kristof Beyls	10cfd54e6a	[AArch64] Correct defs and uses on {PAC,AUT}I{A,B}171615 (#123354 ) I'm not adding tests for this, as I don't think we usually have tests to verify correct description of defs and uses in instructions? This fix will be tested when #122304 lands, as one of the regression tests in that PR fails without this fix.	2025-01-18 07:56:15 +01:00
Philip Reames	143c33c6df	[RISCV] Consider only legally typed splats to be legal shuffles (#123415 ) Given the comment, I'd expected test coverage. There was none so let's do the simple thing which benefits the one thing we have tests for.	2025-01-17 19:13:04 -08:00
Craig Topper	0c6e03eea0	[RISCV] Fold vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK) (#123123 ) Co-authored-by: Brandon Wu <brandon.wu@sifive.com>	2025-01-17 14:22:25 -08:00
Farzon Lotfi	eddeb36cf1	[SPIRV] add pre legalization instruction combine (#122839 ) - Add the boilerplate to support instcombine in SPIRV - instcombine length(X-Y) to distance(X,Y) - switch HLSL's distance intrinsic to not special case for SPIRV. - fixes #122766 - This RFC we were requested to add in the infra for pattern matching: https://discourse.llvm.org/t/rfc-add-targetbuiltins-for-spirv-to-support-hlsl/83329/13	2025-01-17 14:46:14 -05:00
Shubham Sandeep Rastogi	ee1c852252	[DebugInfo][InstrRef] Treat ORRWrr as a copy instr (#123102 ) The insturction selector uses the `MachineFunction::copySalvageSSA` function to insert `DBG_PHIs` or identify a defining instruction for a copy-like instruction when finalizing Instruction References. AArch64 has the ORR instruction which is a logical OR with the variants ORRWrr which refers to a register to register variant, and ORRWrs which is a register to a shifted register variant. An ORRWrs where the shift amount is 0, and the zero register ($wzr) is used is considered a copy, for example: `$w0 = ORRWrs $wzr, killed $w3, 0` However an ORRWrr with a zero register is not considered a copy `$w0 = ORRWrr $wzr, killed $w3` This causes an issue in the livedebugvalues pass because in aarch64-isel the instruction is the ORRWrr variant, but is then changed to the ORRWrs variant before the livedebugvalues pass. This causes a mismatch between the two passes which leads to a crash in the livedebugvalues pass. This patch fixes the issue.	2025-01-17 09:27:36 -08:00
Steven Perron	4b692a95d1	[SPIRV] Expand RWBuffer load and store from HLSL (#122355 ) The code pattern that clang will generate for HLSL has changed from the original plan. This allows the SPIR-V backend to generate code for the current code generation. It looks for patterns of the form: ``` %1 = @llvm.spv.resource.handlefrombinding %2 = @llvm.spv.resource.getpointer(%1, index) load/store %2 ``` These three llvm-ir instruction are treated as a single unit that will 1. Generate or find the global variable identified by the call to `resource.handlefrombinding`. 2. Generate an OpLoad of the variable to get the handle to the image. 3. Generate an OpImageRead or OpImageWrite using that handle with the given index. This will generate the OpLoad in the same BB as the read/write. Note: Now that `resource.handlefrombinding` is not processed on its own, many existing tests had to be removed. We do not have intrinsics that are able to use handles to sampled images, input attachments, etc., so we cannot generate the load of the handle. These tests are removed for now, and will be added when those resource types are fully implemented.	2025-01-17 12:22:28 -05:00
Cullen Rhodes	f719771f25	Revert "[AArch64] Combine and and lsl into ubfiz" (#123356 ) Reverts llvm/llvm-project#118974	2025-01-17 16:53:33 +00:00
Simon Pilgrim	76569025dd	[X86] Fold (v4i32 (scalar_to_vector (i32 (anyext (bitcast (f16)))))) -> (v4i32 bitcast (v8f16 scalar_to_vector)) (#123338 ) This pattern tends to appear during f16 -> f32 promotion Partially addresses the unnecessary XMM->GPR->XMM moves when working with f16 types (#107086)	2025-01-17 14:46:22 +00:00
Brox Chen	a18f4bdb18	[AMDGPU][True16][MC] true16 for v_cmpx_lt_f16 (#122936 ) True16 format for v_cmpx_lt_f16. Update VOPCX t16 and fake16 pseudo.	2025-01-17 09:38:52 -05:00
Brox Chen	703e9e97d9	[AMDGPU][True16][CodeGen] true16 codegen for bswap (#122849 ) true16 codegen pattern for bswap	2025-01-17 09:36:55 -05:00
Phoebe Wang	48803bc8c7	[X86][AMX-AVX512][NFC] Remove P from intrinsic and instruction name (#123270 ) Ref.: https://cdrdv2.intel.com/v1/dl/getContent/828965	2025-01-17 22:21:19 +08:00
Wesley Wiser	41f430a48d	[X86] Don't fold very large offsets into addr displacements during ISel (#121678 ) Doing so can cause the resulting displacement after frame layout to become inexpressible (or cause over/underflow currently during frame layout). Fixes the error reported in https://github.com/llvm/llvm-project/pull/101840#issuecomment-2306975944.	2025-01-17 20:09:00 +07:00
Simon Pilgrim	a864906772	[X86] Fix logical operator warnings. NFC.	2025-01-17 11:55:22 +00:00
Stanislav Mekhanoshin	21704a685d	[AMDGPU] Fix printing hasInitWholeWave in mir (#123232 )	2025-01-17 03:00:02 -08:00
Simon Pilgrim	ad282f4c1f	[X86] Rename combineScalarToVector to combineSCALAR_TO_VECTOR. NFC. Match the file style of using the ISD NodeType name for the combine/lower method name.	2025-01-17 10:50:13 +00:00
Alexandros Lamprineas	831527a5ef	[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939 ) To deduce whether the optimization is legal we need to compare the target features between caller and callee versions. The criteria for bypassing the resolver are the following: * If the callee's feature set is a subset of the caller's feature set, then the callee is a candidate for direct call. * Among such candidates the one of highest priority is the best match and it shall be picked, unless there is a version of the callee with higher priority than the best match which cannot be picked from a higher priority caller (directly or through the resolver). * For every higher priority callee version than the best match, there is a higher priority caller version whose feature set availability is implied by the callee's feature set. Example: Callers and Callees are ordered in decreasing priority. The arrows indicate successful call redirections. Caller Callee Explanation ========================================================================= mops+sve2 --+--> mops all the callee versions are subsets of the \| caller but mops has the highest priority \| mops --+ sve2 between mops and default callees, mops wins sve sve between sve and default callees, sve wins but sve2 does not have a high priority caller default -----> default sve (callee) implies sve (caller), sve2(callee) implies sve (caller), mops(callee) implies mops(caller)	2025-01-17 10:49:43 +00:00
Benjamin Maxwell	32a4650f3c	[AArch64] Avoid hardcoding spill size/align in FrameLowering (NFC) (#123080 ) This is already defined for each register class in AArch64RegisterInfo, not hardcoding it here makes these values easier to change (perhaps based on hardware mode).	2025-01-17 10:10:21 +00:00
Benjamin Maxwell	2c9dc089fd	[AArch64] Use spill size when calculating callee saves size (NFC) (#123086 ) This is an NFC right now, as currently, all register and spill sizes are the same, but the spill size is the correct size to use here.	2025-01-17 10:09:31 +00:00
Phoebe Wang	fbb9d49506	[X86][APX] Support APX + AMX-MOVRS/AMX-TRANSPOSE (#123267 ) Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266	2025-01-17 17:51:42 +08:00
ZhaoQi	31b62e2d3d	[LoongArch] Add relax relocations for tls_le code sequence (#121329 ) This commit add relax relocations for `tls_le` code sequence. Handwritten assembly and generating source code by clang are both affected. Scheduled `tls_le` code sequence can be relaxed normally and we can add relax relocs when code emitting according to their relocs. Other relaxable macros' code sequence cannot simply add relax relocs according to their relocs, such as `PCALA_{HI20/LO12}`, we do not want to add relax relocs when code model is large. This will be implemented in later commit.	2025-01-17 17:30:57 +08:00
ZhaoQi	89e3a649f2	[LoongArch] Emit R_LARCH_RELAX when expanding some macros (#120067 ) Emit `R_LARCH_RELAX` relocations when expanding some macros, including: - `la.tls.ie`, `la.tls.ld`, `la.tls.gd`, `la.tls.desc`, - `call36`, `tail36`. Other macros that need to emit `R_LARCH_RELAX` relocations was implemented in https://github.com/llvm/llvm-project/pull/72961, including: - `la.local`, `la.pcrel`, `la.pcrel` expanded as `la.abs`, `la`, `la.global`, `la/la.global` expanded as `la.pcrel`, `la.got`. Note: `la.tls.le` macro can be relaxed when expanded with `R_LARCH_TLS_LE_{HI20/ADD/LO12}_R` relocations. But if we do so, previously handwritten assembly code will occur error due to the redundant `add.{w/d}` followed by `la.tls.le`. So `la.tls.le` keeps to expands with `R_LARCH_TLS_LE_{HI20/LO12}`.	2025-01-17 17:29:22 +08:00
Will Froom	c8ba551da1	[AArch64] Return early rather than asserting when Size of value passed to targetShrinkDemandedConstant is not 32 or 64 (#123084 ) See https://github.com/llvm/llvm-project/issues/123029 for details.	2025-01-17 08:41:33 +00:00
Phoebe Wang	1274bca2ad	[X86][APX] Support APX + MOVRS (#123264 ) Ref.: https://cdrdv2.intel.com/v1/dl/getContent/784266	2025-01-17 16:06:31 +08:00
Kazu Hirata	bfb6bb69fd	[AMDGPU] Fix a warning This patch fixes: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:13908:46: error: comparison of integers of different signs: 'uint32_t' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]	2025-01-16 22:40:08 -08:00
Vikram Hegde	225fc4f356	[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218 ) The intention is to use a "copy" instead of a "sub" to handle the high parts of 64-bit multiply for this specific case. This unlocks copy prop use cases where the copy can be reused by later multiply+add sequences if possible. Fixes: SWDEV-487672, SWDEV-487669	2025-01-17 11:09:39 +05:30
Matt Arsenault	ca95519704	AMDGPU: Implement isExtractVecEltCheap (#122460 ) Once again we have excessive TLI hooks with bad defaults. Permit this for 32-bit element vectors, which are just use-different-register. We should permit 16-bit vectors as cheap with legal packed instructions, but I see some mixed improvements and regressions that need investigation.	2025-01-17 08:38:01 +07:00
Luke Lau	a761e26b23	[RISCV] Allow non-loop invariant steps in RISCVGatherScatterLowering (#122244 ) The motivation for this is to allow us to match strided accesses that are emitted from the loop vectorizer with EVL tail folding (see #122232) In these loops the step isn't loop invariant and is based off of @llvm.experimental.get.vector.length. We can relax this as long as we make sure to construct the updates after the definition inside the loop, instead of the preheader. I presume the restriction was previously added so that the step would dominate the insertion point in the preheader. I can't think of why it wouldn't be safe to calculate it in the loop otherwise.	2025-01-17 08:58:56 +08:00
Philip Reames	bb6e94a05d	[RISCV] Custom legalize <N x i128>, <4 x i256>, etc.. shuffles (#122352 ) I have a particular user downstream who likes to write shuffles in terms of unions involving _BitInt(128) types. This isn't completely crazy because there's a bunch of code in the wild which was written with SSE in mind, so 128 bits is a common data fragment size. The problem is that generic lowering scalarizes this to ELEN, and we end up with really terrible extract/insert sequences if the i128 shuffle is between other (non-i128) operations. I explored trying to do this via generic lowering infrastructure, and frankly got lost. Doing this a target specific DAG is a bit ugly - really, there's nothing hugely target specific here - but oh well. If reviewers prefer, I could probably phrase this as a generic DAG combine, but I'm not sure that's hugely better. If reviewers have a strong preference on how to handle this, let me know, but I may need a bit of help. A couple notes: * The argument passing weirdness is due to a missing combine to turn a build_vector of adjacent i64 loads back into a vector load. I'm a bit surprised we don't get that, but the isel output clearly has the build_vector at i64. * The splat case I plan to revisit in another patch. That's a relatively common pattern, and the fact I have to scalarize that to avoid an infinite loop is non-ideal.	2025-01-16 14:55:45 -08:00
Brox Chen	8a0c2e7567	[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736 ) Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and fake16 flow. Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we have to at least update the fake16 codeGen to get codeGen test passing. For this case, we have to update the true16 and with fake16 together, otherwise some of the true16 tests will fail	2025-01-16 17:18:28 -05:00
Princeton Ferro	3ba339b5e7	[NVPTX] Improve support for {ex2,lg2}.approx (#120519 ) - Add support for `@llvm.exp2()`: - LLVM: `float` -> PTX: `ex2.approx{.ftz}.f32` - LLVM: `half` -> PTX: `ex2.approx.f16` - LLVM: `<2 x half>` -> PTX: `ex2.approx.f16x2` - LLVM: `bfloat` -> PTX: `ex2.approx.ftz.bf16` - LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2` - Any operations with non-native vector widths are expanded. On targets not supporting f16/bf16, values are promoted to f32. - Add CONDITIONAL support for `@llvm.log2()` [^1]: - LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32` - Support for f16/bf16 is emulated by promoting values to f32. [1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is implemented differently, so this is off by default. To enable, use the flag `-nvptx-approx-log2f32`.	2025-01-16 12:21:32 -08:00
Raphael Moreira Zinsly	01d7f434d2	[RISCV] Stack clash protection for dynamic alloca (#122508 ) Create a probe loop for dynamic allocation and add the corresponding SelectionDAG support in order to use it.	2025-01-16 11:58:42 -08:00

1 2 3 4 5 ...

82186 Commits