llvm-project

Author	SHA1	Message	Date
Matt Arsenault	f4598194b5	DAG: Fold bitcast of scalar_to_vector to anyext (#122660 ) scalar_to_vector is difficult to make appear and test, but I found one case where this makes an observable difference. It fires more often than this in the test suite, but most of them have no net result in the final code. This helps reduce regressions in a future commit.	2025-01-13 19:38:58 +07:00
Matt Arsenault	e9a55770dc	AMDGPU: Add gfx9 run line to scalar_to_vector test (#122659 )	2025-01-13 19:35:56 +07:00
Akshat Oke	73b0e8a191	[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434 )	2025-01-13 17:52:30 +05:30
Sander de Smalen	3efe83291f	[AArch64] Fix chain for calls from agnostic-ZA functions. The lowering code was using the wrong chain value, which meant that the 'smstart' after the call from streaming agnostic-ZA functions -> non-streaming private-ZA functions was incorrectly removed from the DAG.	2025-01-13 12:06:50 +00:00
Simon Pilgrim	6c5941b09f	[X86] subvectorwise-store-of-vector-splat.ll - regenerate VPTERNLOG comments	2025-01-13 11:36:58 +00:00
quic_hchandel	171d3edd05	[RISCV] Add Qualcomm uC Xqciint (Interrupts) extension (#122256 ) This extension adds eleven instructions to accelerate interrupt servicing. The current spec can be found at: https://github.com/quic/riscv-unified-db/releases/latest This patch adds assembler only support. --------- Co-authored-by: Harsh Chandel <hchandel@qti.qualcomm.com>	2025-01-13 16:36:05 +05:30
Durgadoss R	7e2eb0f83e	[NVPTX] Add float to tf32 conversion intrinsics (#121507 ) This patch adds the missing variants of float to tf32 conversion intrinsics, with their corresponding lit tests. PTX Spec link: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-01-13 16:17:42 +05:30
Oliver Stannard	e2a071ece5	[MachineCP] Correctly handle register masks and sub-registers (#122472 ) When passing an instruction with a register mask, the machine copy propagation pass was dropping the information about some copy instructions which define a register which is preserved by the mask, because that register overlaps a register which is partially clobbered by it. This resulted in a miscompilation for AArch64, because this caused a live copy to be considered dead. The fix is to clobber register masks by finding the set of reg units which is preserved by the mask, and clobbering all units not in that set.	2025-01-13 09:55:08 +00:00
Akshat Oke	7bf1cb702b	[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261 )	2025-01-13 10:11:40 +05:30
Shilei Tian	f15da5fb78	[AMDGPU] Fix an invalid cast in `AMDGPULateCodeGenPrepare::visitLoadInst` (#122494 ) Fixes: SWDEV-507695	2025-01-12 23:40:25 -05:00
Pengcheng Wang	681c4a2068	Reapply "[RISCV] Rework memcpy test (#120364 )" Use descriptive names and add more cases. This recommits 59bba39 which was reverted in 4637c77.	2025-01-13 12:06:26 +08:00
Pengcheng Wang	4637c77746	Revert "[RISCV] Rework memcpy test" (#122662 ) Reverts llvm/llvm-project#120364 The test should be updated due to some recent changes.	2025-01-13 11:36:37 +08:00
Pengcheng Wang	59bba39a69	[RISCV] Rework memcpy test (#120364 ) Use descriptive names and add more cases.	2025-01-13 11:28:24 +08:00
Justin Bogner	0e51b54b7a	[DirectX] Implement the resource.store.rawbuffer intrinsic (#121282 ) This introduces `@llvm.dx.resource.store.rawbuffer` and generalizes the buffer store docs under DirectX/DXILResources. Fixes #106188	2025-01-12 18:52:20 -07:00
Simon Pilgrim	be6c752e15	[X86] X86FixupVectorConstantsPass - use VPMOVSX/ZX extensions for PS/PD domain moves (#122601 ) For targets with free domain moves, or AVX512 support, allow the use of VPMOVSX/ZX extension loads to reduce the load sizes. I've limited this to extension to i32/i64 types as we're mostly interested in shuffle mask loading here, but we could include i16 types as well just as easily. Inspired by a regression on #122485	2025-01-12 15:59:05 +00:00
Daniel Paoliello	5ee0a71df9	[aarch64][win] Add support for import call optimization (equivalent to MSVC /d2ImportCallOptimization) (#121516 ) This change implements import call optimization for AArch64 Windows (equivalent to the undocumented MSVC `/d2ImportCallOptimization` flag). Import call optimization adds additional data to the binary which can be used by the Windows kernel loader to rewrite indirect calls to imported functions as direct calls. It uses the same [Dynamic Value Relocation Table mechanism that was leveraged on x64 to implement `/d2GuardRetpoline`](https://techcommunity.microsoft.com/blog/windowsosplatform/mitigating-spectre-variant-2-with-retpoline-on-windows/295618). The change to the obj file is to add a new `.impcall` section with the following layout: ```cpp // Per section that contains calls to imported functions: // uint32_t SectionSize: Size in bytes for information in this section. // uint32_t Section Number // Per call to imported function in section: // uint32_t Kind: the kind of imported function. // uint32_t BranchOffset: the offset of the branch instruction in its // parent section. // uint32_t TargetSymbolId: the symbol id of the called function. ``` NOTE: If the import call optimization feature is enabled, then the `.impcall` section must be emitted, even if there are no calls to imported functions. The implementation is split across a few parts of LLVM: * During AArch64 instruction selection, the `GlobalValue` for each call to a global is recorded into the Extra Information for that node. * During lowering to machine instructions, the called global value for each call is noted in its containing `MachineFunction`. * During AArch64 asm printing, if the import call optimization feature is enabled: - A (new) `.impcall` directive is emitted for each call to an imported function. - The `.impcall` section is emitted with its magic header (but is not filled in). * During COFF object writing, the `.impcall` section is filled in based on each `.impcall` directive that were encountered. The `.impcall` section can only be filled in when we are writing the COFF object as it requires the actual section numbers, which are only assigned at that point (i.e., they don't exist during asm printing). I had tried to avoid using the Extra Information during instruction selection and instead implement this either purely during asm printing or in a `MachineFunctionPass` (as suggested in [on the forums](https://discourse.llvm.org/t/design-gathering-locations-of-instructions-to-emit-into-a-section/83729/3)) but this was not possible due to how loading and calling an imported function works on AArch64. Specifically, they are emitted as `ADRP` + `LDR` (to load the symbol) then a `BR` (to do the call), so at the point when we have machine instructions, we would have to work backwards through the instructions to discover what is being called. An initial prototype did work by inspecting instructions; however, it didn't correctly handle the case where the same function was called twice in a row, which caused LLVM to elide the `ADRP` + `LDR` and reuse the previously loaded address. Worse than that, sometimes for the double-call case LLVM decided to spill the loaded address to the stack and then reload it before making the second call. So, instead of trying to implement logic to discover where the value in a register came from, I instead recorded the symbol being called at the last place where it was easy to do: instruction selection.	2025-01-11 21:30:17 -08:00
Austin Kerbow	657fb4433e	[AMDGPU] Add target hook to isGlobalMemoryObject (#112781 ) We want special handing for IGLP instructions in the scheduler but they should still be treated like they have side effects by other passes. Add a target hook to the ScheduleDAGInstrs DAG builder so that we have more control over this.	2025-01-11 09:57:57 -08:00
David Green	ab9a80a3ad	[DAG] Allow AssertZExt to scalarize. (#122463 ) With range and undef metadata on a call we can have vector AssertZExt generated on a target with no vector operations. The AssertZExt needs to scalarize to a normal `AssertZext tin, ValueType`. I have added AssertSext too, although I do not have a test case. Fixes #110374	2025-01-11 16:29:06 +00:00
Marius Kamp	1eed46960c	[AArch64] Eliminate Common Subexpression of CSEL by Reassociation (#121350 ) If we have a CSEL instruction that depends on the flags set by a (SUBS x c) instruction and the true and/or false expression is (add (add x y) -c), we can reassociate the latter expression to (add (SUBS x c) y) and save one instruction. Proof for the basic transformation: https://alive2.llvm.org/ce/z/-337Pb We can extend this transformation for slightly different constants. For example, if we have (add (add x y) -(c-1)) and a the comparison x <u c, we can transform the comparison to x <=u c-1 to eliminate the comparison instruction, too. Similarly, we can transform (x == 0) to (x <u 1). Proofs for the transformations that alter the constants: https://alive2.llvm.org/ce/z/3nVqgR Fixes #119606.	2025-01-11 16:26:11 +00:00
Simon Pilgrim	70f37321de	[X86] avx512-build-vector.ll - regenerate VPTERNLOG comments	2025-01-11 15:02:53 +00:00
Simon Pilgrim	6078815498	[X86] avx512-mask-op.ll - regenerate VPTERNLOG comments	2025-01-11 15:02:52 +00:00
Simon Pilgrim	7b184687dd	[X86] vselect-avx.ll - regenerate VPTERNLOG comments	2025-01-11 15:02:52 +00:00
Simon Pilgrim	78953433a5	[X86] vector popcnt tests - regenerate VPTERNLOG comments	2025-01-11 15:02:52 +00:00
Simon Pilgrim	b622cc67d0	[X86] LowerCTPOP - check if the operand is a constant when collecting KnownBits Under certain circumstances, lowering of other instructions can result in computeKnownBits being able to detect a constant that it couldn't previously. Fixes #122580	2025-01-11 13:41:50 +00:00
Alina Sbirlea	29e5c1c927	[Hexagon] Fix test after 9d7df23f4d6537752854d54b0c4c583512b930d0	2025-01-10 13:03:28 -08:00
Austin Kerbow	2e5c298281	[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167 ) Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.	2025-01-10 11:39:02 -08:00
Farzon Lotfi	b900379e26	[HLSL] Reapply Move length support out of the DirectX Backend (#121611 ) (#122337 ) ## Changes - Delete DirectX length intrinsic - Delete HLSL length lang builtin - Implement length algorithm entirely in the header. ## History - In the past if an HLSL intrinsic lowered to either a spirv op code or a DXIL opcode we represented it with intrinsics ## Why we are moving away? - To make HLSL apis more portable the team decided that it makes sense for some intrinsics to be defined only in the header. - Since there tends to be more SPIRV opcodes than DXIL opcodes the plan is to support SPIRV opcodes either with target specific builtins or via pattern matching.	2025-01-10 14:16:27 -05:00
Raphael Moreira Zinsly	6f53886a9a	[RISCV] Add stack clash vector support (#119458 ) Use the probe loop structure to allocate vector code in the stack as well. We add the pseudo instruction RISCV::PROBED_STACKALLOC_RVV to differentiate from the normal loop.	2025-01-10 09:48:21 -08:00
Durgadoss R	372044ee09	[NVPTX] Add TMA Bulk Copy intrinsics (#122344 ) PR #96083 added intrinsics for async copy of 'tensor' data using TMA. Following a similar design, this PR adds intrinsics for async copy of bulk data (non-tensor variants) through TMA. * These intrinsics optionally support multicast and cache_hints, as indicated by the boolean arguments at the end of the intrinsics. * The backend looks through these flag arguments and lowers to the appropriate PTX instructions. * Lit tests are added for all combinations of these intrinsics in cp-async-bulk.ll. * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst file. PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-01-10 22:31:53 +05:30
Matt Arsenault	7ebf0df409	AMDGPU: Test gfx940 mfma intrinsics on gfx950 This requires splitting the xf32 cases into a separate file	2025-01-10 23:16:25 +07:00
Santanu Das	9d7df23f4d	[Hexagon] Add missing pattern for v8i1 type (#120703 ) HexagonISD::PFALSE and PTRUE patterns do not form independently in general as they are treated like operands of all 0s or all 1s. Eg: i32 = transfer HEXAGONISD::PFALSE. In this case, v8i1 = HEXAGONISD::PFALSE is formed independently without accompanying opcode. This patch adds a pattern to transfer all 0s or all 1s to a scalar register and then use that register and this PFALSE/PTRUE opcode to transfer to a predicate register like v8i1.	2025-01-10 09:54:02 -06:00
Simon Pilgrim	35a392553d	[X86] widenSubVector - widen from smaller build vector if the upper elements are already the same padding elements (#122445 ) Further simplifies some shuffle masks to help additional combines	2025-01-10 15:13:53 +00:00
Philip Reames	24bb180e8a	[RISCV] Attempt to widen SEW before generic shuffle lowering (#122311 ) This takes inspiration from AArch64 which does the same thing to assist with zip/trn/etc.. Doing this recursion unconditionally when the mask allows is slightly questionable, but seems to work out okay in practice. As a bit of context, it's helpful to realize that we have existing logic in both DAGCombine and InstCombine which mutates the element width of in an analogous manner. However, that code has two restriction which prevent it from handling the motivating cases here. First, it only triggers if there is a bitcast involving a different element type. Second, the matcher used considers a partially undef wide element to be a non-match. I considered trying to relax those assumptions, but the information loss for undef in mid-level opt seemed more likely to open a can of worms than I wanted.	2025-01-10 07:12:24 -08:00
David Green	5a069eac5f	[AArch64] Don't try to sink and(load) (#122274 ) If we sink the and in and(load), CGP can hoist is back again to the load, getting into an infinite loop. This prevents sinking the and in this case. Fixes #122074	2025-01-10 11:54:46 +00:00
Mirko Brkušanin	3def49cb64	[AMDGPU] Remove s_wakeup_barrier instruction (#122277 )	2025-01-10 11:30:22 +01:00
Usha Gupta	4c853be667	[AArch64] Replace uaddlv with addv for popcount operation (#121934 ) Replace `uaddlv` with `addv` for popcount operation as it is simpler operation. On certain platforms like Cortex-A510, `addv` has a latency of 3 cycles whereas `uaddlv` has a latency of 4 cycles GCC generates `addv` as well: https://godbolt.org/z/MnYG9jcEo	2025-01-10 09:47:50 +00:00
Nikita Popov	eeac0ffaf4	Revert "[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826 )" This reverts commit b4e17d4a314ed87ff6b40b4b05397d4b25b6636a. This causes a large compile-time regression.	2025-01-10 09:05:06 +01:00
Jakub Chlanda	01a7d4e26b	[AMDGPU] Allow selection of BITOP3 for some 2 opcodes and B32 cases (#122267 ) This came up in downstream static analysis - as a dead code. Admittedly, it depends on what the intention was when checking for [`if (NumOpcodes == 2 && IsB32)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3792C3-L3792C32) and I took a guess that for certain cases the selection should take place. If that's incorrect, that whole if statement can be removed, as it is after a check for: [`if (NumOpcodes < 4)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3788)	2025-01-10 07:49:11 +01:00
Heejin Ahn	a8e1135baa	[WebAssembly] Add -wasm-use-legacy-eh option (#122158 ) This replaces the existing `-wasm-enable-exnref` with `-wasm-use-legacy-eh` option, in an effort to make the new standardized exnref proposal the 'default' state and the legacy proposal needs to be separately enabled an option. But given that most users haven't switched to the new proposal and major web browsers haven't turned it on by default, this `-wasm-use-legacy-eh` is turned on by default, so nothing will change for now for the functionality perspective. This also removes the restriction that `-wasm-enable-exnref` be only used with `-wasm-enable-eh` because this option is enabled by default. This option does not have any effect when `-wasm-enable-eh` is not used.	2025-01-09 22:36:10 -08:00
ssijaric-nv	a4472c7dac	[AArch64] Fix the size passed to __trampoline_setup (#118234 ) The trampoline size is 36 bytes on AArch64. The runtime function __trampoline_setup aborts as it expects the trampoline size of at least 36 bytes, and the size passed is 20 bytes. Fix the inconsistency in AArch64TargetLowering::LowerINIT_TRAMPOLINE.	2025-01-09 22:09:08 -08:00
Chinmay Deshpande	211bcf67aa	[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (#121465 )	2025-01-10 09:05:41 +05:30
Michael Maitland	d0373dbe7c	[RISCV][VLOPT] Add vadc to isSupportedInstr (#122345 )	2025-01-09 19:44:40 -05:00
Michael Maitland	04e54cc19f	[RISCV][VLOPT] Add Vector Single-Width Averaging Add and Subtract to isSupportedInstr (#122351 )	2025-01-09 19:39:12 -05:00
Thurston Dang	4f42e16516	[hwasan] Omit tag check for null pointers (#122206 ) If the pointer to be checked is statically known to be zero, the tag check will always pass since: 1) the tag is zero 2) shadow memory for address 0 is initialized to 0 and never updated. We can therefore elide the tag check. We perform the elision in two places: 1) the HWASan pass 2) when lowering the CHECK_MEMACCESS intrinsic. Conceivably, the HWASan pass may encounter a "cannot currently statically prove to be null" pointer (and is therefore unable to omit the intrinsic) that later optimization passes convert into a statically known-null pointer. As a last line of defense, we perform elision here too. This also updates the tests from https://github.com/llvm/llvm-project/pull/122186	2025-01-09 13:48:26 -08:00
Michael Maitland	328c3a843f	[RISCV][VLOPT] Add vmerge to isSupportedInstr (#122340 )	2025-01-09 16:10:40 -05:00
Nico Weber	9ec92873ec	Revert "[HLSL] Move length support out of the DirectX Backend (#121611 )" This reverts commit a6b7181733c83523a39d4f4e788c6b7a227d477d. Breaks Clang :: CodeGenHLSL/builtins/length.hlsl, see https://github.com/llvm/llvm-project/pull/121611#issuecomment-2581004278	2025-01-09 14:19:03 -05:00
Michael Maitland	5f70fea79f	[RISCV][VLOPT] Add Vector Floating-Point Compare Instructions to getSupportedInstr	2025-01-09 10:50:32 -08:00
Michael Maitland	b419edeec3	[RISCV][VLOPT] Add widening floating point multiply to isSupportedInstr	2025-01-09 10:50:32 -08:00
Michael Maitland	a484fa1d0a	[RISCV][VLOPT] Add floating point multiply divide instructions to getSupportedInstr	2025-01-09 10:50:32 -08:00
Michael Maitland	8beb9d393d	[RISCV][VLOPT] Add vector widening floating point add subtract instructions to isSupportedInstr	2025-01-09 10:50:31 -08:00

1 2 3 4 5 ...

56904 Commits