llvm-project

Author	SHA1	Message	Date
XinWang10	dd6fec5d4f	[X86][APX]Support lowering for APX promoted AMX-TILE instructions (#78689 ) The enc/dec of promoted AMX-TILE instructions have been supported in https://github.com/llvm/llvm-project/pull/76210. This patch support lowering for promoted AMX-TILE instructions and integrate test to existing tests.	2024-01-22 11:33:23 +08:00
XinWang10	d3cd1ce6ab	[X86] Add lowering tests for promoted CMPCCXADD and update CC representation (#78685 ) https://github.com/llvm/llvm-project/pull/76125 supported the enc/dec for CMPCCXADD instructions, this patch 1. Add lowering test for promoted CMPCCXADD 2. Update the representation of condition code for promoted CMPCCXADD to align with the existing one	2024-01-22 11:32:03 +08:00
Emma Pilkington	bc82cfb38d	[AMDGPU] Add an asm directive to track code_object_version (#76267 ) Named '.amdhsa_code_object_version'. This directive sets the e_ident[ABIVERSION] in the ELF header, and should be used as the assumed COV for the rest of the asm file. This commit also weakens the --amdhsa-code-object-version CL flag. Previously, the CL flag took precedence over the IR flag. Now the IR flag/asm directive take precedence over the CL flag. This is implemented by merging a few COV-checking functions in AMDGPUBaseInfo.h.	2024-01-21 11:54:47 -05:00
Fangrui Song	d0230446d2	[AArch64] Remove non-sensible define nonlazybind test nonlazybind is for declarations, not for definitions. We could test the behavior, but the output would be misleading.	2024-01-20 23:39:07 -08:00
Fangrui Song	f9614b328a	[AArch64] Improve nonlazybind test Prepare for -fno-plt implementation.	2024-01-20 22:16:29 -08:00
Kerry McLaughlin	a8a3711e74	[AArch64][SME2] Preserve ZT0 state around function calls (#78321 ) If a function has ZT0 state and calls a function which does not preserve ZT0, the caller must save and restore ZT0 around the call. If the caller shares ZT0 state and the callee is not shared ZA, we must additionally call SMSTOP/SMSTART ZA around the call. This patch adds new AArch64ISDNodes for spilling & filling ZT0. Where requiresPreservingZT0 is true, ZT0 state will be preserved across a call.	2024-01-20 12:06:00 +00:00
Jay Foad	63d7ca924f	[AMDGPU] Add GFX12 llvm.amdgcn.s.wait.*cnt intrinsics (#78723 )	2024-01-20 11:44:42 +00:00
Craig Topper	9396891271	[RISCV] Don't look for sext in RISCVCodeGenPrepare::visitAnd. We want to know the upper 33 bits of the And Input are zero. SExt only guarantees they are the same. We originally checked for SExt or ZExt when we were using isImpliedByDomCondition because a ZExt may have been changed to SExt before we visited the And. We are no longer using isImpliedByDomCondition so we can only look for zext with the nneg flag. While here, switch to PatternMatch to simplify the code. Fixes #78783	2024-01-19 14:44:47 -08:00
Craig Topper	66cea7143a	[RISCV] Add test case for #78783 . NFC	2024-01-19 14:44:47 -08:00
Arthur Eubanks	86eaf6083b	[X86] Refine X86DAGToDAGISel::isSExtAbsoluteSymbolRef() (#76191 ) We just need to check if the global is large or not. In the kernel code model, globals are in the negative 2GB of the address space, so globals can be a sign extended 32-bit immediate. In other code models, small globals are in the low 2GB of the address space, so sign extending them is equivalent to zero extending them.	2024-01-19 14:11:18 -08:00
Craig Topper	9ae28fb9d3	[RISCV] Prevent RISCVMergeBaseOffsetOpt from calling getVRegDef on a physical register. (#78762 ) Fixes #78679.	2024-01-19 12:15:08 -08:00
Min-Yih Hsu	5330daad41	[RISCV] Add support for Smepmp 1.0 (#78489 ) Smepmp is a supervisor extension that prevents privileged processes from accessing unprivileged program and data. Spec: https://github.com/riscv/riscv-tee/blob/main/Smepmp/Smepmp.pdf	2024-01-19 11:09:35 -08:00
Durgadoss R	43531e7196	[LLVM][NVPTX] Add cp.async.bulk.commit/wait intrinsics (#78698 ) This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants of the async-copy commit/wait instructions. lit tests are added to verify the generated PTX. PTX Doc link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-01-19 10:42:33 -08:00
Jay Foad	89226ecbb9	[AMDGPU] Do not widen scalar loads on GFX12 (#78724 ) GFX12 has subword scalar loads so there is no need to do this.	2024-01-19 15:30:07 +00:00
Jay Foad	ed12388082	[AMDGPU] Do not emit `V_DOT2C_F32_F16_e32` on GFX12 (#78709 ) That instruction is not supported on GFX12. Added a testcase which previously crashed without this change. Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>	2024-01-19 14:36:27 +00:00
Simon Pilgrim	a2a0089ac3	[X86] movsd/movss/movd/movq - add support for constant comments (#78601 ) If we're loading a constant value, print the constant (and the zero upper elements) instead of just the shuffle mask. This did require me to move the shuffle mask handling into addConstantComments as we can't handle this in the MC layer.	2024-01-19 14:21:26 +00:00
Sander de Smalen	340054e561	[AArch64][SME] Remove combination of private-ZA and preserves_za. (#78563 ) The new Clang attributes no longer support the combination of having a private-ZA function that preserves ZA. The use of __arm_preserves("za") means that ZA is shared and preserved. There wasn't that much benefit to the special handling of this, because in practice it only meant that we'd avoid restoring the lazy-save afterwards, but it still needed setting up a lazy-save (with the possibility of using a 0-sized buffer). Perhaps a new attribute will be added in the future to support this case, at which point we can revert back some of the changes removed in this patch. But for now removing this code simplifies things.	2024-01-19 13:48:44 +00:00
Danila Malyutin	9ad7d8f0e4	[Statepoint] Optimize Location structure size (#78600 ) Reduce its size from 24 to 12 bytes. Improves memory consumption when dealing with statepoint-heavy code.	2024-01-19 17:15:36 +04:00
David Spickett	955417ade2	Revert "[llvm][AArch64] Copy all operands when expanding BLR_BTI bundle (#78267 )" This reverts commit 228aecbcf106a50c30b1f8f1915d61850860cbcd. Failing expensive checks: https://lab.llvm.org/buildbot/#/builders/16/builds/59798	2024-01-19 12:06:30 +00:00
Jay Foad	879cbe06ed	[AMDGPU] Fix predicates for BUFFER_ATOMIC_CSUB pattern (#78701 ) Use OtherPredicates to avoid interfering with other uses of SubtargetPredicate for GFX12.	2024-01-19 12:01:31 +00:00
David Spickett	228aecbcf1	[llvm][AArch64] Copy all operands when expanding BLR_BTI bundle (#78267 ) Fixes #77915 Previously I based the operand copying on expandCALL_RVMARKER but did not understand it properly at the time. This lead to me dropping the arguments of the function being branched to. This fixes that by copying all operands from the BLR_BTI to the BL/BLR without skipping anything. I've updated the existing test by adding function arguments.	2024-01-19 11:22:43 +00:00
Mirko Brkušanin	0185c76456	[AMDGPU] Fix test for expensive-checks build (#78687 )	2024-01-19 11:32:02 +01:00
Leon Clark	2759cfa0c3	[AMDGPU] Remove unnecessary add instructions in ctlz.i8 (#77615 ) Add custom lowering for ctlz.i8 to avoid multiple add/sub operations. --------- Co-authored-by: Leon Clark <leoclark@amd.com> Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2024-01-19 10:16:46 +00:00
Craig Topper	0ad83bc26c	[RISCV] Don't look through EXTRACT_ELEMENT in lowerScalarInsert if the element types are different. (#78668 ) If the element type of the vector we're extracting from doesn't match the type we're inserting into, we can't directly insert or extract the subvector.	2024-01-18 22:35:24 -08:00
Christudasan Devadasan	4d566e57a2	[AMDGPU] Precommit lit test.	2024-01-19 09:32:03 +05:30
Luke Lau	8649328060	[RISCV] Add support for new unprivileged extensions defined in profiles spec (#77458 ) This adds minimal support for 7 new unprivileged extensions that were defined as a part of the RISC-V Profiles specification here: https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#7-new-isa-extensions * Ziccif: Main memory supports instruction fetch with atomicity requirement * Ziccrse: Main memory supports forward progress on LR/SC sequences * Ziccamoa: Main memory supports all atomics in A * Zicclsm: Main memory supports misaligned loads/stores * Za64rs: Reservation set size of 64 bytes * Za128rs: Reservation set size of 128 bytes * Zic64b: Cache block size isf 64 bytes As stated in the specification, these extensions don't add any new features but describe existing features. So this patch only adds parsing and subtarget features.	2024-01-19 06:57:06 +07:00
Florian Hahn	83365152a4	[AArch64] Add tests for operations on vectors with 3 elements.	2024-01-18 21:42:06 +00:00
Haohai Wen	fb2c6bbf42	[BranchFolding] Use isSuccessor to confirm fall through (#77923 ) When merging blocks, if the previous block has no any branch instruction and has one successor, the successor may be SEH landing pad and the block will always raise exception and nerver fall through to next block. We can not merge them in such case. isSuccessor should be used to confirm it can fall through to next block.	2024-01-18 23:26:22 +08:00
Simon Pilgrim	33287e35f2	[X86] Emit verbose (constant) comments before EVEX compression tag (#78585 ) This helps ensure the encoding details are next to the EVEX tag Noticed while preparing to add more constant commenting as part of #73783 and #71078	2024-01-18 15:13:42 +00:00
Piotr Sobczak	57f6a3f7ea	[AMDGPU] Add global_load_tr for GFX12 (#77772 ) Support new amdgcn_global_load_tr instructions for load with transpose. * MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128 * Intrinsic int_amdgcn_global_load_tr * Clang builtins amdgcn_global_load_tr*	2024-01-18 15:14:42 +01:00
Jay Foad	745b193260	[AMDGPU] Regenerate tests for #77892 after #77438	2024-01-18 13:50:59 +00:00
Jay Foad	0a3a0ea591	[AMDGPU] Update uses of new VOP2 pseudos for GFX12 (#78155 ) New pseudos were added for instructions that were natively VOP3 on GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64, V_LSHLREV_B64_pseudo --------- Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>	2024-01-18 13:26:13 +00:00
Mariusz Sikora	3e6589f21c	[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917 ) - image_atomic_pk_add_f16 - image_atomic_pk_add_bf16 - ds_pk_add_bf16 - ds_pk_add_f16 - ds_pk_add_rtn_bf16 - ds_pk_add_rtn_f16 - flat_atomic_pk_add_f16 - flat_atomic_pk_add_bf16 - global_atomic_pk_add_f16 - global_atomic_pk_add_bf16 - buffer_atomic_pk_add_f16 - buffer_atomic_pk_add_bf16	2024-01-18 14:01:09 +01:00
Mariusz Sikora	28b7e498b6	AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892 ) Endoding is VOP3P. Tagged as deep/machine learning instructions. i32 type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32 fneg(neg_lo[2]) and f32 fabs(neg_hi[2]). --------- Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>	2024-01-18 14:00:27 +01:00
Florian Hahn	40d952b874	[CGP] Avoid replacing a free ext with multiple other exts. (#77094 ) Replacing a free extension with 2 or more extensions unnecessarily increases the number of IR instructions without providing any benefits. It also unnecessarily causes operations to be performed on wider types than necessary. In some cases, the extra extensions also pessimize codegen (see bfis-in-loop.ll). The changes in arm64-codegen-prepare-extload.ll also show that we avoid promotions that should only be performed in stress mode. PR: https://github.com/llvm/llvm-project/pull/77094	2024-01-18 10:48:10 +00:00
Jay Foad	ba52f06f9d	[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438 ) Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into separate LOADCNT, SAMPLECNT and BVHCNT counters.	2024-01-18 10:47:45 +00:00
Jay Foad	9ca36932b5	[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186 )	2024-01-18 10:23:27 +00:00
Jay Foad	c111dc72e9	[AMDGPU] Allow potentially negative flat scratch offsets on GFX12 (#78193 ) https://github.com/llvm/llvm-project/pull/70634 has disabled use of potentially negative scratch offsets, but we can use it on GFX12. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-01-18 10:02:40 +00:00
Matthew Devereau	51e3d2f73d	[AArch64][SME] Conditionally do smstart/smstop (#77113 ) This patch adds conditional enabling/disabling of streaming mode for functions which have both the aarch64_pstate_sm_compatible and aarch64_pstate_sm_body attributes. This combination allows callees to determine if switching streaming mode is required instead of relying on the caller.	2024-01-18 09:17:23 +00:00
Luke Lau	15b0fabb21	[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd (#78244 ) LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, which has a scalar accumulator operand. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) ``` to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.	2024-01-18 16:15:20 +07:00
Ivan Kosarev	2a869ced61	[AMDGPU][True16] Support V_FLOOR_F16. (#78446 )	2024-01-18 08:43:47 +00:00
Mirko Brkušanin	1d286ad59b	[AMDGPU] Add mark last scratch load pass (#75512 )	2024-01-18 09:36:44 +01:00
Stanislav Mekhanoshin	021def6c22	[AMDGPU] Use alias info to relax waitcounts for LDS DMA (#74537 ) LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427	2024-01-17 23:44:15 -08:00
Matt Arsenault	c8007f9047	DAG: Fix chain mismanagement in SoftenFloatRes_FP_EXTEND (#74558 )	2024-01-18 14:32:44 +07:00
Matt Arsenault	11bf02e019	DAG: Fix ABI lowering with FP promote in strictfp functions (#74405 ) This was emitting non-strict casts in ABI contexts for illegal types.	2024-01-18 10:57:53 +07:00
Chia	ba81477e9c	Recommit "[RISCV][ISel] Combine scalable vector add/sub/mul with zero/sign extension." (#76785 ) This patch was originally introduced in PR #72340, but was reverted due to a bug on invalid extension combine. Specifically, we resolve the case in the https://github.com/llvm/llvm-project/pull/72340#issuecomment-1874810998 ``` define <vscale x 1 x i32> @foo(<vscale x 1 x i1> %x, <vscale x 1 x i2> %y) { %a = zext <vscale x 1 x i1> %x to <vscale x 1 x i32> %b = zext <vscale x 1 x i1> %y to <vscale x 1 x i32> %c = add <vscale x 1 x i32> %a, %b ret <vscale x 1 x i32> %c } ``` The previous patch didn't check if the semantic of `ISD::ZERO_EXTEND` and `ISD::ZERO_EXTEND` is equivalent to the `vsext.vf2` or `vzext.vf2` (not ensuring the SEW condition on widening Vector Arithmetic Instructions). Thanks for @topperc pointing out this bug. ## The original description This PR mainly aims at resolving the below missed-optimization case, while it could also be considered as an extension of the previous patch https://reviews.llvm.org/D133739?id= ### Missed-Optimization Case Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh ### Source Code: ``` define <vscale x 2 x i16> @multiple_users(ptr %x, ptr %y, ptr %z) { %a = load <vscale x 2 x i8>, ptr %x %b = load <vscale x 2 x i8>, ptr %y %b2 = load <vscale x 2 x i8>, ptr %z %c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16> %d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16> %d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16> %e = mul <vscale x 2 x i16> %c, %d %f = add <vscale x 2 x i16> %c, %d2 %g = sub <vscale x 2 x i16> %c, %d2 %h = or <vscale x 2 x i16> %e, %f %i = or <vscale x 2 x i16> %h, %g ret <vscale x 2 x i16> %i } ``` ### Before This Patch ``` # %bb.0: vsetvli a3, zero, e16, mf2, ta, ma vle8.v v8, (a0) vle8.v v9, (a1) vle8.v v10, (a2) svf2 v11, v8 vsext.vf2 v8, v9 vsext.vf2 v9, v10 vmul.vv v8, v11, v8 vadd.vv v10, v11, v9 vsub.vv v9, v11, v9 vor.vv v8, v8, v10 vor.vv v8, v8, v9 ret ``` ### After This Patch ``` # %bb.0: vsetvli a3, zero, e8, mf4, ta, ma vle8.v v8, (a0) vle8.v v9, (a1) vle8.v v10, (a2) vwmul.vv v11, v8, v9 vwadd.vv v9, v8, v10 vwsub.vv v12, v8, v10 vsetvli zero, zero, e16, mf2, ta, ma vor.vv v8, v11, v9 vor.vv v8, v8, v12 ret ``` We can see Add/Sub/Mul are combined with the Sign Extension. ### Relation to the Patch D133739 The patch D133739 introduced an optimization for folding `ADD_VL`/ `SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did not consider the case of non-fixed length vector case, thus this PR could also be considered as an extension for the D133739.	2024-01-17 18:30:27 -08:00
XinWang10	2d92f7de80	[X86] Support lowering for APX promoted BMI instructions. (#77433 ) R16-R31 was added into GPRs in https://github.com/llvm/llvm-project/pull/70958, This patch supports the lowering for promoted BMI instructions in EVEX space, enc/dec has been supported in https://github.com/llvm/llvm-project/pull/73899. RFC: https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4	2024-01-18 10:15:54 +08:00
XinWang10	f6617091a9	[X86][test] Add --show-mc-encoding for lowering tests of NDD arithmetic instructions (#78406 ) #77564 added lowering tests for NDD arithmetic instructions. It would be great to add `--show-mc-encoding` to check the NDD variant is selected first.	2024-01-18 09:29:02 +08:00
Stanislav Mekhanoshin	558ea41159	[AMDGPU] Reapply 'Sign extend simm16 in setreg intrinsic' (#78492 ) We currently force users to use a negative contant in the intrinsic call. Changing it zext would break existing programs, so just sign extend an argument.	2024-01-17 17:23:46 -08:00
Alex MacLean	430a40d12e	[NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385 ) Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for `nvvm.{min,max,mulhi,sad}` - `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select` instructions but it is still nice to support the 16 bit variants just in case any generators of IR are still trying to use these intrinsics. - `nvvm.sad` added both the 16 and 64 bit variants, also marked this instruction as speculateble. These directly correspond to the PTX `sad.{u16,s16,u64,s64}` instructions. - `nvvm.mulhi` added the 16 bit variants. These directly correspond to the PTX `mul.hi.{s,u}16` instructions.	2024-01-17 16:18:39 -08:00

1 2 3 4 5 ...

51669 Commits