llvm-project

Author	SHA1	Message	Date
Craig Topper	0ad83bc26c	[RISCV] Don't look through EXTRACT_ELEMENT in lowerScalarInsert if the element types are different. (#78668 ) If the element type of the vector we're extracting from doesn't match the type we're inserting into, we can't directly insert or extract the subvector.	2024-01-18 22:35:24 -08:00
Christudasan Devadasan	4d566e57a2	[AMDGPU] Precommit lit test.	2024-01-19 09:32:03 +05:30
Luke Lau	8649328060	[RISCV] Add support for new unprivileged extensions defined in profiles spec (#77458 ) This adds minimal support for 7 new unprivileged extensions that were defined as a part of the RISC-V Profiles specification here: https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#7-new-isa-extensions * Ziccif: Main memory supports instruction fetch with atomicity requirement * Ziccrse: Main memory supports forward progress on LR/SC sequences * Ziccamoa: Main memory supports all atomics in A * Zicclsm: Main memory supports misaligned loads/stores * Za64rs: Reservation set size of 64 bytes * Za128rs: Reservation set size of 128 bytes * Zic64b: Cache block size isf 64 bytes As stated in the specification, these extensions don't add any new features but describe existing features. So this patch only adds parsing and subtarget features.	2024-01-19 06:57:06 +07:00
Florian Hahn	83365152a4	[AArch64] Add tests for operations on vectors with 3 elements.	2024-01-18 21:42:06 +00:00
Haohai Wen	fb2c6bbf42	[BranchFolding] Use isSuccessor to confirm fall through (#77923 ) When merging blocks, if the previous block has no any branch instruction and has one successor, the successor may be SEH landing pad and the block will always raise exception and nerver fall through to next block. We can not merge them in such case. isSuccessor should be used to confirm it can fall through to next block.	2024-01-18 23:26:22 +08:00
Simon Pilgrim	33287e35f2	[X86] Emit verbose (constant) comments before EVEX compression tag (#78585 ) This helps ensure the encoding details are next to the EVEX tag Noticed while preparing to add more constant commenting as part of #73783 and #71078	2024-01-18 15:13:42 +00:00
Piotr Sobczak	57f6a3f7ea	[AMDGPU] Add global_load_tr for GFX12 (#77772 ) Support new amdgcn_global_load_tr instructions for load with transpose. * MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128 * Intrinsic int_amdgcn_global_load_tr * Clang builtins amdgcn_global_load_tr*	2024-01-18 15:14:42 +01:00
Jay Foad	745b193260	[AMDGPU] Regenerate tests for #77892 after #77438	2024-01-18 13:50:59 +00:00
Jay Foad	0a3a0ea591	[AMDGPU] Update uses of new VOP2 pseudos for GFX12 (#78155 ) New pseudos were added for instructions that were natively VOP3 on GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64, V_LSHLREV_B64_pseudo --------- Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>	2024-01-18 13:26:13 +00:00
Mariusz Sikora	3e6589f21c	[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917 ) - image_atomic_pk_add_f16 - image_atomic_pk_add_bf16 - ds_pk_add_bf16 - ds_pk_add_f16 - ds_pk_add_rtn_bf16 - ds_pk_add_rtn_f16 - flat_atomic_pk_add_f16 - flat_atomic_pk_add_bf16 - global_atomic_pk_add_f16 - global_atomic_pk_add_bf16 - buffer_atomic_pk_add_f16 - buffer_atomic_pk_add_bf16	2024-01-18 14:01:09 +01:00
Mariusz Sikora	28b7e498b6	AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892 ) Endoding is VOP3P. Tagged as deep/machine learning instructions. i32 type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32 fneg(neg_lo[2]) and f32 fabs(neg_hi[2]). --------- Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>	2024-01-18 14:00:27 +01:00
Florian Hahn	40d952b874	[CGP] Avoid replacing a free ext with multiple other exts. (#77094 ) Replacing a free extension with 2 or more extensions unnecessarily increases the number of IR instructions without providing any benefits. It also unnecessarily causes operations to be performed on wider types than necessary. In some cases, the extra extensions also pessimize codegen (see bfis-in-loop.ll). The changes in arm64-codegen-prepare-extload.ll also show that we avoid promotions that should only be performed in stress mode. PR: https://github.com/llvm/llvm-project/pull/77094	2024-01-18 10:48:10 +00:00
Jay Foad	ba52f06f9d	[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438 ) Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into separate LOADCNT, SAMPLECNT and BVHCNT counters.	2024-01-18 10:47:45 +00:00
Jay Foad	9ca36932b5	[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186 )	2024-01-18 10:23:27 +00:00
Jay Foad	c111dc72e9	[AMDGPU] Allow potentially negative flat scratch offsets on GFX12 (#78193 ) https://github.com/llvm/llvm-project/pull/70634 has disabled use of potentially negative scratch offsets, but we can use it on GFX12. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-01-18 10:02:40 +00:00
Matthew Devereau	51e3d2f73d	[AArch64][SME] Conditionally do smstart/smstop (#77113 ) This patch adds conditional enabling/disabling of streaming mode for functions which have both the aarch64_pstate_sm_compatible and aarch64_pstate_sm_body attributes. This combination allows callees to determine if switching streaming mode is required instead of relying on the caller.	2024-01-18 09:17:23 +00:00
Luke Lau	15b0fabb21	[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd (#78244 ) LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, which has a scalar accumulator operand. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) ``` to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.	2024-01-18 16:15:20 +07:00
Ivan Kosarev	2a869ced61	[AMDGPU][True16] Support V_FLOOR_F16. (#78446 )	2024-01-18 08:43:47 +00:00
Mirko Brkušanin	1d286ad59b	[AMDGPU] Add mark last scratch load pass (#75512 )	2024-01-18 09:36:44 +01:00
Stanislav Mekhanoshin	021def6c22	[AMDGPU] Use alias info to relax waitcounts for LDS DMA (#74537 ) LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427	2024-01-17 23:44:15 -08:00
Matt Arsenault	c8007f9047	DAG: Fix chain mismanagement in SoftenFloatRes_FP_EXTEND (#74558 )	2024-01-18 14:32:44 +07:00
Matt Arsenault	11bf02e019	DAG: Fix ABI lowering with FP promote in strictfp functions (#74405 ) This was emitting non-strict casts in ABI contexts for illegal types.	2024-01-18 10:57:53 +07:00
Chia	ba81477e9c	Recommit "[RISCV][ISel] Combine scalable vector add/sub/mul with zero/sign extension." (#76785 ) This patch was originally introduced in PR #72340, but was reverted due to a bug on invalid extension combine. Specifically, we resolve the case in the https://github.com/llvm/llvm-project/pull/72340#issuecomment-1874810998 ``` define <vscale x 1 x i32> @foo(<vscale x 1 x i1> %x, <vscale x 1 x i2> %y) { %a = zext <vscale x 1 x i1> %x to <vscale x 1 x i32> %b = zext <vscale x 1 x i1> %y to <vscale x 1 x i32> %c = add <vscale x 1 x i32> %a, %b ret <vscale x 1 x i32> %c } ``` The previous patch didn't check if the semantic of `ISD::ZERO_EXTEND` and `ISD::ZERO_EXTEND` is equivalent to the `vsext.vf2` or `vzext.vf2` (not ensuring the SEW condition on widening Vector Arithmetic Instructions). Thanks for @topperc pointing out this bug. ## The original description This PR mainly aims at resolving the below missed-optimization case, while it could also be considered as an extension of the previous patch https://reviews.llvm.org/D133739?id= ### Missed-Optimization Case Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh ### Source Code: ``` define <vscale x 2 x i16> @multiple_users(ptr %x, ptr %y, ptr %z) { %a = load <vscale x 2 x i8>, ptr %x %b = load <vscale x 2 x i8>, ptr %y %b2 = load <vscale x 2 x i8>, ptr %z %c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16> %d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16> %d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16> %e = mul <vscale x 2 x i16> %c, %d %f = add <vscale x 2 x i16> %c, %d2 %g = sub <vscale x 2 x i16> %c, %d2 %h = or <vscale x 2 x i16> %e, %f %i = or <vscale x 2 x i16> %h, %g ret <vscale x 2 x i16> %i } ``` ### Before This Patch ``` # %bb.0: vsetvli a3, zero, e16, mf2, ta, ma vle8.v v8, (a0) vle8.v v9, (a1) vle8.v v10, (a2) svf2 v11, v8 vsext.vf2 v8, v9 vsext.vf2 v9, v10 vmul.vv v8, v11, v8 vadd.vv v10, v11, v9 vsub.vv v9, v11, v9 vor.vv v8, v8, v10 vor.vv v8, v8, v9 ret ``` ### After This Patch ``` # %bb.0: vsetvli a3, zero, e8, mf4, ta, ma vle8.v v8, (a0) vle8.v v9, (a1) vle8.v v10, (a2) vwmul.vv v11, v8, v9 vwadd.vv v9, v8, v10 vwsub.vv v12, v8, v10 vsetvli zero, zero, e16, mf2, ta, ma vor.vv v8, v11, v9 vor.vv v8, v8, v12 ret ``` We can see Add/Sub/Mul are combined with the Sign Extension. ### Relation to the Patch D133739 The patch D133739 introduced an optimization for folding `ADD_VL`/ `SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did not consider the case of non-fixed length vector case, thus this PR could also be considered as an extension for the D133739.	2024-01-17 18:30:27 -08:00
XinWang10	2d92f7de80	[X86] Support lowering for APX promoted BMI instructions. (#77433 ) R16-R31 was added into GPRs in https://github.com/llvm/llvm-project/pull/70958, This patch supports the lowering for promoted BMI instructions in EVEX space, enc/dec has been supported in https://github.com/llvm/llvm-project/pull/73899. RFC: https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4	2024-01-18 10:15:54 +08:00
XinWang10	f6617091a9	[X86][test] Add --show-mc-encoding for lowering tests of NDD arithmetic instructions (#78406 ) #77564 added lowering tests for NDD arithmetic instructions. It would be great to add `--show-mc-encoding` to check the NDD variant is selected first.	2024-01-18 09:29:02 +08:00
Stanislav Mekhanoshin	558ea41159	[AMDGPU] Reapply 'Sign extend simm16 in setreg intrinsic' (#78492 ) We currently force users to use a negative contant in the intrinsic call. Changing it zext would break existing programs, so just sign extend an argument.	2024-01-17 17:23:46 -08:00
Alex MacLean	430a40d12e	[NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385 ) Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for `nvvm.{min,max,mulhi,sad}` - `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select` instructions but it is still nice to support the 16 bit variants just in case any generators of IR are still trying to use these intrinsics. - `nvvm.sad` added both the 16 and 64 bit variants, also marked this instruction as speculateble. These directly correspond to the PTX `sad.{u16,s16,u64,s64}` instructions. - `nvvm.mulhi` added the 16 bit variants. These directly correspond to the PTX `mul.hi.{s,u}16` instructions.	2024-01-17 16:18:39 -08:00
Arthur Eubanks	00647a18ce	[X86] Don't respect large data threshold for globals with an explicit section (#78348 ) If multiple globals are placed in an explicit section, there's a chance that the large data threshold will cause the different globals to be inconsistent in whether they're large or small. Mixing sections with mismatched large section flags can cause undesirable issues like increased relocation pressure because there may be 32-bit references to the section in some TUs, but the section is considered large since input section flags are unioned and other TUs added the large section flag. An explicit code model on the global still overrides the decision. We can do this for globals without any references to them, like what we did with asan_globals in #74514. If we have some precompiled small code model files where asan_globals is not considered large mixed with medium/large code model files, that's ok because the section is considered large and placed farther. However, overriding the code model for globals in some TUs but not others and having references to them from code will still result in the above undesired behavior. This mitigates a whole class of mismatched large section flag issues like what #77986 was trying to fix. This ends up not adding the SHF_X86_64_LARGE section flag on explicit sections in the medium/large code model. This is ok for the large code model since all references from large text must use 64-bit relocations anyway.	2024-01-17 15:38:32 -08:00
Mikhail Gudim	c1f433849b	[GISel][RISCV] Implement selectShiftMask. (#77572 ) Implement `selectShiftMask` in `GlobalISel`.	2024-01-17 16:25:43 -05:00
Thorsten Schütt	67dc6e9075	[GlobalIsel][AArch64] more legal icmps (#78239 ) In https://github.com/llvm/llvm-project/pull/78181 the godbolt (https://llvm.godbolt.org/z/vMsnxMf1v) crashed with GlobalIsel. LLVM ERROR: unable to legalize instruction: %90:_(<3 x s32>) = G_ICMP intpred(uge), %15:_(<3 x s32>), %0:_ (in function: vec3_i32)	2024-01-17 22:23:51 +01:00
Philip Reames	de423cfe3d	[RISCV] Prefer vsetivli for VLMAX when VLEN is exactly known (#75509 ) If VLEN is exactly known, we may be able to use the vsetivli encoding instead of the vsetvli a0, zero, <vtype> encoding. This slightly reduces register pressure. This builds on 632f1c5, but reverses course a bit. It turns out to be quite complicated to canonicalize from VLMAX to immediate early because the sentinel value is widely used in tablegen patterns without knowledge of LMUL. Instead, we canonicalize towards the VLMAX representation, and then pick the immediate form during insertion since we have the LMUL information there. Within InsertVSETVLI, this could reasonable fit in a couple places. If reviewers want me to e.g. move it to emission, let me know. Doing so may require a bit of extra code to e.g. handle comparisons of the two forms, but shouldn't be too complicated.	2024-01-17 12:40:00 -08:00
Alex MacLean	da7462a6ae	[NVPTX] Add tex.grad.cube{array} intrinsics (#77693 ) Extend IR support for PTX `tex` instruction described in [PTX ISA. 9.7.9.3. Texture Instructions: tex](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex). Add support for unified-move versions of `tex.grad.cube{array}` variants added in PTX ISA 4.3.	2024-01-17 10:41:11 -08:00
Mariusz Sikora	c99da46fc1	[AMDGPU][GFX12] Add Atomic cond_sub_u32 (#76224 ) Co-authored-by: Vang Thao <Vang.Thao@amd.com>	2024-01-17 19:23:42 +01:00
Petar Avramovic	90bdf76fdb	Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#78468 ) Reverts llvm/llvm-project#76145	2024-01-17 17:41:19 +01:00
Simon Pilgrim	d92ce344bf	Revert faecc736e2ac3cd8c77 #74443 [DAG] isSplatValue - node is a splat if all demanded elts have the same whole constant value (#74443 ) Relying on ComputeKnownBits to find a splat is causing miscompilations where a shift of zero is being assumed to give zero, but further simplification leads to a shift of zero by undef, resulting in an unexpected undef value. Fixes #78109	2024-01-17 15:59:33 +00:00
Jay Foad	e4c8c58517	[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on GFX12 (#77929 )	2024-01-17 15:57:36 +00:00
Matt Arsenault	af4f1766ae	AMDGPU: Allocate special SGPRs before user SGPR arguments (#78234 )	2024-01-17 21:41:50 +07:00
Jay Foad	f12059eb3f	[AMDGPU] Fix llvm.amdgcn.s.wait.event.export.ready for GFX12 (#78191 ) The meaning of bit 0 of the immediate operand of S_WAIT_EVENT has been flipped from GFX11.	2024-01-17 11:59:15 +00:00
Jay Foad	e9e9d1b0b1	[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX12 (#77927 )	2024-01-17 11:52:19 +00:00
Petar Avramovic	1fbf533286	AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#76145 ) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: https://github.com/llvm/llvm-project/pull/73337	2024-01-17 12:10:24 +01:00
Jay Foad	4a77414660	[AMDGPU] CodeGen for GFX12 8/16-bit SMEM loads (#77633 )	2024-01-17 10:28:03 +00:00
Jay Foad	42b9ea841e	[AMDGPU] Increase max scratch allocation for GFX12 (#77625 )	2024-01-17 10:25:28 +00:00
Jay Foad	36ef291d63	[AMDGPU] Fix hang caused by VS_CNT handling at calls (#78318 ) Fix a potential hang introduced by #77439 and #77935. This line: setScoreUB(VS_CNT, getScoreLB(VS_CNT) + getWaitCountMax(VS_CNT)); could potentialy set UB lower than it was before, which confused SIInsertWaitcnts's fixed point algorithm. This was only triggered a STORE instruction with an implicit-def, which seems odd but apparently happens for some spills.	2024-01-17 10:24:29 +00:00
Matt Arsenault	53a3c738a9	AMDGPU: Remove fixed fixme from a test	2024-01-17 16:52:50 +07:00
Nikita Popov	435bcea83b	[GISel] Add debug counter to force sdag fallback (#78257 ) Add a debug counter that allows forcing an sdag fallback after a certain number of functions. The intended use-case is to bisect which function gets miscompiled by global isel using `-debug-counter=globalisel-count=N` (in cases where sdag doesn't also miscompile it, of course). The "falling back" debug line is printed unconditionally, because using `-debug-only` is usually too spammy for the intended purpose.	2024-01-17 09:33:31 +01:00
Nikita Popov	cde780c18f	[DAGCombine] Add debug counter (#78259 ) Add a debug counter for DAGCombine. This can help with bisecting which DAG combine introduced a miscompile.	2024-01-17 09:31:56 +01:00
Dávid Ferenc Szabó	55172b7005	[GlobalISel] Improve combines for extend operation by taking hint ins… (#74125 ) …tructions into account Hint instructions like G_ASSERT_ZEXT cann be viewed as a copy. Including this fact into the combiner allows the match more patterns involving such instructions.	2024-01-17 15:21:02 +07:00
Alex Bradbury	da0755f7b7	[RISCV][test] Test showing missed optimisation for spills/fills of GPR<->FPR moves The fmv can be removed through appropriate logic in RISCVInstrInfo::foldMemoryOperandImpl.	2024-01-17 08:11:06 +00:00
Fangrui Song	d4cb5d9f2b	[X86] Add "Ws" constraint and "p" modifier for symbolic address/label reference (#77886 ) Printing the raw symbol is useful in inline asm (e.g. getting the C++ mangled name, referencing a symbol in a custom way while ensuring it is not optimized out even if internal). Similar constraints are available in other targets (e.g. "S" for aarch64/riscv, "Cs" for m68k). ``` namespace ns { extern int var, a[4]; } void foo() { asm(".pushsection .xxx,\"aw\"; .dc.a %p0; .popsection" :: "Ws"(&ns::var)); asm(".reloc ., BFD_RELOC_NONE, %p0" :: "Ws"(&ns::a[3])); } ``` Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105576	2024-01-16 23:57:42 -08:00
Danila Malyutin	46a929f0a0	[SelectionDAG] Fix isKnownNeverZeroFloat for vectors (#78308 ) Return true iff all of vector elements are constant AND not zero Fixes #77805 Previously, it'd return `true` (as in - the value is known to be never zero) for any build_vector/splat_vector with non-constant elements.	2024-01-17 12:55:57 +07:00

... 22 23 24 25 26 ...

52796 Commits