llvm-project

Author	SHA1	Message	Date
Pierre van Houtryve	ac296b696c	[AMDGPU] Drop verify from SIMemoryLegalizer tests (#78697 ) SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s to complete and that's on a fast machine. I also recall seeing them in the slowest tests list on build bots. This removes the verify-machineinstrs option from these tests to speed them up, bringing the slowest test down to +-2s. Verifier still runs in EXPENSIVE_CHECKS builds.	2024-01-22 10:31:37 +01:00
Emma Pilkington	bc82cfb38d	[AMDGPU] Add an asm directive to track code_object_version (#76267 ) Named '.amdhsa_code_object_version'. This directive sets the e_ident[ABIVERSION] in the ELF header, and should be used as the assumed COV for the rest of the asm file. This commit also weakens the --amdhsa-code-object-version CL flag. Previously, the CL flag took precedence over the IR flag. Now the IR flag/asm directive take precedence over the CL flag. This is implemented by merging a few COV-checking functions in AMDGPUBaseInfo.h.	2024-01-21 11:54:47 -05:00
Jay Foad	63d7ca924f	[AMDGPU] Add GFX12 llvm.amdgcn.s.wait.*cnt intrinsics (#78723 )	2024-01-20 11:44:42 +00:00
Jay Foad	89226ecbb9	[AMDGPU] Do not widen scalar loads on GFX12 (#78724 ) GFX12 has subword scalar loads so there is no need to do this.	2024-01-19 15:30:07 +00:00
Jay Foad	ed12388082	[AMDGPU] Do not emit `V_DOT2C_F32_F16_e32` on GFX12 (#78709 ) That instruction is not supported on GFX12. Added a testcase which previously crashed without this change. Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>	2024-01-19 14:36:27 +00:00
Jay Foad	879cbe06ed	[AMDGPU] Fix predicates for BUFFER_ATOMIC_CSUB pattern (#78701 ) Use OtherPredicates to avoid interfering with other uses of SubtargetPredicate for GFX12.	2024-01-19 12:01:31 +00:00
Mirko Brkušanin	0185c76456	[AMDGPU] Fix test for expensive-checks build (#78687 )	2024-01-19 11:32:02 +01:00
Leon Clark	2759cfa0c3	[AMDGPU] Remove unnecessary add instructions in ctlz.i8 (#77615 ) Add custom lowering for ctlz.i8 to avoid multiple add/sub operations. --------- Co-authored-by: Leon Clark <leoclark@amd.com> Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2024-01-19 10:16:46 +00:00
Christudasan Devadasan	4d566e57a2	[AMDGPU] Precommit lit test.	2024-01-19 09:32:03 +05:30
Piotr Sobczak	57f6a3f7ea	[AMDGPU] Add global_load_tr for GFX12 (#77772 ) Support new amdgcn_global_load_tr instructions for load with transpose. * MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128 * Intrinsic int_amdgcn_global_load_tr * Clang builtins amdgcn_global_load_tr*	2024-01-18 15:14:42 +01:00
Jay Foad	745b193260	[AMDGPU] Regenerate tests for #77892 after #77438	2024-01-18 13:50:59 +00:00
Jay Foad	0a3a0ea591	[AMDGPU] Update uses of new VOP2 pseudos for GFX12 (#78155 ) New pseudos were added for instructions that were natively VOP3 on GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64, V_LSHLREV_B64_pseudo --------- Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>	2024-01-18 13:26:13 +00:00
Mariusz Sikora	3e6589f21c	[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917 ) - image_atomic_pk_add_f16 - image_atomic_pk_add_bf16 - ds_pk_add_bf16 - ds_pk_add_f16 - ds_pk_add_rtn_bf16 - ds_pk_add_rtn_f16 - flat_atomic_pk_add_f16 - flat_atomic_pk_add_bf16 - global_atomic_pk_add_f16 - global_atomic_pk_add_bf16 - buffer_atomic_pk_add_f16 - buffer_atomic_pk_add_bf16	2024-01-18 14:01:09 +01:00
Mariusz Sikora	28b7e498b6	AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892 ) Endoding is VOP3P. Tagged as deep/machine learning instructions. i32 type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32 fneg(neg_lo[2]) and f32 fabs(neg_hi[2]). --------- Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>	2024-01-18 14:00:27 +01:00
Jay Foad	ba52f06f9d	[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438 ) Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into separate LOADCNT, SAMPLECNT and BVHCNT counters.	2024-01-18 10:47:45 +00:00
Jay Foad	9ca36932b5	[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186 )	2024-01-18 10:23:27 +00:00
Jay Foad	c111dc72e9	[AMDGPU] Allow potentially negative flat scratch offsets on GFX12 (#78193 ) https://github.com/llvm/llvm-project/pull/70634 has disabled use of potentially negative scratch offsets, but we can use it on GFX12. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-01-18 10:02:40 +00:00
Ivan Kosarev	2a869ced61	[AMDGPU][True16] Support V_FLOOR_F16. (#78446 )	2024-01-18 08:43:47 +00:00
Mirko Brkušanin	1d286ad59b	[AMDGPU] Add mark last scratch load pass (#75512 )	2024-01-18 09:36:44 +01:00
Stanislav Mekhanoshin	021def6c22	[AMDGPU] Use alias info to relax waitcounts for LDS DMA (#74537 ) LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427	2024-01-17 23:44:15 -08:00
Matt Arsenault	11bf02e019	DAG: Fix ABI lowering with FP promote in strictfp functions (#74405 ) This was emitting non-strict casts in ABI contexts for illegal types.	2024-01-18 10:57:53 +07:00
Stanislav Mekhanoshin	558ea41159	[AMDGPU] Reapply 'Sign extend simm16 in setreg intrinsic' (#78492 ) We currently force users to use a negative contant in the intrinsic call. Changing it zext would break existing programs, so just sign extend an argument.	2024-01-17 17:23:46 -08:00
Mariusz Sikora	c99da46fc1	[AMDGPU][GFX12] Add Atomic cond_sub_u32 (#76224 ) Co-authored-by: Vang Thao <Vang.Thao@amd.com>	2024-01-17 19:23:42 +01:00
Petar Avramovic	90bdf76fdb	Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#78468 ) Reverts llvm/llvm-project#76145	2024-01-17 17:41:19 +01:00
Jay Foad	e4c8c58517	[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on GFX12 (#77929 )	2024-01-17 15:57:36 +00:00
Matt Arsenault	af4f1766ae	AMDGPU: Allocate special SGPRs before user SGPR arguments (#78234 )	2024-01-17 21:41:50 +07:00
Jay Foad	f12059eb3f	[AMDGPU] Fix llvm.amdgcn.s.wait.event.export.ready for GFX12 (#78191 ) The meaning of bit 0 of the immediate operand of S_WAIT_EVENT has been flipped from GFX11.	2024-01-17 11:59:15 +00:00
Jay Foad	e9e9d1b0b1	[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX12 (#77927 )	2024-01-17 11:52:19 +00:00
Petar Avramovic	1fbf533286	AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#76145 ) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: https://github.com/llvm/llvm-project/pull/73337	2024-01-17 12:10:24 +01:00
Jay Foad	4a77414660	[AMDGPU] CodeGen for GFX12 8/16-bit SMEM loads (#77633 )	2024-01-17 10:28:03 +00:00
Jay Foad	42b9ea841e	[AMDGPU] Increase max scratch allocation for GFX12 (#77625 )	2024-01-17 10:25:28 +00:00
Jay Foad	36ef291d63	[AMDGPU] Fix hang caused by VS_CNT handling at calls (#78318 ) Fix a potential hang introduced by #77439 and #77935. This line: setScoreUB(VS_CNT, getScoreLB(VS_CNT) + getWaitCountMax(VS_CNT)); could potentialy set UB lower than it was before, which confused SIInsertWaitcnts's fixed point algorithm. This was only triggered a STORE instruction with an implicit-def, which seems odd but apparently happens for some spills.	2024-01-17 10:24:29 +00:00
Matt Arsenault	53a3c738a9	AMDGPU: Remove fixed fixme from a test	2024-01-17 16:52:50 +07:00
Fangrui Song	9e9907f1cf	[AMDGPU,test] Change llc -march= to -mtriple= (#75982 ) Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```	2024-01-16 21:54:58 -08:00
Florian Mayer	f3190c78ec	Revert "[AMDGPU] Sign extend simm16 in setreg intrinsic" (#78372 ) Reverts llvm/llvm-project#77997 Broke UBSan bots.	2024-01-16 16:37:48 -08:00
Stanislav Mekhanoshin	371fdbaa57	[AMDGPU] Sign extend simm16 in setreg intrinsic (#77997 ) We currently force users to use a negative contant in the intrinsic call. Changing it zext would break existing programs, so just sign extend an argument.	2024-01-16 09:17:18 -08:00
Pierre van Houtryve	4b0a76a3d7	[GlobalISel] Fix buildCopyFromRegs for split vectors (#77448 ) Fixes #77055	2024-01-16 10:04:20 +01:00
Matt Arsenault	480cc413b7	AMDGPU/GlobalISel: Handle inreg arguments as SGPRs (#78123 ) This is the missing GISel part of 54470176afe20b16e6b026ab989591d1d19ad2b7	2024-01-16 15:13:31 +07:00
Shilei Tian	d63c2e52e6	[AMDGPU][MC] Remove incorrect `_e32` suffix from `v_dot2c_f32_f16` and `v_dot4c_i32_i8` (#77993 ) The two VOP2 instructions cannot be encoded as VOP3. Fix #54691.	2024-01-15 23:11:50 -05:00
Jay Foad	ba131b7017	[AMDGPU] Do not generate s_set_inst_prefetch_distance for GFX12 (#78190 ) GFX12 can still encode the s_set_inst_prefetch_distance instruction but it has no effect.	2024-01-15 18:20:45 +00:00
Jay Foad	ed60cb8fb9	[AMDGPU] Disable hasVALUPartialForwardingHazard for GFX12 (#78188 )	2024-01-15 18:20:10 +00:00
Jay Foad	85705bbf1d	[AMDGPU] Disable hasVALUMaskWriteHazard for GFX12 (#78187 )	2024-01-15 18:19:32 +00:00
chuongg3	fcfe1b6482	[GlobalISel] Refactor extractParts() (#75223 ) Moved extractParts() and extractVectorParts() from LegalizerHelper to Utils to be able to use it in different passes. extractParts() will also try to use unmerge when doing irregular splits where possible, falling back to extract elements when not.	2024-01-15 16:40:39 +00:00
Jay Foad	f3d07881c8	[AMDGPU] Remove functions with incompatible gws attribute (#78143 ) This change is to remove incompatible gws related functions in order to make device-libs work correctly under -O0 for gfx1200+ Co-authored-by: Changpeng Fang <changpeng.fang@amd.com>	2024-01-15 16:23:39 +00:00
Krzysztof Drewniak	88871784fd	[AMDGPU] Allow buffer intrinsics to be marked volatile at the IR level (#77847 ) In order to ensure the correctness of ptr addrspace(7) lowering, we need a backwards-compatible way to flag buffer intrinsics as volatile that can't be dropped (unlike metadata). To acheive this in a backwards-compatible way, we use bit 31 of the auxilliary immediates of buffer intrinsics as the volatile flag. When this bit is set, the MachineMemOperand for said intrinsic is marked volatile. Existing code will ensure that this results in the appropriate use of flags like glc and dlc. This commit also harmorizes the handling of the auxilliary immediate for atomic intrinsics, which new go through extract_cpol like loads and stores, which masks off the volatile bit.	2024-01-12 11:20:01 -06:00
Jay Foad	dec74a8347	[AMDGPU] Fix VS_CNT overflow assertion (#77935 ) Always set the upper bound for VS_CNT higher than the lower bound. Before #77439 this code was only executed on function entry where the lower bound was 0 so it was not a problem. Fixes #77931	2024-01-12 17:11:19 +00:00
Carl Ritson	6752f1517d	[TwoAddressInstruction] Recompute live intervals for partial defs (#74431 ) Force live interval recomputation for a register if its definition is narrowed to become partial. The live interval repair process cannot otherwise detect these changes.	2024-01-12 13:26:01 +09:00
Mirko Brkušanin	3867e6689e	[AMDGPU] Add new GFX12 image atomic float instructions (#76946 )	2024-01-11 17:28:04 +01:00
Jay Foad	b120dae9bb	[AMDGPU] Support GFX12 VDSDIR instructions WAITVMSRC operand in GCNHazardRecognizer (#77628 ) Modify GCNHazardRecognizer::fixLdsDirectVMEMHazard() so the waitvsrc operand in gfx12 DS_PARAM_LOAD or DS_DIRECT_LOAD instructions is set appropriately depending on whether a hazard is found or not, rather than inserting an S_WAITCNT_DEPCTR instruction if a hazard needs to be mitigated. Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>	2024-01-11 13:20:19 +00:00
Amara Emerson	bbbe8ecc17	[GlobalISel][Localizer] Allow localization of a small number of repeated phi uses. (#77566 ) We previously had a heuristic that if a value V was used multiple times in a single PHI, then to avoid potentially rematerializing into many predecessors we bail out. The phi uses only counted as a single use in the shouldLocalize() hook because it counted the PHI as a single instruction use, not factoring in it may have many incoming edges. It turns out this heuristic is slightly too pessimistic, and allowing a small number of these uses to be localized can improve code size due to shortening live ranges, especially if those ranges span a call. This change results in some improvements in size on CTMark -Os: ``` Program size.__text before after diff kimwitu++/kc 451676.00 451860.00 0.0% mafft/pairlocalalign 241460.00 241540.00 0.0% tramp3d-v4/tramp3d-v4 389216.00 389208.00 -0.0% 7zip/7zip-benchmark 587528.00 587464.00 -0.0% Bullet/bullet 457424.00 457348.00 -0.0% consumer-typeset/consumer-typeset 405472.00 405376.00 -0.0% SPASS/SPASS 410288.00 410120.00 -0.0% lencod/lencod 426396.00 426108.00 -0.1% ClamAV/clamscan 380108.00 379756.00 -0.1% sqlite3/sqlite3 283664.00 283372.00 -0.1% Geomean difference -0.0% ``` I experimented with different variations and thresholds. Using 3 instead of 2 resulted in a further 0.1% improvement on ClamAV but also regressed sqlite3 by the same %.	2024-01-11 18:57:37 +08:00

1 2 3 4 5 ...

7117 Commits