SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s
to complete and that's on a fast machine. I also recall seeing them in
the slowest tests list on build bots.
This removes the verify-machineinstrs option from these tests to speed
them up, bringing the slowest test down to +-2s.
Verifier still runs in EXPENSIVE_CHECKS builds.
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.
Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Support new amdgcn_global_load_tr instructions for load with transpose.
* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
New pseudos were added for instructions that were natively VOP3 on
GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64,
V_LSHLREV_B64_pseudo
---------
Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).
---------
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
https://github.com/llvm/llvm-project/pull/70634 has disabled use
of potentially negative scratch offsets, but we can use it on GFX12.
---------
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.
This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.
Fixes: SWDEV-433427
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
Fix a potential hang introduced by #77439 and #77935. This line:
setScoreUB(VS_CNT, getScoreLB(VS_CNT) + getWaitCountMax(VS_CNT));
could potentialy set UB lower than it was before, which confused
SIInsertWaitcnts's fixed point algorithm.
This was only triggered a STORE instruction with an implicit-def, which
seems odd but apparently happens for some spills.
Similar to 806761a7629df268c8aed49657aeccffa6bca449.
For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.
This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:
```
LLVM :: CodeGen/AMDGPU/fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fabs.ll
LLVM :: CodeGen/AMDGPU/floor.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.
Moved extractParts() and extractVectorParts() from LegalizerHelper
to Utils to be able to use it in different passes.
extractParts() will also try to use unmerge when doing irregular
splits where possible, falling back to extract elements when not.
This change is to remove incompatible gws related functions
in order to make device-libs work correctly under -O0 for
gfx1200+
Co-authored-by: Changpeng Fang <changpeng.fang@amd.com>
In order to ensure the correctness of ptr addrspace(7) lowering, we need
a backwards-compatible way to flag buffer intrinsics as volatile that
can't be dropped (unlike metadata).
To acheive this in a backwards-compatible way, we use bit 31 of the
auxilliary immediates of buffer intrinsics as the volatile flag. When
this bit is set, the MachineMemOperand for said intrinsic is marked
volatile. Existing code will ensure that this results in the appropriate
use of flags like glc and dlc.
This commit also harmorizes the handling of the auxilliary immediate for
atomic intrinsics, which new go through extract_cpol like loads and
stores, which masks off the volatile bit.
Always set the upper bound for VS_CNT higher than the lower bound.
Before #77439 this code was only executed on function entry where the
lower bound was 0 so it was not a problem.
Fixes#77931
Force live interval recomputation for a register if its definition is
narrowed to become partial. The live interval repair process cannot
otherwise detect these changes.
Modify GCNHazardRecognizer::fixLdsDirectVMEMHazard() so the waitvsrc
operand
in gfx12 DS_PARAM_LOAD or DS_DIRECT_LOAD instructions is set
appropriately
depending on whether a hazard is found or not, rather than inserting an
S_WAITCNT_DEPCTR instruction if a hazard needs to be mitigated.
Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>
We previously had a heuristic that if a value V was used multiple times
in a single PHI, then to avoid potentially rematerializing into many predecessors
we bail out. The phi uses only counted as a single use in the shouldLocalize() hook
because it counted the PHI as a single instruction use, not factoring in it may
have many incoming edges.
It turns out this heuristic is slightly too pessimistic, and allowing a small number
of these uses to be localized can improve code size due to shortening live ranges,
especially if those ranges span a call.
This change results in some improvements in size on CTMark -Os:
```
Program size.__text
before after diff
kimwitu++/kc 451676.00 451860.00 0.0%
mafft/pairlocalalign 241460.00 241540.00 0.0%
tramp3d-v4/tramp3d-v4 389216.00 389208.00 -0.0%
7zip/7zip-benchmark 587528.00 587464.00 -0.0%
Bullet/bullet 457424.00 457348.00 -0.0%
consumer-typeset/consumer-typeset 405472.00 405376.00 -0.0%
SPASS/SPASS 410288.00 410120.00 -0.0%
lencod/lencod 426396.00 426108.00 -0.1%
ClamAV/clamscan 380108.00 379756.00 -0.1%
sqlite3/sqlite3 283664.00 283372.00 -0.1%
Geomean difference -0.0%
```
I experimented with different variations and thresholds. Using 3 instead
of 2 resulted in a further 0.1% improvement on ClamAV but also regressed
sqlite3 by the same %.