The enc/dec of promoted AMX-TILE instructions have been supported in
https://github.com/llvm/llvm-project/pull/76210.
This patch support lowering for promoted AMX-TILE instructions and
integrate test to existing tests.
https://github.com/llvm/llvm-project/pull/76125 supported the enc/dec
for CMPCCXADD instructions, this patch
1. Add lowering test for promoted CMPCCXADD
2. Update the representation of condition code for promoted CMPCCXADD to
align with the existing one
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
If a function has ZT0 state and calls a function which does not
preserve ZT0, the caller must save and restore ZT0 around the call.
If the caller shares ZT0 state and the callee is not shared ZA, we must
additionally call SMSTOP/SMSTART ZA around the call.
This patch adds new AArch64ISDNodes for spilling & filling ZT0.
Where requiresPreservingZT0 is true, ZT0 state will be preserved
across a call.
We want to know the upper 33 bits of the And Input are zero. SExt
only guarantees they are the same.
We originally checked for SExt or ZExt when we were using
isImpliedByDomCondition because a ZExt may have been changed to SExt
before we visited the And.
We are no longer using isImpliedByDomCondition so we can only look
for zext with the nneg flag.
While here, switch to PatternMatch to simplify the code.
Fixes#78783
We just need to check if the global is large or not.
In the kernel code model, globals are in the negative 2GB of the address
space, so globals can be a sign extended 32-bit immediate.
In other code models, small globals are in the low 2GB of the address
space, so sign extending them is equivalent to zero extending them.
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.
Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
If we're loading a constant value, print the constant (and the zero upper elements) instead of just the shuffle mask.
This did require me to move the shuffle mask handling into addConstantComments as we can't handle this in the MC layer.
The new Clang attributes no longer support the combination of having a
private-ZA function that preserves ZA. The use of __arm_preserves("za")
means that ZA is shared and preserved.
There wasn't that much benefit to the special handling of this, because
in practice it only meant that we'd avoid restoring the lazy-save
afterwards, but it still needed setting up a lazy-save (with the
possibility of using a 0-sized buffer).
Perhaps a new attribute will be added in the future to support this
case, at which point we can revert back some of the changes removed in
this patch. But for now removing this code simplifies things.
Fixes#77915
Previously I based the operand copying on expandCALL_RVMARKER but did
not understand it properly at the time. This lead to me dropping the
arguments of the function being branched to.
This fixes that by copying all operands from the BLR_BTI to the BL/BLR
without skipping anything.
I've updated the existing test by adding function arguments.
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
If the element type of the vector we're extracting from doesn't match the type we're
inserting into, we can't directly insert or extract the subvector.
This adds minimal support for 7 new unprivileged extensions that were
defined as a part of
the RISC-V Profiles specification here:
https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#7-new-isa-extensions
* Ziccif: Main memory supports instruction fetch with atomicity
requirement
* Ziccrse: Main memory supports forward progress on LR/SC sequences
* Ziccamoa: Main memory supports all atomics in A
* Zicclsm: Main memory supports misaligned loads/stores
* Za64rs: Reservation set size of 64 bytes
* Za128rs: Reservation set size of 128 bytes
* Zic64b: Cache block size isf 64 bytes
As stated in the specification, these extensions don't add any new
features but
describe existing features. So this patch only adds parsing and
subtarget
features.
When merging blocks, if the previous block has no any branch instruction
and has one successor, the successor may be SEH landing pad and the
block will always raise exception and nerver fall through to next block.
We can not merge them in such case. isSuccessor should be used to
confirm it can fall through to next block.
Support new amdgcn_global_load_tr instructions for load with transpose.
* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
New pseudos were added for instructions that were natively VOP3 on
GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64,
V_LSHLREV_B64_pseudo
---------
Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).
---------
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Replacing a free extension with 2 or more extensions unnecessarily
increases the number of IR instructions without providing any benefits.
It also unnecessarily causes operations to be performed on wider types
than necessary.
In some cases, the extra extensions also pessimize codegen (see
bfis-in-loop.ll).
The changes in arm64-codegen-prepare-extload.ll also show that we avoid
promotions that should only be performed in stress mode.
PR: https://github.com/llvm/llvm-project/pull/77094
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
https://github.com/llvm/llvm-project/pull/70634 has disabled use
of potentially negative scratch offsets, but we can use it on GFX12.
---------
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
This patch adds conditional enabling/disabling of streaming mode for
functions which have both the aarch64_pstate_sm_compatible and
aarch64_pstate_sm_body attributes.
This combination allows callees to determine if switching streaming mode
is required instead of relying on the caller.
LLVM vector reduction intrinsics return a scalar result, but on RISC-V
vector reduction instructions write the result in the first element of a
vector register. So when a reduction in a loop uses a scalar phi, we end
up with unnecessary scalar moves:
loop:
vfmv.s.f v10, fa0
vfredosum.vs v8, v8, v10
vfmv.f.s fa0, v8
This mainly affects ordered fadd reductions, which has a scalar accumulator
operand.
This tries to vectorize any scalar phis that feed into a fadd reduction
in RISCVCodeGenPrepare, converting:
loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
```
to
loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
Which eliminates the scalar -> vector -> scalar crossing during
instruction selection.
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.
This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.
Fixes: SWDEV-433427
This patch was originally introduced in PR #72340, but was reverted due
to a bug on invalid extension combine.
Specifically, we resolve the case in the
https://github.com/llvm/llvm-project/pull/72340#issuecomment-1874810998
```
define <vscale x 1 x i32> @foo(<vscale x 1 x i1> %x, <vscale x 1 x i2> %y) {
%a = zext <vscale x 1 x i1> %x to <vscale x 1 x i32>
%b = zext <vscale x 1 x i1> %y to <vscale x 1 x i32>
%c = add <vscale x 1 x i32> %a, %b
ret <vscale x 1 x i32> %c
}
```
The previous patch didn't check if the semantic of `ISD::ZERO_EXTEND`
and `ISD::ZERO_EXTEND` is equivalent to the `vsext.vf2` or `vzext.vf2`
(not ensuring the SEW condition on widening Vector Arithmetic
Instructions).
Thanks for @topperc pointing out this bug.
## The original description
This PR mainly aims at resolving the below missed-optimization case,
while it could also be considered as an extension of the previous patch
https://reviews.llvm.org/D133739?id=
### Missed-Optimization Case
Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh
### Source Code:
```
define <vscale x 2 x i16> @multiple_users(ptr %x, ptr %y, ptr %z) {
%a = load <vscale x 2 x i8>, ptr %x
%b = load <vscale x 2 x i8>, ptr %y
%b2 = load <vscale x 2 x i8>, ptr %z
%c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16>
%d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16>
%d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16>
%e = mul <vscale x 2 x i16> %c, %d
%f = add <vscale x 2 x i16> %c, %d2
%g = sub <vscale x 2 x i16> %c, %d2
%h = or <vscale x 2 x i16> %e, %f
%i = or <vscale x 2 x i16> %h, %g
ret <vscale x 2 x i16> %i
}
```
### Before This Patch
```
# %bb.0:
vsetvli a3, zero, e16, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
svf2 v11, v8
vsext.vf2 v8, v9
vsext.vf2 v9, v10
vmul.vv v8, v11, v8
vadd.vv v10, v11, v9
vsub.vv v9, v11, v9
vor.vv v8, v8, v10
vor.vv v8, v8, v9
ret
```
### After This Patch
```
# %bb.0:
vsetvli a3, zero, e8, mf4, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
vwmul.vv v11, v8, v9
vwadd.vv v9, v8, v10
vwsub.vv v12, v8, v10
vsetvli zero, zero, e16, mf2, ta, ma
vor.vv v8, v11, v9
vor.vv v8, v8, v12
ret
```
We can see Add/Sub/Mul are combined with the Sign Extension.
### Relation to the Patch D133739
The patch D133739 introduced an optimization for folding `ADD_VL`/
`SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did
not consider the case of non-fixed length vector case, thus this PR
could also be considered as an extension for the D133739.
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for
`nvvm.{min,max,mulhi,sad}`
- `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select`
instructions but it is still nice to support the 16 bit variants just in
case any generators of IR are still trying to use these intrinsics.
- `nvvm.sad` added both the 16 and 64 bit variants, also marked this
instruction as speculateble. These directly correspond to the PTX
`sad.{u16,s16,u64,s64}` instructions.
- `nvvm.mulhi` added the 16 bit variants. These directly correspond to
the PTX `mul.hi.{s,u}16` instructions.