51669 Commits

Author SHA1 Message Date
XinWang10
dd6fec5d4f
[X86][APX]Support lowering for APX promoted AMX-TILE instructions (#78689)
The enc/dec of promoted AMX-TILE instructions have been supported in
https://github.com/llvm/llvm-project/pull/76210.
This patch support lowering for promoted AMX-TILE instructions and
integrate test to existing tests.
2024-01-22 11:33:23 +08:00
XinWang10
d3cd1ce6ab
[X86] Add lowering tests for promoted CMPCCXADD and update CC representation (#78685)
https://github.com/llvm/llvm-project/pull/76125 supported the enc/dec
for CMPCCXADD instructions, this patch
1. Add lowering test for promoted CMPCCXADD
2. Update the representation of condition code for promoted CMPCCXADD to
align with the existing one
2024-01-22 11:32:03 +08:00
Emma Pilkington
bc82cfb38d
[AMDGPU] Add an asm directive to track code_object_version (#76267)
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.

This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
2024-01-21 11:54:47 -05:00
Fangrui Song
d0230446d2 [AArch64] Remove non-sensible define nonlazybind test
nonlazybind is for declarations, not for definitions. We could test the
behavior, but the output would be misleading.
2024-01-20 23:39:07 -08:00
Fangrui Song
f9614b328a [AArch64] Improve nonlazybind test
Prepare for -fno-plt implementation.
2024-01-20 22:16:29 -08:00
Kerry McLaughlin
a8a3711e74
[AArch64][SME2] Preserve ZT0 state around function calls (#78321)
If a function has ZT0 state and calls a function which does not
preserve ZT0, the caller must save and restore ZT0 around the call.
If the caller shares ZT0 state and the callee is not shared ZA, we must
additionally call SMSTOP/SMSTART ZA around the call.

This patch adds new AArch64ISDNodes for spilling & filling ZT0.
Where requiresPreservingZT0 is true, ZT0 state will be preserved
across a call.
2024-01-20 12:06:00 +00:00
Jay Foad
63d7ca924f
[AMDGPU] Add GFX12 llvm.amdgcn.s.wait.*cnt intrinsics (#78723) 2024-01-20 11:44:42 +00:00
Craig Topper
9396891271 [RISCV] Don't look for sext in RISCVCodeGenPrepare::visitAnd.
We want to know the upper 33 bits of the And Input are zero. SExt
only guarantees they are the same.

We originally checked for SExt or ZExt when we were using
isImpliedByDomCondition because a ZExt may have been changed to SExt
before we visited the And.

We are no longer using isImpliedByDomCondition so we can only look
for zext with the nneg flag.

While here, switch to PatternMatch to simplify the code.

Fixes #78783
2024-01-19 14:44:47 -08:00
Craig Topper
66cea7143a [RISCV] Add test case for #78783. NFC 2024-01-19 14:44:47 -08:00
Arthur Eubanks
86eaf6083b
[X86] Refine X86DAGToDAGISel::isSExtAbsoluteSymbolRef() (#76191)
We just need to check if the global is large or not.

In the kernel code model, globals are in the negative 2GB of the address
space, so globals can be a sign extended 32-bit immediate.

In other code models, small globals are in the low 2GB of the address
space, so sign extending them is equivalent to zero extending them.
2024-01-19 14:11:18 -08:00
Craig Topper
9ae28fb9d3
[RISCV] Prevent RISCVMergeBaseOffsetOpt from calling getVRegDef on a physical register. (#78762)
Fixes #78679.
2024-01-19 12:15:08 -08:00
Min-Yih Hsu
5330daad41
[RISCV] Add support for Smepmp 1.0 (#78489)
Smepmp is a supervisor extension that prevents privileged processes from
accessing unprivileged program and data.

Spec: https://github.com/riscv/riscv-tee/blob/main/Smepmp/Smepmp.pdf
2024-01-19 11:09:35 -08:00
Durgadoss R
43531e7196
[LLVM][NVPTX] Add cp.async.bulk.commit/wait intrinsics (#78698)
This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants
of the async-copy commit/wait instructions.
lit tests are added to verify the generated PTX.

PTX Doc link:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-01-19 10:42:33 -08:00
Jay Foad
89226ecbb9
[AMDGPU] Do not widen scalar loads on GFX12 (#78724)
GFX12 has subword scalar loads so there is no need to do this.
2024-01-19 15:30:07 +00:00
Jay Foad
ed12388082
[AMDGPU] Do not emit V_DOT2C_F32_F16_e32 on GFX12 (#78709)
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.

Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
2024-01-19 14:36:27 +00:00
Simon Pilgrim
a2a0089ac3
[X86] movsd/movss/movd/movq - add support for constant comments (#78601)
If we're loading a constant value, print the constant (and the zero upper elements) instead of just the shuffle mask.

This did require me to move the shuffle mask handling into addConstantComments as we can't handle this in the MC layer.
2024-01-19 14:21:26 +00:00
Sander de Smalen
340054e561
[AArch64][SME] Remove combination of private-ZA and preserves_za. (#78563)
The new Clang attributes no longer support the combination of having a
private-ZA function that preserves ZA. The use of __arm_preserves("za")
means that ZA is shared and preserved.

There wasn't that much benefit to the special handling of this, because
in practice it only meant that we'd avoid restoring the lazy-save
afterwards, but it still needed setting up a lazy-save (with the
possibility of using a 0-sized buffer).

Perhaps a new attribute will be added in the future to support this
case, at which point we can revert back some of the changes removed in
this patch. But for now removing this code simplifies things.
2024-01-19 13:48:44 +00:00
Danila Malyutin
9ad7d8f0e4
[Statepoint] Optimize Location structure size (#78600)
Reduce its size from 24 to 12 bytes. Improves memory consumption when
dealing with statepoint-heavy code.
2024-01-19 17:15:36 +04:00
David Spickett
955417ade2 Revert "[llvm][AArch64] Copy all operands when expanding BLR_BTI bundle (#78267)"
This reverts commit 228aecbcf106a50c30b1f8f1915d61850860cbcd.

Failing expensive checks: https://lab.llvm.org/buildbot/#/builders/16/builds/59798
2024-01-19 12:06:30 +00:00
Jay Foad
879cbe06ed
[AMDGPU] Fix predicates for BUFFER_ATOMIC_CSUB pattern (#78701)
Use OtherPredicates to avoid interfering with other uses of
SubtargetPredicate for GFX12.
2024-01-19 12:01:31 +00:00
David Spickett
228aecbcf1
[llvm][AArch64] Copy all operands when expanding BLR_BTI bundle (#78267)
Fixes #77915

Previously I based the operand copying on expandCALL_RVMARKER but did
not understand it properly at the time. This lead to me dropping the
arguments of the function being branched to.

This fixes that by copying all operands from the BLR_BTI to the BL/BLR
without skipping anything.

I've updated the existing test by adding function arguments.
2024-01-19 11:22:43 +00:00
Mirko Brkušanin
0185c76456
[AMDGPU] Fix test for expensive-checks build (#78687) 2024-01-19 11:32:02 +01:00
Leon Clark
2759cfa0c3
[AMDGPU] Remove unnecessary add instructions in ctlz.i8 (#77615)
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.

---------

Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2024-01-19 10:16:46 +00:00
Craig Topper
0ad83bc26c
[RISCV] Don't look through EXTRACT_ELEMENT in lowerScalarInsert if the element types are different. (#78668)
If the element type of the vector we're extracting from doesn't match the type we're
inserting into, we can't directly insert or extract the subvector.
2024-01-18 22:35:24 -08:00
Christudasan Devadasan
4d566e57a2 [AMDGPU] Precommit lit test. 2024-01-19 09:32:03 +05:30
Luke Lau
8649328060
[RISCV] Add support for new unprivileged extensions defined in profiles spec (#77458)
This adds minimal support for 7 new unprivileged extensions that were
defined as a part of
the RISC-V Profiles specification here:

https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#7-new-isa-extensions

* Ziccif: Main memory supports instruction fetch with atomicity
requirement
* Ziccrse: Main memory supports forward progress on LR/SC sequences
* Ziccamoa: Main memory supports all atomics in A
* Zicclsm: Main memory supports misaligned loads/stores
* Za64rs: Reservation set size of 64 bytes
* Za128rs: Reservation set size of 128 bytes
* Zic64b: Cache block size isf 64 bytes

As stated in the specification, these extensions don't add any new
features but
describe existing features. So this patch only adds parsing and
subtarget
features.
2024-01-19 06:57:06 +07:00
Florian Hahn
83365152a4
[AArch64] Add tests for operations on vectors with 3 elements. 2024-01-18 21:42:06 +00:00
Haohai Wen
fb2c6bbf42
[BranchFolding] Use isSuccessor to confirm fall through (#77923)
When merging blocks, if the previous block has no any branch instruction
and has one successor, the successor may be SEH landing pad and the
block will always raise exception and nerver fall through to next block.
We can not merge them in such case. isSuccessor should be used to
confirm it can fall through to next block.
2024-01-18 23:26:22 +08:00
Simon Pilgrim
33287e35f2
[X86] Emit verbose (constant) comments before EVEX compression tag (#78585)
This helps ensure the encoding details are next to the EVEX tag

Noticed while preparing to add more constant commenting as part of #73783 and #71078
2024-01-18 15:13:42 +00:00
Piotr Sobczak
57f6a3f7ea
[AMDGPU] Add global_load_tr for GFX12 (#77772)
Support new amdgcn_global_load_tr instructions for load with transpose.

* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
2024-01-18 15:14:42 +01:00
Jay Foad
745b193260 [AMDGPU] Regenerate tests for #77892 after #77438 2024-01-18 13:50:59 +00:00
Jay Foad
0a3a0ea591
[AMDGPU] Update uses of new VOP2 pseudos for GFX12 (#78155)
New pseudos were added for instructions that were natively VOP3 on
GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64,
V_LSHLREV_B64_pseudo

---------

Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
2024-01-18 13:26:13 +00:00
Mariusz Sikora
3e6589f21c
[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917)
- image_atomic_pk_add_f16
- image_atomic_pk_add_bf16
- ds_pk_add_bf16
- ds_pk_add_f16
- ds_pk_add_rtn_bf16
- ds_pk_add_rtn_f16
- flat_atomic_pk_add_f16
- flat_atomic_pk_add_bf16
- global_atomic_pk_add_f16
- global_atomic_pk_add_bf16
- buffer_atomic_pk_add_f16
- buffer_atomic_pk_add_bf16
2024-01-18 14:01:09 +01:00
Mariusz Sikora
28b7e498b6
AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892)
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).

---------

Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
2024-01-18 14:00:27 +01:00
Florian Hahn
40d952b874
[CGP] Avoid replacing a free ext with multiple other exts. (#77094)
Replacing a free extension with 2 or more extensions unnecessarily
increases the number of IR instructions without providing any benefits.
It also unnecessarily causes operations to be performed on wider types
than necessary.

In some cases, the extra extensions also pessimize codegen (see
bfis-in-loop.ll).

The changes in arm64-codegen-prepare-extload.ll also show that we avoid
promotions that should only be performed in stress mode.

PR: https://github.com/llvm/llvm-project/pull/77094
2024-01-18 10:48:10 +00:00
Jay Foad
ba52f06f9d
[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438)
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
2024-01-18 10:47:45 +00:00
Jay Foad
9ca36932b5
[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186) 2024-01-18 10:23:27 +00:00
Jay Foad
c111dc72e9
[AMDGPU] Allow potentially negative flat scratch offsets on GFX12 (#78193)
https://github.com/llvm/llvm-project/pull/70634 has disabled use
of potentially negative scratch offsets, but we can use it on GFX12.

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2024-01-18 10:02:40 +00:00
Matthew Devereau
51e3d2f73d
[AArch64][SME] Conditionally do smstart/smstop (#77113)
This patch adds conditional enabling/disabling of streaming mode for
functions which have both the aarch64_pstate_sm_compatible and
aarch64_pstate_sm_body attributes.

This combination allows callees to determine if switching streaming mode
is required instead of relying on the caller.
2024-01-18 09:17:23 +00:00
Luke Lau
15b0fabb21
[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd (#78244)
LLVM vector reduction intrinsics return a scalar result, but on RISC-V
vector reduction instructions write the result in the first element of a
vector register. So when a reduction in a loop uses a scalar phi, we end
up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, which has a scalar accumulator
operand.
This tries to vectorize any scalar phis that feed into a fadd reduction
in RISCVCodeGenPrepare, converting:

loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
```

to

loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during
instruction selection.
2024-01-18 16:15:20 +07:00
Ivan Kosarev
2a869ced61
[AMDGPU][True16] Support V_FLOOR_F16. (#78446) 2024-01-18 08:43:47 +00:00
Mirko Brkušanin
1d286ad59b
[AMDGPU] Add mark last scratch load pass (#75512) 2024-01-18 09:36:44 +01:00
Stanislav Mekhanoshin
021def6c22
[AMDGPU] Use alias info to relax waitcounts for LDS DMA (#74537)
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427
2024-01-17 23:44:15 -08:00
Matt Arsenault
c8007f9047
DAG: Fix chain mismanagement in SoftenFloatRes_FP_EXTEND (#74558) 2024-01-18 14:32:44 +07:00
Matt Arsenault
11bf02e019
DAG: Fix ABI lowering with FP promote in strictfp functions (#74405)
This was emitting non-strict casts in ABI contexts for illegal
types.
2024-01-18 10:57:53 +07:00
Chia
ba81477e9c
Recommit "[RISCV][ISel] Combine scalable vector add/sub/mul with zero/sign extension." (#76785)
This patch was originally introduced in PR #72340, but was reverted due
to a bug on invalid extension combine.

Specifically, we resolve the case in the
https://github.com/llvm/llvm-project/pull/72340#issuecomment-1874810998

```
define <vscale x 1 x i32> @foo(<vscale x 1 x i1> %x, <vscale x 1 x i2> %y) {     
  %a = zext <vscale x 1 x i1> %x to <vscale x 1 x i32>                           
  %b = zext <vscale x 1 x i1> %y to <vscale x 1 x i32>                           
  %c = add <vscale x 1 x i32> %a, %b                                             
  ret <vscale x 1 x i32> %c                                                      
}
```
The previous patch didn't check if the semantic of `ISD::ZERO_EXTEND`
and `ISD::ZERO_EXTEND` is equivalent to the `vsext.vf2` or `vzext.vf2`
(not ensuring the SEW condition on widening Vector Arithmetic
Instructions).

Thanks for @topperc pointing out this bug.

## The original description 
This PR mainly aims at resolving the below missed-optimization case,
while it could also be considered as an extension of the previous patch
https://reviews.llvm.org/D133739?id=

### Missed-Optimization Case
Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh

### Source Code: 
```
define <vscale x 2 x i16> @multiple_users(ptr  %x, ptr  %y, ptr %z) {
  %a = load <vscale x 2 x i8>, ptr %x
  %b = load <vscale x 2 x i8>, ptr %y
  %b2 = load <vscale x 2 x i8>, ptr %z
  %c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16>
  %d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16>
  %d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16>
  %e = mul <vscale x 2 x i16> %c, %d
  %f = add <vscale x 2 x i16> %c, %d2
  %g = sub <vscale x 2 x i16> %c, %d2
  %h = or <vscale x 2 x i16> %e, %f
  %i = or <vscale x 2 x i16> %h, %g
  ret <vscale x 2 x i16> %i
}
```
### Before This Patch
```
# %bb.0:
        vsetvli a3, zero, e16, mf2, ta, ma
        vle8.v  v8, (a0)
        vle8.v  v9, (a1)
        vle8.v  v10, (a2)
        svf2       v11, v8
        vsext.vf2       v8, v9
        vsext.vf2       v9, v10
        vmul.vv v8, v11, v8
        vadd.vv v10, v11, v9
        vsub.vv v9, v11, v9
        vor.vv  v8, v8, v10
        vor.vv  v8, v8, v9
        ret
```
###  After This Patch 
```
# %bb.0:
	vsetvli	a3, zero, e8, mf4, ta, ma
	vle8.v	v8, (a0)
	vle8.v	v9, (a1)
	vle8.v	v10, (a2)
	vwmul.vv	v11, v8, v9
	vwadd.vv	v9, v8, v10
	vwsub.vv	v12, v8, v10
	vsetvli	zero, zero, e16, mf2, ta, ma
	vor.vv	v8, v11, v9
	vor.vv	v8, v8, v12
	ret
```
We can see Add/Sub/Mul are combined with the Sign Extension.

### Relation to the Patch D133739
The patch D133739 introduced an optimization for folding `ADD_VL`/
`SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did
not consider the case of non-fixed length vector case, thus this PR
could also be considered as an extension for the D133739.
2024-01-17 18:30:27 -08:00
XinWang10
2d92f7de80
[X86] Support lowering for APX promoted BMI instructions. (#77433)
R16-R31 was added into GPRs in
https://github.com/llvm/llvm-project/pull/70958,
This patch supports the lowering for promoted BMI instructions in EVEX
space, enc/dec has been supported in
https://github.com/llvm/llvm-project/pull/73899.

RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
2024-01-18 10:15:54 +08:00
XinWang10
f6617091a9
[X86][test] Add --show-mc-encoding for lowering tests of NDD arithmetic instructions (#78406)
#77564 added lowering tests for NDD arithmetic instructions.
It would be great to add `--show-mc-encoding` to check the NDD variant
is selected first.
2024-01-18 09:29:02 +08:00
Stanislav Mekhanoshin
558ea41159
[AMDGPU] Reapply 'Sign extend simm16 in setreg intrinsic' (#78492)
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.
2024-01-17 17:23:46 -08:00
Alex MacLean
430a40d12e
[NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385)
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for
`nvvm.{min,max,mulhi,sad}`

- `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select`
instructions but it is still nice to support the 16 bit variants just in
case any generators of IR are still trying to use these intrinsics.
- `nvvm.sad` added both the 16 and 64 bit variants, also marked this
instruction as speculateble. These directly correspond to the PTX
`sad.{u16,s16,u64,s64}` instructions.
- `nvvm.mulhi` added the 16 bit variants. These directly correspond to
the PTX `mul.hi.{s,u}16` instructions.
2024-01-17 16:18:39 -08:00