81365 Commits

Author SHA1 Message Date
Matt Arsenault
5dd48c4901
AMDGPU: MC support for v_cvt_scalef32_pk32_f32_[fp|bf]6 of gfx950 (#117590)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:20:51 -08:00
Philip Reames
6657d4bd70
[TTI][RISCV] Unconditionally break critical edges to sink ADDI (#108889)
This looks like a rather weird change, so let me explain why this isn't
as unreasonable as it looks. Let's start with the problem it's solving.

```
define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb:
  %i = icmp eq i32 %arg1, 1
  br i1 %i, label %bb2, label %bb5

bb2:                                              ; preds = %bb
  %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
  %i4 = load i32, ptr %i3, align 4
  br label %bb5

bb5:                                              ; preds = %bb2, %bb
  %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
  ret i32 %i6
}
```

Right now, we codegen this as:

```
	li	a3, 1
	li	a2, 13
	bne	a1, a3, .LBB0_2
	lw	a2, 4(a0)
.LBB0_2:
	mv	a0, a2
	ret
```

In this example, we have two values which must be assigned to a0 per the
ABI (%arg, and the return value). SelectionDAG ensures that all values
used in a successor phi are defined before exit the predecessor block.
This creates an ADDI to materialize the immediate in the entry block.

Currently, this ADDI is not sunk into the tail block because we'd have
to split a critical edges to do so. Note that if our immediate was
anything large enough to require two instructions we *would* split this
critical edge.

Looking at other targets, we notice that they don't seem to have this
problem. They perform the sinking, and tail duplication that we don't.
Why? Well, it turns out for AArch64 that this is entirely an accident of
the existance of the gpr32all register class. The immediate is
materialized into the gpr32 class, and then copied into the gpr32all
register class. The existance of that copy puts us right back into the
two instruction case noted above.

This change essentially just bypasses this emergent behavior aspect of
the aarch64 behavior, and implements the same "always sink immediates"
behavior for RISCV as well.
2024-11-25 18:59:31 -08:00
Pengcheng Wang
6633916ef5
[RISCV] Remove getPostRAMutations (#117527)
We are using `PostMachineScheduler` instead of `PostRAScheduler`
since #68696.

The hook `getPostRAMutations` is only used in `PostRAScheduler` so
it is actually dead code for RISC-V now.
2024-11-26 10:55:43 +08:00
LiqinWeng
dd7aabf7c0
[TTI][RISCV] Deduplicate type-based VP costing of vpcmp/vpcast (#117520)
Refered to: https://github.com/llvm/llvm-project/pull/115983
2024-11-26 10:49:24 +08:00
Daniel Zabawa
c1a3960abe
[X86] Add APX imulzu support. (#116806)
Add patterns to select 16b imulzu with -mapx-feature=zu, including
folding of zero-extends of the result. IsDesirableToPromoteOp is changed
to leave 16b multiplies by constant un-promoted, as imulzu will not
cause partial-write stalls.
2024-11-26 08:26:23 +08:00
Phoebe Wang
2ab84a60ff
[X86][FP16][BF16] Improve vectorization of fcmp (#116153) 2024-11-26 08:17:48 +08:00
Craig Topper
c2bb056482
[SelectionDAG][RISCV][AArch64] Allow f16 STRICT_FLDEXP to be promoted. Fix integer promotion of STRICT_FLDEXP in type legalizer. (#117633)
A special case in type legalization wasn't accounting for different
operand numbering between FLDEXP and STRICT_FLDEXP.

AArch64 already asked STRICT_FLDEXP to be promoted, but had no test for
it.
2024-11-25 16:12:45 -08:00
Helena Kotas
cac978331f
[HLSL] Add Increment/DecrementCounter methods to structured buffers (#117608)
Introduces `__builtin_hlsl_buffer_update_counter` clang buildin that is
used to implement the `IncrementCounter` and `DecrementCounter` methods
on `RWStructuredBuffer` and `RasterizerOrderedStructuredBuffer` (see
Note).

The builtin is translated to LLVM intrisic `llvm.dx.bufferUpdateCounter`
or `llvm.spv.bufferUpdateCounter`.

Introduces `BuiltinTypeMethodBuilder` helper in `HLSLExternalSemaSource`
that enables adding methods to builtin types using builder pattern like
this:
```
   BuiltinTypeMethodBuilder(Sema, RecordBuilder, "MethodName", ReturnType)
       .addParam("param_name", Type, InOutModifier)
       .callBuiltin("buildin_name", { BuiltinParams })
       .finalizeMethod();
```

Fixes #113513

[First version](llvm/llvm-project#114148) of this PR was reverted
because of build break.
2024-11-25 16:10:48 -08:00
Feng Zou
ab4e06667d
[X86][MC] Add R_X86_64_CODE_6_GOTTPOFF (#117277)
For

    add %reg1, name@GOTTPOFF(%rip), %reg2
    add name@GOTTPOFF(%rip), %reg1, %reg2
    {nf} add %reg1, name@GOTTPOFF(%rip), %reg2
    {nf} add name@GOTTPOFF(%rip), %reg1, %reg2
    {nf} add name@GOTTPOFF(%rip), %reg

add

  `R_X86_64_CODE_6_GOTTPOFF` = 50

if the instruction starts at 6 bytes before the relocation offset. It's
similar to R_X86_64_GOTTPOFF.

Linker can treat `R_X86_64_CODE_6_GOTTPOFF` as `R_X86_64_GOTTPOFF` or
convert the instructions above to

    add $name@tpoff, %reg1, %reg2
    add $name@tpoff, %reg1, %reg2
    {nf} add $name@tpoff, %reg1, %reg2
    {nf} add $name@tpoff, %reg1, %reg2
    {nf} add $name@tpoff, %reg

if the first byte of the instruction at the relocation `offset - 6` is
`0xd5` (namely, encoded w/REX2 prefix) when possible.


Binutils patch:
5bc71c2a6b
Binutils mailthread:
https://sourceware.org/pipermail/binutils/2024-February/132351.html
ABI discussion:
https://groups.google.com/g/x86-64-abi/c/FhEZjCtDLFw/m/VHDjN4orAgAJ
Blog: https://kanrobert.github.io/rfc/All-about-APX-relocation
2024-11-26 07:00:20 +08:00
Matt Arsenault
935da49a4d
AMDGPU: Pass HwMode to AMDGPUGenRegisterInfo (#117449)
I haven't figured out how to do anything useful with this yet,
but it seems you are supposed to pass this to the subtarget
constructor.
2024-11-25 14:21:52 -08:00
S. Bharadwaj Yadavalli
96547decd5
[DirectX] Infrastructure to collect shader flags for each function (#112967)
Currently, ShaderFlagsAnalysis pass represents various module-level
properties as well as function-level properties of a DXIL Module using a
single mask. However, one mask per function is needed for accurate
computation of shader flags mask, such as for entry function metadata
creation.

This change introduces a structure that wraps a sorted vector of
function-shader flag mask pairs that represent function properties
instead of a single shader flag mask that represents module properties
and properties of all functions. The result type of ShaderFlagsAnalysis
pass is changed to newly-defined structure type instead of a single
shader flags mask.

This allows accurate computation of shader flags of an entry function
(and all functions in a library shader) for use during its metadata
generation (DXILTranslateMetadata pass) and its feature flags in DX
container globals construction (DXContainerGlobals pass) based on the
shader flags mask of functions. However, note that the change to
implement propagation of such callee-based shader flags mask computation
is planned in a follow-on PR. Consequently, this PR changes shader flag
mask computation in DXILTranslateMetadata and DXContainerGlobals passes
to simply be a union of module flags and shader flags of all functions,
thereby retaining the existing effect of using a single shader flag
mask.
2024-11-25 16:02:46 -05:00
Kazu Hirata
8e510b8472 [RISCV] Fix a warning
This patch fixes:

  llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp:476:25: error: unused
  variable 'ST' [-Werror,-Wunused-variable]
2024-11-25 12:57:05 -08:00
Matt Arsenault
5001f16058
AMDGPU: MC support for v_cvt_scalef32_pk_{f|bf}16_fp4 of gfx950. (#117418)
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 12:43:45 -08:00
Philip Reames
d733fa1c90
[RISCV] Consolidate VLS codepaths in stack frame manipulation [nfc] (#117605)
We can move the logic from adjustStackForRVV into adjustReg, which
results in the remaining logic being trivially inlined to the two
callers and allows a duplicate copy of the same logic in
eliminateFrameIndex to be pruned.
2024-11-25 12:40:37 -08:00
Justin Bogner
bb88fd171a
[DirectX] Calculate resource binding offsets using the lower bound (#117303)
In the DXIL CreateHandle and CreateHandleFromBinding ops, resource
bindings are
indexed from the beginning of the binding space, not from the binding
itself.
Translate from an index into the binding to one from the beginning of
the space
when lowering to these operations.
2024-11-25 10:44:01 -08:00
Simon Pilgrim
b0bc4674b7 [X86] Fix bad instregex in VPMOVSX/ZX znver4 512-bit patterns.
The Z size was optional, meaning it matched with the 128-bit SSE instructions as well.

Noticed while triaging the strange perf numbers on #110308
2024-11-25 18:39:02 +00:00
Craig Topper
ed6749a405 [RISCV] Promote frexp with Zfh.
The default expansion tries to create an illegal integer type after
legalization.
2024-11-25 10:27:37 -08:00
Matt Arsenault
1a86d44c80
AMDGPU: MC support for v_cvt_scale_fp4<->f32 of gfx950. (#117417)
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte  to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 10:08:13 -08:00
Craig Topper
29828b26fa
[RISCV] Fix double counting scalar CSRs with Zcmp when emitting cfi_offset for RVV CSRs. (#117408)
getCalleeSavedStackSize() already contains RVPushStackSize. Don't
subtract it again.
2024-11-25 10:03:48 -08:00
Matt Arsenault
362d8fb241
AMDGPU: MC support for v_cvt_scalef32_pk_{fp|bf}8_{f|bf}16 of gfx950. (#117384)
OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[z] = Inst{14} = src0_modifier{3}

Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:59:10 -08:00
Matt Arsenault
70fef78329
AMDGPU: MC support for v_cvt_scalef32_pk_f32_[fp|bf]8 of gfx950. (#117383)
OPSEL[0] selects srcword to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:56:22 -08:00
Matt Arsenault
8997bf8e46
AMDGPU: MC support for v_cvt_scalef32_pk_{fp8|bf8}_f32 of gfx950. (#117382)
OPSEL[3] selects low/high 16 bits of dest write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:53:15 -08:00
Matt Arsenault
91af15b764
AMDGPU: MC support for v_cvt_scale_[f16|f32]_bf8 of gfx950. (#117381)
OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}
Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:49:49 -08:00
Matt Arsenault
7ad1084b52
AMDGPU: MC support for v_cvt_scale_[f16|f32]_fp8 of gfx950. (#117380)
OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}
Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:45:48 -08:00
Matt Arsenault
6f8e7c11cf
AMDGPU: Add MC support for gfx950 V_BITOP3_B32/B16 (#117379)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 09:42:07 -08:00
Matt Arsenault
e97fb2207e
AMDGPU: Add support for load transpose instructions for gfx950 (#117378)
This patch support for intrinsics in clang, as well as assembly
instructions in the backend.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 09:39:04 -08:00
Raphael Moreira Zinsly
d88ed9357a
[NFC][RISCV] Refactor allocation of the stack space (#116625)
Separates the stack allocations from prologue in preparation for the
stack clash protection support.
2024-11-25 09:36:15 -08:00
Matt Arsenault
27a8afa3fc
AMDGPU: Handle gfx950 valu write vdst + permlane read hazard (#117287) 2024-11-25 09:33:04 -08:00
Matt Arsenault
c3fe5ad6be
AMDGPU: Handle vcmpx+permalane gfx950 hazard (#117286)
Confusingly, this is a different hazard to the one on gfx10
with a subtarget feature.
2024-11-25 09:27:53 -08:00
Matt Arsenault
3db4f5b0da
AMDGPU: Refine gfx950 xdl-write-vgpr hazard cases (#117285)
The 2-pass XDL write VGPR, read by non-XDL SGEMM/DGEMM case
was 1 wait state overly conservative. Previously, for gfx940,
the XDL/non-XDL cases happened to have the same number of cycles
in all cases. Now the XDL consumer case has an additional state for
2 pass sources.
2024-11-25 09:23:51 -08:00
Craig Topper
20bd029a40
[RISCV] Promote fldexp with Zfh. (#117396)
The default expansion tries to create i16 operations after type
legalization.

Fixes #117349
2024-11-25 09:08:56 -08:00
David Sherwood
9b76e7fc60
Revert "[DAGCombiner] Add support for scalarising extracts of a vector setcc (#116031)" (#117556)
This reverts commit 22ec44f509ff266b581dbb490d7b040473b7c31a.
2024-11-25 13:49:21 +00:00
Nikita Popov
3699931dee [M68k] Use getSignedConstant() where necessary 2024-11-25 14:43:11 +01:00
Nikita Popov
4715dec8c0 [XTensa] Use getSignedConstant() for negative values 2024-11-25 14:43:11 +01:00
Alex Voicu
48ec59c234
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early (#114481)
Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's
concrete value is used throughout subsequent optimisation passes.
2024-11-25 10:29:50 +00:00
Luke Lau
15fadeb2aa
[RISCV] Add cost for @llvm.experimental.vp.splat (#117313)
This is split off from #115274. There doesn't seem to be an easy way to
share this with getShuffleCost since that requires passing in a real
insert_element operand to get it to recognise it's a scalar splat.

For i1 vectors we can't currently lower them so it returns an invalid
cost.

---------

Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>
2024-11-25 11:28:46 +01:00
David Green
c537c75278 [AArch64][GlobalISel] Scalarize i128 vector sadd_sat/uadd_sat/etc.
As with other operations we scalarize any vectors with larger types to let the
scalare legalization kick in.
2024-11-25 09:55:46 +00:00
LiqinWeng
db14010405
[RISCV][TTI] Implement cost of intrinsic abs with LMUL (#115813) 2024-11-25 17:35:58 +08:00
David Sherwood
22ec44f509
[DAGCombiner] Add support for scalarising extracts of a vector setcc (#116031)
For IR like this:

  %icmp = icmp ult <4 x i32> %a, splat (i32 5)
  %res = extractelement <4 x i1> %icmp, i32 1

where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into

  %ext = extractelement <4 x i32> %a, i32 1
  %res = icmp ult i32 %ext, 5

For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.

NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file

CodeGen/AArch64/extract-vector-cmp.ll

because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
2024-11-25 09:25:01 +00:00
Nikita Popov
3317c9ceac
[AMDGPU] Use getSignedConstant() where necessary (#117328)
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().

This also touches some generic legalization code, which apparently only
AMDGPU tests.
2024-11-25 09:49:34 +01:00
Nikita Popov
815a1bb53a
[SystemZ] Use getSignedConstant() where necessary (#117181)
This will avoid assertion failures once we disable implicit truncation
in getConstant().

Inside adjustSubwordCmp() I ended up suppressing the issue with an
explicit cast, because this code deals with a mix of unsigned and signed
immediates.
2024-11-25 09:47:49 +01:00
Craig Topper
3fb0bea859 [RISCV][GISel] Add register class to some isel output patterns so they can be imported.
This makes (fcopysign X, (fneg Y)) patterns work.
2024-11-24 19:29:52 -08:00
hev
e26af0938c
[llvm] Add BasicTTIImpl::areInlineCompatible for target feature subset checks (#117493)
This patch moves the `areInlineCompatible` implementation from multiple
subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to
the base class `BasicTTIImpl`. The new implementation checks whether the
callee's target features are a subset of the caller's, enabling
consistent behavior across targets. Subclasses now simply delegate to
the base implementation, reducing code duplication and improving
maintainability.
2024-11-25 11:22:49 +08:00
Craig Topper
bb5bbe523d [RISCV][GISel] Support s32/s64 G_FSUB/FDIV/FNEG without F/D extensions.
Use libcalls for G_FSUB/FDIV. Use integer operations for G_FNEG.

Copy most of the IR tests for arithmetic from SelectionDAG.
2024-11-24 18:22:12 -08:00
Sergei Barannikov
5f3eab9e45
[AVR] Remove extra ROL / ROR operands (#117510)
The nodes have one input, shift amount of 1 is implied.
2024-11-25 05:15:20 +03:00
Aiden Grossman
512dc5cb32
[X86] Swap ports 10 and 11 in Alder Lake Scheduling Model (#117466)
Based on https://github.com/intel/perfmon/issues/149, the documentation
is incorrect and the pfm counter names are actually correct. This patch
adjusts the Alder Lake scheduling model to match the performance counter
naming/ correct naming that will soon be reflected in the optimization
manual.

This fixes part of #117360.
2024-11-24 16:58:16 -08:00
Aiden Grossman
095f489757
[X86] Swap ports 10 and 11 in SapphireRapids Scheduling Model (#117468)
Based on intel/perfmon#149, the documentation is incorrect and the pfm
counter names are actually correct. This patch adjusts the
SapphireRapids scheduling model to match the performance counter naming/
correct naming that will soon be reflected in the optimization manual.

This fixes part of #117360.
2024-11-24 16:58:01 -08:00
Simon Pilgrim
0a6d797c20 [X86] Improve F16C CVT schedules on SNB/HSW/BDW
Add complete IvyBridge schedule (which is included in the SandyBridge model, IvyBridge was the first to support F16C) - split rr/rm schedules as they usually have very different port usage.

Haswell/Broadwell use Port1 not Port0.

Confirmed with a mixture of Agner + uops.info comparisons.
2024-11-24 17:04:53 +00:00
Simon Pilgrim
6cfaddfd52
[X86] Split rr/rm CVT schedules on SNB/HSW/BDW (#117494)
The folded load variants almost never require Port5 for length changing conversions (just for SNB ymm cases), and don't typically use an extra uop for the load.

Confirmed with a mixture of Agner + uops.info comparisons.
2024-11-24 16:19:34 +00:00
Sergei Barannikov
c85c77c054
[AVR] Fix shift node descriptions (#117456)
Wide shift nodes produce two results, not one.
Reuse the added type profile to define the standard "shift parts" nodes.
2024-11-24 09:26:52 +03:00