57164 Commits

Author SHA1 Message Date
Oliver Stannard
184d1783db [AArch64] PAUTH_PROLOGUE should not be duplicated with PAuthLR (#124775)
When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of
instructions which takes the address of one of those instructions, and
uses that address to compute the return address signature. If this is
duplicated, there will be two different addresses used in calculating
the signature, so the epilogue will only be correct for (at most) one of
them.

This change also restricts code generation when using v8.3-A return
address signing, without PAuthLR. This isn't strictly needed, as
duplicating the prologue there would be valid. We could fix this by
having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable,
but I don't think it's worth adding the extra complexity to a security
feature for that.

(cherry picked from commit 36b3c43524c8ca86a5050496b8773f07c5ccddff)
2025-01-31 17:55:58 -08:00
Yingwei Zheng
40ca089d99 [CodeGenPrepare] Replace deleted ext instr with the promoted value. (#71058)
This PR replaces the deleted ext with the promoted value in `AddrMode`.
Fixes #70938.

(cherry picked from commit 3c6aa04cf4dee65113e2a780b9f90b36bb4c4e04)
2025-01-31 17:49:18 -08:00
David Green
b23297a7f1 [GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799)
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.

(cherry picked from commit 66e0498dafbfa7f8fd7deaa88ae62bdf38a12113)
2025-01-31 16:38:42 -08:00
Luke Lau
ff271d04a2
[RISCV][VLOPT] Fix assertion failure across blocks (#124734)
Whilst adding a cross-block test, I encountered an assertion failure in
the second pass where we check the instruction popped off the worklist
is a candidate.

The leaf instruction %c in this case will be added to the worklist when
its VL is VLMAX, but during the first pass it will have its VL reduced
to 1.

Then in the second pass when its processed via the worklist, isCandidate
will no longer be true due to its VL == 1.

This fixes it by moving the VL == 1 check to tryReduceVL, keeping it
alongside the other VL check for bailing out early as an optimisation.
2025-01-29 11:00:50 +08:00
yonghong-song
8aae191cb6
[BPF] Remove 'may_goto 0' instructions (#123482)
Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn
is generated by clang compiler. But 'may_goto 0' insn is actually a
no-op so it makes sense to remove that in llvm. The patch is also able
to handle the following code pattern
```
   ...
   may_goto 2
   may_goto 1
   may_goto 0
   ...
```
where three may_goto insns can all be removed.

---------

Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
2025-01-28 15:19:05 -08:00
Stephen Tozer
822f74a911
[Clang] Cleanup docs and comments relating to -fextend-variable-liveness (#124767)
This patch contains a number of changes relating to the above flag;
primarily it updates comment references to the old flag names,
"-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names,
"-fextend-variable-liveness[={all,this}]". These changes are all NFC.

This patch also removes the explicit -fextend-this-ptr-liveness flag
alias, and shortens the help-text for the main flag; these are both
changes that were meant to be applied in the initial PR (#110000), but
due to some user-error on my part they were not included in the merged
commit.
2025-01-28 18:25:32 +00:00
Venkata Ramanaiah Nalamothu
a0b049055d
[RISC-V] Fix incorrect epilogue_begin setting in debug line table (#120623)
The DwarfDebug.cpp implementation expects the epilogue instructions to
have source location of last non-debug instruction after which the epilogue
instructions are inserted. The epilogue_begin is set on location of the first
FrameDestroy instruction with source line information that has been seen in
the epilogue basic block.

In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has
actually begun i.e. after callee saved register reloads and the source line
information is not set on those reload instructions. This is leading to #120553
where, while debugging, breaking on or single stepping to the epilogue_begin
location will make accessing the variables from wrong place as the FP has been
restored to the parent frame's FP.

To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved
register spill/reload instructions which is actually correct. Then the
RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a
reload of the callee saved register in the epilogue and copies the source
line information from insert position instruction to that reload instruction.

Requires PR #120622

Fixes #120553
2025-01-28 21:03:12 +05:30
Daniil Fukalov
68d90cff58
[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608)
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.

The change explicitly marks offset as a signed value.
2025-01-28 15:53:17 +01:00
Stephen Tozer
22687aa97b
[CodeGen] Correctly handle non-standard cases in RemoveLoadsIntoFakeUses (#111551)
In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are
only used by fake uses, as well as the fake use in question. There are
two existing errors with the pass however: it incorrectly examines every
operand of each FAKE_USE, when only the first is relevant (extra
operands will just be "killed" regs assigned by a previous pass), and it
ignores cases where the FAKE_USE register is not an exact match for the
loaded register, which is incorrect as regalloc may choose to load a
wider value than the FAKE_USE required pre-regalloc. This patch fixes
both of these cases.
2025-01-28 13:59:41 +00:00
Renat Idrisov
11db7fb09b
[GlobalISel] Catching inconsistencies in load memory, result, and range metadata type (#121247)
This is a fix for:
https://github.com/llvm/llvm-project/issues/97290
Please let me know if that is the right way to address the issue. Thank
you!

---------

Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-28 20:54:34 +07:00
abhishek-kaushik22
015aed18ee
[SelectionDAG] WidenVecOp_INSERT_SUBVECTOR - Replace INSERT_SUBVECTOR with series of INSERT_VECTOR_ELT (#124420)
If the operands to `INSERT_SUBVECTOR` can't be widened legally, just
replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`.

Closes #124255 (and possibly #102016)
2025-01-28 18:54:49 +05:30
Luke Lau
500a1834d9 [RISCV][VLOPT] Fix some typos in vl-opt-op-info.mir test. NFC
vleN_v_incompatible_emul reassigns to %x and
vsuxeiN_v_idx_incompatible_eew has a dead instruction
2025-01-28 20:28:02 +08:00
Pierre van Houtryve
8ea018ce1d
[DAGISel] Fix MMRA Handling in copyExtraInfo (#124730)
#78569 did not implement this correctly and an edge case breaks it by
triggering `Assertion `!Leafs.empty()' failed.`

Fixes SWDEV-507698
2025-01-28 13:27:26 +01:00
Cullen Rhodes
8017ca1d00
Reapply "[AArch64] Combine and and lsl into ubfiz" (#123356) (#124576)
Patch was reverted due to test case (added) exposing an infinite loop in
combiner, where (shl C1, C2) create by performSHLCombine isn't
constant-folded:

  Combining: t14: i64 = shl t12, Constant:i64<1>
Creating new node: t36: i64 = shl
OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1>
  Creating new node: t37: i64 = shl t6, Constant:i64<1>
  Creating new node: t38: i64 = and t37, t36
   ... into: t38: i64 = and t37, t36
  ...
  Combining: t38: i64 = and t37, t36
Creating new node: t39: i64 = and t6,
OpaqueConstant:i64<-2401053089408754003>
  Creating new node: t40: i64 = shl t39, Constant:i64<1>
   ... into: t40: i64 = shl t39, Constant:i64<1>

and subsequently gets simplified by DAGCombiner::visitAND:

  // Simplify: (and (op x...), (op y...))  -> (op (and x, y))
  if (N0.getOpcode() == N1.getOpcode())
    if (SDValue V = hoistLogicOpWithSameOpcodeHands(N))
      return V;

before being folded by performSHLCombine once again and so on.

The combine in performSHLCombine should only be done if (shl C1, C2) can
be constant-folded, it may otherwise be unsafe and generally have a
worse end result. Thanks to Dave Sherwood for his insight on this one.

This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.
2025-01-28 11:27:34 +00:00
Csanád Hajdú
4a00c84fbb
[AArch64] Allow register offset addressing mode for prefetch (#124534)
Previously instruction selection failed to generate PRFM instructions
with register offsets because `AArch64ISD::PREFETCH` is not a
`MemSDNode`.
2025-01-28 09:16:40 +00:00
Aaditya
cd57c9530b
[NFC][AMDGPU] Autogenerating test cases (#124507) 2025-01-28 13:41:59 +05:30
Adam Yang
aab25f20f6
[HLSL][SPIRV][DXIL] Implement WaveActiveMax intrinsic (#123428)
```    - add clang builtin to Builtins.td
      - link builtin in hlsl_intrinsics
      - add codegen for spirv intrinsic and two directx intrinsics to retain
        signedness information of the operands in CGBuiltin.cpp
      - add semantic analysis in SemaHLSL.cpp
      - add lowering of spirv intrinsic to spirv backend in
        SPIRVInstructionSelector.cpp
      - add lowering of directx intrinsics to WaveActiveOp dxil op in
    DXIL.td

      - add test cases to illustrate passespendent pr merges.
```
Resolves #99170
2025-01-27 23:26:56 -08:00
Djordje Todorovic
0cb7636a46
[RISCV] Add MIPS extensions (#121394)
Adding two extensions for MIPS p8700 CPU:
  1. cmove (conditional move)
  2. lsp (load/store pair)

The official product page here:
https://mips.com/products/hardware/p8700
2025-01-28 08:04:09 +01:00
Craig Topper
d4af658323 [RISCV] Support multiple memory operands in expandRV32ZdinxStore.
TailMerge can create stores with multiple memory operands. We
need to split all of them instead of assuming there is only one.
2025-01-27 22:10:51 -08:00
quic_hchandel
2d0688797c
[RISCV] Renaming muladdi to muliadd as per v0.5 spec. (#124237)
muliadd is more relevant to the operation performed, i.e. multiply by
immediate.

The latest spec can be found at:
https://github.com/quic/riscv-unified-db/releases/latest
2025-01-27 20:40:45 -08:00
Shilei Tian
6e4105574e
[NFC][AMDGPU] Improve code introduced in #124607 (#124672) 2025-01-27 22:57:16 -05:00
Momchil Velikov
f75860f895
[AArch64] Implement NEON FP8 intrinsics for fused multiply-add (#123615)
This patch adds the following intrinsics:

* Fused multiply-add non-indexed

float16x8_t vmlalbq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t,
fpm_t)
float16x8_t vmlaltq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t,
fpm_t)
        
float32x4_t vmlallbbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlallbtq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlalltbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlallttq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)

* Floating-point multiply-add long to half-precision (vector, by
element)

float16x8_t vmlalbq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlalbq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlaltq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlaltq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
    
* Floating-point multiply-add long-long to single-precision (vector, by
element)

float32x4_t vmlallbbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbtq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbtq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
2025-01-28 00:38:44 +00:00
Shilei Tian
3b2b7ec07d
[AMDGPU] Handle invariant marks in AMDGPUPromoteAllocaPass (#124607)
Fixes SWDEV-509327.
2025-01-27 17:30:50 -05:00
David Green
5a81a559d6
[GISel] Explicitly disable BF16 tablegen patterns. (#124113)
We currently have an issue where bf16 patters can be used to match fp16
types, as GISel does not know about the difference between the two. This
patch explicitly disables them to make sure that they are never used.

The opposite can also happen too, where fp16 patterns are used for
operators that should be bf16. So this also changes any operations with
bf16 types to now cause a fallback to SDAG.

The pass setup for GISel has been slightly adjusted to make sure that a
verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass,
which otherwise can cause verifier issues when falling back.
2025-01-27 22:21:12 +00:00
Momchil Velikov
804b81d39f
[AArch64] Add FP8 Neon intrinsics for dot-product (#123613)
This patch adds the following intrinsics:

float16x4_t vdot_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t
vm, fpm_t fpm)
float16x8_t vdotq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, fpm_t fpm)
    
float16x4_t vdot_lane_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x4_t vdot_laneq_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vdotq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vdotq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
    
float32x2_t vdot_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t
vm, fpm_t fpm)
float32x4_t vdotq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, fpm_t fpm)

float32x2_t vdot_lane_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x2_t vdot_laneq_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vdotq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vdotq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
2025-01-27 21:14:16 +00:00
Craig Topper
aa34a6ab29
[RISCV] Add register allocation hints for lui/auipc+addi fusion. (#123860)
Spotted the auipc case while looking at code for P550. I'm not sure this
is the right long term fix. We're still missing rematerialization
opportunities for these pairs so a pseudo might be better. That would
interfere with folding auipc+add into load/store addressing though.

Fixes #76779.
2025-01-27 11:16:22 -08:00
Heejin Ahn
539b2e0654
[WebAssembly] Fix catch block type in wasm64 (#124381)
`try_table`'s `catch` or `catch_ref`'s target block's return type should
be `i64` and `(i64, exnref)` in case of wasm64.
2025-01-27 11:01:48 -08:00
Jeffrey Byrnes
e77d428e46
[AMDGPU] Do not remat instructions with PhysReg uses (#124366)
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.

Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
2025-01-27 10:50:06 -08:00
Brox Chen
d1139b32d2
[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000)
true16 codegen pats for v_mad_u16 (mul+add)
2025-01-27 13:47:17 -05:00
Momchil Velikov
5d6d982df6
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (11/11) (#116837)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.
2025-01-27 18:12:00 +00:00
Momchil Velikov
99bd2e3f12
[AArch64] Add Neon FP8 conversion intrinsics (#123612)
The patch adds the following intrinsics:

    bfloat16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    bfloat16x8_t vcvt1_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    bfloat16x8_t vcvt1_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    float16x8_t vcvt1_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    float16x8_t vcvt1_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    float16x8_t vcvt2_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    float16x8_t vcvt2_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    float16x8_t vcvt1_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    float16x8_t vcvt2_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
mfloat8x8_t vcvt_mf8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm)
mfloat8x16_t vcvt_high_mf8_f32_fpm(mfloat8x8_t vd, float32x4_t vn,
float32x4_t vm, fpm_t fpm)
    
mfloat8x8_t vcvt_mf8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm)
mfloat8x16_t vcvtq_mf8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t
fpm)

Co-Authored-By: Caroline Concatto <caroline.concatto@arm.com>
2025-01-27 17:32:47 +00:00
Momchil Velikov
4e231014c1
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (10/11) (#116836)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.
2025-01-27 16:45:40 +00:00
Luke Lau
cb6f021af2
[RISCV][VLOPT] Remove unnecessary passthru restriction (#124549)
We currently check for passthrus in two places, on the instruction to
reduce in isCandidate, and on the users in checkUsers.

We cannot reduce the VL if an instruction has a user that's a passthru,
because the user will read elements past VL in the tail.

However it's fine to reduce an instruction if it itself contains a
non-undef passthru. Since the VL can only be reduced, not increased, the
previous tail will always remain the same.
2025-01-27 23:54:32 +08:00
Momchil Velikov
f95f10c7e6
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (9/11) (#116835)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.
2025-01-27 15:50:53 +00:00
Simon Pilgrim
86705eb624 [X86] huge-stack-offset.ll - add gnux32 test coverage
This should match x86 for the basic implementation, but its useful to check it actually runs correctly.
2025-01-27 14:10:16 +00:00
David Green
ef54e0bbfb
[AArch64] Avoid generating LDAPUR on certain cores (#124274)
On the CPUs listed below, we want to avoid LDAPUR for performance
reasons. Add a tuning feature to disable them when using:
 -mcpu=neoverse-v2
 -mcpu=neoverse-v3
 -mcpu=cortex-x3
 -mcpu=cortex-x4
 -mcpu=cortex-x925
2025-01-27 13:12:11 +00:00
Momchil Velikov
d8ad1eef8f
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (7/11) (#116833)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`FLOGB` instructions.
2025-01-27 12:53:38 +00:00
Durgadoss R
3b5e9eed2f
[NVPTX] Add float to tf32 conversion intrinsics (#124316)
This patch adds the set of f32 -> tf32 cvt intrinsics introduced
in sm100 with ptx8.6. This builds on top of the recent PR #121507.

Tests are verified with a 12.8 ptxas executable.

PTX ISA link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-01-27 15:52:43 +05:30
Momchil Velikov
bd38c4993a
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (8/11) (#116834)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`FRINTx`, `FRECPX`, and `FSQRT` instructions.
2025-01-27 09:21:56 +00:00
YunQiang Su
bfa7de0df5
X86: Support FCANONICALIZE on f64/f80 for i686 with SSE2 or AVX (#123917)
Currently, FCANONICALIZE is not enabled for f64 with SSE2,
and is not enabled for f80 for 32bit system.
Let's enable them.
2025-01-27 07:27:26 +08:00
mconst
1f26ac10ca
[X86] Better handling of impossibly large stack frames (#124217)
If you try to create a stack frame of 4 GiB or larger with a 32-bit
stack pointer, we currently emit invalid instructions like `mov eax,
5000000000` (unless you specify `-fstack-clash-protection`, in which
case we emit a trap instead).

The trap seems nicer, so let's do that in all cases. This avoids
emitting invalid instructions, and also fixes the "can't have 32-bit
16GB stack frame" assertion in `X86FrameLowering::emitSPUpdate()` (which
used to be triggerable by user code, but is now correct).

This was originally part of #124041.

@phoebewang
2025-01-25 12:03:57 +08:00
Hiroshi Yamauchi
425d25f5df
[AArch64][WinCFI] Fix a crash due to missing seh directives (#123993)
https://github.com/llvm/llvm-project/issues/123808
2025-01-24 14:01:41 -08:00
Philip Reames
a9ad601f7c
[RISCV] Use vrsub for select of add and sub of the same operands (#123400)
If we have a (vselect c, a+b, a-b), we can combine this to a+(vselect c,
b, -b). That by itself isn't hugely profitable, but if we reverse the
select, we get a form which matches a masked vrsub.vi with zero. The
result is that we can use a masked vrsub *before* the add instead of a
masked add or sub. This doesn't change the critical path (since we
already had the pass through on the masked second op), but does reduce
register pressure since a, b, and (a+b) don't need to all be alive at
once.

In addition to the vselect form, we can also see the same pattern with a
vector_shuffle encoding the vselect. I explored canonicalizing these to
vselects instead, but that exposes several unrelated missing combines.
2025-01-24 10:08:42 -08:00
Brox Chen
ec66c4af09
[AMDGPU][True16][CodeGen] true16 codegen pattern for f16 canonicalize (#122000)
true16 codegen pattern for f16 canonicalize
2025-01-24 10:44:00 -05:00
Nikita Popov
2068b1ba03
[X86] Fix ABI for passing after i128 (#124134)
If we're passing an i128 value and we no longer have enough argument
registers (only r9 unallocated), the value gets passed via the stack.
However, r9 is still allocated as a shadow register, which means that a
following i64 argument will not use it. This doesn't match the x86-64
psABI.

Fix this by making i128 arguments as requiring consecutive registers,
and then adding a custom CC lowering that will allocate both parts of
the i128 at the same time, either to register or to stack, without
reserving a shadow register.

Fixes https://github.com/llvm/llvm-project/issues/123935.
2025-01-24 15:31:53 +01:00
Aaditya
11b0401926
[AMDGPU] Restore SP from saved-FP or saved-BP (#124007)
Currently, the AMDGPU backend bumps the Stack Pointer 
by fixed size offsets in the prolog of device functions, and 
restores it by the same amount in the epilog.
Prolog:
sp += frameSize

Epilog:
sp -= frameSize

If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment

Epilog:
sp -= frameSize + max_alignment

These calculations are not optimal in case of dynamic 
stack realignment, and completely fail in case of 
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP. 
Prolog:
fp = sp
sp += frameSize

Epilog:
sp = fp

In case of dynamic stack realignment, SP is restored from 
the saved Base Pointer. 
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment

Epilog:
sp = bp

(Note: The presence of BP has been enforced in case of any 
dynamic stack realignment.)

---------

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-24 19:13:40 +05:30
yingopq
d6e0798a2a
[Mips] Add the missing judgment when processing function handleMFLOSlot (#121463)
In function handleMFLOSlot, we may get a variable LastInstInFunction
with a value of true from function getNextMachineInstr and IInSlot may
be null which would trigger an assert.
So we need to skip this case.

Fix #118223.
2025-01-24 20:03:03 +08:00
Petar Avramovic
b60c118f53
MachineUniformityAnalysis: Improve isConstantOrUndefValuePhi (#112866)
Change existing code for G_PHI to match what LLVM-IR version is doing
via PHINode::hasConstantOrUndefValue. This is not safe for regular PHI
since it may appear with an undef operand and getVRegDef can fail.
Most notably this improves number of values that can be allocated
to sgpr in AMDGPURegBankSelect.
Common case here are phis that appear in structurize-cfg lowering
for cycles with multiple exits:
Undef incoming value is coming from block that reached cycle exit
condition, if other incoming is uniform keep the phi uniform despite
the fact it is joining values from pair of blocks that are entered
via divergent condition branch.
2025-01-24 12:43:40 +01:00
Petar Avramovic
4831fa8632
AMDGPU/GlobalISel: RegBankLegalize rules for load (#112882)
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
2025-01-24 12:36:41 +01:00
Petar Avramovic
0ee037b861
AMDGPU/GlobalISel: AMDGPURegBankLegalize (#112864)
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"

Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.

AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
  overview of how to lower generic instructions based on available
  target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
  all the same just for different combinations of types and banks.
  Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
  Names of IDs are meant to give brief description what lowering does
  for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms

Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
  G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
  SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
  is defined as uniform(AMDGPURegBankSelect) - missing temporal
  divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
  fixed in AMDGPURegBankSelect via another fix in machine uniformity
  analysis.
2025-01-24 12:12:45 +01:00