57176 Commits

Author SHA1 Message Date
Matt Arsenault
1cbfac04d0
SystemZ: Handle copies between gr64 and fp64 (#124890)
I'm guessing based on tablegen definitions. I also don't
really understand how this could have been missing.

This defends against regressions in a future peephole-opt
patch.
2025-01-30 11:08:08 +07:00
Matt Arsenault
6017480461
MachineVerifier: Fix check for range type (#124894)
We need to permit scalar extending loads with range annotations.

Fix expensive_checks failures after 11db7fb09b36e656a801117d6a2492133e9c2e46
2025-01-30 10:56:12 +07:00
Matt Arsenault
97a1f494a6
DAG: Avoid breaking legal vector_shuffle with multiple uses (#123712)
Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.

We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.
2025-01-30 10:55:21 +07:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Yingwei Zheng
3c6aa04cf4
[CodeGenPrepare] Replace deleted ext instr with the promoted value. (#71058)
This PR replaces the deleted ext with the promoted value in `AddrMode`.
Fixes #70938.
2025-01-30 08:58:23 +08:00
Alex MacLean
de7438e472
[NVPTX] Auto-Upgrade some nvvm.annotations to attributes (#119261)
Add a new AutoUpgrade function to convert some legacy nvvm.annotations
metadata to function level attributes. These attributes are quicker to
look-up so improve compile time and are more idiomatic than using
metadata which should not include required information that changes the
meaning of the program.

Currently supported annotations are:

- !"kernel" -> ptx_kernel calling convention
- !"align" -> alignstack parameter attributes (return not yet supported)
2025-01-29 16:27:27 -08:00
Konstantina Mitropoulou
9adc99bcc5
[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028)
- **[NFC] Use GCNPat instead of Pat.**
- **[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point
branches.**

---------

Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>
2025-01-29 09:00:40 -08:00
Nikita Popov
29441e4f5f
[IR] Convert from nocapture to captures(none) (#123181)
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.

Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
   reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
   make it easier to use old IR files and somewhat reduce the test churn in
   this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
   attribute. The representation in the LLVM IR dialect should be updated
   separately.
2025-01-29 16:56:47 +01:00
Mikhail Gudim
3c3c850a45
[ReachingDefAnalysis] Extend the analysis to stack objects. (#118097)
We track definitions of stack objects, the implementation is identical
to tracking of registers.

Also, added printing of all found reaching definitions for testing
purposes.

---------

Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
2025-01-29 10:55:16 -05:00
Acim Maravic
3a29dfe37c
[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616) 2025-01-29 14:04:10 +01:00
David Green
66e0498daf
[GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799)
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.
2025-01-29 12:48:11 +00:00
Simon Pilgrim
9534d27e33 [X86] vector-idiv-sdiv-512.ll - regenerate VPTERNLOG comments 2025-01-29 11:34:44 +00:00
Oliver Stannard
36b3c43524
[AArch64] PAUTH_PROLOGUE should not be duplicated with PAuthLR (#124775)
When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of
instructions which takes the address of one of those instructions, and
uses that address to compute the return address signature. If this is
duplicated, there will be two different addresses used in calculating
the signature, so the epilogue will only be correct for (at most) one of
them.

This change also restricts code generation when using v8.3-A return
address signing, without PAuthLR. This isn't strictly needed, as
duplicating the prologue there would be valid. We could fix this by
having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable,
but I don't think it's worth adding the extra complexity to a security
feature for that.
2025-01-29 10:42:47 +00:00
Mingming Liu
3feb724496
[AsmPrinter][ELF] Support profile-guided section prefix for jump tables' (read-only) data sections (#122215)
https://github.com/llvm/llvm-project/pull/122183 adds a codegen pass to
infer machine jump table entry's hotness from the MBB hotness. This is a
follow-up PR to produce `.hot` and or `.unlikely` section prefix for
jump table's (read-only) data sections in the relocatable `.o` files.

When this patch is enabled, linker will see {`.rodata`, `.rodata.hot`,
`.rodata.unlikely`} in input sections. It can map `.rodata.hot` and
`.rodata` in the input sections to `.rodata.hot` in the executable, and
map `.rodata.unlikely` into `.rodata` with a pending extension to
`--keep-text-section-prefix` like
059e7cbb66,
or with a linker script.

1. To partition hot and jump tables, the AsmPrinter pass slices a function's jump table indices into two groups, one for hot and the other for cold jump tables. It then emits hot jump tables into a `.hot`-prefixed data section and cold ones into a `.unlikely`-prefixed data section, retaining the relative order of `LJT<N>` labels within each group.

2. [ELF only] To have data sections with _dynamic_ names (e.g., `.rodata.hot[.func]`), we implement
`TargetLoweringObjectFile::getSectionForJumpTable` method that accepts a `MachineJumpTableEntry` parameter, and update `selectELFSectionForGlobal` to generate `.hot` or `.unlikely` based on
MJTE's hotness.
    - The dynamic JT section name doesn't depend on `-ffunction-section=true` or `-funique-section-names=true`, even though it leverages the similar underlying mechanism to have a MCSection with on-demand name as `-ffunction-section` does.

3. The new code path is off by default.
    - Typically, `TargetOptions` conveys clang or LLVM tools' options to code generation passes. To follow the pattern, add option `EnableStaticDataPartitioning` bit in `TargetOptions` and make it
readable through `TargetMachine`.
    - To enable the new code path in tools like `llc`, `partition-static-data-sections` option is introduced in
`CodeGen/CommandFlags.h/cpp`.
    -  A subsequent patch
([draft](8f36a13743)) will add a clang option to enable the new code path.

---------

Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>
2025-01-28 22:49:28 -08:00
Luke Lau
8675cd3fac
[RISCV][VLOPT] Compute demanded VLs up front (#124530)
This replaces the worklist by instead computing what VL is demanded by
each instruction's users first, which is done via checkUsers.

The demanded VLs are stored in a DenseMap, and then we can just do a
single forward pass of tryReduceVL where we check if a candidate's
demanded VL is less than its VLOp.

This means the pass should now be linear in complexity, and allows us to
relax the restriction on tied operands in more easily as in #124066.
2025-01-29 12:39:38 +08:00
Luke Lau
ff271d04a2
[RISCV][VLOPT] Fix assertion failure across blocks (#124734)
Whilst adding a cross-block test, I encountered an assertion failure in
the second pass where we check the instruction popped off the worklist
is a candidate.

The leaf instruction %c in this case will be added to the worklist when
its VL is VLMAX, but during the first pass it will have its VL reduced
to 1.

Then in the second pass when its processed via the worklist, isCandidate
will no longer be true due to its VL == 1.

This fixes it by moving the VL == 1 check to tryReduceVL, keeping it
alongside the other VL check for bailing out early as an optimisation.
2025-01-29 11:00:50 +08:00
yonghong-song
8aae191cb6
[BPF] Remove 'may_goto 0' instructions (#123482)
Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn
is generated by clang compiler. But 'may_goto 0' insn is actually a
no-op so it makes sense to remove that in llvm. The patch is also able
to handle the following code pattern
```
   ...
   may_goto 2
   may_goto 1
   may_goto 0
   ...
```
where three may_goto insns can all be removed.

---------

Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
2025-01-28 15:19:05 -08:00
Stephen Tozer
822f74a911
[Clang] Cleanup docs and comments relating to -fextend-variable-liveness (#124767)
This patch contains a number of changes relating to the above flag;
primarily it updates comment references to the old flag names,
"-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names,
"-fextend-variable-liveness[={all,this}]". These changes are all NFC.

This patch also removes the explicit -fextend-this-ptr-liveness flag
alias, and shortens the help-text for the main flag; these are both
changes that were meant to be applied in the initial PR (#110000), but
due to some user-error on my part they were not included in the merged
commit.
2025-01-28 18:25:32 +00:00
Venkata Ramanaiah Nalamothu
a0b049055d
[RISC-V] Fix incorrect epilogue_begin setting in debug line table (#120623)
The DwarfDebug.cpp implementation expects the epilogue instructions to
have source location of last non-debug instruction after which the epilogue
instructions are inserted. The epilogue_begin is set on location of the first
FrameDestroy instruction with source line information that has been seen in
the epilogue basic block.

In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has
actually begun i.e. after callee saved register reloads and the source line
information is not set on those reload instructions. This is leading to #120553
where, while debugging, breaking on or single stepping to the epilogue_begin
location will make accessing the variables from wrong place as the FP has been
restored to the parent frame's FP.

To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved
register spill/reload instructions which is actually correct. Then the
RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a
reload of the callee saved register in the epilogue and copies the source
line information from insert position instruction to that reload instruction.

Requires PR #120622

Fixes #120553
2025-01-28 21:03:12 +05:30
Daniil Fukalov
68d90cff58
[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608)
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.

The change explicitly marks offset as a signed value.
2025-01-28 15:53:17 +01:00
Stephen Tozer
22687aa97b
[CodeGen] Correctly handle non-standard cases in RemoveLoadsIntoFakeUses (#111551)
In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are
only used by fake uses, as well as the fake use in question. There are
two existing errors with the pass however: it incorrectly examines every
operand of each FAKE_USE, when only the first is relevant (extra
operands will just be "killed" regs assigned by a previous pass), and it
ignores cases where the FAKE_USE register is not an exact match for the
loaded register, which is incorrect as regalloc may choose to load a
wider value than the FAKE_USE required pre-regalloc. This patch fixes
both of these cases.
2025-01-28 13:59:41 +00:00
Renat Idrisov
11db7fb09b
[GlobalISel] Catching inconsistencies in load memory, result, and range metadata type (#121247)
This is a fix for:
https://github.com/llvm/llvm-project/issues/97290
Please let me know if that is the right way to address the issue. Thank
you!

---------

Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-28 20:54:34 +07:00
abhishek-kaushik22
015aed18ee
[SelectionDAG] WidenVecOp_INSERT_SUBVECTOR - Replace INSERT_SUBVECTOR with series of INSERT_VECTOR_ELT (#124420)
If the operands to `INSERT_SUBVECTOR` can't be widened legally, just
replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`.

Closes #124255 (and possibly #102016)
2025-01-28 18:54:49 +05:30
Luke Lau
500a1834d9 [RISCV][VLOPT] Fix some typos in vl-opt-op-info.mir test. NFC
vleN_v_incompatible_emul reassigns to %x and
vsuxeiN_v_idx_incompatible_eew has a dead instruction
2025-01-28 20:28:02 +08:00
Pierre van Houtryve
8ea018ce1d
[DAGISel] Fix MMRA Handling in copyExtraInfo (#124730)
#78569 did not implement this correctly and an edge case breaks it by
triggering `Assertion `!Leafs.empty()' failed.`

Fixes SWDEV-507698
2025-01-28 13:27:26 +01:00
Cullen Rhodes
8017ca1d00
Reapply "[AArch64] Combine and and lsl into ubfiz" (#123356) (#124576)
Patch was reverted due to test case (added) exposing an infinite loop in
combiner, where (shl C1, C2) create by performSHLCombine isn't
constant-folded:

  Combining: t14: i64 = shl t12, Constant:i64<1>
Creating new node: t36: i64 = shl
OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1>
  Creating new node: t37: i64 = shl t6, Constant:i64<1>
  Creating new node: t38: i64 = and t37, t36
   ... into: t38: i64 = and t37, t36
  ...
  Combining: t38: i64 = and t37, t36
Creating new node: t39: i64 = and t6,
OpaqueConstant:i64<-2401053089408754003>
  Creating new node: t40: i64 = shl t39, Constant:i64<1>
   ... into: t40: i64 = shl t39, Constant:i64<1>

and subsequently gets simplified by DAGCombiner::visitAND:

  // Simplify: (and (op x...), (op y...))  -> (op (and x, y))
  if (N0.getOpcode() == N1.getOpcode())
    if (SDValue V = hoistLogicOpWithSameOpcodeHands(N))
      return V;

before being folded by performSHLCombine once again and so on.

The combine in performSHLCombine should only be done if (shl C1, C2) can
be constant-folded, it may otherwise be unsafe and generally have a
worse end result. Thanks to Dave Sherwood for his insight on this one.

This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.
2025-01-28 11:27:34 +00:00
Csanád Hajdú
4a00c84fbb
[AArch64] Allow register offset addressing mode for prefetch (#124534)
Previously instruction selection failed to generate PRFM instructions
with register offsets because `AArch64ISD::PREFETCH` is not a
`MemSDNode`.
2025-01-28 09:16:40 +00:00
Aaditya
cd57c9530b
[NFC][AMDGPU] Autogenerating test cases (#124507) 2025-01-28 13:41:59 +05:30
Adam Yang
aab25f20f6
[HLSL][SPIRV][DXIL] Implement WaveActiveMax intrinsic (#123428)
```    - add clang builtin to Builtins.td
      - link builtin in hlsl_intrinsics
      - add codegen for spirv intrinsic and two directx intrinsics to retain
        signedness information of the operands in CGBuiltin.cpp
      - add semantic analysis in SemaHLSL.cpp
      - add lowering of spirv intrinsic to spirv backend in
        SPIRVInstructionSelector.cpp
      - add lowering of directx intrinsics to WaveActiveOp dxil op in
    DXIL.td

      - add test cases to illustrate passespendent pr merges.
```
Resolves #99170
2025-01-27 23:26:56 -08:00
Djordje Todorovic
0cb7636a46
[RISCV] Add MIPS extensions (#121394)
Adding two extensions for MIPS p8700 CPU:
  1. cmove (conditional move)
  2. lsp (load/store pair)

The official product page here:
https://mips.com/products/hardware/p8700
2025-01-28 08:04:09 +01:00
Craig Topper
d4af658323 [RISCV] Support multiple memory operands in expandRV32ZdinxStore.
TailMerge can create stores with multiple memory operands. We
need to split all of them instead of assuming there is only one.
2025-01-27 22:10:51 -08:00
quic_hchandel
2d0688797c
[RISCV] Renaming muladdi to muliadd as per v0.5 spec. (#124237)
muliadd is more relevant to the operation performed, i.e. multiply by
immediate.

The latest spec can be found at:
https://github.com/quic/riscv-unified-db/releases/latest
2025-01-27 20:40:45 -08:00
Shilei Tian
6e4105574e
[NFC][AMDGPU] Improve code introduced in #124607 (#124672) 2025-01-27 22:57:16 -05:00
Momchil Velikov
f75860f895
[AArch64] Implement NEON FP8 intrinsics for fused multiply-add (#123615)
This patch adds the following intrinsics:

* Fused multiply-add non-indexed

float16x8_t vmlalbq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t,
fpm_t)
float16x8_t vmlaltq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t,
fpm_t)
        
float32x4_t vmlallbbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlallbtq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlalltbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)
float32x4_t vmlallttq_f32_mf8_fpm(float32x4_t, mfloat8x16_t,
mfloat8x16_t, fpm_t)

* Floating-point multiply-add long to half-precision (vector, by
element)

float16x8_t vmlalbq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlalbq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlaltq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vmlaltq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
    
* Floating-point multiply-add long-long to single-precision (vector, by
element)

float32x4_t vmlallbbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbtq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallbtq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
2025-01-28 00:38:44 +00:00
Shilei Tian
3b2b7ec07d
[AMDGPU] Handle invariant marks in AMDGPUPromoteAllocaPass (#124607)
Fixes SWDEV-509327.
2025-01-27 17:30:50 -05:00
David Green
5a81a559d6
[GISel] Explicitly disable BF16 tablegen patterns. (#124113)
We currently have an issue where bf16 patters can be used to match fp16
types, as GISel does not know about the difference between the two. This
patch explicitly disables them to make sure that they are never used.

The opposite can also happen too, where fp16 patterns are used for
operators that should be bf16. So this also changes any operations with
bf16 types to now cause a fallback to SDAG.

The pass setup for GISel has been slightly adjusted to make sure that a
verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass,
which otherwise can cause verifier issues when falling back.
2025-01-27 22:21:12 +00:00
Momchil Velikov
804b81d39f
[AArch64] Add FP8 Neon intrinsics for dot-product (#123613)
This patch adds the following intrinsics:

float16x4_t vdot_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t
vm, fpm_t fpm)
float16x8_t vdotq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, fpm_t fpm)
    
float16x4_t vdot_lane_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x4_t vdot_laneq_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vdotq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float16x8_t vdotq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
    
float32x2_t vdot_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t
vm, fpm_t fpm)
float32x4_t vdotq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, fpm_t fpm)

float32x2_t vdot_lane_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x2_t vdot_laneq_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vdotq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm)
float32x4_t vdotq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn,
mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)
2025-01-27 21:14:16 +00:00
Craig Topper
aa34a6ab29
[RISCV] Add register allocation hints for lui/auipc+addi fusion. (#123860)
Spotted the auipc case while looking at code for P550. I'm not sure this
is the right long term fix. We're still missing rematerialization
opportunities for these pairs so a pseudo might be better. That would
interfere with folding auipc+add into load/store addressing though.

Fixes #76779.
2025-01-27 11:16:22 -08:00
Heejin Ahn
539b2e0654
[WebAssembly] Fix catch block type in wasm64 (#124381)
`try_table`'s `catch` or `catch_ref`'s target block's return type should
be `i64` and `(i64, exnref)` in case of wasm64.
2025-01-27 11:01:48 -08:00
Jeffrey Byrnes
e77d428e46
[AMDGPU] Do not remat instructions with PhysReg uses (#124366)
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.

Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
2025-01-27 10:50:06 -08:00
Brox Chen
d1139b32d2
[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000)
true16 codegen pats for v_mad_u16 (mul+add)
2025-01-27 13:47:17 -05:00
Momchil Velikov
5d6d982df6
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (11/11) (#116837)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.
2025-01-27 18:12:00 +00:00
Momchil Velikov
99bd2e3f12
[AArch64] Add Neon FP8 conversion intrinsics (#123612)
The patch adds the following intrinsics:

    bfloat16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    bfloat16x8_t vcvt1_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    bfloat16x8_t vcvt1_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    bfloat16x8_t vcvt2_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    float16x8_t vcvt1_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    float16x8_t vcvt1_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    float16x8_t vcvt2_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm)
    float16x8_t vcvt2_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
    float16x8_t vcvt1_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    float16x8_t vcvt2_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm)
    
mfloat8x8_t vcvt_mf8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm)
mfloat8x16_t vcvt_high_mf8_f32_fpm(mfloat8x8_t vd, float32x4_t vn,
float32x4_t vm, fpm_t fpm)
    
mfloat8x8_t vcvt_mf8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm)
mfloat8x16_t vcvtq_mf8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t
fpm)

Co-Authored-By: Caroline Concatto <caroline.concatto@arm.com>
2025-01-27 17:32:47 +00:00
Momchil Velikov
4e231014c1
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (10/11) (#116836)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.
2025-01-27 16:45:40 +00:00
Luke Lau
cb6f021af2
[RISCV][VLOPT] Remove unnecessary passthru restriction (#124549)
We currently check for passthrus in two places, on the instruction to
reduce in isCandidate, and on the users in checkUsers.

We cannot reduce the VL if an instruction has a user that's a passthru,
because the user will read elements past VL in the tail.

However it's fine to reduce an instruction if it itself contains a
non-undef passthru. Since the VL can only be reduced, not increased, the
previous tail will always remain the same.
2025-01-27 23:54:32 +08:00
Momchil Velikov
f95f10c7e6
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (9/11) (#116835)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.
2025-01-27 15:50:53 +00:00
Simon Pilgrim
86705eb624 [X86] huge-stack-offset.ll - add gnux32 test coverage
This should match x86 for the basic implementation, but its useful to check it actually runs correctly.
2025-01-27 14:10:16 +00:00
David Green
ef54e0bbfb
[AArch64] Avoid generating LDAPUR on certain cores (#124274)
On the CPUs listed below, we want to avoid LDAPUR for performance
reasons. Add a tuning feature to disable them when using:
 -mcpu=neoverse-v2
 -mcpu=neoverse-v3
 -mcpu=cortex-x3
 -mcpu=cortex-x4
 -mcpu=cortex-x925
2025-01-27 13:12:11 +00:00
Momchil Velikov
d8ad1eef8f
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (7/11) (#116833)
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.

This patch adds support for emitting the zeroing forms of certain
`FLOGB` instructions.
2025-01-27 12:53:38 +00:00
Durgadoss R
3b5e9eed2f
[NVPTX] Add float to tf32 conversion intrinsics (#124316)
This patch adds the set of f32 -> tf32 cvt intrinsics introduced
in sm100 with ptx8.6. This builds on top of the recent PR #121507.

Tests are verified with a 12.8 ptxas executable.

PTX ISA link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-01-27 15:52:43 +05:30