52796 Commits

Author SHA1 Message Date
Tuan Chuong Goh
5686f06d7f [AArch64][GlobalISel] Select USHLL2 Instruction
Select ushll2 instruction instead of using mov and ushll

Differential Revision: https://reviews.llvm.org/D158420
2023-08-23 13:41:47 +01:00
David Green
ef0b8cf3f4 [AArch64][GISel] Expand coverage of FAdd and FSub.
This adds some more extensive test coverage for fadd/fsub through global isel,
switching the opcodes to use the more complete ActionDefinitions to handle more
cases.
2023-08-23 09:51:06 +01:00
Jianjian GUAN
879e801a91 [RISCV] Apply promotion for f16 vector ops when only have zvfhmin
For most fp16 vector ops, we could promote it to fp32 vector when zvfhmin is enable but zvfh is not.
But for nxv32f16, we need to split it first since nxv32f32 is not a valid MVT.

Reviewed By: michaelmaitland

Differential Revision: https://reviews.llvm.org/D153848
2023-08-23 16:49:20 +08:00
Jianjian GUAN
759903568f [RISCV] Add Zvfhmin extension support for llvm RISCV backend
This patch supports Zvfhmin for RISCV codegen.

Reviewed By: michaelmaitland

Differential Revision: https://reviews.llvm.org/D151414
2023-08-23 16:47:47 +08:00
esmeyi
96b5ea6e00 [NFC][PowerPC] Add cases for 64-bit constants. 2023-08-23 04:10:16 -04:00
wanglei
1bb7766489 [LoongArch] Optimize stack realignment using BSTRINS instruction
Prior to this change, stack realignment was achieved using the SRLI/SLLI
instructions in two steps. With this patch, stack realignment is
optimized using a single `BSTRINS` instruction.

Reviewed By: SixWeining, xen0n

Differential Revision: https://reviews.llvm.org/D158384
2023-08-23 09:21:42 +08:00
Rahman Lavaee
d7e10df605 Remove checking stats from -gc-empty-basic-blocks test.
The test does not require asserts. So it can't check the stats.
2023-08-23 01:18:47 +00:00
Rahman Lavaee
d0ec03a384 Revert "[BasicBlockSections] avoid insertting redundant branch to fall through blocks"
This reverts commit ab53109166c0345a79cbd6939cf7bc764a982856 which was
commited by mistake.
2023-08-23 01:09:13 +00:00
Rahman Lavaee
ab53109166 [BasicBlockSections] avoid insertting redundant branch to fall through blocks 2023-08-22 23:32:02 +00:00
Rahman Lavaee
e280e406c2 Add a pass to garbage-collect empty basic blocks after code generation.
Propeller and pseudo-probes map profiles back to Machine IR via basic block addresses that are stored in metadata sections.
Empty basic blocks (basic blocks without real code) obfuscate the profile mapping because their addresses collide with their next basic blocks.
For instance, the fallthrough block of an empty block should always be adjacent to it. Otherwise, a completely unnecessary jump would be added.
This patch adds a MachineFunction pass named `GCEmptyBasicBlocks` which attempts to garbage-collect the empty blocks before the `BasicBlockSections` and pass.
This pass removes each empty basic block after redirecting its incoming edges to its fall-through block.
The garbage-collection is not complete. We keep the empty block in 4 cases:
      1. The empty block is an exception handling pad.
      2. The empty block has its address taken.
      3. The empty block is the last block of the function and it has
         predecessors.
      4. The empty block is the only block of the function.
The first three cases are extremely rare in normal code (no cases for the clang binary). Removing the blocks under the first two cases requires modifying exception handling structures and operands of non-terminator instructions -- which is doable but not worth the additional complexity in the pass.

Reviewed By: tmsriram

Differential Revision: https://reviews.llvm.org/D107534
2023-08-22 22:42:19 +00:00
Daniel Hoekwater
90ab85a1b2 Reland "[CodeGen][AArch64] Make MFS testable on AArch64"
Reverted by 3d22dac6c3b97d7bb92f243886dfb0d32a5c42e9 because it depended
on b9d079d6188b50730e0a67267b7fee36008435ce, which broke some tests.
2023-08-22 20:21:33 +00:00
Craig Topper
d9320e22d4 [RISCV][GlobalISel] Select register banks for GPR ALU instructions
This patch implements the getInstrMapping hook for RISCVRegisterBankInfo and others in order to correctly select the GPR register bank for operands of ALU instructions, and the associated operations introduced by the legalizer.

Co-authored-by: Lewis Revill <lewis.revill@embecosm.com>

Reviewed By: nitinjohnraj

Differential Revision: https://reviews.llvm.org/D76051
2023-08-22 12:31:20 -07:00
Joseph Huber
8a20612467 [AMDGPU] Respect nobuiltin when converting printf
The AMDGPU backend uses a pass to transform calls to the `printf`
function to a built-in verision for either HIP or OpenCL. Currently this
does not respect `-fno-builtin` and is always emitted. This allows the
user to turn off this functionality as is standard for these types of
built-in transformations. The motivation behind this change is to allow
the `libc` project to provide a linkable version of the `printf`
function in the future.

Reviewed By: sameerds

Differential Revision: https://reviews.llvm.org/D158477
2023-08-22 12:48:16 -05:00
Changpeng Fang
930e8dea41 AMDGPU: Add s-memrealtime and s-memtime-inst to RemoveIncompatibleFunctions
Summary:
 Under -O0, device-libs may still emit these instructions under conditions.
So we need to remove them with warning if not compatible.

Fixes: SWDEV-417219

Reviewers:
  arsenm, Pierre-vh and b-sumner

Differential Revision:
  https://reviews.llvm.org/D158316
2023-08-22 10:22:41 -07:00
David Green
13c2514df3 [AArch64] Disable GlobalISel/FastISel for more SME functions
The patch D136361 disabled GlobalISel and FastISel for some SME functions, as
the saving and restoring of SM is not yet handled. There were several tests
added for fp128 fadd, which will be expanded to a libcall, that only happened
to work by accident and did not handle other cases such as f32/f64 frem
libcalls.

This extends the cases where GlobalISel / FastISel is disabled for functions
with SME attributes, under the assumption that it is difficult to tell what
will become a libcall reliably, and so should fall back for all function until
GlobalISel and/or FastISel can handle them.

Differential Revision: https://reviews.llvm.org/D158490
2023-08-22 18:06:27 +01:00
David Green
ba114abec7 [AArch64] Add extra SME attribute tests for expanded intrinsics. NFC
See D136361.
2023-08-22 17:51:57 +01:00
Noah Goldstein
7c9fe735d4 [ValueTracking] Strengthen analysis in computeKnownBits of phi
Use the comparison based analysis to strengthen the standard
knownbits analysis rather than choosing either/or.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D157807
2023-08-22 10:59:03 -05:00
Philip Reames
c3b48ec6ff [RISCV] Match strided loads with reversed indexing sequences
This extends the concat_vector of loads to strided_load transform to handle reversed index pattern. The previous code expected indexing of the form (a0, a1+S, a2+S,...). However, we can also see indexing of the form (a1+S, a2+S, a3+S, .., aS). This form is a strided load starting at address aN + S*(n-1) with stride -S.

Note that this is also fixing what looks to be a bug in the memory location reasoning for forward strided case. A strided load with negative stride access eltsize bytes past base ptr, and then bytes *before* base ptr. (That is, the range should extend from before base ptr to after base ptr.)

Differential Revision: https://reviews.llvm.org/D157886
2023-08-22 07:59:49 -07:00
Philip Reames
ecb855a5a8 [RISCV] Reduce LMUL for vector extracts
If we have a known (or bounded) index which definitely fits in a smaller LMUL register group size, we can reduce the LMUL of the slide and extract instructions. This loosens constraints on register allocation, and allows the hardware to do less work, at the potential cost of some additional VTYPE toggles. In practice, we appear (after prior patches) to do a decent job of eliminating the additional VTYPE toggles in most cases.

Differential Revision: https://reviews.llvm.org/D158460
2023-08-22 07:36:17 -07:00
David Green
8f6a1a07cb [GISel][AArch64] Combine G_BUILD_VECTOR(G_UNMERGE) with undef elements
This extends the existing legalization combine to fold G_BUILD_VECTOR where the
sources are all from the same G_UNMERGE, to handle cases where some of the
lanes are undef. This comes up in the legalization of <3 x ..> vectors in
AArch64, where they are padded with undef.

There are two choices for what to create. This patch just removes the
G_BUILD_VECTOR/G_UNMERGE, losing the information about which lanes are undef.
The alternative would be to generate an identity G_SHUFFLE_VECTOR with undef
lanes marked as undef. I think both have advantages and disadvantages.

Differential Revision: https://reviews.llvm.org/D158063
2023-08-22 14:25:31 +01:00
Luke Lau
946c672fe0 [RISCV] Remove fixed length lmul max restriction from fp build_vector tests. NFC
For the same reasons as D157973, remove the LMUL flag from the tests to
simplify them and make the diffs in D157976 easier to read.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D158270
2023-08-22 11:13:02 +01:00
Luke Lau
4f996d7fbf [RISCV] Add test for constant build_vector that could use vid. NFC
We currently don't lower this to a vid because the addend doesn't fit into a
vadd.vi immediate. An extra li here seems like a small cost to pay for a
constant pool load.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D157975
2023-08-22 11:12:57 +01:00
Luke Lau
7492b54bd5 [RISCV] Split up structs in buildvec tests. NFC
Some of these tests have multiple vid sequence build_vectors in them, and machine CSE tidies up their output.
But now that small build_vectors are sometimes lowered to scalar `vmv.x.s`s, the output is harder to read with no CSE, e.g. see `buildvec_no_vid_v4i8`.

This patch splits them up into separate functions to address this, and also makes the diff in D157976 more clear since it causes `vmv.x.s` to be lowered to more frequently.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D157974
2023-08-22 11:12:52 +01:00
Luke Lau
9918853215 [RISCV] Remove fixed length lmul max restriction from build_vector tests. NFC
I'm not sure why the flag was added initially but I find the tests a bit easier
to read with it off, and it matches the default behaviour.

Reviewed By: craig.topper, reames

Differential Revision: https://reviews.llvm.org/D157973
2023-08-22 11:12:47 +01:00
Luke Lau
007b41b393 [RISCV] Don't relax policy to ta when vmerge's VL shrinks during folding
When folding a vmerge into its operands, if the resulting VL is smaller than
what the vmerge had originally then what was previously in its body then gets
moved to the tail. In that case, we can't relax the tail policy to agnostic
when the merge operand is undefined, since we need to preserve these elements
past the new VL.

Fixes https://github.com/llvm/llvm-project/issues/64754

Reviewed By: craig.topper, reames

Differential Revision: https://reviews.llvm.org/D158161
2023-08-22 10:39:22 +01:00
Luke Lau
6e532f94eb [RISCV] Add test case showing vmerge fold miscompile with tail policy
Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D158160
2023-08-22 10:39:18 +01:00
Tuan Chuong Goh
0b91b1aec4 [AArch64][GlobalISel] Legalize and Lower Funnel Shift for GlobalISel
Recognise G_FSHR with constant shift amount as a legal instruction.
Lowers G_FSHL with constant shift to G_FSHR.
If shift amount is non-constant, generic lowering is applied to the
instruction.

Differential Revision: https://reviews.llvm.org/D155484
2023-08-22 10:32:50 +01:00
Nikita Popov
1c6e6432ca [SCEVExpander] Fix incorrect reuse of more poisonous instructions (PR63763)
SCEVExpander tries to reuse existing instruction with the same
SCEV expression. However, doing this replacement blindly is not
safe, because the instruction might be more poisonous.

What we were already doing is to drop poison-generating flags on
the reused instruction. But this is not the only way that more
poison can be introduced. The poison-generating flag might not
be directly on the reused instruction, or the poison contribution
might come from something like 0 * %var, which folds to 0 but can
still introduce poison.

This patch fixes the issue in a principled way, by determining which
values can contribute poison to the SCEV expression, and then
checking whether any additional values can contribute poison to the
instruction being reused. Poison-generating flags are dropped if
doing that enables reuse.

This is a pretty big hammer and does cause some regressions in
tests, but less than I would have expected. I wasn't able to come
up with a less intrusive fix that still satisfies the correctness
requirements.

Fixes https://github.com/llvm/llvm-project/issues/63763.
Fixes https://github.com/llvm/llvm-project/issues/63926.
Fixes https://github.com/llvm/llvm-project/issues/64333.
Fixes https://github.com/llvm/llvm-project/issues/63727.

Differential Revision: https://reviews.llvm.org/D158181
2023-08-22 09:27:07 +02:00
pvanhout
2d87319f06 [GlobalISel] Rewrite some simple rules using MIR Patterns
Rewrites some simple rules that cause little to no codegen regressions as MIR patterns.

I may have missed some easy cases, but some other rules have intentionally been left as-is because bigger
changes are needed to make them work.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157690
2023-08-22 09:09:54 +02:00
Mikael Holmen
d1e685df45 [test] Add -verify-coalescing to testcase and fix problems
Apparently the testcase
 coalesce-partial-redundant-reguse-terminator.mir
was broken in a way that -verify-coalescing detected.

Update the testcase so -verify-coalescing doesn't complain and so
that it still exposes the problem originally fixed in 6c062b7641623.

Differential Revision: https://reviews.llvm.org/D158397
2023-08-22 07:20:53 +02:00
Eduard Zingerman
651e644595 [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.

Short description
=================

The `BPFMIPeepholeTruncElim` handles two patterns:

Pattern #1:

    %1 = LDB %0, ...              %1 = LDB %0, ...
    %2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

    bb.1:                         bb.1:
      %a = LDB %0, ...              %a = LDB %0, ...
      br %bb3                       br %bb3
    bb.2:                         bb.2:
      %b = LDB %0, ...        ->    %b = LDB %0, ...
      br %bb3                       br %bb3
    bb.3:                         bb.3:
      %1 = PHI %a, %b               %1 = PHI %a, %b
      %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.

When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.

Commit also adds a few test cases to make sure that patterns in
question are handled.

Long description
================

Why this works: Pattern #1
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:

    ...
    Type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
          t19: i64 = and t16, Constant:i64<255>
        t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.1 t19: i64 = and t16, Constant:i64<255>
    With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
        t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Optimized type-legalized selection DAG:
  - `t19 = and t16, 255` had been replaced by `t16` (load).
  - Patterns like `(and (load ... i8), 255)` are replaced by `load`
    in `DAGCombiner::BackwardsPropagateMask` called from
    `DAGCombiner::visitAND`.
  - Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
    `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
    look for `TLI.shouldFoldConstantShiftPairToMask()` call).

Why this works: Pattern #2
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      br label %next

    next:
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:

    Initial selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 9 nodes:
      t0: ch,glue = EntryToken
      t3: i64 = Constant<0>
            t2: i64,ch = CopyFromReg t0, Register:i64 %1
          t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
        t6: i64 = zero_extend t5
      t8: ch = CopyToReg t0, Register:i64 %0, t6
    ...
    Replacing.1 t6: i64 = zero_extend t5
    With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized lowered selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 7 nodes:
      t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %1
        t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
      t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:
- Initial selection DAG:
  - `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
    w/o special `isZExtFree()` overload added by this commit
    it is instead lowered as `t6 = (any_extend (load ...))`.
  - The decision to generate `zero_extend` or `any_extend` is
    done in `RegsForValue::getCopyToRegs` called from
    `SelectionDAGBuilder::CopyValueToVirtualRegister`:
    - if `isZExtFree()` for load returns true `zero_extend` is used;
    - `any_extend` is used otherwise.
- Optimized lowered selection DAG:
  - `t6 = (any_extend (load ...))` is replaced by
    `t9 = load ..., zext from i8`
    This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
    `DAGCombiner::visitZERO_EXTEND`.

Log for second basic block:

    Initial selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
                t2: i64,ch = CopyFromReg t0, Register:i64 %0
              t4: i64 = AssertZext t2, ValueType:ch:i8
            t5: i8 = truncate t4
          t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
        t9: i64 = any_extend t8
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.2 t18: i64 = and t4, Constant:i64<255>
    With: t4: i64 = AssertZext t2, ValueType:ch:i8
    ...
    Type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t4: i64 = AssertZext t2, ValueType:ch:i8
          t18: i64 = and t4, Constant:i64<255>
        t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Optimized type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Initial selection DAG:
  - `t0` is an input value for this basic block, it corresponds load
    instruction (`t9`) from the first basic block.
  - It is accessed within basic block via
    `t4` (AssertZext (CopyFromReg t0, ...)).
  - The `AssertZext` is generated by RegsForValue::getCopyFromRegs
    called from SelectionDAGBuilder::getCopyFromRegs, it is generated
    only when `LiveOutInfo` with known number of leading zeros is
    present for `t0`.
  - Known register bits in `LiveOutInfo` are computed by
    `SelectionDAG::computeKnownBits` called from
    `SelectionDAGISel::ComputeLiveOutVRegInfo`.
  - `computeKnownBits()` generates leading zeros information for
    `(load ..., zext from ...)` but *does not* generate leading zeros
    information for `(load ..., anyext from ...)`.
    This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
  - `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
  - `t18 = and t4, 255` is replaced by `t4`, this is done by
    `DAGCombiner::SimplifyDemandedBits` called from
    `DAGCombiner::visitAND`, which simplifies patterns like
    `(and (assertzext ...))`

Impact
------

This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
  remove_truncate_8 improved slightly, for other tests it is
  unchanged.

For remove_truncate_7:

    Before this revision                 After this revision
    --------------------                 -------------------
        r1 <<= 0x20                          r1 <<= 0x20
        r1 >>= 0x20                          r1 >>= 0x20
        if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
        r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
        goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
    <LBB0_2>:                            <LBB0_2>:
        r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
    <LBB0_3>:                            <LBB0_3>:
        r0 = r1                              exit
        exit

For remove_truncate_8:

    Before this revision                 After this revision
    --------------------                 -------------------
        r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
        r3 = r2                              r3 = r2
        r3 <<= 0x20                          r3 <<= 0x20
        r4 = r3                              r3 s>>= 0x20
        r4 s>>= 0x20
        if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
        r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
        r3 >>= 0x20
        if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
        r2 += 0x2                            r2 += 0x2
        *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
    <LBB0_3>:                            <LBB0_3>:
        r0 = 0x3                             r0 = 0x3
        exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
  in all cases total number of instructions marginally decreased
  (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
  - For 19 object files number of instruction decreased
    (-129 instruction in total): some redundant `rX &= 0xffff`
    and register to register assignments removed;
  - For 2 object files number of instructions increased +2
    instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same
example:

    define void @foo(ptr %p) {
    entry:
      %a = load i32, ptr %p, align 4
      %b = sext i32 %a to i64
      %c = icmp ult i64 1, %b
      br i1 %c, label %next, label %end

    next:
      call void inttoptr (i64 62 to ptr)(i32 %a)
      br label %end

    end:
      ret void
    }

Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:

    Before this revision          After this revision
    --------------------          -------------------
        r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
        r1 <<= 32                     r1 <<= 32
        r1 s>>= 32                    r1 s>>= 32
        if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                      r1 <<= 32
                                      r1 >>= 32
        call 62                       call 62
    <LBB0_2>:                     <LBB0_2>:
        exit                          exit

Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.

The difference comes from `DAGCombiner` operation on the initial DAG:

Initial selection DAG before this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = any_extend t5         <--------------------- (1)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = zero_extend t5        <--------------------- (2)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:

    Replacing.1 t9: i64 = sign_extend t5
    With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
     and 0 other values
    Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
    With: t31: i32 = truncate t30
     and 1 other values

This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
  converted to a pair of shifts. This pair of shifts survives till the
  end of translation.

Differential Revision: https://reviews.llvm.org/D157870
2023-08-22 00:04:51 +03:00
Fangrui Song
77596e6b16 Revert D157750 "[Driver][CodeGen] Properly handle -fsplit-machine-functions for fatbinary compilation."
This reverts commit 317a0fe5bd7113c0ac9d30b2de58ca409e5ff754.
This reverts commit 30c4b97aec60895a6905816670f493cdd1d7c546.

See post-commit discussions on https://reviews.llvm.org/D157750 that
we should use a different mechanism to handle the error with --cuda-gpu-arch=

The IR/DiagnosticInfo.cpp, warn_drv_for_elf_only, codegne tests in
clang/test/Driver, and the following driver behavior (downgrading error
to warning) changes are undesired.
```
% clang --target=riscv64 -fsplit-machine-functions -c a.c
warning: -fsplit-machine-functions is not valid for riscv64 [-Wbackend-plugin]
```
2023-08-21 13:54:15 -07:00
Philip Reames
dd0d36d09f [RISCVInsertVSETVLI] Handle vl-preserve case in backwards rewrite
This updates the backwards mutation code to handle the case where the previous vset was in vl-preserving (x0, x0) form, but that VL was never used before the next vset which changes the VL. Since this requires writing both VL operands, eliminate the restriction on removing GPR producing vsetv as well. (The register will now be written by the earlier vsetv.)

Differential Revision: https://reviews.llvm.org/D158019
2023-08-21 12:28:28 -07:00
Craig Topper
479716d954 [RISCV][GISel] Make G_SEXT_INREG with source size of 32 legal for RV64.
This maps to the sext.w instruction.

As far as I could tell this needs custom lowering to check the immediate.

Reviewed By: nitinjohnraj

Differential Revision: https://reviews.llvm.org/D158350
2023-08-21 10:43:10 -07:00
Daniel Hoekwater
e223e45677 Reland "[AArch64][CodeGen] Avoid inverting hot branches during relaxation""
This is a reland of 46d2d7599d9ed5e68fb53e910feb10d47ee2667b, which was
reverted because of breaking build
https://lab.llvm.org/buildbot/#/builders/21/builds/78779. However, this
buildbot is spuriously broken due to Flang::underscoring.f90 being
nondeterministic.
2023-08-21 17:29:47 +00:00
Daniel Hoekwater
0303137bfc Revert "[AArch64][CodeGen] Avoid inverting hot branches during relaxation"
This reverts commit 46d2d7599d9ed5e68fb53e910feb10d47ee2667b.
Breaks build https://lab.llvm.org/buildbot/#/builders/21/builds/78779
2023-08-21 17:13:35 +00:00
Daniel Hoekwater
46d2d7599d [AArch64][CodeGen] Avoid inverting hot branches during relaxation
Current behavior for relaxing out-of-range conditional branches
is to invert the conditional and insert a fallthrough unconditional
branch to the original destination. This approach biases the branch
predictor in the wrong direction, which can degrading performance.

Machine function splitting introduces many rarely-taken cross-section
conditional branches, which are improperly relaxed. Avoid inverting
these branches; instead, retarget them to trampolines at the end of the
function. Doing so increases the runtime cost of jumping to cold code
but eliminates the misprediction cost of jumping to hot code.

Differential Revision: https://reviews.llvm.org/D156837
2023-08-21 16:41:02 +00:00
Harvin Iriawan
7ba4896ecf [AArch64][NFC] Fix stack-guard-sysreg.ll
Fix test updated by commit db158c7c830807caeeb0691739c41f1d522029e9

  Differential Revision: https://reviews.llvm.org/D158432
2023-08-21 17:30:53 +01:00
Stefan Pintilie
d0e1e7649b [NFC][PowerPC] Add a test case for rotate and clear.
Added a test case for situations where a rotate is followed by a clear.
NFC because only a test case is added.
2023-08-21 11:01:47 -04:00
Harvin Iriawan
db034da211 [AArch64][NFC] Update test related to a510 sched update
Missed updating the win64_vararg2.ll test on db158c7c830807caeeb0691739c41f1d522029e9

  Differential Revision: https://reviews.llvm.org/D158410
2023-08-21 13:03:56 +01:00
Harvin Iriawan
db158c7c83 [AArch64] Update generic sched model to A510
Refresh of the generic scheduling model to use A510 instead of A55.
  Main benefits are to the little core, and introducing SVE scheduling information.
  Changes tested on various OoO cores, no performance degradation is seen.

  Differential Revision: https://reviews.llvm.org/D156799
2023-08-21 12:25:15 +01:00
Martin Storsjö
955d7615bd [AArch64] [GlobalISel] Fix clobbered callee saved registers with win64 varargs
This fixes a regression since 1c10d5b175992a9d056a2d763a932e5652386fc1
/ https://reviews.llvm.org/D130903 by applying the same fix from
SelectionDAG from 8cb3667541a94c4fa11b06e19020f753414c1d03 /
https://reviews.llvm.org/D35720.

This could possibly have been detected if the existing testcases
in win64_vararg.ll had been tested with GlobalISel too, but all
the IR snippets there fail to be translated with GlobalISel.

This adds a separate testcase based on real world LLVM IR (instead of
hand-reduced IR), which GlobalISel does translate happily - tested
with both SelectionDAG and GlobalISel.

Before this change, the stack object locations (visible in MIR
with "llc -print-after-all") didn't match with what the prologue
emitted by AArch64FrameLowering actually looked like, which caused
clobbered callee saved registers when function local stack objects
aliased the actual location of the callee saved registers.

This fixes https://github.com/llvm/llvm-project/issues/64740.

Differential Revision: https://reviews.llvm.org/D158272
2023-08-21 14:08:23 +03:00
wangpc
dc60003ec8 [RISCV] Support global address as inline asm memory operand of m
In D146245, we have supported lowering inline asm `m` with offset
to `register+imm`, but we didn't handle the case that the offset
is the low part of global address.

This patch will emit `%lo(g)` when `g` is a global address.

Fixes #64656

Reviewed By: asb

Differential Revision: https://reviews.llvm.org/D157839
2023-08-21 18:59:52 +08:00
wangpc
a3b11ce786 [RISCV][NFC] Move tests of inline asm memory constraints to separate file
We will need to check the output of medium code model.

Reviewed By: wangpc

Differential Revision: https://reviews.llvm.org/D157965
2023-08-21 18:59:52 +08:00
Diana Picus
5272ae667d [AMDGPU] Add IsChainFunction to the MachineFunctionInfo
This will represent functions with the amdgpu_cs_chain or
amdgpu_cs_chain_preserve calling conventions.

Differential Revision: https://reviews.llvm.org/D156410
2023-08-21 12:37:32 +02:00
Simon Pilgrim
ba818c4019 [DAG] replaceStoreOfInsertLoad - don't fold if the inserted element is implicitly truncated
D152276 wasn't handling the case where the inserted element is implicitly truncated into the vector - resulting in a i1 element (implicitly truncated from i8) overwriting 8 bits instead of 1 bit.

This patch is intended to be merged into 17.x so I've just disallowed any vector element vs inserted element type mismatch - technically we could be more elegant and permit truncated stores (as long as the store is still byte sized), but the use cases for that are so limited I'd prefer to play it safe for now.

Candidate patch for #64655 17.x merge

Differential Revision: https://reviews.llvm.org/D158366
2023-08-21 11:22:07 +01:00
Nikita Popov
69bd66b3ce [Tests] Remove some and/or constant expressions in tests (NFC)
In preparation for their removal in D158081.
2023-08-21 12:05:32 +02:00
Nikita Popov
fa1b6e6b34 [X86] Fix i128 argument passing under SysV ABI
The x86_64 SysV ABI specifies that __int128 is passed either in
two registers (if available) or in a 16 byte aligned stack slot.
GCC implements this behavior. However, if only one free register
is available, LLVM will instead pass one half of the i128 in a
register, and the other on the stack.

Make sure that either both are passed in registers or both on the
stack.

Fixes https://github.com/llvm/llvm-project/issues/41784.
The patch is basically what craig.topper proposed to do there.

Differential Revision: https://reviews.llvm.org/D158169
2023-08-21 11:44:35 +02:00
Diana Picus
26dc284498 [AMDGPU] ISel for amdgpu_cs_chain[_preserve] functions
Lower formal arguments and returns for functions with the
`amdgpu_cs_chain` and `amdgpu_cs_chain_preserve` calling conventions:

* Put `inreg` arguments into SGPRs, starting at s0, and other arguments
into VGPRs, starting at v8. No arguments should end up on the stack, if
we don't have enough registers we should error out.

* Lower the return (which is always void) as an S_ENDPGM.

* Set the ScratchRSrc register to s48:51, as described in the docs.

* Set the SP to s32, matching amdgpu_gfx. This might be revisited in a
future patch.

Differential Revision: https://reviews.llvm.org/D153517
2023-08-21 11:16:17 +02:00
Tuan Chuong Goh
a40c984976 [AArch64][GlobalISel] Support more legal types for EXTEND
Expand (s/z/any)ext instructions to be compatible with more
types for GlobalISel.
This patch mainly focuses on 64-bit and 128-bit vectors with
element size of powers of 2.
It also notably handles larger than legal vectors.

Differential Revision: https://reviews.llvm.org/D157113
2023-08-21 09:51:17 +01:00