52796 Commits

Author SHA1 Message Date
spupyrev
cc2fbc648d
[CodeLayout] Faster basic block reordering, ext-tsp (#68617)
Aggressive inlining might produce huge functions with >10K of basic 
blocks. Since BFI treats _all_ blocks and jumps as "hot" having 
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances) 
speeds up the computation. Some notable changes:
- reduced the maximum chain size to 512 (from the prior 4096);
- introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
- dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created 
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf 
difference for such small changes. I did test the performance clang-14 
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and 
"tsp-score" (the higher the better).
**Before**:

- benchmark 1:
  num functions: 13,047
  reordering running time is 2.4 seconds
  score: 125503458 (128.3102%)
- benchmark 2:
  num functions: 16,438
  reordering running time is 3.4 seconds
  score: 12613997277 (129.7495%)
- benchmark 3:
  num functions: 12,359
  reordering running time is 1.9 seconds
  score: 1315881613 (105.8991%)
- benchmark 4:
  num functions: 96,588
  reordering running time is 7.3 seconds
  score: 89513906284 (100.3413%)
- benchmark 5:
  num functions: 1
  reordering running time is 372 seconds
  score: 21292505965077 (99.9979%)
- benchmark 6:
  num functions:  71,155
  reordering running time is 314 seconds
  score: 29795381626270671437824 (102.7519%)

**After**:
- benchmark 1:
  reordering running time is 2.2 seconds
  score: 125510418 (128.3130%)

- benchmark 2:
  reordering running time is 2.6 seconds
  score: 12614502162 (129.7525%)

- benchmark 3:
  reordering running time is 1.6 seconds
  score: 1315938168 (105.9024%)

- benchmark 4:
  reordering running time is 4.9 seconds
  score: 89518095837 (100.3454%)

- benchmark 5:
  reordering running time is 4.8 seconds
  score: 21292295939119 (99.9971%)

- benchmark 6:
  reordering running time is 104 seconds
  score: 29796710925310302879744 (102.7565%)
2023-10-25 07:52:26 -07:00
Vladislav Dzhidzhoev
2dea7bd8a0 [AArch64][GlobalISel] Legalize NEON smin,smax,umin,umax,fmin,fmax intrinsics
Replace these intrinsics with the corresponding GISel operators during
legalization stage to reuse available selection patterns.
2023-10-25 16:02:44 +02:00
Simon Pilgrim
ac534d2a16 [X86] combineArithReduction - use PACKUSWB directly for PSADBW(TRUNCATE(v8i16 X)) reduction patterns
Avoids a crash in the D152928 patch due to a reduction pattern appearing after legalization

We can probably extend this further to avoid truncating to sub-128-bit vXi8 (and then calling WidenToV16I8) entirely, but we can't currently hit other cases.
2023-10-25 14:56:58 +01:00
Nikita Popov
d9cfb82207 [AArch64] Add test for #70207 (NFC) 2023-10-25 15:43:33 +02:00
Simon Pilgrim
a8913f8e04 [X86] Regenerate pr38539.ll
Even though we're only interested in the X64 codegen for the first test, its much easier to maintain if we just let the update script generate the codegen checks for X86 as well.
2023-10-25 13:25:15 +01:00
Jay Foad
c82ebfb97a Revert "[AMDGPU] Accept arbitrary sized sources in CalculateByteProvider"
This reverts commit ef33659492325de7871c8c85e35bd9c1c37f7347.

It was causing incorrect codegen for some Vulkan CTS tests.
2023-10-25 11:11:27 +01:00
Momchil Velikov
9d35387811
[AArch64] Disable by default MachineSink sink-and-fold (#70101)
There is a report about a large compile time regression in V8 when
generating debug info.
2023-10-25 10:58:31 +01:00
Oliver Stannard
7e8eccd990 [AArch64] Move SLS later in pass pipeline
Currently, the SLS hardening pass is run before the machine outliner,
which means that the outliner creates new functions and calls which do
not have the SLS hardening applied.

The fix for this is to move the SLS passes to after the outliner, as has
recently been done for the return address signing pass.

This also avoids a bug where the SLS outliner emits code with
instructions after a return, which the outliner doesn't correctly
handle.

Reviewed By: kristof.beyls

Differential Revision: https://reviews.llvm.org/D158511
2023-10-25 10:45:12 +01:00
Oliver Stannard
5640d28201 [AArch64] Add test showing incorrect code-gen
Differential Revision: https://reviews.llvm.org/D158512
2023-10-25 10:28:50 +01:00
Craig Topper
34af57c5c1 [RISCV][GISel] Add G_SEXTLOAD to legalizer and regbank select. Add instruction selection tests.
This updates our G_SEXTLOAD support to the same level as G_ZEXTLOAD.
Still missing some legalizer rules for both though.
2023-10-25 00:13:21 -07:00
Craig Topper
35d771fd4f [RISCV][GISel] Fix failure to legalize non-power of 2 shifts between i32 and i64 on RV64.
We weren't legalizing the shift amount to i64.
2023-10-24 23:30:41 -07:00
Matthias Braun
e3cf80c5c1
BlockFrequencyInfoImpl: Avoid big numbers, increase precision for small spreads
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:

* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
2023-10-24 20:27:39 -07:00
Ruiling, Song
ac24238002
[LowerSwitch] Don't let pass manager handle the dependency (#68662)
Some passes has limitation that only support simple terminators:
branch/unreachable/return. Right now, they ask the pass manager to add
LowerSwitch pass to eliminate `switch`. Let's manage such kind of pass
dependency by ourselves. Also add the assertion in the related passes.
2023-10-25 09:24:36 +08:00
Min-Yih Hsu
cdcaef876c
[RISCV][GISel] Add ISel support for SHXADD_UW and SLLI.UW (#69972)
This patch also includes:
  - Remove legacy non_imm12 PatLeaf from RISCVInstrInfoZb.td
- Implement a custom GlobalISel operand renderer for TrailingZeros
SDNodeXForm
2023-10-24 16:26:38 -07:00
Luke Lau
b2accb9d8e
[RISCV] Mark V0 regclasses as larger superclasses of non-V0 classes (#70109) 2023-10-24 22:13:17 +01:00
Amara Emerson
1b11729dc0
[AArch64][GlobalISel] Add support for post-indexed loads/stores. (#69532)
Gives small code size improvements across the board at -Os CTMark.

Much of the work is porting the existing heuristics in the DAGCombiner.
2023-10-24 13:51:59 -07:00
Benjamin Kramer
4c600bd117 [NVPTX] Add a test to verify the .version with sm_90(a) 2023-10-24 18:01:48 +02:00
Mircea Trofin
ec0645939b Revert "[mlgo] Fix tests post 760e7d0"
This reverts commit ab91e05e48d9ea47b60858dc259bdbf00dfde7fa.

This is because 760e7d0 has been reverted in 3fb5b18.
2023-10-24 08:43:32 -07:00
Brandon Wu
7cce908367
[RISCV][GISel][NFC] Correct the test case in constant32.mir (#70003) 2023-10-24 22:57:05 +08:00
Simon Pilgrim
2df69ed14c [X86] Add scalar isel test coverage for AND/OR/XOR types
Even something as simple as bitlogic ops are showing differences between DAG/Fast/Global ISel - promotion, commutation, load/rmw folding etc.
2023-10-24 15:13:17 +01:00
Simon Pilgrim
f2eef3fab6 [DAG] Add test case for Issue #69965 2023-10-24 13:58:18 +01:00
Benjamin Kramer
858d6a15a0 [wasm] Don't crash on non-simple value types during shuffle combine
These still exist during the DAGCombine phase.
2023-10-24 12:35:43 +02:00
Stanislav Mekhanoshin
945e943db7
[AMDGPU] Fix subreg check in the SIFixSGPRCopies (#70007)
It checks for the copy of subregs, but it checks destination which may
never happen in SSA. It misses the subreg check and happily produces
S_MOV_B64 out of a subreg COPY.

The affected test should have never been formed in the first place
because the pass is running in SSA and copies into a subreg shall never
happen.
2023-10-24 01:44:58 -07:00
Nikita Popov
eb86de63d9
[IR] Require that ptrmask mask matches pointer index size (#69343)
Currently, we specify that the ptrmask intrinsic allows the mask to have
any size, which will be zero-extended or truncated to the pointer size.

However, what semantics of the specified GEP expansion actually imply is
that the mask is only meaningful up to the pointer type *index* size --
any higher bits of the pointer will always be preserved. In other words,
the mask gets 1-extended from the index size to the pointer size. This
is also the behavior we want for CHERI architectures.

This PR makes two changes:
* It spells out the interaction with the pointer type index size more
explicitly.
* It requires that the mask matches the pointer type index size. The
intention here is to make handling of this intrinsic more robust, to
avoid accidental mix-ups of pointer size and index size in code
generating this intrinsic. If a zero-extend or truncate of the mask is
desired, it should just be done explicitly in IR. This also cuts down on
the amount of testing we have to do, and things transforms needs to
check for.

As far as I can tell, we don't actually support pointers with different
index type size at the SDAG level, so I'm just asserting the sizes match
there for now. Out-of-tree targets using different index sizes may need
to adjust that code.
2023-10-24 09:54:29 +02:00
Mogball
3fb5b18e81 Revert 24633ea and 760e7d0 "Enable FoldImmediate for X86"
This reverts commits 24633eac38d46cd4b253ba53258165ee08d886cd
and 760e7d00d142ba85fcf48c00e0acc14a355da7c3.

I have confirmed that these commits are introducing a new crash in the
peephole optimizer. I have minimized a test case, which you can find
below.

```llvmir
; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "/mnt/big/modular/Kernels/mojo/Mogg/MOGG.mojo"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

declare dso_local void @foo({ { ptr, [4 x i64], [4 x i64], i1 }, { ptr, [4 x i64], [4 x i64], i1 } }, { ptr }, { ptr, i64, i8 })

define dso_local void @bad_fn(ptr %0, ptr %1, ptr %2) {
  %4 = load i64, ptr null, align 8
  %5 = insertvalue [4 x i64] poison, i64 12, 1
  %6 = insertvalue [4 x i64] %5, i64 poison, 2
  %7 = insertvalue [4 x i64] %6, i64 poison, 3
  %8 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } poison, [4 x i64] %7, 1
  %9 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } %8, [4 x i64] poison, 2
  %10 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } %9, i1 poison, 3
  %11 = icmp ne i64 %4, 1
  %12 = or i1 false, %11
  %13 = select i1 %12, i64 %4, i64 0
  %14 = zext i1 %12 to i64
  %15 = insertvalue [4 x i64] poison, i64 12, 1
  %16 = insertvalue [4 x i64] %15, i64 poison, 2
  %17 = insertvalue [4 x i64] %16, i64 %13, 3
  %18 = insertvalue [4 x i64] poison, i64 %14, 3
  %19 = icmp eq i64 0, 0
  %20 = icmp eq i64 0, 0
  %21 = icmp eq i64 %13, 0
  %22 = and i1 %20, %19
  %23 = select i1 %22, i1 %21, i1 false
  %24 = select i1 %23, i1 %12, i1 false
  %25 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } poison, [4 x i64] %17, 1
  %26 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } %25, [4 x i64] %18, 2
  %27 = insertvalue { ptr, [4 x i64], [4 x i64], i1 } %26, i1 %24, 3
  %28 = insertvalue { { ptr, [4 x i64], [4 x i64], i1 }, { ptr, [4 x i64], [4 x i64], i1 } } undef, { ptr, [4 x i64], [4 x i64], i1 } %10, 0
  %29 = insertvalue { { ptr, [4 x i64], [4 x i64], i1 }, { ptr, [4 x i64], [4 x i64], i1 } } %28, { ptr, [4 x i64], [4 x i64], i1 } %27, 1
  br label %31

30:                                               ; preds = %3
  br label %softmax_pass

31:                                               ; preds = %31
  %exitcond.not.i = icmp eq i64 poison, 3
  br i1 %exitcond.not.i, label %37, label %31

32:                                               ; preds = %31
  br i1 poison, label %34, label %33

33:                                               ; preds = %32
  br label %34

34:                                               ; preds = %33, %32
  br i1 poison, label %35, label %36

35:                                               ; preds = %34
  br label %softmax_pass

36:                                               ; preds = %34
  br i1 poison, label %37, label %.critedge.i

37:                                               ; preds = %36
  br i1 poison, label %38, label %.critedge.i

38:                                               ; preds = %37
  br i1 poison, label %40, label %39

39:                                               ; preds = %38
  br label %40

40:                                               ; preds = %39, %38
  br i1 poison, label %.lr.ph28.i, label %._crit_edge.i

.lr.ph28.i:                                       ; preds = %40
  br label %41

41:                                               ; preds = %51, %.lr.ph28.i
  br i1 poison, label %.thread, label %42

42:                                               ; preds = %41
  br i1 poison, label %43, label %44

43:                                               ; preds = %42
  br label %45

44:                                               ; preds = %42
  br label %45

45:                                               ; preds = %44, %43
  br i1 poison, label %46, label %.thread

46:                                               ; preds = %45
  br label %47

.thread:                                          ; preds = %45, %41
  br label %47

47:                                               ; preds = %.thread, %46
  br i1 poison, label %51, label %48

48:                                               ; preds = %47
  br i1 poison, label %49, label %50

49:                                               ; preds = %48
  br label %51

50:                                               ; preds = %48
  br label %51

51:                                               ; preds = %50, %49, %47
  call void @foo({ { ptr, [4 x i64], [4 x i64], i1 }, { ptr, [4 x i64], [4 x i64], i1 } } %29, { ptr } poison, { ptr, i64, i8 } poison)
  br i1 poison, label %._crit_edge.i, label %41

._crit_edge.i:                                    ; preds = %51, %40
  br label %softmax_pass

.critedge.i:                                      ; preds = %37, %36
  br i1 poison, label %.lr.ph.i, label %softmax_pass

.lr.ph.i:                                         ; preds = %.lr.ph.i, %.critedge.i
  store { ptr, [4 x i64], [4 x i64], i1 } %10, ptr poison, align 8
  br i1 poison, label %.lr.ph.i, label %softmax_pass

softmax_pass:                                     ; preds = %.lr.ph.i, %.critedge.i, %._crit_edge.i, %35, %30
  ret void
}
```
2023-10-24 07:08:38 +00:00
pvanhout
300190ffa7 [AMDGPU] Regenerate udiv.ll 2023-10-24 07:59:41 +02:00
Pierre van Houtryve
2bc93584f5
[DAG] Constant Folding for U/SMUL_LOHI (#69437) 2023-10-24 07:37:55 +02:00
huhu233
dbe8def9cc
[AArch64] Lower mathlib call ldexp into fscale when sve is enabled (#67552)
The function of 'fscale' is equivalent to mathlib call ldexp, but has
better performance. This patch lowers ldexp into fscale when sve is
enabled.
2023-10-24 10:17:04 +08:00
Evgenii Kudriashov
cc455033d4
[X86][GlobalISel] Reorganize shift scalar tests (NFC) (#68232)
Removed duplicated tests from GlobalISel directory
2023-10-24 02:00:30 +02:00
Jeffrey Byrnes
ef33659492 [AMDGPU] Accept arbitrary sized sources in CalculateByteProvider
This allows working with e.g. v8i8 / v16i8 sources.

It is generally useful, but is primarily beneficial when allowing e.g. v8i8s to be passed to branches directly through registers. As such, this is the first in a series of patches to enable that work. However, it effects https://reviews.llvm.org/D155995, so it has been implemented on top of that.

Differential Revision: https://reviews.llvm.org/D159036

Change-Id: Idfcb57dacd0c32cab040fe4dd4ac2ec762750664
2023-10-23 16:07:54 -07:00
Artem Belevich
6115b1b907
[NVPTX] Add lowering for bitcasts float<->v4i8 (#69960)
.. and move bitcast from a constant for integer-based types into a
better suited location. It solves the mystery of why we sometimes used
`mov.u32` and sometimes `mov.b32` for loading constants. Now they all
should use `.b32`
2023-10-23 13:54:39 -07:00
alex-t
2973febe10
[AMDGPU] Force the third source operand of the MAI instructions to VGPR if no AGPRs are used. (#69720)
eaf85b9c28 "[AMDGPU] Select VGPR versions of MFMA if possible" prevents
the compiler from reserving AGPRs if a kernel has no inline asm
explicitly using AGPRs, no calls, and runs at least 2 waves with not
more than 256 VGPRs. This, in turn, makes it impossible to allocate AGPR
if necessary. As a result, regalloc fails in case we have an MAI
instruction that has at least one AGPR operand.
This change checks if we have AGPRs and forces operands to VGPR if we do
not have them.

---------

Co-authored-by: Alexander Timofeev <alexander.timofeev@amd.com>
2023-10-23 19:41:07 +02:00
Philip Reames
25da9bb7d4
[RISCV] Allow swapped operands in reduction formation (#68634)
Very straight forward, but worth landing on it's own in advance of a
more complicated generalization.
2023-10-23 10:37:56 -07:00
David Green
2e69407547
[AArch64] Don't generate st2 for 64bit store that can use stp (#69901)
D142966 made it so that st2 that do not start at element 0 use zip2
instead of st2. This extends that to any 64bit store that has a nearby
load that can better become a LDP operation, which is expected to have a
higher throughput. It searches up to 20 instructions away for a store to
p+16 or p-16.
2023-10-23 18:15:36 +01:00
Sundeep
4554eac5d4
Update call-long1.ll
[llvm][test][Hexagon] NFC: test commit
2023-10-23 11:55:42 -05:00
Michael Maitland
4458ba8cef
[RISCV][GISel] Select G_SELECT (G_ICMP, A, B) (#68247)
If MI is a G_SELECT(G_ICMP(tst, A, B), C, D) then we can use (A, B, tst)
as the (LHS, RHS, CC) of the Select_GPR_Using_CC_GPR.
2023-10-23 10:07:15 -04:00
Igor Kirillov
b507509f6a
[AArch64] Allow SVE code generation for fixed-width vectors (#67122)
This patch allows the generation of SVE code with masks that mimic Neon.
2023-10-23 12:41:34 +01:00
Hans Wennborg
e2fc68c3db Typos: 'maxium', 'minium' 2023-10-23 10:42:28 +02:00
Nikita Popov
2ad9fde418
[MemDep] Use EarliestEscapeInfo (#69727)
Use BatchAA with EarliestEscapeInfo instead of callCapturesBefore() in
MemDepAnalysis. The advantage of this is that it will also take
not-captured-before information into account for non-calls (see
test_store_before_capture for a representative example), and that this
is a cached analysis. The disadvantage is that EII is slightly less
precise than full CapturedBefore analysis.

In practice the impact is positive, with gvn.NumGVNLoad going from 22022
to 22808 on test-suite. 

The impact to compile-time is also positive, mainly in the ThinLTO
configuration.
2023-10-23 09:57:26 +02:00
Jon Roelofs
461918e290
[CodeGen][Remarks] Add the function name to the stack size remark (#69346)
It is already present in the yaml, but missing from the printed
diagnostics.
2023-10-22 11:39:02 -07:00
Ashley Nelson
47f0f8ca47
[WebAssembly] Add exp10 libcall signatures (#69661)
The llvm.exp.* family of intrinsics and their corresponding libcalls
were recently added, which means we need to know their signatures.
2023-10-20 12:15:48 -07:00
Craig Topper
cfdafc1e70
[RISCV][GISel] Support G_PTRTOINT and G_INTTOPTR (#69542)
Legalizer, register bank selection, and instruction selection.
2023-10-20 12:03:09 -07:00
Craig Topper
f533e8ca9f Recommit "[RISCV][GISel] Disable call lowering for integers larger than 2*XLen. (#69144)"
Remove bad test for >2x XLen scalar. Don't restrict struct returns if they aren't homogenous.

Original commit message:
Types larger than 2*XLen are passed indirectly which is not supported
yet. Currently, we will incorrectly pass X10 multiple times.
2023-10-20 11:51:47 -07:00
Luke Lau
b4729f79ed
[RISCV] Use LMUL=1 for vmv_s_x_vl with non-undef passthru (#66659)
We currently shrink the type of vmv_s_x_vl to LMUL=1 when its passthru
is
undef to avoid constraining the register allocator since it ignores
LMUL.
This patch relaxes it for non-undef passthrus, which occurs when
lowering
insert_vector_elt.
2023-10-20 14:19:04 -04:00
weiguozhi
24633eac38
[Peephole] Check instructions from CopyMIs are still COPY (#69511)
Function foldRedundantCopy records COPY instructions in CopyMIs and uses
it later. But other optimizations may delete or modify it. So before
using it we should check if the extracted instruction is existing and
still a COPY instruction.
2023-10-20 08:34:43 -07:00
Ivan Kosarev
b6ecdf0a6b
[AMDGPU] Segregate 16-bit fix-sgpr-copies tests. (#69353)
The 16-bit instructions used in them are not available on the generic
target. This patches makes them run for GFX11.
2023-10-20 14:09:30 +01:00
Wang Pengcheng
f24d9490e5
[RISCV] Match prefetch address with offset (#66072)
A new ComplexPattern `AddrRegImmLsb00000` is added, which is like
`AddrRegImm` except that if the least significant 5 bits isn't all
zeros, we will fail back to offset 0.
2023-10-20 14:22:48 +08:00
Wang Pengcheng
af3ead4ccf
[RISCV] Add more prefetch tests (#67644)
We should be able to merge the offset later.
2023-10-20 14:19:49 +08:00
yubingex007-a11y
f2517cbcee
[X86][AMX] remove related code of X86PreAMXConfigPass (#69569)
In https://reviews.llvm.org/D125075, we switched to use
FastPreTileConfig in O0 and abandoned X86PreAMXConfigPass.
we can remove related code of X86PreAMXConfigPass safely.
2023-10-20 13:43:34 +08:00
Brandon Wu
d1985e3d1f
[RISCV] Support Xsfvqmaccdod and Xsfvqmaccqoq extensions (#68295)
SiFive Int8 Matrix Multiplication Extensions Specification

https://sifive.cdn.prismic.io/sifive/c4f0e51d-4dd3-402a-98bc-1ffad6011259_int8-matmul-spec.pdf
2023-10-20 11:16:20 +08:00