52796 Commits

Author SHA1 Message Date
Simon Pilgrim
d2856ff457 [X86] Enable v32f16 FNEG custom lowering on AVX512 targets 2023-11-30 10:07:01 +00:00
Simon Pilgrim
8851e411ed [X86] Enable v16f16 FNEG custom lowering on AVX targets 2023-11-30 10:07:01 +00:00
Simon Pilgrim
06b9d92e0e [X86] Enable v8f16 FABS custom lowering on SSE2 targets 2023-11-30 10:07:00 +00:00
Simon Pilgrim
abc60e9808 [X86] vec_fabs.ll - add SSE test coverage 2023-11-30 10:07:00 +00:00
Simon Pilgrim
9d4c3e9035 [X86] Enable v8f16 FNEG custom lowering 2023-11-30 10:07:00 +00:00
Shengchen Kan
eb64697a7b [X86][Codegen] Correct the domain of VP2INTERSECT
GenericDomain -> SSEPackedInt

Found by #73654
2023-11-30 17:56:21 +08:00
wanglei
b72456120f
[LoongArch] Add codegen support for extractelement (#73759)
Add codegen support for extractelement when enable `lsx` or `lasx`
feature.
2023-11-30 17:29:18 +08:00
Shengchen Kan
511ba45a47
[X86][MC][CodeGen] Support EGPR for KMOV (#73781)
KMOV is essential for copy between k-registers and GPRs.
R16-R31 was added into GPRs in #70958, so we extend KMOV for these new
registers first.

This patch
1.  Promotes KMOV instructions from VEX space to EVEX space
2.  Emits prefix {evex} for the EVEX variants
3. Prefers EVEX variant than VEX variant in ISEL and optimizations for
better RA

EVEX variants will be compressed to VEX variants by existing EVEX2VEX
pass if no EGPR is used.

RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
TAG: llvm-test-suite && CPU2017 can be built with feature egpr
successfully.
2023-11-30 16:13:51 +08:00
Pierre van Houtryve
8a66510fa7
[AMDGPU] Don't create mulhi_24 in CGP (#72983)
Instead, create a mul24 with a 64 bit result and let ISel take care of
it.

This allows patterns to simply match mul24 even for 64-bit muls instead of having to match both mul/mulhi and a buildvector/bitconvert/etc.
2023-11-30 08:26:45 +01:00
Craig Topper
ce570d1a22 [RISCV] Remove old FIXMEs from test. NFC 2023-11-29 21:53:32 -08:00
Kai Luo
afd9582b36 [PowerPC] Enhance test for PR #73609. NFC. 2023-11-30 05:06:29 +00:00
Philip Reames
e947f95337 [LSR][TTI][RISCV] Enable terminator folding for RISC-V
If looking for a miscompile revert candidate, look here!

The transform being enabled prefers comparing to a loop invariant
exit value for a secondary IV over using an otherwise dead primary
IV.  This increases register pressure (by requiring the exit value
to be live through the loop), but reduces the number of instructions
within the loop by one.

On RISC-V which has a large number of scalar registers, this is
generally a profitable transform.  We loose the ability to use a beqz
on what is typically a count down IV, and pay the cost of computing
the exit value on the secondary IV in the loop preheader, but save
an add or sub in the loop body.  For anything except an extremely
short running loop, or one with extreme register pressure, this is
profitable.  On spec2017, we see a 0.42% geomean improvement in
dynamic icount, with no individual workload regressing by more than
0.25%.

Code size wise, we trade a (possibly compressible) beqz and a (possibly
compressible) addi for a uncompressible beq.  We also add instructions
in the preheader.  Net result is a slight regression overall, but
neutral or better inside the loop.

Previous versions of this transform had numerous cornercase correctness
bugs.  All of them ones I can spot by inspection have been fixed, and I
have run this through all of spec2017, but there may be further issues
lurking.  Adding uses to an IV is a fraught thing to do given poison
semantics, so this transform is somewhat inherently risky.

This patch is a reworked version of D134893 by @eop.  That patch has
been abandoned since May, so I picked it up, reworked it a bit, and
am landing it.
2023-11-29 12:04:06 -08:00
David Li
f688e09012
Enable custom lowering of fabs_v16f16 with AVX and fabs_v32f16 with A… (#73565)
This is the last patch for fabs lowering. v32f16 works for AVX as well
with the patch (with type legalization).
2023-11-29 09:39:53 -08:00
Nick Desaulniers
b053359892
[X86InstrInfo] support memfold on spillable inline asm (#70832)
This enables -regalloc=greedy to memfold spillable inline asm
MachineOperands.

Because no instruction selection framework marks MachineOperands as
spillable, no language frontend can observe functional changes from this
patch. That will change once instruction selection frameworks are
updated.

Link: https://github.com/llvm/llvm-project/issues/20571
2023-11-29 08:18:51 -08:00
Simon Pilgrim
244389ad17 [X86] Add fneg vector test coverage 2023-11-29 15:03:26 +00:00
Simon Pilgrim
53f3e59e59 [X86] Rename vec_fneg.ll to combine-fneg.ll
These tests are for fneg canonicalization combines, not codegen coverage tests like most of the other vec_* test files.
2023-11-29 15:03:25 +00:00
Paschalis Mpeis
1bfb84b477
[NFC][TLI] Improve tests for ArmPL and SLEEF Intrinsics. (#73352)
Auto-generate test `armpl-intrinsics.ll` and simplify tests:
- Eliminate scalar tail with no tail-folding flag.
- Use active lane mask for shorter check lines (no long `shufflevectors`).
- Eliminate scalar loops by providing `noalias` to relevant arguments and
run `simplifycfg` to drop them.
- Update script now use `@llvm.compiler.used` instead of a longer regex.
2023-11-29 11:19:10 +00:00
Simon Pilgrim
0fac9da734 [DAG] getNode() - relax (zext (trunc x)) -> x fold iff the upper bits are known zero.
Just leave the (zext (trunc (and x, c))) pattern which is still being used to create some zext_inreg patterns.
2023-11-29 10:38:11 +00:00
Simon Pilgrim
621183cb45 [X86] Add test case showing failure to remove unnecessary zext from address math
Thanks to @yubingex007-a11y for the original test case
2023-11-29 10:38:11 +00:00
Alex Bradbury
85c9c16895
[RISCV] Support load clustering in the MachineScheduler (off by default) (#73754)
This adds minimal support for load clustering, but disables it by
default. The intent is to iterate on the precise heuristic and the
question of turning this on by default in a separate PR. Although
previous discussion indicates hope that the MachineScheduler would
replace most uses of the SelectionDAG scheduler, it does seem most
targets aren't using MachineScheduler load clustering right now:
PPC+AArch64 seem to just use it to help with paired load/store formation
and although AMDGPU uses it for general clustering it also implements
ShouldScheduleLoadsNear for the SelectionDAG scheduler's clustering.
2023-11-29 10:01:55 +00:00
Qiu Chaofan
403ab9ac74 [NFC] Update X86 frem CodeGen case 2023-11-29 16:53:34 +08:00
David Green
b6ee831b59 [AArch64] Load/store optimizer fixes and cleanup.
This includes a couple of fixes after #71908 for bundles and some cleanup for
the debug output. One was an iterator type that asserted on bundles, the second
a rather subtle issue where forAllMIsUntilDef would hit the LdStLimit when
renaming registers, meaning the last instruction was not updated leaving an
invalid `ldp x6, x6` instruction.
2023-11-29 07:41:15 +00:00
Craig Topper
d345cfb55c [RISCV][GISel] Support s64 G_SELECT on RV32 with D extension.
We have to force the register bank to FPRB if the type is s64 and
the GPR is 32 bits.
2023-11-28 23:36:51 -08:00
wanglei
5e7e0d6032
[LoongArch] Fix pattern for FNMSUB_{S/D} instructions (#73742)
```
when a=c=-0.0, b=0.0:
-(a * b + (-c)) = -0.0
-a * b + c = 0.0
(fneg (fma a, b (-c))) != (fma (fneg a), b ,c)
```

See https://reviews.llvm.org/D90901 for a similar discussion on X86.
2023-11-29 15:21:21 +08:00
Craig Topper
35db35b7cf [RISCV][GISel] Support G_FCOPYSIGN with F and D extension. 2023-11-28 21:50:04 -08:00
Yeting Kuo
f35c0f2f23
[RISCV] Refine pattern (select_cc seteq (and x, C), 0, 0, A) with Zbs. (#73746)
PR #72978 disabled transformation (select_cc seteq (and x, C), 0, 0, A)
-> (and (sra(shl x)), A) for better Zicond codegen. It still enables the
combine when C is not fit into 12-bits. This patch disables the combine
when Zbs enabled.
2023-11-29 13:09:47 +08:00
Ruiling, Song
c1511a65d5
[AMDGPU] Folding imm offset in more cases for scratch access (#70634)
For scratch load/store, our hardware only accept non-negative value in
SGPR/VGPR. Besides the case that we can prove from known bits, we can
also prove that the value in `base` will be non-negative: 1.) When the
ADD for the address calculation has NonUnsignedWrap flag. 2.) When the
immediate offset is already negative.
2023-11-29 12:46:45 +08:00
Yeting Kuo
f73844d92b
[RISCV] Generate bexti for (select(setcc eq (and x, c))) where c is power of 2. (#73649)
Currently, llvm can transform (setcc ne (and x, c)) to (bexti x,
log2(c)) where c is power of 2.
This patch transform (select (setcc ne (and x, c)), T, F) into (select
(setcc eq (and x, c)), F, T).
It is benefit to the case c is not fit to 12-bits.
2023-11-29 11:56:48 +08:00
paperchalice
1debbae96b
[CodeGen] Port CallBrPrepare to new pass manager (#73630)
IIUC in the new pass manager infrastructure, the analysis result is
always computed lazily. So just use `getResult` here.
2023-11-29 10:33:14 +09:00
Arthur Eubanks
d8d9394cb0 Revert "[X86] With large code model, put functions into .ltext with large section flag (#73037)"
This reverts commit 38e435895779c6f0e6c47a171f3b300ad99828b3.

May be culprit for https://lab.llvm.org/buildbot/#/builders/37/builds/28079/steps/9/logs/stdio.
2023-11-28 14:14:40 -08:00
Arthur Eubanks
38e4358957
[X86] With large code model, put functions into .ltext with large section flag (#73037)
So that when mixing small and large text, large text stays out of the
way of the rest of the binary.

This is useful for mixing precompiled small code model object files and
built-from-source large code model binaries so that the the text
sections don't get merged.
2023-11-28 12:55:17 -08:00
Philip Reames
02cbae4fe0
[RISCV] Work on subreg for insert_vector_elt when vlen is known (#72666) (#73680)
If we have a constant index and a known vlen, then we can identify which
registers out of a register group is being accessed. Given this, we can
reuse the (slightly generalized) existing handling for working on
sub-register groups. This results in all constant index extracts with
known vlen becoming m1 operations.

One bit of weirdness to highlight and explain: the existing code uses
the VL from the original vector type, not the inner vector type. This is
correct because the inner register group must be smaller than the
original (possibly fixed length) vector type. Overall, this seems to a
reasonable codegen tradeoff as it biases us towards immediate AVLs,
which avoids needing the vsetvli form which clobbers a GPR for no real
purpose. The downside is that for large fixed length vectors, we end up
materializing an immediate in register for little value. We should
probably generalize this idea and try to optimize the large fixed length
vector case, but that can be done in separate work.
2023-11-28 10:45:22 -08:00
Stanislav Mekhanoshin
87d884b5c8
[AMDGPU] Fix folding of v2i16/v2f16 splat imms (#72709)
We can use inline constants with packed 16-bit operands, but these
should use op_sel. Currently splat of inlinable constants is considered
legal, which is not really true if we fail to fold it with op_sel and
drop the high half. It may be legal as a literal but not as inline
constant, but then usual literal checks must be performed.

This patch makes these splat literals illegal but adds additional logic
to the operand folding to keep current folds. This logic is somewhat
heavy though.

This has fixed constant bus violation in the fdot2 test.
2023-11-28 09:07:26 -08:00
Philip Reames
3e5acc78f7 [RISCV] Precommit test coverage for insert_vector_elt with exact VLEN 2023-11-28 08:49:04 -08:00
Philip Reames
f3a9dbe7fc
[RISCV] Split build_vector into vreg sized pieces when exact VLEN is known (#73606)
If we have a high LMUL build_vector and a known exact VLEN, we can
decompose the build_vector into one build_vector per register in the
register group. Doing so requires exact knowledge of which elements
correspond to each register in the register group, and thus an exact
VLEN must be known.

Since we no longer have operations which are linear (or worse) in LMUL,
this also allows us to lower all build_vectors without resorting to
going through the stack.
2023-11-28 07:39:58 -08:00
Jay Foad
0d40831765
[AMDGPU] Allow folding to FMAAK with SGPR and immediate operand on GFX10+ (#72266)
Allow foldImmediate to create instructions like:

  v_fmaak_f32 v0, s0, v0, 0x42000000

This instruction has two "scalar values": s0 and 0x42000000. On GFX10+
this is allowed. This fold was originally implemented before the
compiler supported GFX10, when all ASICs were limited to one scalar
value.
2023-11-28 14:36:37 +00:00
Uday Bondhugula
b5d132010d
[NFC][NVPTX] Add a simpler test case for 0b80288e9e0b (#73379)
While 0b80288e9e0b allowed more efficient lowering for 16xi8 loads, its
test case was closer to an "integration" one. Add a much simpler unit
test case that exercises it.
2023-11-28 19:28:51 +05:30
David Green
ab7110bcd6
[AArch64][SVE] Remove pseudo from LD1_IMM (#73631)
The LD1 immediate offset instructions have both a pseudo and a real
instruction, mostly as the instructions shares a tablegen class with the
FFR version of the instructions. As far as I can tell the pseudo for the
non-ffr versions does not serve any useful purpose though, and we can
rejig the the classes to only define the pseudo for FFR instructions
similar to the existing sve_mem_cld_ss instructions.

The end result of this is that we don't have a SideEffects flag on the
LD1_IMM instructions whilst scheduling them, and have a few less pseudo
instructions which is usually a good thing.
2023-11-28 12:13:26 +00:00
Mariusz Sikora
facead618b
[AMDGPU] PromoteAlloca - bail always if load/store is volatile (#73228)
This change is addressing case where alloca size is the same as
load/store size.
2023-11-28 12:01:35 +01:00
Simon Pilgrim
eba50929b8 [X86] X86DAGToDAGISel - fix typo in #73126
We were casting the LoadSDNode from the wrong node in the base pointer uses list, meaning the ptr/chain comparison were comparing against themselves.
2023-11-28 10:17:57 +00:00
paperchalice
61e58c4dc1
[CodeGen] Port DwarfEHPrepare to new pass manager (#72500)
Co-authored-by: PaperChalice <example@example.com>
2023-11-28 17:53:25 +09:00
Stanislav Mekhanoshin
82d22a1bb4
[AMDGPU] Fixed folding of inline imm into dot w/o opsel (#73589)
A splat packed constant can be folded as an inline immediate but it
shall use opsel. On gfx940 this code path can be skipped due to HW bug
workaround and then it may be folded w/o opsel which is a bug. Fixed.
2023-11-28 00:50:41 -08:00
Craig Topper
ffcc5c7796
[RISCV][GISel] Select G_FENCE. (#73184)
Using IR test to make it easier to compare with the SelectionDAG test
output. The constant operands otherwise make it harder to understand.
2023-11-27 20:24:03 -08:00
Shengchen Kan
a3b7b2d635
[X86][CodeGen] Not compress EVEX into VEX when R16-R31 is used (#73604)
b/c VEX prefix can not encode R16-R31.
2023-11-28 11:40:48 +08:00
Kai Luo
00f9946680 [PowerPC] Precommit test of building vector via load and zeros. NFC. 2023-11-28 03:32:57 +00:00
Shengchen Kan
d9221da72b
[X86][MC] Keep backward compatibility in inline asm for constraints (#73529)
Not use r16-r31 with 'q','r','l' constraint for backward compatibility
2023-11-28 09:42:03 +08:00
Philip Reames
52b413f25a [RISCV] Precommit tests for buildvector lowering with exact VLEN 2023-11-27 16:48:20 -08:00
Philip Reames
93e156833b
[DAG] Fix a miscompile in insert_subvector undef (insert_subvector undef, ..), idx combine (#73587)
The combine was implicitly assuming that the index on the outer
insert_subvector meant the same thing when the source was switched to be
the index of the inner insert_subvector. This is not true if the
innermost sub-vector is fixed, and the outer subvector is scalable.

I could do a less restrictive fix here - i.e. allow the case where the
scalability of the subvectors are the same - but there's no test
coverage which shows this transform actually has profit. Given that, go
for the simplest fix.
2023-11-27 16:45:29 -08:00
Craig Topper
b4cf014991
[RISCV][GISel] Select trap and debugtrap. (#73171) 2023-11-27 15:52:15 -08:00
Philip Reames
cf17a24a4b
[RISCV] Use subreg extract for extract_vector_elt when vlen is known (#72666)
This is the first in a planned patch series to teach our vector lowering
how to exploit register boundaries in LMUL>1 types when VLEN is known to
be an exact constant. This corresponds to code compiled by clang with
the -mrvv-vector-bits=zvl option.

For extract_vector_elt, if we have a constant index and a known vlen,
then we can identify which register out of a register group is being
accessed. Given this, we can do a sub-register extract for that
register, and then shift any remaining index.

This results in all constant index extracts becoming m1 operations, and
thus eliminates the complexity concern for explode-vector idioms at high
lmul.
2023-11-27 14:33:16 -08:00