52796 Commits

Author SHA1 Message Date
Fangrui Song
157ba33007 [CSKY,test] Update switch.ll 2023-10-31 15:47:05 -07:00
Fangrui Song
5888dee7d0
[ARM,ELF] Fix access to dso_preemptable __stack_chk_guard with static relocation model (#70014)
The ELF code from https://reviews.llvm.org/D112811 emits LDRLIT_ga_pcrel
when `TM.isPositionIndependent()` but uses a different condition
`Subtarget.isGVIndirectSymbol(GV)` (aka dso_preemptable on ELF targets).
This would cause incorrect access for dso_preemptable
`__stack_chk_guard` with the static relocation model.

Regarding whether `__stack_chk_guard` gets the dso_local specifier,
https://reviews.llvm.org/D150841 switched to
`M.getDirectAccessExternalData()` (implied by "PIC Level") instead of
`TM.getRelocationModel() == Reloc::Static`.

The result is that when non-zero "PIC Level" is used with static
relocation model (e.g. -fPIE/-fPIC LTO compiles with -no-pie linking),
`__stack_chk_guard` accesses are incorrect.
```
        ldr     r0, .LCPI0_0
        ldr     r0, [r0]
        ldr     r0, [r0]   // incorrectly dereferences __stack_chk_guard
...
.LCPI0_0:
        .long   __stack_chk_guard
```

To fix this, for dso_preemptable `__stack_chk_guard`, emit a GOT PIC
code sequence like for -fpic using `LDRLIT_ga_pcrel`:
```
        ldr     r0, .LCPI0_0
.LPC0_0:
        add     r0, pc, r0
        ldr     r0, [r0]
        ldr     r0, [r0]
...
LCPI0_0:
.Ltmp0:
        .long   __stack_chk_guard(GOT_PREL)-((.LPC0_0+8)-.Ltmp0)
```

Technically, `LDRLIT_ga_abs` with `R_ARM_GOT_ABS` could be used, but
`R_ARM_GOT_ABS` does not have GNU or integrated assembler support.
(Note, `.LCPI0_0: .long __stack_chk_guard@GOT` produces an
`R_ARM_GOT_BREL`, which is not desired).

This patch fixes #6499 while not changing behavior for the following
configurations:
```
run arm.linux.nopic --target=arm-linux-gnueabi -fno-pic
run arm.linux.pie --target=arm-linux-gnueabi -fpie
run arm.linux.pic --target=arm-linux-gnueabi -fpic
run armv6.darwin.nopic --target=armv6-apple-darwin -fno-pic
run armv6.darwin.dynamicnopic --target=armv6-apple-darwin -mdynamic-no-pic
run armv6.darwin.pic --target=armv6-apple-darwin -fpic
run armv7.darwin.nopic --target=armv7-apple-darwin -mcpu=cortex-a8 -fno-pic
run armv7.darwin.dynamicnopic --target=armv7-apple-darwin -mcpu=cortex-a8 -mdynamic-no-pic
run armv7.darwin.pic --target=armv7-apple-darwin -mcpu=cortex-a8 -fpic
run arm64.darwin.pic --target=arm64-apple-darwin
```
2023-10-31 15:37:26 -07:00
Fangrui Song
6ae7b735db [ARM][test] Improve stack-protector tests
llvm/test/LTO/ARM/ssp-static-reloc.ll is more about using the static
relocation model with "PIC Level" and unrelated to the LTO infrastructure.
Move the test. Update stack_guard_remat.ll to clearly test "PIC Level"
with the relevant relocation models.
2023-10-31 15:30:08 -07:00
Ramkumar Ramachandra
7a76038835
CodeGen/RISCV: increase test coverage of lrint, llrint (#70826)
To follow up on 98c90a1 (ISel: introduce vector ISD::LRINT, ISD::LLRINT;
custom RISCV lowering), increase the test coverage to test the codegen
of the i32-variant of lrint on RV64, and llrint on RV32.
2023-10-31 19:16:39 +00:00
Sander de Smalen
73498d2608
[AArch64] Also implement PNR -> PNR copies. (#70682)
Previously we only implemented PNR -> PPR and PPR -> PNR copies.
2023-10-31 16:52:42 +00:00
Simon Pilgrim
51d4ad6701 [AMDGPU] amdgpu-codegenprepare-idiv.ll - regenerate checks. NFC.
Reduces diffs in a future patch
2023-10-31 13:24:27 +00:00
Matt Arsenault
e62d25e37d
RegisterCoalescer: Relax assert for super register def rematerialization (#69088) 2023-10-31 21:52:36 +09:00
Jay Foad
a6dabed348
[AMDGPU] Fix nondeterminism in SIFixSGPRCopies (#70644)
There are a couple of loops that iterate over V2SCopies. The iteration
order needs to be deterministic, otherwise we can call moveToVALU in
different orders, which causes temporary vregs to be allocated in
different orders, which can affect register allocation heuristics.
2023-10-31 11:47:42 +00:00
Kerry McLaughlin
3b786f2c76 [AArch64] Add intrinsic to count trailing zero elements
This patch introduces an experimental intrinsic for counting the
trailing zero elements in a vector. The intrinsic has generic expansion
in SelectionDAGBuilder, and for AArch64 there is a pattern which matches
to brkb & cntp instructions where SVE is enabled.

The intrinsic has a second operand, is_zero_poison, similar to the
existing cttz intrinsic.

These changes have been split out from D158291.
2023-10-31 10:48:08 +00:00
Jessica Del
b8d3ccdff1
[AMDGPU] - Add s_bitreplicate intrinsic (#69209)
Add intrinsic for s_bitreplicate. Lower to S_BITREPLICATE_B64_B32
machine instruction in both GISel and Selection DAG.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-10-31 11:26:45 +01:00
Ilya Leoshkevich
03934e70ef
[SystemZ] Enable AtomicExpand pass (#70398)
The upcoming OpenMP support for SystemZ requires handling of IR insns
like `atomicrmw fadd`. Normally atomic float operations are expanded by
Clang and such insns do not occur, but OpenMP generates them directly.
Other architectures handle this using the AtomicExpand pass, which
SystemZ did not need so far. Enable it.

Currently AtomicExpand treats atomic load and stores of floats
pessimistically: it casts them to integers, which SystemZ does not need,
since the floating point load and store instructions are already atomic.
However, the way Clang currently expands them is pessimistic as well, so
this change does not make things worse. Optimizing operations on atomic
floats can be a separate change in the future.

This change does not create any differences the Linux kernel build.
2023-10-31 09:51:06 +01:00
Luke Lau
03c8fbf092
[RISCV] Add _RM pseudos to pseudos table (#70693) 2023-10-31 08:41:56 +00:00
Fangrui Song
5908559c10
[X86] Don't set SHF_X86_64_LARGE for variables with explicit section name of a well-known small data section prefix (#70748)
Commit f3ea73133f91c1c23596d45680c8f2269c1dd289 allows SHF_X86_64_LARGE
for all global variables with an explicit section. For the following
variables, their data sections will be annotated as SHF_X86_64_LARGE.

```
const char relro[512] __attribute__((section(".rodata"))) = "a";
const char *const relro __attribute__((section(".data.rel.ro"))) = "a";
char data[512] __attribute__((section(".data"))) = "a";
```

The typical linker requirement is that we do not create more than one
output section with the same name, and the only output section should
have the bitwise OR value of all input section flags. Therefore, the
output .data section will have the SHF_X86_64_LARGE flag and be
moved away from the regular sections. This is undesired but benign.
However, .data.rel.ro having the SHF_X86_64_LARGE flag is problematic
because dynamic loaders do not support more than one PT_GNU_RELRO
program header, and LLD produces the error
`error: section: .jcr is not contiguous with other relro sections`.

I believe the most appropriate solution is to disallow SHF_X86_64_LARGE
on variables with an explicit section of certain prefixes (
.bss/.data/.bss) and allow others (e.g. metadata sections for various
instrumentation). Fortunately, global variables with an explicit
.bss/.data/.bss section are rare, so they should not cause excessive
relocation overflow pressure.
2023-10-30 17:03:04 -07:00
Philip Reames
784a2cd561
[RISCV] Rewrite RISCVCodeGenPrepare using zext nneg [nfc-ish] (#70739)
This stacks on #70725. Once we have lowering for zext nneg, we can
rewrite all of the existing RISCVCodeGenPrepare login in terms of zext
nneg instead of sext. The change isn't NFC from the perspective of the
individual pass, but should be from the perspective of codegen as a
whole.

As noted in the TODO, one piece can be moved to instcombine, but I'll
leave that to a separate commit.
2023-10-30 16:35:30 -07:00
Philip Reames
83c560b3bf
[SDAG] Prefer forming sign_extend for zext nneg per target preference (#70725)
Builds on #67982 which recently introduced the nneg flag on a zext
instruction. Note that this change is the first point where the flag is
being used for an optimization, and thus may expose latent miscompiles.
We've recently taught both CVP and InstCombine to infer the flag when
forming zext, but nothing else is using the flag just yet.
2023-10-30 15:29:57 -07:00
Justin Bogner
428af867d8
[DirectX] Update test after opt learned to infer datalayout (#70726)
Since e39f6c1844fa "[opt] Infer DataLayout from triple if not
specified", this test (correctly) emits a load of an i64 with 8 byte
alignment, rather than with 4 byte alignment.
2023-10-30 14:04:15 -07:00
Philip Reames
c92c86f66a [RISCV] Add test coverage for "zext nneg" [nfc]
This IR feature was recently added in #67982.  An upcoming change will
improve our lowering on these examples.
2023-10-30 13:35:58 -07:00
Philip Reames
cc6f9cf5a2 [RISCV] Add zbb coverage to test file [nfc] 2023-10-30 13:18:35 -07:00
Michael Maitland
04dd2ac03a
[RISCV][GlobalISel] Select G_GLOBAL_VALUE (#70091)
G_GLOBAL_VALUE should be lowered into an absolute address if
`-codemodel=small` is used or into a PC-relative if `-codemodel=medium`
is used.

PR #68380 tried to create special instructions to do this, but I don't
see why we need to do that.
2023-10-30 15:46:36 -04:00
Igor Kirillov
849f963e31
[CodeGen] Improve ExpandMemCmp for more efficient non-register aligned sizes handling (#70469)
* Enhanced the logic of ExpandMemCmp pass to merge contiguous
subsequences
  in LoadSequence, based on sizes allowed in `AllowedTailExpansions`.
* This enhancement seeks to minimize the number of basic blocks and
produce
  optimized code when using memcmp with non-register aligned sizes.
* Enable this feature for AArch64 with memcmp sizes modulo 8 equal to
  3, 5, and 6.

Reapplication of #69942 after fixing a bug
2023-10-30 18:40:48 +00:00
Antonio Frighetto
9fe5700611 [AArch64] Add support for v8.4a ldapur/stlur
AArch64 backend now features v8.4a atomic Load-Acquire
RCpc and Store-Release register unscaled support.
2023-10-30 19:27:48 +01:00
Antonio Frighetto
a8799719f7 [AArch64] Introduce tests for PR67879 (NFC) 2023-10-30 19:27:48 +01:00
Nikita Popov
e46dd6fbc0 Revert "[InstCombine] Simplify and/or of icmp eq with op replacement (#70335)"
This reverts commit 1770a2e325192f1665018e21200596da1904a330.

Stage 2 llvm-tblgen crashes when generating X86GenAsmWriter.inc and
other files.
2023-10-30 18:33:03 +01:00
Craig Topper
9a7c26a399
[GISel] Restrict G_BSWAP to multiples of 16 bits. (#70245)
This is consistent with the IR verifier and SelectionDAG's getNode.

Update tests accordingly. I tried to keep some coverage of non-pow2 when
possible. X86 didn't like a G_UNMERGE_VALUES from s48 to 3 s16 that got
created when I tried s48.
2023-10-30 10:27:57 -07:00
Craig Topper
77e88db6b7 [RISCV][GISel] Add missing curly brace to test. NFC 2023-10-30 10:12:56 -07:00
Craig Topper
284d136c4a
[RISCV] Teach copyPhysReg to allow copies between GPR<->FPR32/FPR64 (#70525)
This is needed because GISel emits copies instead of bitcasts like
SelectionDAG.
2023-10-30 09:58:51 -07:00
Jay Foad
101008be83
[AMDGPU] CodeGen for 64-bit buffer atomic cmpswap intrinsics (#70475)
Implement codegen for:
llvm.amdgcn.raw.buffer.atomic.cmpswap.i64
llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.i64
2023-10-30 16:44:22 +00:00
Jessica Del
849297c97d
[AMDGPU][wmma] - Add tied wmma intrinsic (#69903)
These new intrinsics, `amdgcn_wmma_tied_f16_16x16x16_f16` and
`amdgcn_wmma_tied_f16_16x16x16_f16`,
explicitly tie the destination accumulator matrix to the input
accumulator matrix.

The `wmma_f16` and `wmma_bf16` intrinsics only write to 16-bit of the
32-bit destination VGPRs.
Which half is determined via the `op_sel` argument. The other half of
the destination registers remains unchanged.

In some cases however, we expect the destination to copy the other
halves from the input accumulator.
For instance, when packing two separate accumulator matrices into one.
In that case, the two matrices
are tied into the same registers, but separate halves. Then it is
important to copy the other matrix values
to the new destination.
2023-10-30 16:23:49 +01:00
Luke Lau
72e6c1c70d
[RISCV] Begin moving post-isel vector peepholes to a MF pass (#70342)
We currently have three postprocess peephole optimisations for vector
pseudos:

1) Masked pseudo with all ones mask -> unmasked pseudo
2) Merge vmerge pseudo into operand pseudo's mask
3) vmerge pseudo with all ones mask -> vmv.v.v pseudo

This patch aims to move these peepholes out of SelectionDAG and into a
separate RISCVFoldMasks MachineFunction pass.

There are a few motivations for doing this:

* The current SelectionDAG implementation operates on MachineSDNodes,
which are essentially MachineInstrs but require a bunch of logic to
reason about chain and glue operands. The RISCVII::has*Op helper
functions also don't exactly line up with the SDNode operands. Mutating
these pseudos and their operands in place becomes a good bit easier at
the MachineInstr level. For example, we would no longer need to check
for cycles in the DAG during performCombineVMergeAndVOps.

* Although it's further down the line, moving this code out of
SelectionDAG allows it to be reused by GlobalISel later on.

* In performCombineVMergeAndVOps, it may be possible to commute the
operands to enable folding in more cases (see
test/CodeGen/RISCV/rvv/vmadd-vp.ll). There is existing machinery to
commute operands in TII::commuteInstruction, but it's implemented on
MachineInstrs.

The pass runs straight after ISel, before any of the other machine SSA
optimization passes run. This is so that dead-mi-elimination can mop up
any vmsets that are no longer used (but if preferred we could try and
erase them from inside RISCVFoldMasks itself). This also means that
these peepholes are no longer run at codegen -O0, so this patch isn't
strictly NFC.

Only the performVMergeToVMv peephole is refactored in this patch, the
remaining two would be implemented later. And as noted by @preames, it
should be possible to move doPeepholeSExtW out of SelectionDAG as well.
2023-10-30 15:17:00 +00:00
Stanislav Mekhanoshin
fe8335babb
[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)
This allows folding of 64-bit operands if fit into 32-bit. Fixes
https://github.com/llvm/llvm-project/issues/67781
2023-10-30 08:12:28 -07:00
Stanislav Mekhanoshin
ee6d62db99
[AMDGPU] Prevent folding of the negative i32 literals as i64 (#70274)
We can use sign extended 64-bit literals, but only for signed operands.
At the moment we do not know if an operand is signed. Such operand will
be encoded as its low 32 bits and then either correctly sign extended or
incorrectly zero extended by HW.
2023-10-30 08:07:43 -07:00
Nikita Popov
292f34b0d3
[AArch64][GlobalISel] Fix incorrect ABI when tail call not supported (#70215)
The check for whether a tail call is supported calls
determineAssignments(), which may modify argument flags. As such, even
though the check fails and a non-tail call will be emitted, it will not
have a different (incorrect) ABI.

Fix this by operating on a separate copy of the arguments.

Fixes https://github.com/llvm/llvm-project/issues/70207.
2023-10-30 15:01:01 +01:00
Simon Pilgrim
432649700d [X86] vec_insert-5.ll - ensure we build with +mmx as we reference x86_mmx types
Enabling SSE doesn't guarantee MMX is enabled on all targets

Avoids a crash in D152928 (although we still currently see a regression with that patch applied resulting in MMX codegen)
2023-10-30 12:43:18 +00:00
Nikita Popov
1770a2e325
[InstCombine] Simplify and/or of icmp eq with op replacement (#70335)
and/or in logical (select) form benefit from generic simplifications via
simplifyWithOpReplaced(). However, the corresponding fold for plain
and/or currently does not exist.

Similar to selects, there are two general cases for this fold
(illustrated with `and`, but there are `or` conjugates).

The basic case is something like `(a == b) & c`, where the replacement
of a with b or b with a inside c allows it to fold to true or false.
Then the whole operation will fold to either false or `a == b`.

The second case is something like `(a != b) & c`, where the replacement
inside c allows it to fold to false. In that case, the operand can be
replaced with c, because in the case where a == b (and thus the icmp is
false), c itself will already be false.

As the test diffs show, this catches quite a lot of patterns in existing
test coverage. This also obsoletes quite a few existing special-case
and/or of icmp folds we have (e.g. simplifyAndOrOfICmpsWithLimitConst),
but I haven't removed anything as part of this patch in the interest of
risk mitigation.

Fixes #69050.
Fixes #69091.
2023-10-30 10:05:39 +01:00
Cullen Rhodes
54732a3e0b [AArch64] Use TargetRegisterClass::hasSubClassEq in tryToFindRegisterToRename
When renaming store operands for pairing in the load/store optimizer it
tries to find an available register from the minimal physical register
class of the original register. For each register it compares the
equality of minimal physical register class of all sub/super registers
with the minimal physical register class of the original register.

Simply checking for register class equality can break once additional
register classes are added, as was the case when adding:

    def foo : RegisterClass<"AArch64", [i32], 32, (sequence "W%u", 12, 15)>

which broke:

    CodeGen/AArch64/stp-opt-with-renaming-reserved-regs.mir
    CodeGen/AArch64/stp-opt-with-renaming.mir

Since the introduction of the register class above, the rename register
in test1 of the reserved regs test changed from x12 to x18. The reason
for this is the minimal physical register class of x12 (as well as
x13-x15) and its sub/super registers no longer matches that of x9
(GPR64noip_and_tcGPR64).

Rather than selecting a matching register based on a comparison of the minimal
physical register classes of the original and rename registers, this patch
selects based on `MachineInstr::getRegClassConstraint` for the original
register.

It's worth mentioning the parameter passing registers (r0-r7) could be now be
used as rename registers since the GPR32arg and GPR64arg register classes are
subclasses of the minimal physical register class for x8 for example. I'm not
entirely sure if we want to exclude those registers, if so maybe we could
explicitly exclude those register classes.

Reviewed By: efriedma, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D88663
2023-10-30 08:47:39 +00:00
David Green
072a7edec3 [AArch64] Add additional concat trunc -> UZP1 patterns
These extra patterns come from the lowering of fptoi, where an extra assertzext
is present between some of the vectors.
2023-10-29 22:58:57 +00:00
Simon Pilgrim
d96529af3c [DAG] Attempt shl narrowing in SimplifyDemandedBits (REAPPLIED)
If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext.

Followup to D146121

Reapplied - moved after the ShrinkDemandedOp call; reuse the existing KnownBits result; ensure that we only attempt this if all the upper bits are demanded; 547dc461225ba should address the remaining regressions that were noticed in the previous commit.

Differential Revision: https://reviews.llvm.org/D155472
2023-10-29 15:38:46 +00:00
Phoebe Wang
b5281afe42
[X86] Avoid returning the same shuffle operation for broadcast (#70592)
This is to fix a crash since aab8b2eb080d, which generates a new pattern
```
      t35: v8i32 = xor t11, t14
    t36: v8i32 = vector_shuffle<0,1,0,1,0,1,0,1> t35, undef:v8i32
```

The pattern exposed a bug introduced since f885c08034, which breaks
element widen but doesn't handle the broadcast case.

The patch just solved the crash issue. I observed performance regression
cased by above patches in the test, which may need further
investigation.
2023-10-29 21:55:00 +08:00
Craig Topper
133e50db31 [RISCV][GISel] Directly emit X0 from getICMPOperandsForBranch instead of using buildConstant+selectConstant.
This is simpler to implement and matches what SelectionDAG emits.

Also allows us to make getICMPOperandsForBranch a static function.
2023-10-28 23:50:46 -07:00
Craig Topper
4ac304242b [RISCV][GISel] Support G_FPEXT/G_FPTRUNC for F and D extension. 2023-10-28 16:22:17 -07:00
Craig Topper
49ae2efb80 [RISCV][GISel] Support G_FMA/NEG/ABS/SQRT/MAXNUM/MINNUM for F and D extension. 2023-10-28 15:53:15 -07:00
XChy
fc6bdb8549
[SimplifyCFG] Reland transform for redirecting phis between unmergeable BB and SuccBB (#68473)
Reland #67275 with #68953 resolved.
2023-10-28 17:10:20 +08:00
Rahman Lavaee
f70e39ec17
[BasicBlockSections] Apply path cloning with -basic-block-sections. (#68860)
28b9126879
introduced the path cloning format in the basic-block-sections profile.

This PR validates and applies path clonings. 
A path cloning is valid if all of these conditions hold:
  1. All bb ids in the path are mapped to existing blocks.
2. Each two consecutive bb ids in the path have a successor relationship
in the CFG.
3. The path does not include a block with indirect branches, except
possibly as the last block.
 
Applying a path cloning involves cloning all blocks in the path (except
the first one) and setting up their branches.
Once all clonings are applied, the cluster information is used to guide
block layout in the modified function.
2023-10-27 21:49:39 -07:00
Changpeng Fang
8ceb72ffe5
[AMDGPU] make v32i16/v32f16 legal (#70484)
Some upcoming intrinsics will be using these new types
2023-10-27 15:28:31 -07:00
Mircea Trofin
840bf2a0bb [mlgo][regalloc] Fix tests post 9a091de7 2023-10-27 14:03:25 -07:00
Stanislav Mekhanoshin
d136432038
[AMDGPU] Remove unneeded implicit-def from shrink-i32-kimm.mir. NFC. (#70489) 2023-10-27 13:32:48 -07:00
Guozhi Wei
9a091de7fe [X86, Peephole] Enable FoldImmediate for X86
Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.

Also enhanced peephole by deleting identical instructions after FoldImmediate.

Differential Revision: https://reviews.llvm.org/D151848
2023-10-27 19:47:23 +00:00
Michael Maitland
4b581125ed [RISCV][GISel] Fix failing test case for G_BSWAP
The test was not updated correctly in #70226. This patch resolves that
problem.
2023-10-27 10:22:09 -07:00
Michael Maitland
4c43c1eeef
[RISCV][GISEL] Add legalizer for G_BSWAP (#70226)
Lower G_BSWAP into simpler instructions that can be selected in
instruction selection. A future patch can handle when there is Zbb.
2023-10-27 13:03:25 -04:00
Paul Walker
7c90be2857
[SVE] Fix incorrect offset calculation when rewriting an instruction's frame index. (#70315)
When partially packing an offset into an SVE load/store instruction we
are incorrectly calculating the remainder.
2023-10-27 16:53:30 +01:00