52796 Commits

Author SHA1 Message Date
David Green
dc1b772933 [AArch64][GlobalISel] Add missing legalization for v16i8 extract element. 2024-02-19 07:26:57 +00:00
XinWang10
bb91b43719
[X86] Handle repeated blend mask in combineConcatVectorOps (#82155)
https://github.com/llvm/llvm-project/commit/1d27669e8ad07f8f2 add
support for fold 512-bit concat(blendi(x,y,c0),blendi(z,w,c1)) to
AVX512BW mask select.
But when the type of subvector is v16i16, we need to generate repeated
mask to make the result correct.
The subnode looks like t87: v16i16 = X86ISD::BLENDI t132, t58,
TargetConstant:i8<-86>.
2024-02-19 09:24:21 +08:00
Craig Topper
d5167c84f9
[DAGCombiner] Allow tryToFoldExtOfLoad to use a sextload for zext nneg. (#81714)
If the load is used by any signed setccs, we can use a sextload
instead of zextload. Then we don't have to give up on extending
the load.
2024-02-17 11:37:13 -08:00
David Green
3a77522387
[AArch64][GlobalISel] Improve and expand fcopysign lowering (#71283)
This alters the lowering of G_COPYSIGN to support vector types. The
general idea is that we just lower it to vector operations using and/or
and a mask, which are now converted to a BIF/BIT/BSP.

In the process the existing AArch64LegalizerInfo::legalizeFCopySign can
be removed, replying on expanding the scalar versions to vector instead,
which just needs a small adjustment to allow widening scalars to
vectors.
2024-02-17 10:19:27 +00:00
David Green
47c65cf62d
[AArch64][GlobalISel] Fail legalization for unknown libcalls. (#81873)
If, like powi on windows, the libcall is unavailable we should fall back
to SDAG. Currently we try and generate a call to "".
2024-02-17 08:57:14 +00:00
Philip Reames
db836f6094
[RISCV] Narrow vector absolute value (#82041)
If we have a abs(sext a) we can legally perform this as a zext (abs a). 
(See the same combine in instcombine - note that the IntMinIsPoison flag
 doesn't exist in SDAG yet.)

On RVV, this is likely profitable because it may allow us to perform the arithmetic
operations involved in the abs at a narrower LMUL before widening for the user.
We could arguably avoid narrowing below DLEN, but the transform should
at worst move around the extend and create one extra vsetvli toggle if the source
could previously be handled via loads explicit w/EEW.
2024-02-16 19:26:06 -08:00
Prabhuk
ea9ec80b7a
Revert "[AArch64] Add soft-float ABI (#74460)" (#82032)
This reverts commit 9cc98e336980f00cbafcbed8841344e6ac472bdc.

Issue: https://github.com/ClangBuiltLinux/linux/issues/1997
2024-02-16 16:43:50 -08:00
Sumanth Gundapaneni
0e6a48c3e8
[Hexagon] Optimize post-increment load and stores in loops. (#82011)
This patch optimizes the post-increment instructions so that we can
packetize them together.
v1 = phi(v0, v3')
v2,v3  = post_load v1, 4
v2',v3'= post_load v3, 4

This can be optimized in two ways

v1 = phi(v0, v3')
v2,v3' = post_load v1, 8
v2' = load v1, 4
2024-02-16 16:47:54 -06:00
Philip Reames
4265ad1215
[RISCV] Adjust sdloc when creating an extend for widening instructions (#82040)
We were using the SDLoc corresponding to the original arithmetic
instruction, but here using the SDLoc corresponding to the original
extend if we need to introduce a new narrower extend seems cleaner.

As can be seen in the test diffs, this very minorly impacts scheduling
and register allocation by given the scheduler a hint from original
program order.
2024-02-16 13:00:25 -08:00
Shilei Tian
46734aa1e5
[AMDGPU] Use bf16 instead of i16 for bfloat (#80908)
Currently we generally use `i16` to represent `bf16` in those tablegen
files. This patch is trying to use `bf16` directly.

Fix #79369.
2024-02-16 15:58:30 -05:00
Florian Hahn
2f083b364f
[AArch64] Fix resource length computation for STP. (#81749)
On some uArchs, `STP [s|d], [s|d]` first combines the 2 input registers
in a single register using a vector execution unit. IIUC
AArch64StorePairSuppress tries to prevent forming STPs in case the
critical resource are the vector units, in order to prevent adding more
pressure on those units.

The implementation however simply computes the new critical resource
length by adding resource for another STP. If load/store units are the
critical resource, this means we increase that length by one, and
incorrectly prevent forming the STP.

This patch adjusts the resource computation by also removing 2 STRs, as
introducing a STP will remove 2 single stores. This should more
accurately reflect the resource usage after introducing an STP, and does
not prevent forming STPs if load/store units are the critical resources;
in those cases, STP can actually help to reduce resource usage.

PR: https://github.com/llvm/llvm-project/pull/81749
2024-02-16 16:10:10 +00:00
Florian Hahn
35f3298ead
[AArch64] Add extra stp suppression tests.
Add test case for store suppression that still trigger after
https://github.com/llvm/llvm-project/pull/81749
2024-02-16 14:58:54 +00:00
Yingwei Zheng
a300a1a711
[RISCV][ISel] Add codegen support for the experimental zabha extension (#80192)
This patch implements the codegen support of zabha (Byte and Halfword
Atomic Memory Operations) v1.0-rc1 extension.
See also https://github.com/riscv/riscv-zabha/blob/v1.0-rc1/zabha.adoc.

---------

Co-authored-by: Craig Topper <craig.topper@sifive.com>
2024-02-16 15:35:09 +08:00
Craig Topper
feee627974
[RISCV] Make sure ADDI replacement in optimizeCondBranch has a virtual reg destination. (#81938)
If it isn't virtual, we may extend the live range of the physical
register past were it is valid. For example, across a call.

Found while trying to enable -riscv-enable-sink-fold which enables some
copy propagation in machine sink that led to ADDIs with physical
register destinations.
2024-02-15 16:34:40 -08:00
Hiroshi Yamauchi
692566a8b2
Fix an assert failure with a funclet in a swifttailcc function. (#78806)
The failure happens in the livedebugvalues pass.
2024-02-15 15:54:03 -08:00
Arthur Eubanks
5b51d45f49
[X86] Use ".lrodata" prefix for large mergeable constants (#81900)
Otherwise with a small enough large-data-threshold, we can get .rodata.*
sections marked large, making .rodata large in the final binary.
2024-02-15 12:50:26 -08:00
Krzysztof Drewniak
b497234146
[AMDGPU] Make maximum hard clause size a subtarget feature (#81287)
gfx11 chips may, in some conditions, behave incorrectly with S_CLAUSE
instructions (hard clauses) containing more than 32 operations (that is,
whose arguments exceed 0x1f). However, gfx10 targets will work
successfully with clauses of up to length 63.

Therefore, define the MaxHardClauseLength property on GCNSubtarget and
make it a subtarget feature via tablegen, thus allowing us to specify,
both now and in the future, the maximum viable size of clauses on
various hardware from the tablegen definition. If MaxHardClauseLength is
0, which is the default, the hardware does not support hard clauses.
2024-02-15 13:58:31 -06:00
Craig Topper
b57ba8ec51
[RISCV] Use APInt in useInversedSetcc to prevent crashes when mask is larger than UINT64_MAX. (#81888)
There are no checks that the type is legal so we need to handle any
type.
2024-02-15 10:48:52 -08:00
David Green
3564000490
[AArch64][GlobalISel] FNeg constant materialization (#80643)
This is a Global ISel equivalent of #80641, creating fneg(movi) instead
of the alternative constant pool load or gpr dup.
2024-02-15 16:22:12 +00:00
ostannard
9cc98e3369
[AArch64] Add soft-float ABI (#74460)
This adds support for the AArch64 soft-float ABI. The specification for
this ABI was added by https://github.com/ARM-software/abi-aa/pull/232.

Because all existing AArch64 hardware has floating-point hardware, we
expect this to be a niche option, only used for embedded systems on
R-profile systems. We are going to document that SysV-like systems
should only ever use the base (hard-float) PCS variant:
https://github.com/ARM-software/abi-aa/pull/233. For that reason, I've
not added an option to select the ABI independently of the FPU hardware,
instead the new ABI is enabled iff the target architecture does not have
an FPU.

For testing, I have run this through an ABI fuzzer, but since this is
the first implementation it can only test for internal consistency
(callers and callees agree on the PCS), not for conformance to the ABI
spec.
2024-02-15 12:39:16 +00:00
Simon Pilgrim
b279ca2783 [DAG] visitCTPOP - CTPOP(SHIFT(X)) -> CTPOP(X) iff the shift doesn't affect any non-zero bits
If the source is being (logically) shifted, but doesn't affect any active bits, then we can call CTPOP on the shift source directly.
2024-02-15 10:41:08 +00:00
Jay Foad
2df652a691
[CodeGen] Simplify updateLiveIn in MachineSink (#79831)
When a whole register is added a basic block's liveins, use
LaneBitmask::getAll for the live lanes instead of trying to calculate an
accurate mask of the lanes that comprise the register.

This simplifies the code and matches other places where a whole register
is marked as livein.

This also avoids problems when regunits that are synthesized by TableGen
to represent ad hoc aliasing have a lane mask of 0.

Fixes #78942
2024-02-15 10:39:05 +00:00
Vyacheslav Levytskyy
9552a396ed
add support for the SPV_KHR_linkonce_odr extension (#81512)
This PR adds support for the SPV_KHR_linkonce_odr extension and modifies
existing negative test with a positive check for the extension and
proper linkage type in case when the extension is enabled.

SPV_KHR_linkonce_odr adds a "LinkOnceODR" linkage type, allowing proper
translation of, for example, C++ templates classes merging during
linking from different modules and supporting any other cases when a
global variable/function must be merged with equivalent global
variable(s)/function(s) from other modules during the linking process.
2024-02-15 11:30:17 +01:00
Vyacheslav Levytskyy
dfb9bf35c4
let a user select preferred/unpreferred capabilities in a list of enabling capabilities (#81476)
By SPIR-V specification: "If an instruction, enumerant, or other feature
specifies multiple enabling capabilities, only one such
capability needs to be declared to use the feature."

However, one capability may be preferred over another. One important
case is Shader capability that may not be supported by a backend, but
always is inserted if "OpDecorate SpecId" is found, because Enabling
Capabilities for the latter is the list of Shader and Kernel, where
Shader is coming first and thus always selected as the first available
option.

In this PR we address the problem by keeping current behaviour of
selecting the first option among enabling capabilities as is, but giving
a user a way to filter capabilities during the selection process via a
newly introduced "--avoid-spirv-capabilities" command line option. This
option is to avoid selection of certain capabilities if there are other
available enabling capabilities.

This PR is changing also existing pruneCapabilities() function. It
doesn't remove capability from module requirement anymore, but only adds
implicitly required capabilities recursively, so its name is changed
accordingly. This change fixes the present bug in collecting required by
a module capabilities. Before the change, introduced by this PR,
pruneCapabilities() function has been removing, for example, Kernel
capability from required by a module, because Kernel is initially
required and the second time it was needed pruneCapabilities() removed
it by mistake.
2024-02-15 11:28:58 +01:00
chuongg3
f6f8e202f5
[AArch64][GlobalISel] Refactor Combine G_CONCAT_VECTOR (#80866)
The combine now works using tablegen and checks if new instruction is
legal before creating it.
2024-02-15 10:09:20 +00:00
Luke Lau
cd55e230e6 [RISCV] Use $noreg in vsetvli-insert.mir test. NFC
This reflects what actually comes out of SelectionDAG after the noreg passthru
peephole added in a63bd7e99b00c.
2024-02-15 17:12:52 +08:00
Rohit Aggarwal
36adfec155
Adding support of AMDLIBM vector library (#78560)
Hi,

AMD has it's own implementation of vector calls. This patch include the
changes to enable the use of AMD's math library using -fveclib=AMDLIBM.
Please refer https://github.com/amd/aocl-libm-ose 

---------

Co-authored-by: Rohit Aggarwal <Rohit.Aggarwal@amd.com>
2024-02-15 12:13:07 +05:30
Philip Reames
8d32654292 [RISCV] Add coverage for an upcoming set of vector narrowing changes 2024-02-14 13:34:08 -08:00
YunQiang Su
c007fbb198
MipsAsmParser/O32: Don't add redundant $ to $-prefixed symbol in the la macro (#80644)
When parsing the `la` macro, we add a duplicate `$` prefix in
`getOrCreateSymbol`,
leading to `error: Undefined temporary symbol $$yy` for code like:

```
xx:
	la	$2,$yy
$yy:
	nop
```

Remove the duplicate prefix.

In addition, recognize `.L`-prefixed symbols as local for O32.

See: #65020.

---------

Co-authored-by: Fangrui Song <i@maskray.me>
2024-02-14 12:48:55 -08:00
sgundapa
de16a05af0
[Hexagon] Fix zero extension of bit predicates with vtrunehb (#81772)
vector extension from v4i1 to v4i8 generates an incorrect word. This
patch uses a vtrunehb for truncation to fix the bug.
2024-02-14 13:10:18 -06:00
Daniel Hoekwater
ea06384bf6
[CodeGen][AArch64] Only split safe blocks in BBSections (#81553)
Some types of machine function and machine basic block are unsafe to
split on AArch64: basic blocks that contain jump table dispatch or
targets (D157124), and blocks that contain inline ASM GOTO blocks
or their targets (D158647) all cause issues and have been excluded
from Machine Function Splitting on AArch64.

These issues are caused by any transformation pass that places
same-function basic blocks in different text sections
(MachineFunctionSplitter and BasicBlockSections) and must be
special-cased in both passes.
2024-02-14 10:58:07 -08:00
Philip Reames
275eeda32f
[RISCV] Split long build_vector sequences to reduce critical path (#81312)
If we have a long chain of vslide1down instructions to build e.g. a <16
x i8> from scalar, we end up with a critical path going through the
entire chain. We can instead build two halves, and then combine them
with a vselect. This costs one additional temporary register, but
reduces the critical path by roughly half.

To avoid needing to change VL, we fill each half with undefs for the
elements which will come from the other half. The vselect will at worst
become a vmerge, but is often folded back into the final instruction of
the sequence building the lower half.

A couple notes on the heuristic here:
* This is restricted to LMUL1 to avoid quadratic costing reasoning.
* This only splits once. In future work, we can explore recursive
splitting here, but I'm a bit worried about register pressure and thus
decided to be conservative. It also happens to be "enough" at the
default zvl of 128.
* "8" is picked somewhat arbitrarily as being "long". In practice, our
build_vector codegen for 2 defined elements in a VL=4 vector appears to
need some work. 4 defined elements in a VL=8 vector seems to generally
produce reasonable results.
* Halves may not be an optimal split point. I went down the rabit hole
of trying to find the optimal one, and decided it wasn't worth the
effort to start with.

---------

Co-authored-by: Luke Lau <luke_lau@icloud.com>
2024-02-14 10:15:24 -08:00
Pierre van Houtryve
43c7eb5d7b
[AMDGPU] Replace '.' with '-' in generic target names (#81718)
The dot is too confusing for tools. Output temporaries would have
'10.3-generic' so tools could parse it as an extension, device libs &
the associated clang driver logic are also confused by the dot.

After discussions, we decided it's better to just remove the '.' from
the target name than fix each issue one by one.
2024-02-14 15:19:04 +01:00
David Green
6c84709eff
[AArch64] Materialize constants via fneg. (#80641)
This is something that is already done as a special case for copysign,
this patch extends it to be more generally applied. If we are trying to
matrialize a negative constant (notably -0.0, 0x80000000), then there
may be no movi encoding that creates the immediate, but a fneg(movi)
might.

Some of the existing patterns for RADDHN needed to be adjusted to keep
them in line with the new immediates.
2024-02-14 13:55:51 +00:00
Philipp Tomsich
6cab375b4b
[AArch64] Add tests for fusion on Ampere1/1A/1B (#81725)
As commented on the PR #81293, the Ampere1-family does not have test
cases for the common fusion cases it implements. This adds the Ampere1
targets to the relevant misched-fusion testcases:
 * addadrp
 * addr
 * aes
2024-02-14 13:05:22 +01:00
Simon Pilgrim
f82e0809ba [X86] Add v8i64/v16i32/v16i64 ctpop reduction test coverage
Add test coverage for types wider than legal
2024-02-14 10:54:23 +00:00
Luke Lau
0fee2115bb
[RISCV] Remove -riscv-v-fixed-length-vector-lmul-max from tests. NFC (#78299)
Some fixed vector tests in test/CodeGen/RISCV/rvv have multiple run
lines that
check various configurations of -riscv-v-fixed-length-vector-lmul-max.
From
what I understand this flag was introduced in the early days of fixed
length
vector support, but now that fixed vector codegen has matured I'm not
sure if
it's as relevant today.

This patch proposes to remove the various lmul-max run lines from the
tests to
make them more readable, and any changes to fixed vector codegen easier
to
review.

We have removed them before for the same reason, so this would take care
of the
remaining test cases: https://reviews.llvm.org/D157973#4593268

(I don't have any strong motivation to remove the actual flag itself, my
own
personal motivation is just to clean up the tests)
2024-02-14 16:12:37 +08:00
Craig Topper
86ce491f30
[DAGCombiner] Remove unneeded commonAlignment from reduceLoadWidth. (#81707)
We already have the PtrOff factored into MachinePointerInfo. Any calls
to getAlign on the new load with do commonAlignment with the
MachinePointerInfo offset and the base alignment.
2024-02-13 23:26:25 -08:00
Jeffrey Byrnes
7180c23cf6
[SeparateConstOffsetFromGEP] Reland: Reorder trivial GEP chains to separate constants (#81671)
Actually update tests w.r.t
9e5a77f252
and reland https://github.com/llvm/llvm-project/pull/73056
2024-02-13 17:10:23 -08:00
Pranav Kant
21630efb5a
[X86][CodeGen] Restrict F128 lowering to GNU environment (#81664)
Otherwise it breaks some environment like X64 Android that doesn't have
f128 functions available in its libc.

Followup to #79611.
2024-02-13 16:39:59 -08:00
Craig Topper
0de2b26942
[RISCV] Register fixed stack slots for callee saved registers for -msave-restore/Zcmp (#81392)
PEI previously used fake frame indices for these callee saved registers.
These fake frame indices are not register with MachineFrameInfo. This
required them to be deleted form CalleeSavedInfo after PEI to avoid
breaking later passes. See #79535

Unfortunately, removing the registers from CalleeSavedInfo pessimizes
Interprocedural Register Allocation. The RegUsageInfoCollector pass runs
after PEI and uses CalleeSavedInfo.

This patch replaces #79535 by properly creating fixed stack objects
through MachineFrameInfo. This changes the stack size and offsets
returned by MachineFrameInfo which requires changes to how
RISCVFrameLowering uses that information.

In addition to the individual object for each register, I've also create
a single large fixed object that covers the entire stack area covered by
cm.push or the libcalls. cm.push must always push a multiple of 16 bytes
and the save restore libcall pushes a multiple of stack align. I think
this leaves holes in the stack where we could spill other registers, but
it matches what we did previously. Maybe we can optimize this in the
future.

The only test changes are due to stack alignment handling after the
callee save registers. Since we now have the fixed objects, on the stack
the offset is non-zero when an aligned object is processed so the offset
gets rounded up, increasing the stack size.

I suspect we might need some more updates for RVV related code. There is
very little or maybe even no testing of RVV mixed with Zcmp and
save-restore.
2024-02-13 14:59:28 -08:00
Heejin Ahn
473ef10b0f
[WebAssembly] Demote PHIs in catchswitch BB only (#81570)
`DemoteCatchSwitchPHIOnly` option in `WinEHPrepare` pass was added in
99d60e0dab,
because Wasm EH uses `WinEHPrepare`, but it doesn't need to demote all
PHIs. PHIs in `catchswitch` BBs have to be removed (= demoted) because
`catchswitch`s are removed in ISel and `catchswitch` BBs are removed as
well, so they can't have other instructions.

But because Wasm EH doesn't use funclets, so PHIs in `catchpad` or
`cleanuppad` BBs don't need to be demoted. That was the reason
`DemoteCatchSwitchPHIOnly` option was added, in order not to demote more
instructions unnecessarily.

The problem is it should have been set to `true` for Wasm EH. (Its
default value is `false` for WinEH) And I mistakenly set it to `false`
and wasn't aware about this for more than 5 years. This was not the end
of the world; it just means we've been demoting more instructions than
we should, possibly huting code size. In practice I think it would've
had hardly any effect in real performance given that the occurrence of
PHIs in `catchpad` or `cleanuppad` BBs are not very frequent and many
people run other optimizers like Binaryen anyway.
2024-02-13 13:43:21 -08:00
Philip Reames
99c5a66c62 Revert "[SeparateConstOffsetFromGEP] Reorder trivial GEP chains to separate constants (#73056)" and follow ups
"ninja check-llvm" is failing on tip of tree.

This reverts commit ec0aa1646e9953d1a8d0d15dc381d3250c854572.
This reverts commit 1b65742f8c71f576381fe85d5e34579b24f2d874.
2024-02-13 13:29:23 -08:00
James Y Knight
c1a99b2c77
[Sparc] limit MaxAtomicSizeInBitsSupported to 32 for 32-bit Sparc. (#81655)
When in 32-bit mode, the backend doesn't currently implement 64-bit
atomics, even though the hardware is capable if you have specified a V9
CPU. Thus, limit the width to 32-bit, for now, leaving behind a TODO.

This fixes a regression triggered by PR #73176.
2024-02-13 15:40:51 -05:00
Jeffrey Byrnes
1b65742f8c
[SeparateConstOffsetFromGEP] Reorder trivial GEP chains to separate constants (#73056)
In this case, a trivial GEP chain has the form:

```
%ptr = getelementptr sameType, %base, constant
%val = getelementptr sameType, %ptr, %variable
```

That is, a one-index GEP consumes another (of the same basis and result
type) one-index GEP, where the inner GEP uses a constant index and the
outer GEP uses a variable index. For chains of this type, it is trivial
to reorder them (by simply swapping the indexes). The result of doing so
is better AddrMode matching for users of the ultimate ptr produced by
GEP chain.

Future patches can extend this to support non-trivial GEP chains (e.g.
those with different basis types and/or multiple indices).
2024-02-13 11:22:49 -08:00
Danila Malyutin
e20462a069
[StatepointLowering] Use Constant instead of TargetConstant for undef value (#81635)
Prevents isel errors when trying to lower gc relocate of undef value
(which turns into CopyToReg of TargetConstant). Such relocates may occur
after DCE (e.g. after GVN removes some dead blocks) if there are not
passes like instcombine scheduled after to clean them up.

Fixes #80294

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2024-02-13 21:58:01 +03:00
Craig Topper
7d40ea85d5 [RISCV] Enable the TypePromotion pass from AArch64/ARM.
This pass looks for unsigned icmps that have illegal types and tries
to widen the use/def graph to improve the placement of the zero
extends that type legalization would need to insert.

I've explicitly disabled it for i32 by adding a check for
isSExtCheaperThanZExt to the pass.

The generated code isn't perfect, but my data shows a net
dynamic instruction count improvement on spec2017 for both base and
Zba+Zbb+Zbs.
2024-02-13 09:57:48 -08:00
Craig Topper
9838c8512b [RISCV] Copy typepromotion-overflow.ll from AArch64. NFC 2024-02-13 09:57:48 -08:00
Joseph Huber
11fcae69db
[LLVM] Add __builtin_readsteadycounter intrinsic and builtin for realtime clocks (#81331)
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.

This patch only adds support for the NVPTX and AMDGPU targets.

This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
2024-02-13 10:06:25 -06:00
Nikita Popov
25b9ed6e49
[DAGCombine] Fix multi-use miscompile in load combine (#81586)
The load combine replaces a number of original loads with one new loads
and also replaces the output chains of the original loads with the
output chain of the new load. This is incorrect if the original load is
retained (due to multi-use), as it may get incorrectly reordered.

Fix this by using makeEquivalentMemoryOrdering() instead, which will
create a TokenFactor with both chains.

Fixes https://github.com/llvm/llvm-project/issues/80911.
2024-02-13 16:41:00 +01:00