https://github.com/llvm/llvm-project/commit/1d27669e8ad07f8f2 add
support for fold 512-bit concat(blendi(x,y,c0),blendi(z,w,c1)) to
AVX512BW mask select.
But when the type of subvector is v16i16, we need to generate repeated
mask to make the result correct.
The subnode looks like t87: v16i16 = X86ISD::BLENDI t132, t58,
TargetConstant:i8<-86>.
This alters the lowering of G_COPYSIGN to support vector types. The
general idea is that we just lower it to vector operations using and/or
and a mask, which are now converted to a BIF/BIT/BSP.
In the process the existing AArch64LegalizerInfo::legalizeFCopySign can
be removed, replying on expanding the scalar versions to vector instead,
which just needs a small adjustment to allow widening scalars to
vectors.
If we have a abs(sext a) we can legally perform this as a zext (abs a).
(See the same combine in instcombine - note that the IntMinIsPoison flag
doesn't exist in SDAG yet.)
On RVV, this is likely profitable because it may allow us to perform the arithmetic
operations involved in the abs at a narrower LMUL before widening for the user.
We could arguably avoid narrowing below DLEN, but the transform should
at worst move around the extend and create one extra vsetvli toggle if the source
could previously be handled via loads explicit w/EEW.
This patch optimizes the post-increment instructions so that we can
packetize them together.
v1 = phi(v0, v3')
v2,v3 = post_load v1, 4
v2',v3'= post_load v3, 4
This can be optimized in two ways
v1 = phi(v0, v3')
v2,v3' = post_load v1, 8
v2' = load v1, 4
We were using the SDLoc corresponding to the original arithmetic
instruction, but here using the SDLoc corresponding to the original
extend if we need to introduce a new narrower extend seems cleaner.
As can be seen in the test diffs, this very minorly impacts scheduling
and register allocation by given the scheduler a hint from original
program order.
On some uArchs, `STP [s|d], [s|d]` first combines the 2 input registers
in a single register using a vector execution unit. IIUC
AArch64StorePairSuppress tries to prevent forming STPs in case the
critical resource are the vector units, in order to prevent adding more
pressure on those units.
The implementation however simply computes the new critical resource
length by adding resource for another STP. If load/store units are the
critical resource, this means we increase that length by one, and
incorrectly prevent forming the STP.
This patch adjusts the resource computation by also removing 2 STRs, as
introducing a STP will remove 2 single stores. This should more
accurately reflect the resource usage after introducing an STP, and does
not prevent forming STPs if load/store units are the critical resources;
in those cases, STP can actually help to reduce resource usage.
PR: https://github.com/llvm/llvm-project/pull/81749
This patch implements the codegen support of zabha (Byte and Halfword
Atomic Memory Operations) v1.0-rc1 extension.
See also https://github.com/riscv/riscv-zabha/blob/v1.0-rc1/zabha.adoc.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
If it isn't virtual, we may extend the live range of the physical
register past were it is valid. For example, across a call.
Found while trying to enable -riscv-enable-sink-fold which enables some
copy propagation in machine sink that led to ADDIs with physical
register destinations.
gfx11 chips may, in some conditions, behave incorrectly with S_CLAUSE
instructions (hard clauses) containing more than 32 operations (that is,
whose arguments exceed 0x1f). However, gfx10 targets will work
successfully with clauses of up to length 63.
Therefore, define the MaxHardClauseLength property on GCNSubtarget and
make it a subtarget feature via tablegen, thus allowing us to specify,
both now and in the future, the maximum viable size of clauses on
various hardware from the tablegen definition. If MaxHardClauseLength is
0, which is the default, the hardware does not support hard clauses.
This adds support for the AArch64 soft-float ABI. The specification for
this ABI was added by https://github.com/ARM-software/abi-aa/pull/232.
Because all existing AArch64 hardware has floating-point hardware, we
expect this to be a niche option, only used for embedded systems on
R-profile systems. We are going to document that SysV-like systems
should only ever use the base (hard-float) PCS variant:
https://github.com/ARM-software/abi-aa/pull/233. For that reason, I've
not added an option to select the ABI independently of the FPU hardware,
instead the new ABI is enabled iff the target architecture does not have
an FPU.
For testing, I have run this through an ABI fuzzer, but since this is
the first implementation it can only test for internal consistency
(callers and callees agree on the PCS), not for conformance to the ABI
spec.
When a whole register is added a basic block's liveins, use
LaneBitmask::getAll for the live lanes instead of trying to calculate an
accurate mask of the lanes that comprise the register.
This simplifies the code and matches other places where a whole register
is marked as livein.
This also avoids problems when regunits that are synthesized by TableGen
to represent ad hoc aliasing have a lane mask of 0.
Fixes#78942
This PR adds support for the SPV_KHR_linkonce_odr extension and modifies
existing negative test with a positive check for the extension and
proper linkage type in case when the extension is enabled.
SPV_KHR_linkonce_odr adds a "LinkOnceODR" linkage type, allowing proper
translation of, for example, C++ templates classes merging during
linking from different modules and supporting any other cases when a
global variable/function must be merged with equivalent global
variable(s)/function(s) from other modules during the linking process.
By SPIR-V specification: "If an instruction, enumerant, or other feature
specifies multiple enabling capabilities, only one such
capability needs to be declared to use the feature."
However, one capability may be preferred over another. One important
case is Shader capability that may not be supported by a backend, but
always is inserted if "OpDecorate SpecId" is found, because Enabling
Capabilities for the latter is the list of Shader and Kernel, where
Shader is coming first and thus always selected as the first available
option.
In this PR we address the problem by keeping current behaviour of
selecting the first option among enabling capabilities as is, but giving
a user a way to filter capabilities during the selection process via a
newly introduced "--avoid-spirv-capabilities" command line option. This
option is to avoid selection of certain capabilities if there are other
available enabling capabilities.
This PR is changing also existing pruneCapabilities() function. It
doesn't remove capability from module requirement anymore, but only adds
implicitly required capabilities recursively, so its name is changed
accordingly. This change fixes the present bug in collecting required by
a module capabilities. Before the change, introduced by this PR,
pruneCapabilities() function has been removing, for example, Kernel
capability from required by a module, because Kernel is initially
required and the second time it was needed pruneCapabilities() removed
it by mistake.
Hi,
AMD has it's own implementation of vector calls. This patch include the
changes to enable the use of AMD's math library using -fveclib=AMDLIBM.
Please refer https://github.com/amd/aocl-libm-ose
---------
Co-authored-by: Rohit Aggarwal <Rohit.Aggarwal@amd.com>
When parsing the `la` macro, we add a duplicate `$` prefix in
`getOrCreateSymbol`,
leading to `error: Undefined temporary symbol $$yy` for code like:
```
xx:
la $2,$yy
$yy:
nop
```
Remove the duplicate prefix.
In addition, recognize `.L`-prefixed symbols as local for O32.
See: #65020.
---------
Co-authored-by: Fangrui Song <i@maskray.me>
Some types of machine function and machine basic block are unsafe to
split on AArch64: basic blocks that contain jump table dispatch or
targets (D157124), and blocks that contain inline ASM GOTO blocks
or their targets (D158647) all cause issues and have been excluded
from Machine Function Splitting on AArch64.
These issues are caused by any transformation pass that places
same-function basic blocks in different text sections
(MachineFunctionSplitter and BasicBlockSections) and must be
special-cased in both passes.
If we have a long chain of vslide1down instructions to build e.g. a <16
x i8> from scalar, we end up with a critical path going through the
entire chain. We can instead build two halves, and then combine them
with a vselect. This costs one additional temporary register, but
reduces the critical path by roughly half.
To avoid needing to change VL, we fill each half with undefs for the
elements which will come from the other half. The vselect will at worst
become a vmerge, but is often folded back into the final instruction of
the sequence building the lower half.
A couple notes on the heuristic here:
* This is restricted to LMUL1 to avoid quadratic costing reasoning.
* This only splits once. In future work, we can explore recursive
splitting here, but I'm a bit worried about register pressure and thus
decided to be conservative. It also happens to be "enough" at the
default zvl of 128.
* "8" is picked somewhat arbitrarily as being "long". In practice, our
build_vector codegen for 2 defined elements in a VL=4 vector appears to
need some work. 4 defined elements in a VL=8 vector seems to generally
produce reasonable results.
* Halves may not be an optimal split point. I went down the rabit hole
of trying to find the optimal one, and decided it wasn't worth the
effort to start with.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
The dot is too confusing for tools. Output temporaries would have
'10.3-generic' so tools could parse it as an extension, device libs &
the associated clang driver logic are also confused by the dot.
After discussions, we decided it's better to just remove the '.' from
the target name than fix each issue one by one.
This is something that is already done as a special case for copysign,
this patch extends it to be more generally applied. If we are trying to
matrialize a negative constant (notably -0.0, 0x80000000), then there
may be no movi encoding that creates the immediate, but a fneg(movi)
might.
Some of the existing patterns for RADDHN needed to be adjusted to keep
them in line with the new immediates.
As commented on the PR #81293, the Ampere1-family does not have test
cases for the common fusion cases it implements. This adds the Ampere1
targets to the relevant misched-fusion testcases:
* addadrp
* addr
* aes
Some fixed vector tests in test/CodeGen/RISCV/rvv have multiple run
lines that
check various configurations of -riscv-v-fixed-length-vector-lmul-max.
From
what I understand this flag was introduced in the early days of fixed
length
vector support, but now that fixed vector codegen has matured I'm not
sure if
it's as relevant today.
This patch proposes to remove the various lmul-max run lines from the
tests to
make them more readable, and any changes to fixed vector codegen easier
to
review.
We have removed them before for the same reason, so this would take care
of the
remaining test cases: https://reviews.llvm.org/D157973#4593268
(I don't have any strong motivation to remove the actual flag itself, my
own
personal motivation is just to clean up the tests)
We already have the PtrOff factored into MachinePointerInfo. Any calls
to getAlign on the new load with do commonAlignment with the
MachinePointerInfo offset and the base alignment.
PEI previously used fake frame indices for these callee saved registers.
These fake frame indices are not register with MachineFrameInfo. This
required them to be deleted form CalleeSavedInfo after PEI to avoid
breaking later passes. See #79535
Unfortunately, removing the registers from CalleeSavedInfo pessimizes
Interprocedural Register Allocation. The RegUsageInfoCollector pass runs
after PEI and uses CalleeSavedInfo.
This patch replaces #79535 by properly creating fixed stack objects
through MachineFrameInfo. This changes the stack size and offsets
returned by MachineFrameInfo which requires changes to how
RISCVFrameLowering uses that information.
In addition to the individual object for each register, I've also create
a single large fixed object that covers the entire stack area covered by
cm.push or the libcalls. cm.push must always push a multiple of 16 bytes
and the save restore libcall pushes a multiple of stack align. I think
this leaves holes in the stack where we could spill other registers, but
it matches what we did previously. Maybe we can optimize this in the
future.
The only test changes are due to stack alignment handling after the
callee save registers. Since we now have the fixed objects, on the stack
the offset is non-zero when an aligned object is processed so the offset
gets rounded up, increasing the stack size.
I suspect we might need some more updates for RVV related code. There is
very little or maybe even no testing of RVV mixed with Zcmp and
save-restore.
`DemoteCatchSwitchPHIOnly` option in `WinEHPrepare` pass was added in
99d60e0dab,
because Wasm EH uses `WinEHPrepare`, but it doesn't need to demote all
PHIs. PHIs in `catchswitch` BBs have to be removed (= demoted) because
`catchswitch`s are removed in ISel and `catchswitch` BBs are removed as
well, so they can't have other instructions.
But because Wasm EH doesn't use funclets, so PHIs in `catchpad` or
`cleanuppad` BBs don't need to be demoted. That was the reason
`DemoteCatchSwitchPHIOnly` option was added, in order not to demote more
instructions unnecessarily.
The problem is it should have been set to `true` for Wasm EH. (Its
default value is `false` for WinEH) And I mistakenly set it to `false`
and wasn't aware about this for more than 5 years. This was not the end
of the world; it just means we've been demoting more instructions than
we should, possibly huting code size. In practice I think it would've
had hardly any effect in real performance given that the occurrence of
PHIs in `catchpad` or `cleanuppad` BBs are not very frequent and many
people run other optimizers like Binaryen anyway.
"ninja check-llvm" is failing on tip of tree.
This reverts commit ec0aa1646e9953d1a8d0d15dc381d3250c854572.
This reverts commit 1b65742f8c71f576381fe85d5e34579b24f2d874.
When in 32-bit mode, the backend doesn't currently implement 64-bit
atomics, even though the hardware is capable if you have specified a V9
CPU. Thus, limit the width to 32-bit, for now, leaving behind a TODO.
This fixes a regression triggered by PR #73176.
In this case, a trivial GEP chain has the form:
```
%ptr = getelementptr sameType, %base, constant
%val = getelementptr sameType, %ptr, %variable
```
That is, a one-index GEP consumes another (of the same basis and result
type) one-index GEP, where the inner GEP uses a constant index and the
outer GEP uses a variable index. For chains of this type, it is trivial
to reorder them (by simply swapping the indexes). The result of doing so
is better AddrMode matching for users of the ultimate ptr produced by
GEP chain.
Future patches can extend this to support non-trivial GEP chains (e.g.
those with different basis types and/or multiple indices).
Prevents isel errors when trying to lower gc relocate of undef value
(which turns into CopyToReg of TargetConstant). Such relocates may occur
after DCE (e.g. after GVN removes some dead blocks) if there are not
passes like instcombine scheduled after to clean them up.
Fixes#80294
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
This pass looks for unsigned icmps that have illegal types and tries
to widen the use/def graph to improve the placement of the zero
extends that type legalization would need to insert.
I've explicitly disabled it for i32 by adding a check for
isSExtCheaperThanZExt to the pass.
The generated code isn't perfect, but my data shows a net
dynamic instruction count improvement on spec2017 for both base and
Zba+Zbb+Zbs.
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.
This patch only adds support for the NVPTX and AMDGPU targets.
This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
The load combine replaces a number of original loads with one new loads
and also replaces the output chains of the original loads with the
output chain of the new load. This is incorrect if the original load is
retained (due to multi-use), as it may get incorrectly reordered.
Fix this by using makeEquivalentMemoryOrdering() instead, which will
create a TokenFactor with both chains.
Fixes https://github.com/llvm/llvm-project/issues/80911.