This brings a trivial improvement in the and-add-lsr.ll test case.
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Reviewed By: SixWeining, xen0n
Differential Revision: https://reviews.llvm.org/D154762
During the construction of SelectionDAG, there are no explicit canonicalization rules to adjust the order of operands for AND nodes. This may prevent the optimization in DAGCombiner::visitANDLike from being triggered. This patch canonicalizes the operands before matches, which can be observed to improve optimization on the RISC-V target architecture.
Canonicalize:
```
and(x, add) -> and(add, x)
```
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154760
NEON `movi` is not valid in Streaming SVE mode, so use an `fmov`
instruction instead for zero-initializing a FP value.
Reviewed By: hassnaa-arm
Differential Revision: https://reviews.llvm.org/D155432
This patch supports register allocation for floating-point types when
`zfinx` and `zdinx` is specified in the architecture for the GHC
calling convention.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155910
Summary: The source language ID and CPU version ID are required by debuggers on AIX. AIX's system assembler determines the source language ID based on the source file's name suffix, and the behavior in this patch is consistent with it.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D155684
When input to intrinsic is uniform value, reduced value is
same as input whereas if input value is divergent we need
to iterate over all active lanes of WaveFront to perform
the reduction.
The control flow for a `loop` has been set up, which
iterates over `only` active lanes to perform reduction.
Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and
WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which
are lowered Post-ISel (in `EmitInstrWithCustomInserter `).
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D154858
The implementation in https://reviews.llvm.org/D151313 is done for the circumstance without Zfbfmin. This patch adds codegen support for the 6 instructions provided in Zfbfmin extension.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D153234
This patch does two things. First it removes the tryHighFPExt DAG2DAG method
used to select fcvtl2 instructions, using tablegen patterns through
SelectExtractHigh instead. This essentially undoes D71515, in a way that should
hopefully avoid any regressions. The second is that a GI equivalent of
SelectExtractHigh is added in selectExtractHigh, from G_UNMERGE_VALUES. The
end result is that GlobalISel (and some constrained fpext) can now make use of
the fcvtl2 instructions, saving an extra dup/ext.
Differential Revision: https://reviews.llvm.org/D155871
Followup to af32e51a43fb4343f - where the undef funnel shift amounts (during widening from v2i32 -> v4i32) were being constant folded to 0 when the shift amounts are created during expansion, losing the splat'd shift amounts.
The gain is usually suffiscient to go the extra mile and reconstruct a carry in some cases.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154533
Fixes issue identified on #63980 where the undef rotate amounts (during widening from v2i32 -> v4i32) were being constant folded to 0 when the shift amounts are created during expansion, losing the splat'd shift amounts.
Fixes root cause of #63017.
The reason is similar to BUILD_VECTOR. We have legal vector type but
still soft promote for scalar type. So we need to customize these scalar
to vector nodes.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D155961
This reverts commit 086ee99564afbb11449c08ea2e094f7f49fadde5.
This patch causes an infinite loop when building arch/mips/mm/c-r4k.c in
the Linux kernel. See the comment in Phabricator for a reduced
reproducer: https://reviews.llvm.org/rG086ee99564afbb11449c08ea2e094f7f49fadde5
Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.
Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can still scalarize in some no-op edge cases.
https://reviews.llvm.org/D155740
The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.
This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.
Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.
Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.
This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.
Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.
This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).
This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).
The fdiv.fast and sqrt handling need to be revisited next.
https://reviews.llvm.org/D155593
Because branch relaxation needs to factor in if branches target
a block in the same section or a different one, it needs to run
after the Basic Block Sections / Machine Function Splitting passes.
Because Jump table compression relies on block offsets remaining
fixed after the table is compressed, we must also move the JT
compression pass.
The only tests affected are ones enforcing just the ordering and
the a few that have basic block ids changed because RenumberBlocks
hasn't run yet.
Differential Revision: https://reviews.llvm.org/D153829
Similar to Issue #63710 - by truncating the v8i16 result with a PACKSS node before type legalization, we fail to make use of various folds that rely on TRUNCATE nodes.
This required tweaks to LowerTruncateVecPackWithSignBits to recognise when the truncation source has been widened and to more closely match combineVectorSignBitsTruncation wrt truncating with PACKSS/PACKUS on AVX512 targets.
One of the last stages before we can finally get rid of combineVectorSignBitsTruncation.
ProgrammersManual.html says
> StringMap iteration order, however, is not guaranteed to be deterministic, so any uses which require that should instead use a std::map.
This patch makes -DLLVM_REVERSE_ITERATION=on (currently
-DLLVM_ENABLE_REVERSE_ITERATION=on works as well) shuffle StringMap
iteration order (actually flipping the hash so that elements not in the
same bucket are reversed) to catch violations, similar to D35043 for
DenseMap. This should help change the hash function (e.g., D142862,
D155781).
With a lot of fixes, there are still some violations. This patch
implements the "reverse_iteration" lit feature to skip such tests.
Eventually we should remove this feature.
`ninja check-{llvm,clang,clang-tools}` are clean with
`#define LLVM_ENABLE_REVERSE_ITERATION 1`.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D155789
For a label difference `A-B` in assembly, if A and B are separated by a
linker-relaxable instruction, we should emit a pair of ADD/SUB
relocations (e.g. R_RISCV_ADD32/R_RISCV_SUB32,
R_RISCV_ADD64/R_RISCV_SUB64).
However, the decision is made upfront at parsing time with inadequate
heuristics (`requiresFixup`). As a result, LLVM integrated assembler
incorrectly suppresses R_RISCV_ADD32/R_RISCV_SUB32 for the following
code:
```
// Simplified from a workaround https://android-review.googlesource.com/c/platform/art/+/2619609
// Both end and begin are not defined yet. We decide ADD/SUB relocations upfront and don't know they will be needed.
.4byte end-begin
begin:
call foo
end:
```
To fix the bug, make two primary changes:
* Delete `requiresFixups` and the overridden emitValueImpl (from D103539).
This deletion requires accurate evaluateAsAbolute (D153097).
* In MCAssembler::evaluateFixup, call handleAddSubRelocations to emit
ADD/SUB relocations.
However, there is a remaining issue in
MCExpr.cpp:AttemptToFoldSymbolOffsetDifference. With MCAsmLayout, we may
incorrectly fold A-B even when A and B are separated by a
linker-relaxable instruction. This deficiency is acknowledged (see
D153097), but was previously bypassed by eagerly emitting ADD/SUB using
`requiresFixups`. To address this, we partially reintroduce `canFold` (from
D61584, removed by D103539).
Some expressions (e.g. .size and .fill) need to take the `MCAsmLayout`
code path in AttemptToFoldSymbolOffsetDifference, avoiding relocations
(weird, but matching GNU assembler and needed to match user
expectation). Switch to evaluateKnownAbsolute to leverage the `InSet`
condition.
As a bonus, this change allows for the removal of some relocations for
the FDE `address_range` field in the .eh_frame section.
riscv64-64b-pcrel.s contains the main test.
Add a linker relaxable instruction to dwarf-riscv-relocs.ll to test what
it intends to test.
Merge fixups-relax-diff.ll into fixups-diff.ll.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D155357
We only attempted to determine KnownBits for uniform constant shift amounts, but ComputeKnownBits is able to handle some non-uniform cases as well that we can use as a fallback.
We're currently only matching scalar shift amounts where the type is the same
as the vector element type. But because only the bottom log2(2*SEW) bits are
used, only 7 bits will be used at most so we can use any scalar type >= i8.
This patch adds patterns for the case above, as well as for when the shift
amount type is the same as the widened element type and doesn't need extended.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155698
These patterns of ([l,a]shr v, ([s,z]ext splat)) only pick up the cases where
the scalar has the same type as the vector element. However since only the low
log2(SEW) bits of the scalar are read, we could use any scalar type that has
been extended.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155697
Reapply after fixing an issue in canonicalizeLogicFirst() exposed
by this change (218f97578b26f7a89f7f8ed0748c31ef0181f80a).
-----
In preparation for removing support for and expressions, mark them
as undesirable. As such, we will no longer implicitly create such
expressions, but they still exist.