This is a conservative workaround for broken liveness tracking of
SUBREG_TO_REG to speculatively fix all targets. The current reported
failures are on X86 only, but this issue should appear for all targets
that use SUBREG_TO_REG. The next minimally correct refinement would be
to disallow only implicit defs.
The coalescer now introduces implicit-defs of the super register to
track the dependency on other subregisters. If we see such an implicit
operand, we cannot simply treat the subregister def as the result
operand in case downstream users depend on the implicitly defined
parts. Really target implementations should be considering the
implicit defs and trying to interpret them appropriately (maybe with
some generic helpers). The full implicit def could possibly be
reported as the move result, rather than the subregister def but that
requires additional work.
Hopefully fixes#64060 as well.
This needs to be applied to the release branch.
https://reviews.llvm.org/D156346
Currently coalescing with SUBREG_TO_REG introduces an invisible load
bearing undef. There is liveness for the super register not
represented in the MIR.
This is part 1 of a fix for regressions that appeared after
b7836d856206ec39509d42529f958c920368166b. The allocator started
recognizing undef-def subregister MOVs as copies. Since there was no
representation for the dependency on the high bits, different undef
segments of the super register ended up disconnected and downstream
users ended up observing different undefs than they did previously.
This does not yet fix the regression. The isCopyInstr handling needs
to start handling implicit-defs on any instruction.
I wanted to include an end to end IR test since the actual failure
only appeared with an interaction between the coalescer and the
allocator. It's a bit bigger than I'd like but I'm having a bit of
trouble reducing it to something which definitely shows a diff that's
meaningful.
The same problem likely exists everywhere trying to do anything with
SUBREG_TO_REG. I don't understand how this managed to be broken for so
long.
This needs to be applied to the release branch.
https://reviews.llvm.org/D156345
If this was coalescing a def of a subregister with a def of the super
register, it was introducing a redundant super-register def and
marking the subregister def as dead.
Resulting in something like:
dead $eax = MOVr0, implicit-def $rax, implicit-def $rax
Avoid this by checking if the new instruction already has the super
def, so we end up with this instead:
dead $eax = MOVr0, implicit-def $rax
The dead flag looks suspicious to me, seems like it's easy to buggily
interpret dead def of subreg and a non-dead def of an aliasing
register. It seems to be intentional though.
https://reviews.llvm.org/D156343
Permit an implicit-def of a virtual register when rematerializing if
it defines a super register of a subregister def. The
rematerialization pre-legality check should really have been checking
the implicit operands, but that should be fixed separately.
https://reviews.llvm.org/D156331
This change adds two related DAG combines which together will take a
left-reduce scalar add tree of an explode_vector, and will incrementally
form a vector reduction of the vector prefix. If the entire vector is
reduced, the result will be a reduction over the entire vector.
Profitability wise, this relies on vredsum being cheaper than a pair of
extracts and scalar add. Given vredsum is linear in LMUL, and the
vslidedown required for the extract is *also* linear in LMUL, this is
clearly true at higher index values. At N=2, it's a bit questionable,
but I think the vredsum form is probably a better canonical form
anyways.
Note that this only matches left reduces. This happens to be the
motivating example I have (from spec2017 x264). This approach could be
generalized to handle right reduces without much effort, and could be
generalized to handle any reduce whose tree starts with adjacent
elements if desired. The approach fails for a reduce such as (A+C)+(B+D)
because we can't find a root to start the reduce with without scanning
the entire associative add expression. We could maybe explore using
masked reduces for the root node, but that seems of questionable
profitability. (As in, worth questioning - I haven't explored in any
detail.)
This is covering up a deficiency in SLP. If SLP encounters the scalar
form of reduce_or(A) + reduce_sum(a) where a is some common
vectorizeable tree, SLP will sometimes fail to revisit one of the
reductions after vectorizing the other. Fixing this in SLP is hard, and
there's no good reason not to handle the easy cases in the backend.
Another option here would be to do this in VectorCombine or generic DAG.
I chose not to as the profitability of the non-legal typed prefix cases
is very target dependent. I think this makes sense as a starting point,
even if we move it elsewhere later.
This is currently restructed only to add reduces, but obviously makes
sense for any associative reduction operator. Once this is approved, I
plan to extend it in this manner. I'm simply staging work in case we
decide to go in another direction.
- Add tests for computeOverflowFor*Sub functions
- extend the computeOverflowForSignedSub/computeOverflowForUnsignedSub
implementations with ConstantRange (#37109)
We could maybe extend this by allowing the lowest subop to have multiple uses and extract the lowest subvector result of the concatenated op, but let's just get the fix in first.
Fixes#67333
This avoids some redundant spills of subranges, and avoids a compile failure.
This greatly reduces the numbers of spills in a loop.
The main range is not informative when multiple instructions are needed to fully define
a register. A common scenario is a lowered reg_sequence where every subregister
is sequentially defined, but each def changes the main range's value number. If
we look at specific lanes at the use index, we can see the value is actually the
same.
In this testcase, there are a large number of materialized 64-bit constant defs
which are hoisted outside of the loop by MachineLICM. These are feeding REG_SEQUENCES,
which is not considered rematerializable inside the loop. After coalescing, the split
constant defs produce main ranges with an apparent phi def. There's no phi def if you look
at each individual subrange, and only half of the register is really redefined to a constant.
Fixes: SWDEV-380865
https://reviews.llvm.org/D147079
SplitKit creates questionably formed bundles of copies
when it needs to copy a subset of live lanes and can't do
it with a single subregister index. These are merely marked
as part of a bundle, and don't start with a BUNDLE instruction.
Queries for the slot index would give the first copy in the
bundle, and we need to inspect the operands of all the other
bundled copies.
Also fix and simplify detection of read lane subsets. This causes
some RISCV test regressions, but these look like accidentally beneficial
splits. I don't see a subrange based reason to perform these splits.
Avoids some really ugly regressions in a future patch.
https://reviews.llvm.org/D146859
https://reviews.llvm.org/D152789 added an `exit` op before each
`unreachable`. This means we never get to the `trap` instruction.
This change limits the insertion of `exit` instructions to the cases
where `unreachable` is not lowered to `trap`. Trap itself is changed to
be emitted as `trap; exit;` to convey to `ptxas` that it exits the CFG.
The AArch64StorePairSuppress pass prevents the creation of STP under some
heuristics. Unfortunately it often prevents the creation of STP in cases where
it is obviously beneficial, and it doesn't match my understanding of
scheduling/cpu pipelining to prevent the creation of STP. From some
benchmarking, even on an in-order cpu where the scheduling is most important I
don't see it giving better results. In general the lower instruction count for
STP would be expected to give a slightly better cycle count.
As the pass specifically mentions the cyclone cpu, this patch adds a target
feature for FeatureStorePairSuppress, enabled for all the non-Arm cpus. This
has the effect of disabling it for all Arm cpus.
Differential Revision: https://reviews.llvm.org/D134646
The goal in #66818 was to capture function entry counts, but those are not the same as the frequency of the entry (machine) basic block. This fixes that, and adds explicit profiles to the test.
We also increase the precision of `MachineBlockFrequencyInfo::getBlockFreqRelativeToEntryBlock` to double. Existing code uses it as float so should be unaffected.
This fixes up the generation of 128bit atomic, volatile and non-temporal
loads/stores, under the assumption that they should usually be the same as
standard versions.
https://godbolt.org/z/xxc89eMKEFixes#64580Closes#67413
Following on from D135150, this patch fixes another crash caused by this
DAG combine:
fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)
The combine calls ReplaceAllUsesOfValueWith to replace (fmul C, D) with
(fma C, D, E). This can cause nodes to get CSEd. In D135150 the problem
was that the (fma C, D, E) node got CSEd away. In this new case, the
problem is that the outer fadd node gets CSEd away. To fix it we have
to return SDValue(N, 0) from the combine and be careful not to add a
deleted node to the worklist.
Migrate creation of most casts to use the FoldXYZ rather than
CreateXYZ style APIs. This means that InstSimplifyFolder now
works for these, which is what accounts for the AMDGPU test changes.
In order to avoid duplicating every dpp pseudo opcode that has src1, we
allow it for all opcodes and add manual checks on subtargets that do not
support it.
There were a couple of issues with maintaining register def/uses held
in `MachineRegisterInfo`:
* when an operand is changed from one register to another, the
corresponding instruction must already be inserted into the function,
or MRI won't be updated
* when traversing the set of all uses of a register, that set must not
change
We expand aarch64_neon_rshrn intrinsics to trunc(srl(add)), having tablegen
patterns to combine the results back into rshrn. See D140297. Unfortunately,
but perhaps not surprisingly, other combines can happen that prevent us
converting back. For example sext(rshrn) becomes sext(trunc(srl(add))) which
will turn into sext_inreg(srl(add))).
This patch just prevents the expansion of rshrn intrinsics, reinstating the old
tablegen patterns for selecting them. This should allow us to still regognize
the rshrn instructions from trunc+shift+add, without performing any negative
optimizations for the intrinsics.
Closes#67451
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).
Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D159140
This fixes a crash seen in https://github.com/openxla/iree/issues/15038
and
elsewhere. We were reducing the LMUL for inserts into undef at 0 without
inserting it back into the original LMUL at the end. But we don't
actually
perform the slidedown in this path, so we can just skip reducing LMUL
here.
PowerPC backend generate calls to libc function calls
for soft-float, regardless of the -nostdlib /-ffreestanding flag.
fma is not a function provided by compiler-rt builtins and
thus should not be generated here.
PR : [[ https://github.com/llvm/llvm-project/issues/55230 | #55230 ]]
Below is patch given by @nemanjai
Reviewed By: jhibbits
Differential Revision: https://reviews.llvm.org/D156344