This is a fix for a crash in the HexagonOptAddrMode pass that was looking
for the third operand (offset) in the following instruction that does not,
in fact, have a third operand:
$r1 = L2_loadw_locked $r1
Additionally, this patch also adds an addrMode value to vgather pseudos
in the Hexagon backend.
Differential Revision: https://reviews.llvm.org/D117133
This commit fixes a missed opportunity in merging consecutive stores.
The code that searches for stores skipped the case of stores that
directly connect to the root. The comment above the implementation lists
this case but the code did not handle it. I found this pattern when
looking into the shared_ptr destructor. GCC generates the right
sequence. Here is a small repo:
int foo(int* buff) {
buff[0] = 0;
int x = buff[1];
buff[1] = 0;
return x;
}
Differential Revision: https://reviews.llvm.org/D116895
Lower select(I1,Q,Q) by converting vector predicate Q to vector register V,
doing select(I1,V,V), and then converting the resulting V back to Q. Also,
try to avoid creating such situations in the first place.
This change extends the addressing mode optimization
pass to HVX vgather. This is specifically intended to
resolve compiler not generating indexed addresses for
vgather stores to vtcm. Changed the vgather pseudo
instructions to accept an immediate operand and handled
addition of appropriate immediate operand in addressing
mode optimization pass.
Ideally we should make USR as Def for these floating point instructions.
However, it violates some assembler MCChecker rules. This patch fixes
the issue by marking these FP instructions as non-sinkable.
For code below:
{
r7 = addasl(r3,r0,#2)
r8 = addasl(r3,r2,#2)
r5 = memw(r3+r0<<#2)
r6 = memw(r3+r2<<#2)
}
{
p1 = cmp.gtu(r6,r5)
if (p1.new) memw(r8+#0) = r5
if (p1.new) memw(r7+#0) = r6
}
{
r0 = mux(p1,r2,r4)
}
In packetizer, a new packet is created for the cmp instruction since
there arent enough resources in previous packet. Also it is determined
that the cmp stalls by 2 cycles since it depends on the prior load of r5.
In current packetizer implementation, the predicated store is evaluated
for whether it can go in the same packet as compare, and since the compare
stalls, the stall of the predicated store does not matter and it can go in
the same packet as the cmp. However the predicated store will stall for
more cycles because of its dependence on the addasl instruction and to
avoid that stall we can put it in a new packet.
Improve the packetizer to check if an instruction being added to packet
will stall longer than instruction already in packet and if so create a
new packet.
The HexagonVectorCombine pass was moving an instruction
incorrectly, which caused a use in a GEP that was not yet
defined.
HexagonVectorCombine removes a load from a group due to its
dependences, but in realignGroup, the load is processed anyways.
In realignGroup, when determining the maximum alignment, only
those instructions still in the group should be considered.
This reverts commit ba07f300c6d67a2c6dde8eef216b7a77ac4600bb.
A build-vector sequence is made of pairs: rotate+insert. When constructing
a single vector, this results in a chain of 2*N instructions. The rotate
operation is a permute operation, but the insert uses a multiplication
resource: insert and rotate can execute in the same cycle, but obviously
they cannot operate on the same vector. The original halving idea is still
beneficial since it does allow for insert/rotate overlap, and for hiding
insert's latency.
For vectors with repeating values, old codegen would rotate and insert
every duplicate element. This patch replaces that behavior with a splat
of the most common element, vinsert/vror only occur when needed.
They are the same as for the other HVX vectors, but types need to be
listed explicitly. Also, add a detailed codegen testcase.
Co-authored-by: Abhikrant Sharma <quic_abhikran@quicinc.com>
Adds x-ray support for hexagon to llvm codegen, clang driver,
compiler-rt libs.
Differential Revision: https://reviews.llvm.org/D113638
Reapplying this after 543a9ad7c460bb8d641b1b7c67bbc032c9bfdb45,
which fixes the leak introduced there.
Most changes are rewording comments but there are some assertions that I rephrased.
Reviewed By: kparzysz
Differential Revision: https://reviews.llvm.org/D114132
In TwoAddressInstructionPass::processTiedPairs when updating live
intervals after moving the last use of RegB back to the newly inserted
copy, update any affected subranges as well as the main range.
Differential Revision: https://reviews.llvm.org/D110411
In TwoAddressInstructionPass::processTiedPairs, update subranges of the
live interval for RegB as well as the main range.
This is a small step towards switching TwoAddressInstructionPass over
from LiveVariables to LiveIntervals. Currently this path is only tested
if you explicitly enable -early-live-intervals.
Differential Revision: https://reviews.llvm.org/D110526
In repairIntervalsInRange, if the new instructions refer to subregs but
the old instructions did not, make sure any existing live interval for
the superreg is updated to have subranges. Also skip repairing any range
that we have recalculated from scratch, partly for efficiency but also
to avoids some cases that repairOldRegInRange can't handle.
The existing test/CodeGen/AMDGPU/twoaddr-regsequence.mir provides some
test coverage for this change: when TwoAddressInstructionPass converts
REG_SEQUENCE into subreg copies, the live intervals will now get
subranges and MachineVerifier will verify that the subranges are
correct. Unfortunately MachineVerifier does not complain if the
subranges are not present, so the test also passed before this patch.
This patch also fixes ~800 of the ~1500 failures in the whole CodeGen
lit test suite when -early-live-intervals is forced on.
Differential Revision: https://reviews.llvm.org/D110328
The fmul is a canonicalizing operation, and fneg is not so this would
break denormals that need flushing and also would not quiet signaling
nans. Fold to fsub instead, which is also canonicalizing.
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.
The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
The code was using getTypeStoreSize to calculate the difference
between consecutive objects. The calculation was incorrect due
to padding that is added between consecutive objects. The
getTypeAllocSize includes the padding amount. For example,
if the type is [19 x i8], the difference between consecutive
objects is 32 bytes, not 19 bytes.
A second case for getTypeAllocSize is needed when computing
the pointer values for the vector accesses. The calculation needs
to account for the padding as well.
Differential Revision: https://reviews.llvm.org/D109403
Currently, opaque pointers are supported in two forms: The
-force-opaque-pointers mode, where all pointers are opaque and
typed pointers do not exist. And as a simple ptr type that can
coexist with typed pointers.
This patch removes support for the mixed mode. You either get
typed pointers, or you get opaque pointers, but not both. In the
(current) default mode, using ptr is forbidden. In -opaque-pointers
mode, all pointers are opaque.
The motivation here is that the mixed mode introduces additional
issues that don't exist in fully opaque mode. D105155 is an example
of a design problem. Looking at D109259, it would probably need
additional work to support mixed mode (e.g. to generate GEPs for
typed base but opaque result). Mixed mode will also end up
inserting many casts between i8* and ptr, which would require
significant additional work to consistently avoid.
I don't think the mixed mode is particularly valuable, as it
doesn't align with our end goal. The only thing I've found it to
be moderately useful for is adding some opaque pointer tests in
between typed pointer tests, but I think we can live without that.
Differential Revision: https://reviews.llvm.org/D109290
We already have reassociation code for Adds and Ors separately in DAG
combiner, this adds it for the combination of the two where Ors act like
Adds. It reassociates (add (or (x, c), y) -> (add (add (x, y), c)) where
we know that the Ors operands have no common bits set, and the Or has
one use.
Differential Revision: https://reviews.llvm.org/D104765
This applies to memory accesses to (compile-time) constant addresses
(such as memory-mapped registers). Currently when a misaligned access
to such an address is detected, a fatal error is reported. This change
will emit a remark, and the compilation will continue with a trap,
and "undef" (for loads) emitted.
This fixes https://llvm.org/PR50838.
Differential Revision: https://reviews.llvm.org/D50524