This fixes an expensive chesk failure after 8476a5d480304. The issue
was essentially that getRegClassConstraintEffectForVReg was not doing
anything useful, sometimes. If the register passed to it is not present
in the instruction, it is a no-op and returns the original classe. The
Edit->getReg() register may not be the register as it appears in either
the use or def instruction. It may be some split register, so take
the register directly from the instruction being rematerialized.
Also directly query the constraint from the def instruction, with a
hardcoded operand index. This isn't ideal, but all the other
rematerialize
code makes the same assumption.
So far I've been unable to reproduce this with a standalone MIR test. In
the
original case, stop-before=greedy and running the one pass is not
working.
The module currently stores the target triple as a string. This means
that any code that wants to actually use the triple first has to
instantiate a Triple, which is somewhat expensive. The change in #121652
caused a moderate compile-time regression due to this. While it would be
easy enough to work around, I think that architecturally, it makes more
sense to store the parsed Triple in the module, so that it can always be
directly queried.
For this change, I've opted not to add any magic conversions between
std::string and Triple for backwards-compatibilty purses, and instead
write out needed Triple()s or str()s explicitly. This is because I think
a decent number of them should be changed to work on Triple as well, to
avoid unnecessary conversions back and forth.
The only interesting part in this patch is that the default triple is
Triple("") instead of Triple() to preserve existing behavior. The former
defaults to using the ELF object format instead of unknown object
format. We should fix that as well.
This option was added to improve test coverage for SVE lowering code
that is impossible to reach otherwise. Given it is not possible to
trigger a bug without it and the generated code is universally worse
with it, I figure the option has no value and should be removed.
Despite the name, the HasAddressTaken() heuristic identifies not only
allocas that have their address taken, but also those that have accesses
that cannot be proven to be in-bounds.
However, the current handling for phi nodes is incorrect. Phi nodes are
only visited once, and will perform the analysis using whichever
(remaining) allocation size is passed the first time the phi node is
visited. If it is later visited with a smaller remaining size, which may
lead to out of bounds accesses, it will not be detected.
Fix this by keeping track of the smallest seen remaining allocation size
and redo the analysis if it is decreased. To avoid degenerate cases
(including via loops), limit the number of allowed decreases to a small
number.
This patch makes it so that onEviction actually gets called when the
model ends up selecting the candidate to evict. Where we were handling
this previously ended up being dead code as we would return earlier with
MCRegister::NoRegister.
Fixes#129841.
The old hack of returning v5/v6i32 for the fat and strided buffer
pointers was causing issuse during vectorization queries that expected
to be able to construct a VectorType from the return value of `MVT
getPointerType()`. On example is in the test attached to this PR, which
used to crash.
Now, we define the custom MVT entries, the 160-bit
amdgpuBufferFatPointer and 192-bit amdgpuBufferStridedPointer, which are
used to represent ptr addrspace(7) and ptr addrspace(9) respectively.
Neither of these types will be present at the time of lowering to a
SelectionDAG or other MIR - MVT::amdgpuBufferFatPointer is eliminated by
the LowerBufferFatPointers pass and amdgpu::bufferStridedPointer is not
currently used outside of the SPIR-V translator (which does its own
lowering).
An alternative solution would be to add MVT::i160 and MVT::i192. We
elect not to do this now as it would require changes to unrelated code
and runs the risk of breaking any SelectionDAG code that assumes that
the MVT series are all powers of two (and so can be split apart and
merged back together) in ways that wouldn't be obvious if someone tried
to use MVT::i160 in codegen. If i160 is added at some future point,
these custom types can be retired.
Following 15e295d the machine scheduler no longer filters-out single-MI
regions when emitting regions to schedule. While this has no functional
impact at the moment, it generally has a negative compile-time impact
(see #128739).
Since all targets but AMDGPU do not care for this behavior, this
introduces an off-by-default flag to `ScheduleDAGInstrs` to control
whether such regions are going to be scheduled, effectively reverting
15e295d for all targets but AMDGPU (currently the only target enabling
this flag).
Add generic DAG combine for ISD::PARTIAL_REDUCE_U/SMLA nodes. Transforms
the DAG from:
PARTIAL_REDUCE_MLA(Acc, MUL(EXT(MulOpLHS), EXT(MulOpRHS)), Splat(1)) to
PARTIAL_REDUCE_MLA(Acc, MulOpLHS, MulOpRHS).
For most targets, the register class comes from the type so this
makes no difference. For AMDGPU, the selected register class depends
on the divergence of the value. For a constant phi input, this will
always be false. The heuristic for whether to treat the value as
a scalar or vector constant based on the uses would then incorrectly
think this is a scalar use, when really the phi is a copy from S to V.
This avoids an intermediate s_mov_b32 plus a copy in some cases. These
would often, but not always, fold out in mi passes.
This only adjusts the constant input case. It may make sense to do
this for the non-constant case as well.
This fixes an allocation failure in the new test.
In cases where getLargestLegalSuperClass can inflate the register class,
rematerialization could effectively undo a split which was done to
inflate
the register class, if the defining instruction can only write a
subclass
and the use can read the superclass.
Some of the x86 tests changes look like improvements, but some are
likely regressions.
I'm not entirely sure this is the correct place to fix this. It also
seems more complicated than necessary, but the decision to change
the register class is far removed from the point where the decision
to split the virtual register is made. I'm also also not sure if this
should be considering the register classes of all the use indexes
in getUseSlots, rather than just checking if this use index instruction
reads the register.
Failure to sink a candidate should not block us from attempting to sink
other candidates. There are mechanisms in place to handle the case where
the failed to be sunk instruction uses an instruction that gets sunk (we
do not delete the original instruction corresponding with the sunk
instruction if it still has uses).
If a (two-result) node like `FMODF` or `FFREXP` is expanded to a library
call, where said library has the function prototype like: `float(float,
float*)` -- that is it returns a float from the call and via an output
pointer. The first result of the node maps to the value returned by
value and the second result maps to the value returned via the output
pointer.
If only the second result is used after the expansion, we hit an issue
on x87 targets:
```
// Before expansion:
t0, t1 = fmodf x
return t1 // t0 is unused
```
Expanded result:
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = copy_from_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
So far things are alright, but the DAGCombiner optimizes this to:
```
ptr = alloca
ch0 = call modf ptr
// copy_from_reg optimized out
t1, ch1 = ldr ptr, ch0
return t1
```
On most targets this is fine. The optimized out `copy_from_reg` is
unused and is a NOP. However, x87 uses a floating-point stack, and if
the `copy_from_reg` is optimized out it won't emit a pop needed to
remove the unused result.
The prior solution for this was to attach the chain from the
`copy_from_reg` to the root, which did work, however, the root is not
always available (it's set to null during legalize types). So the
alternate solution in this patch is to replace the `copy_from_reg` with
an `X86ISD::POP_FROM_X87_REG` within the X86 call lowering. This node is
the same as `copy_from_reg` except this node makes it explicit that it
may lower to an x87 FPU stack pop. Optimizations should be more cautious
when handling this node than a normal CopyFromReg to avoid removing a
required FPU stack pop.
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = pop_from_x87_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
Using this node ensures a required x87 FPU pop is not removed due to the
DAGCombiner.
This is an alternate solution for #127976.