If `C0` is a mask and `C1` shifts out all the masked bits (to
essentially compare two subsets of `X`), we can arbitrarily re-order
shift as `srl` or `shl`.
If `C1` (shift amount) is a power of 2, we can replace the and+shift
with a rotate.
Otherwise, based on target preference we can arbitrarily swap `shl`
and `shl` in/out to get better constants.
On x86 we can use this re-ordering to:
1) get better `and` constants for `C0` (zero extended moves or
avoid imm64).
2) covert `srl` to `shl` if `shl` will be implementable with `lea`
or `add` (both of which can be preferable).
Proofs: https://alive2.llvm.org/ce/z/qzGM_w
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D152116
Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.
Also enhanced peephole by deleting identical instructions after FoldImmediate.
Differential Revision: https://reviews.llvm.org/D151848
The last use of isPHIJoin was removed by:
commit fac770b865f59cbe615241dad153ad20d5138b9e
Author: Jakob Stoklund Olesen <stoklund@2pi.dk>
Date: Sat Feb 9 00:04:07 2013 +0000
so there is no reason to maintain PHIJoins.
The BuildVectorSDNode::isConstantSplat function could depend on
endianness, and it takes a bool argument that can be used to indicate
if big or little endian should be considered when internally casting
from a vector to a scalar. However, that argument is default set to
false (= little endian). And in many situations, even in target
generic code such as DAGCombiner, the endianness isn't specified when
using the function.
The intent with this patch is to highlight that endianness doesn't
matter, depending on the context in which the function is used.
In DAGCombiner the code is slightly refactored. Back in the days when
the code was written it wasn't possible to request a MinSplatBits
size when calling isConstantSplat. Instead the code re-expanded the
found SplatValue to match with the EltBitWidth. Now we can just
provide EltBitWidth as MinSplatBits and remove the logic for doing
the re-expand.
While being at it, tidying up around isConstantSplat, this patch also
adds an explicit check in BuildVectorSDNode::isConstantSplat to break
out from the loop if trying to split an on VecWidth into two halves.
Haven't been able to prove that there could be miscompiles involved
if not doing so. There are lit tests that trigger that scenario,
although I think they happen to later discard the returned SplatValue
for other reasons.
This reverts commit b5743d4798b250506965e07ebab806a3c2d767cc.
This causes some minor compile-time impact. Revert for now, better
to do the change more gradually.
Remove use after free when attempting to update SlotIndexes in
MachineBasicBlock::SplitCriticalEdge.
Use MachineFunction delegate mechanism to capture target specific
manipulations of branch instructions and update SlotIndexes.
The current lowering of statepoints does not take into account return
attributes present on the `gc.result` leading to different code being
generated than if one were to not use statepoints. These return
attributes can affect the ABI which is why it is important that they are
applied in the lowering.
Update `LegalizerHelper::widenScalarMulo` to not create a mulo if we aren't going to use the overflow flag. This prevents needing to legalize the widened operation. This generates better code when we need to make a libcall for multiply.
When narrowing logic ops(OR/XOR) with constant rhs, `DAGCombiner` will fixup the constant rhs node.
It is incorrect when lhs is also a constant. For example, we will incorrectly replace `xor OpaqueConstant:i64<8191>, Constant:i64<-1>` with `xor (and OpaqueConstant:i64<8191>, Constant:i64<65535>), Constant:i64<-1>`.
Fixes#68855.
The optimization in CodeGenPrepare, where GEPs are unmerged across
indirect branches must respect the types of both GEPs and their sizes
when adjusting the indices.
The sample here shows the bug:
https://godbolt.org/z/8e9o5sYPP
The value `%elementValuePtr` addresses the second field of the
`%struct.Blub`. It is therefore a GEP with index 1 and type i8.
The value `%nextArrayElement` addresses the next array element. It is
therefore a GEP with index 1 and type `%struct.Blub`.
Both values point to completely different addresses, even if the indices
are the same, due to the types being different.
However, after CodeGenPrepare has run, `%nextArrayElement` is a bitcast
from `%elementValuePtr`, meaning both were treated as equal.
The cause for this is that the unmerging optimization does not take
types into consideration.
It sees both GEPs have `%currentArrayElement` as source operand and
therefore tries to rewrite `%nextArrayElement` in terms of
`%elementValuePtr`.
It changes the index to the difference of the two GEPs. As both indices
are `1`, the difference is `0`. As the indices are `0` the GEP is later
replaced with a simple bitcast in CodeGenPrepare.
Before adjusting the indices, the types of the GEPs would have to be
aligned and the indices scaled accordingly for the optimization to be
correct.
Due to the size of the struct being `16` and the `%elementValuePtr`
pointing to offset `1`, the correct index for the unmerged
`%nextArrayElement` would be 15.
I assume this bug emerged from the opaque pointer change as GEPs like
`%elementValuePtr` that access the struct field based of type i8 did not
naturally occur before.
In light of future migration to ptradd, simply not performing the
optimization if the types mismatch should be sufficient.
Following up on prior RFC
(https://lists.llvm.org/pipermail/llvm-dev/2020-September/145357.html)
we can now improve above our highly-optimized basic-block-sections
binary (e.g., 2% for clang) by applying path cloning. Cloning can
improve performance by reducing taken branches.
This patch prepares the profile format for applying cloning actions.
The basic block cloning profile format extends the basic block sections
profile in two ways.
1. Specifies the cloning paths with a 'p' specifier. For example, `p 1 4
5` specifies that blocks with BB ids 4 and 5 must be cloned along the
edge 1 --> 4.
2. For each cloned block, it will appear in the cluster info as
`<bb_id>.<clone_id>` where `clone_id` is the id associated with this
clone.
For example, the following profile specifies one cloned block (2) and
determines its cluster position as well.
```
f foo
p 1 2
c 0 1 2.1 3 2 5
```
This patch keeps backward-compatibility (retains the behavior for old
profile formats). This feature is only introduced for profile version >=
1.
Split a virtual register with hint may generate COPY instructions in
multiple cold basic blocks, and increase code size. So disable this
split when the function is optimized for size.
Rename advanceMBBIndex and findMBBIndex to getMBBLowerBound and add
getMBBUpperBound.
The motivations are:
- Make it clear what kind of search is being done, using names inspired
by std::upper/lower_bound.
- Simplify getMBBFromIndex which really wants an upper bound search and
previously had to work hard to get the result it wanted from a lower
bound search.
G_TRUNC will get lowered into trunc(merge(trunc(unmerge),
trunc(unmerge))) if the source is larger than 128 bits or the truncation
is more than half of the current bit size.
Now mirrors ZEXT/SEXT code more closely for vector types.
After a pass calls addRequired<X>() it is strange to call
getAnalysisIfAvailable<X>() because analysis X should always be
available. Use getAnalysis<X>() instead.
The change implements support of the intrinsics `get_fpmode`,
`set_fpmode` and `reset_fpmode` in Global Instruction Selector. Now they
are lowered into library function calls.
Differential Revision: https://reviews.llvm.org/D158260
Skip LICM if PHI belongs to the current loop, e.g. is in the
loop's header. This prevents LICM from bailing for CFGs like
L1:
R = LoopInvariant // can be LICM'd
BR L1
L2:
PHI(R, ..)
BR L2
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
Now that llvm::support::endianness has been renamed to
llvm::endianness, we can use the shorter form. This patch replaces
llvm::support::endianness with llvm::endianness.
After the SinkAndFold optimization was enabled, we saw some crashes with
GISel due to SinkAndFold erasing an MI while a reference was being held in a
cache.
This patch fixes:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:10832:12: error:
variable 'Changed' set but not used
[-Werror,-Wunused-but-set-variable]
Temporal divergence that was present in input or introduced in IR
transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies
by changing sgpr source instr to vgpr instr.
After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare,
machine-sinking can introduce temporal divergence by sinking
instructions outside of the cycle.
Add isSafeToSink callback in TargetInstrInfo.
- Refactor the (Machine)BlockFrequencyInfo::printBlockFreq functions
into a `PrintBlockFreq()` function returning a `Printable` object. This
simplifies usage as it can be directly piped to a `raw_ostream` like
`dbgs() << PrintBlockFreq(MBFI, Freq) << '\n';`.
- Previously there was an interesting behavior where
`BlockFrequencyInfoImpl` stores frequencies both as a `Scaled64` number
and as an `uint64_t`. Most algorithms use the `BlockFrequency`
abstraction with the integers, the print function for basic blocks
printed the `Scaled64` number potentially showing higher accuracy than
was used by the algorithm. This changes things to only print
`BlockFrequency` values.
- Replace some instances of `dbgs() << Freq.getFrequency()` with the new
function.
The `BlockFrequency` class abstracts `uint64_t` frequency values. Use it
more consistently in various APIs and disable implicit conversion to
make usage more consistent and explicit.
- Use `BlockFrequency Freq` parameter for `setBlockFreq`,
`getProfileCountFromFreq` and `setBlockFreqAndScale` functions.
- Return `BlockFrequency` in `getEntryFreq()` functions.
- While on it change some `const BlockFrequency& Freq` parameters to
plain `BlockFreqency Freq`.
- Mark `BlockFrequency(uint64_t)` constructor as explicit.
- Add missing `BlockFrequency::operator!=`.
- Remove `uint64_t BlockFreqency::getMaxFrequency()`.
- Add `BlockFrequency BlockFrequency::max()` function.
Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449