We already have cost model code for detecting extending mull multiplies
for the form `mul(ext, ext)`. Since it was added the codegen for mull
has been improved, this attempts to catch the cost model up.
The main idea is to incorporate extends of larger sizes. A vector `v8i32
mul(zext(v8i8), zext(v8i8))` will be code-generated as `zext (v8i16
mul(zext(v8i8), zext(v8i8))`, or umull+ushll+ushll2.
So the total cost should be 3ish if each instruction costs 1. Where
exactly we attribute the costs is dependable, this patch opts to sets
the cost of the extend to 0 (or the cost of the extend not included in
the mull) and the mul gets the cost of the mull+extra extends.
isWideningInstruction is split into two functions for the two types of
operands it supports. isSingleExtWideningInstruction now handles addw
instructions that extend the second operand, isBinExtWideningInstruction
is for instructions like addl that extend both operands.
If the parent node is non-schedulable (only externally used instructions), and at least one instruction has multiple uses and used in the binop, such copyable node should be created. Otherwise, it may contain wrong def-use chain model, which cannot be effective detected.
Fixes#166035
If the laternate operation is more stricter than the main operation, we
cannot rely on the analysis of the main operation. In such case, better
to avoid doing the analysis at all, since it may affect the overall
result and lead to incorrect optimization
Fixes#165878
If the gather/buildvector node has the match and this matching node has
a scheduled copyable parent, and the parent node of the original node
has a last instruction, which is non-schedulable and is part of the
schedule copyable parent, such matching node should be excluded as
non-matching, since it produces wrong def-use chain.
Fixes#165435
Need to re-check the instruction with the non-schedulable parent, only
if this parent has a user phi node (i.e. it is used only outside the
block) and the user instruction has unique parent instruction.
Fixes issue reported in 20675ee67d (commitcomment-168863594)
If the instructions in the node do not require scheduling and used
outside basic block only, still need to check, if their operands are
non-inst too. Such nodes should be emitted in the beginning of the
block.
Fixes#165151
If the parent node is non-schedulable and it includes several copies of
the same instruction, its operand might be replaced by the copyable
nodes in multiple children nodes, and if the instruction is commutative,
they can be used in different operands. The compiler shall consider this
opportunity, taking into account that non-copyable children are
scheduled only ones for the same parent instruction.
Fixes#164242
If the parent node is non-schedulable and it includes several copies of
the same instruction, its operand might be replaced by the copyable
nodes in multiple children nodes, and if the instruction is commutative,
they can be used in different operands. The compiler shall consider this
opportunity, taking into account that non-copyable children are
scheduled only ones for the same parent instruction.
Fixes#164242
If a main instruction in the copyables is a div-like instruction, the
compiler cannot pack duplicates, extending with poisons, these
instructions, being vectorize, will result in undefined behavior.
Fixes#164185
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.
This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)
It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
The compiler shall not check for overflow of the number of copyable
operands counter, otherwise non-copyable operand can be counted as
copyable and lead to a compiler crash.
Fixes#164164
If the copyable entry has the last instruction, used only outside the
block, tha insert ion point for the vector code should be the last
instruction itself, not the following one. It prevents wrong def-use
sequences, which might be generated for the buildvector nodes.
Fixes#163404
Need to insert the vector value for the postponed gather/buildvector
node after all uses non only if the vector value of the user node is
phi, but also if the user node itself is PHI node, which may produce
vector phi + shuffle.
Fixes#162799
If the non-commutative user has several same operands and at least one
of them (but not the first) is copyable, need to consider this
opportunity when calculating the number of dependencies. Otherwise, the
schedule bundle might be not scheduled correctly and cause a compiler
crash
Fixes#162925
If all operands of the non-schedulable nodes were previously only
copyables, need to clear the dependencies of the original schedule data
for such copyable operands and recalculate them to correctly handle
number of dependecies.
Fixes#159406
If the commutable instruction can be represented as a non-commutable
vector instruction (like add 0, %v can be represented as a part of sub
nodes with operation sub %v, 0), its operands might still be reordered
and this should be accounted when checking for copyables in operands
Fixes#158293
If the original instruction is going to be scheduled after same
instruction being scheduled as copyable, need to recalculate
dependencies. Otherwise, the dependencies maybe calculated incorrectly.
If the user node of the SExt/ZExt node is a bitcast to a float point
type, the node itself should not be considered legal to demote, since
still the casting is required to match the size of the float point type.
Fixes#157277
If a standalone schedule data relates to a vectorized instruction, still
need to schedule it as a part of pseudo-bundle to correctly handle
dependencies between its child nodes.
If a standalone schedule data relates to a vectorized instruction, still
need to schedule it as a part of pseudo-bundle to correctly handle
dependencies between its child nodes.
The 256-bit maximum vector register size control was removed from AVX10
whitepaper, ref: https://cdrdv2.intel.com/v1/dl/getContent/784343
We have warned these options in LLVM21 through #132542. This patch
removes underlying implementations in LLVM22.
Commutable instruction can be reordering during tree building, and if
the parent node is not scheduled, its ScheduleData elements are
considered independent and compiler do not looks for reordered operands.
Need to cancel scheduling of copyables in this case.
In the initial patch for FMAD, potential FMAD nodes were completely
excluded from the reduction analysis for the smaller patch. But it may
cause regressions.
This patch adds better detection of scalar FMAD reduction operations and
tries to correctly calculate the costs of the FMAD reduction operations
(also, excluding the costs of the scalar fmuls) and split reduction
operations, combined with regular FMADs.
Fixed the handling for reduced values with many uses.
Reviewers: RKSimon, gregbedwell, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/152787
In the initial patch for FMAD, potential FMAD nodes were completely
excluded from the reduction analysis for the smaller patch. But it may
cause regressions.
This patch adds better detection of scalar FMAD reduction operations and
tries to correctly calculate the costs of the FMAD reduction operations
(also, excluding the costs of the scalar fmuls) and split reduction
operations, combined with regular FMADs.
Reviewers: RKSimon, gregbedwell, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/152787
Currently stores are sorted by the stored values instruction types,
which do not include analysis for copyables. The compiler may miss some
potential vectorization opportunities because of that. Patch adds
detection of the copyables in stored values.
Reviewers: hiraditya, HanKuanChen, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/153213
If the value is checked for the reduction and it is a copyable element
in a root node, it should not be deleted, since it may still be used
after vectorization.
Fixes#155512