Based on discussion from
https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788
Current interface is:
llvm.experimental.histogram(<vecty> ptrs, <intty> inc_amount, <vecty> mask)
The integer type used by 'inc_amount' needs to match the type of the buckets in memory.
The intrinsic covers the following operations:
* Gather load
* histogram on the elements of 'ptrs'
* multiply the histogram results by 'inc_amount'
* add the result of the multiply to the values loaded by the gather
* scatter store the results of the add
Supports lowering to histcnt instructions for AArch64 targets, and scalarization for all others at present.
If we are lowering a frem and the divisor is known to be an integer power-2, we
can use the formula 'frem = x - trunc(x / d) * d'. This avoids the more
expensive call to fmod. The results are identical as fmod so long as d is a
power-2 (so the mul does not round incorrectly), and the sign of the return is
either always positive or not important for zeroes (nsz).
Unfortunately Alive2 does not handle this well at the moment. I was using
exhaustive checking to test this:
(https://gist.github.com/davemgreen/6078015f30d3bacd1e9572f8db5d4b64).
I found this in cpythons implementation of float_pow. I currently added it as a
DAG combine for frem with power-2 fp constants.
If the shuffle mask contains no undef elements, then we can move the freeze through a shuffle node.
This requires special case handling to create a new ShuffleVectorSDNode.
Includes VECTOR_SHUFFLE support for isGuaranteedNotToBeUndefOrPoison / canCreateUndefOrPoison.
If this flag is set, Xor will not be considered AddLike. If an Xor were
treated as an Add it may wrap. If we can prove there would be no carry out and
thus no wrap, the Xor would be turned into a disjoint Or by DAGCombine.
Use this new flag to fix a bug in X86 where an Xor is incorrectly being treated
as an NUWAdd.
Fixes#90668.
This reverts commit 16bd10a38730fed27a3bf111076b8ef7a7e7b3ee.
Re-applies:
b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"
with a fix in DAGCombiner::visitFREEZE.
This reverts:
b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
(because it updates a test case that I don't know how to resolve the conflict for)
8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"
Due to a test suite failure on AArch64 when compiling for SVE.
https://lab.llvm.org/buildbot/#/builders/197/builds/13955
clang: ../llvm/llvm/include/llvm/CodeGen/ValueTypes.h:307: MVT llvm::EVT::getSimpleVT() const: Assertion `isSimple() && "Expected a SimpleValueType!"' failed.
[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison
Handle SELECT_CC similarly as SETCC.
Handle these operations that only propagate poison/undef based on the
input operands:
SADDSAT, UADDSAT, SSUBSAT, USUBSAT, MULHU, MULHS,
SMIN, SMAX, UMIN, UMAX
These operations may create poison based on shift amount and exact
flag being violated:
SRL, SRA
One goal here is to allow pushing freeze through these operations
when allowed, as well as letting analyses such as
isGuaranteedNotToBeUndefOrPoison to not break on such operations.
Since some problems have been observed with pushing freeze through
SRA/SRL we block that explicitly in DAGCombiner::visitFreeze now.
That way we can still model SRA/SRL properly in
SelectionDAG::canCreateUndefOrPoison, e.g. when used by
isGuaranteedNotToBeUndefOrPoison, even if we do not want to push
freeze through those instructions.
The description of CopyFromReg in ISDOpcodes.h says that the input
valus is defined outside the scope of the current SelectionDAG. I
think that means that we basically can treat it as a FREEZE in the
sense that it can be seen as neither being undef nor poison.
Being able to fold freeze(CopyFromReg) into CopyFromReg seems
useful to avoid regressions if we start to introduce freeze
instruction in DAGCombiner/foldBoolSelectToLogic, e.g. to solve
https://github.com/llvm/llvm-project/issues/84653
Things _not_ dealt with in this patch:
- Depending on calling convention an input argument can be passed
also on the stack and not in a register. If it is allowed to treat
an argument received in a register as not being poison, then I think
we want to treat arguments received on the stack the same way. But
then we need to attribute load instructions, or add explicit FREEZE
when lowering formal arguments.
- A common pattern is that there is an AssertZext or AssertSext just
after CopyFromReg. I think that if we treat CopyFromReg as never
being poison, then it should be allowed to fold
(freeze(AssertZext(CopyFromReg))) -> AssertZext(CopyFromReg))
This matches the style used in the Analysis version of this routine, and
makes it less likely we'll miss a poison generating flag in future
changes. Unlike IR, the check for poison generating flags doesn't need
to switch over opcode since all nodes have the SDFlags storage.
All of these constructors were creating a SDVTList using an EVT* created
by SDNode::getValueTypeList. This EVT needs to live at least as long as
the SDNode that uses it. To do this, SDNode::getValueTypeList contains
several function scoped static variables that hold the memory for the
EVT. So the EVT lives until global destructors run.
This is problematic since an EVT contains a Type* that points to memory
allocated by an LLVMContext. If multiple LLVMContexts are used that
don't have overlapping lifetimes, we can end up with stale or or
incorrect pointers cached in the EVTs owned by SDNode::getValueTypeList.
I want to try to make the EVTs be owned by SelectionDAG instead. This is
already done for SDVTLists with more than 1 VT. The single value case is
a very old optimizaton that should be re-evaluated. In order to do this,
I need the SDVTLists to be created by SelectionDAG rather than by the
SDNode itself.
This patch doesn't change how the allocation is done yet. It just moves
the code around.
This patch does reduce the number of calls to getVTList since we now
share with the call needed for the SDNode FoldingSet.
Part of fixing #88233.
It turns out that if any of the operations can be zero, and neither of
the operands can be proven to be positive, it is possible for smax to be
zero, and KnownBits cannot prove otherwise even with KnownBits::smax. In
fact, proving it based on the KnownBits itself at that point without
increasing the depth is actually, provably impossible.
Same with smin.
This covers all the possible cases and is proven to be complete.
This requires type legalization to keep them the same. This means we no
longer need to legalize the operand since it will be legalized when we
legalize the second result.
This applies the same rules we have for the scalar operands of a
BUILD_VECTOR where the scalar type must match the element type or for
integer vectors we allow the scalar type to be larger than the element
type. Hexagon uses i32 for an FP zero vector so we allow that as an
exception.
Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds.
This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements.
There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads?
Fixes#86968
On RV64, we legalize zexts of i1s to (vselect m, (splat_vector i64 1),
(splat_vector i64 0)), where the splat_vectors are implicitly
truncating.
When the vselect is used by a binop we want to pull the vselect out via
foldSelectWithIdentityConstant. But because vectors with an element size
< i64 will truncate, isNeutralConstant will return false.
This patch handles truncating splats by getting the APInt value and
truncating it. We almost don't need to do this since most of the neutral
elements are either one/zero/all ones, but it will make a difference for
smax and smin.
I wasn't able to figure out a way to write the tests in terms of select,
since we need the i1 zext legalization to create a truncating
splat_vector.
This supercedes #87236. Fixed vectors are unfortunately not handled by
this patch (since they get legalized to _VL nodes), but they don't seem
to appear in the wild.
Remove getSizeOrUnknown call when MachineMemOperand is created. For Scalable
TypeSize, the MemoryType created becomes a scalable_vector.
2 MMOs that have scalable memory access can then use the updated BasicAA that
understands scalable LocationSize.
Original Patch by Harvin Iriawan
Co-authored-by: David Green <david.green@arm.com>
Allow targets to rely on TargetLowering::isGuaranteedNotToBeUndefOrPoisonForTargetNode to test nodes for canCreateUndefOrPoisonForTargetNode + all arguments are isGuaranteedNotToBeUndefOrPoison.
Targets can still perform this themselves for specific special case nodes (e.g. target shuffles).
Matches the fallback in SelectionDAG::isGuaranteedNotToBeUndefOrPoison
- Instead of lowering float/double ISD::ATOMIC_LOAD / ISD::ATOMIC_STORE
nodes to regular LOAD/STORE nodes, make them legal and select those nodes
properly instead. This avoids exposing them to the DAGCombiner.
- AtomicExpand pass no longer casts float/double atomic load/stores to integer
(FP128 is still casted).
This is part of #70452 that changes the type used for the external
interface of MMO to LocationSize as opposed to uint64_t. This means the
constructors take LocationSize, and convert ~UINT64_C(0) to
LocationSize::beforeOrAfter(). The getSize methods return a
LocationSize.
This allows us to be more precise with unknown sizes, not accidentally
treating them as unsigned values, and in the future should allow us to
add proper scalable vector support but none of that is included in this
patch. It should mostly be an NFC.
Global ISel is still expected to use the underlying LLT as it needs, and
are not expected to see unknown sizes for generic operations. Most of
the changes are hopefully fairly mechanical, adding a lot of getValue()
calls and protecting them with hasValue() where needed.
When I created APIntOps::absdiff, I totally missed that we already have ISD::ABDS/ABDU nodes, and we use this term in other places/targets as well.
I've added the APIntOps::abds implementation and renamed APIntOps::absdiff to APIntOps::abdu.
Given that APIntOps::absdiff is so young I don't think we need to create a deprecation wrapper, but I can if anyone thinks it important.
I'll do a KnownBits rename patch after this.
No nans/infs in SelectionDAG is complicated. Hopefully I've captured
all of the cases. I've only applied to ConsiderFlags to the SDNodeFlags
since those are the only ones that will be droped by hoisting. The
condition code and TargetOptions would still be in effect.
Recovers some regression from #84232.
Teach canCreateUndefOrPoison that ISD::SETCC with integer operands can
never create undef/poison. FP SETCC is more complicated and will be
handled in a future patch.
Teach isGuaranteedNotToBeUndefOrPoison that ISD::CONDCODE is not
poison/undef. Its a special constant only used by setcc/select_cc like
nodes. This is needed since the hoisting will only hoist if exactly one
operand might be poison. setcc has 3 operand including the condition
code.
Recovers some regression from #84232.
This extends the known bits for extending loads which have range
metadata, handling the range metadata on the original memory type,
extending that to the correct BitWidth.
The base case of these call InferPtrInfo. This is dangerous due to
#82657, but it turns out none of these are used.
It seemed best to reduce the surface area until these are needed.
This handles two cases where we can work out some known-zero bits for
ISD::STEP_VECTOR.
The first case handles when we know the low bits are zero because the
step
amount is a power of two. This is taken from
https://reviews.llvm.org/D128159,
and even though the original patch didn't end up landing this case due
to it
not having any test difference, I've included it here for completeness's
sake.
The second case handles the case when we have an upper bound on
vscale_range.
We can use this to work out the upper bound on the number of elements,
and thus
what the maximum step will be. From the maximum step we then know which
hi bits
are zero.
On its own, computing the known hi bits results in some small
improvements for
RVV with -mrvv-vector-bits=zvl across the llvm-test-suite. However I'm
hoping
to be able to use this later to reduce the LMUL in index calculations
for
vrgather/indexed accesses.
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>