SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
Scalarize vector FPOWI instead of promoting the type. This allows the
scalar FPOWIs to be visited and converted to libcalls before promoting
the type.
FIXME: This should be done in LegalizeVectorOps/LegalizeDAG, but call
lowering needs the unpromoted EVT.
Without this patch, in some backends, such as RISCV64 and LoongArch64,
the i32 type is illegal and will be promoted. This causes exponent type
check to fail when ISD::FPOWI node generates a libcall.
Fix https://github.com/llvm/llvm-project/issues/118079
This adds a new helper `canFoldStoreIntoLibCallOutputPointers()` to
check that it is safe to fold a store into a node that will expand to a
library call that takes output pointers. This requires checking for two
(independent) properties:
1. The store is not within a CALLSEQ_START..CALLSEQ_END pair
* If it is, the expansion would lead to nested call sequences (which is
invalid)
2. The node does not appear as a predecessor to the store
* If it does, attempting to merge the store into the call would result
in a cycle in the DAG
These two properties are checked as part of the same traversal in
`canFoldStoreIntoLibCallOutputPointers()`
ISD::isBuildVectorAllOnes can peek through bitcasts, so this can match against FP NAN (ish) data (e.g. double (bitcast i64 -1)) under certain circumstances - bail if the type isn't an integer and let bitcast folding handle it first.
Fixes#120093
This patch make a couple of improvements to ReduceLoadOpStoreWidth.
When determining the minimum size of "NewBW" we now take byte boundaries
into account. If we for example touch bits 6-10 we shouldn't accept
NewBW=8, because we would fail later when detecting that we can't access
bits from two different bytes in memory using a single load. Instead we
make sure to align LSB/MSB according to byte size boundaries up front
before searching for a viable "NewBW".
In the past we only tried to find a "ShAmt" that was a multiple of
"NewBW", but now we use a sliding window technique to scan for a viable
"ShAmt" that is a multiple of the byte size. This can help out finding
more opportunities for optimization (specially if the original type
isn't byte sized, and for big-endian targets when the original
load/store is aligned on the most significant bit).
DAGCombiner::ReduceLoadOpStoreWidth could replace memory accesses
with more narrow loads/store, although sometimes the new load/store
would touch memory outside the original object. That seemed wrong
and this patch is simply avoiding doing the DAG combine in such
situations.
Also simplifying the expression used to align ShAmt down to a multiple
of NewBW. Subtracting (ShAmt % NewBW) should do the same thing as the
old more complicated expression.
Intention is to follow up with a patch that make more attempts, trying
to align the memory accesses at other offsets, allowing to trigger
the transform in more situations. The current strategy for deciding
size (NewBW) and offset (ShAmt) for the narrowed operations are a bit
ad-hoc, and not really considering big endian memory order in same
way as little endian.
Adding test cases related to narrowing of load-op-store sequences.
ReduceLoadOpStoreWidth isn't careful enough, so it may end up
creating load/store operations that access memory outside the region
touched by the original load/store. Using ARM as a target for the
test cases to show what happens for both little-endian and big-endian.
This patch also adds a way to override the TLI.isNarrowingProfitable
check in DAGCombiner::ReduceLoadOpStoreWidth by using the option
-combiner-reduce-load-op-store-width-force-narrowing-profitable.
Idea is that it should be simpler to for example add lit tests
verifying that the code is correct for big-endian (which otherwise
is difficult since there are no in-tree big-endian targets that
is overriding TLI.isNarrowingProfitable).
This is a pre-commit for
https://github.com/llvm/llvm-project/pull/119203
Currently LLVMContext::emitError emits any error as an "inline asm"
error which does not make any sense. InlineAsm appears to be special,
in that it uses a "LocCookie" from srcloc metadata, which looks like
a parallel mechanism to ordinary source line locations. This meant
that other types of failures had degraded source information reported
when available.
Introduce some new generic error types, and only use inline asm
in the appropriate contexts. The DiagnosticInfo types are still
a bit of a mess, and I'm not sure why DiagnosticInfoWithLocationBase
exists instead of just having an optional DiagnosticLocation in the
base class.
DK_Generic is for any error that derives from an IR level instruction,
and thus can pull debug locations directly from it. DK_GenericWithLoc
is functionally the generic codegen error, since it does not depend
on the IR and instead can construct a DiagnosticLocation from the
MI debug location.
Previously we created an FP_TO_FP16 and legalized it in
SoftenFloatOp_FP_ROUND. This caused i16 to be sent to call lowering
instead of f16. This results in the ABI not being followed if f16 is
supposed to be passed in a different register than i16.
Looking at the libgcc binary for the library function it appears the value
is returned in xmm0 so the X86 test was being miscompiled before.
Fixes#107607.
Each call to push_back contains a check to see if the vector needs to
grow. Using resize or giving the size to the constructor can reduce
the number of checks for growing.
When expanding a load into two loads, use nuw for the add that computes
the offset from the base of the second load, because the original load
doesn't straddle the address space.
It turns out there's already a dedicated helper function for doing this,
`getObjectPtrOffset`.
This is in target-independent code, however in practice it only seems to
affact WebAssembly code, because WebAssembly load and store
instructions' constant offsets don't perform wrapping, so constant
folding often depends on the nuw flag being present.
This was noticed in the development of #119204.
[Reverts d57892a2a153ab71a796f07e39d939eae6910c21]
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
---------
Co-authored-by: Paul Walker <paul.walker@arm.com>
EVTs potentially contain a Type * that points into memory owned by an
LLVMContext. Storing them in a function scoped static means they may
outlive the LLVMContext they point to.
This std::set is used to unique single element VT lists containing a
single extended EVT. Single element VT list with a simple EVT are
uniqued by a separate cache indexed by the MVT::SimpleValueType enum. VT
lists with more than one element are uniqued by a FoldingSet owned by
the SelectionDAG object.
This patch moves the single element cache into SelectionDAG so that it
will be destroyed when SelectionDAG is destroyed.
Fixes#88233
Add DAG legalization support for expanding i1 SETCC nodes using
appropriate logical operations to simulate integer comparisons. Use
these expansions to handle i1 SETCC in NVPTX.
fixes#58428 and #57405
This DAG combine was incorrect for big-endian targets, because it
assumes that when a bitcast changes the lane width, the
least-significant bits of the wider lanes are in the lower-numbered
lanes of the smaller type, which is only true for little-endian.
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
I want to use this function for GISel too so Type * is a better common
interface. All of the callers already convert EVT to Type * as needed
by calling lowering anyway.
It doesn't make sense to add a new generic ISD to handle riscv tuple
type. Instead we use `SPLAT_VECTOR` for ISD and further lower to
`VMV_V_X`.
Note: If there's `visitSPLAT_VECTOR` in generic DAG combiner, it needs
to skip riscv vector tuple type.
Stack on https://github.com/llvm/llvm-project/pull/114329
Currently FastISel triggers a fallback if there is an unreachable
terminator and the TrapUnreachable option is enabled (the ISD::TRAP
selection does not actually work).
Add handling for NoTrapAfterNoReturn, in which case we don't actually
need to emit a trap. The test is just there to make sure there is no
FastISel fallback (which is why I'm not testing the case without
noreturn). We have other tests that check the actual unreachable codegen
variations.
When extracting a smaller integer from a scalar_to_vector source, we were limited to only folding/truncating the lowest bits of the scalar source.
This patch extends the fold to handle extraction of any other element, by right shifting the source before truncation.
Fixes a regression from #117884
Handle \@llvm.expect.with.probability in SelectionDAGBuilder, FastISel,
and IntrinsicLowering in the same way \@llvm.expect is handled, where
the value is passed through as-is. This can be reached if the intrinsic
is used without optimizations, where it would otherwise be properly
transformed out.
Fixes#115411 for SelectionDAG. A similar patch is likely needed for
GlobalISel.
Assert that the passed value is a valid unsigned integer value for the
specified type.
For signed values getSignedConstant() / getSignedTargetConstant() should
be used instead.
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
A special case in type legalization wasn't accounting for different
operand numbering between FLDEXP and STRICT_FLDEXP.
AArch64 already asked STRICT_FLDEXP to be promoted, but had no test for
it.
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
Currently the function will walk the entire DAG to find other candidates
to perform a post-inc store. This leads to very long compilation times
on large functions. Added a MaxSteps limit to avoid this, which is also
aligned to how hasPredecessorHelper is used elsewhere in the code.
Add integer promotion support for for VP_LOAD and VP_STORE via legalization of extend
and truncate of each form.
Patch commandeered from: https://reviews.llvm.org/D109377