Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
AMDGPUCodeGenPrepareImpl::visitBinaryOperator() calls
Builder.CreateBinOp() and casts the resulting Value as a BinaryOperator
without checking, leading to an assert failure in a case found by
fuzzing. In this case, the operands are constant and CreateBinOp does
constant folding so returns a Constant instead of a BinaryOperator.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
For the majority of cases, this is a neutral or positive change.
There are even testcases that greatly benefit from it, but some regressions are possible.
There is #140040 for GlobalISel that'd need to be fixed but it's only a one instruction regression and I think it can be fixed later.
Solves #64591
Most callers want a constant index. Instead of making every caller
create a ConstantInt, we can do it in IRBuilder. This is similar to
createInsertElement/createExtractElement.
Re-apply https://github.com/llvm/llvm-project/pull/130577
Which is reverted in https://github.com/llvm/llvm-project/pull/133880
The old application failed in address sanitizer due to
`tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in
`AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after
free failure.
To fix this, `tryNarrowMathIfNoOverflow` will be called before and
directly return if `tryNarrowMathIfNoOverflow` result in true.
Rework involves below:
- Return unsigned value, the number of div/rem bits actually needed.
- Change from AtLeast(SignBits) to MaxDivBits hint.
- Use MaxDivBits hint for unsigned case.
- Remove unnecessary second early exit.
Mostly NFC changes.
Previously in shrinkDivRem64, it used fixed value 32 for AtLeast which
meant that <64bit divisions would be rejected from shrinking since logic
depended only on number of sign bits. I.e. 'idiv i48 %0, %1' would
return 24 for number of sign bits if %0,%1 both had 24 division bits,
and was rejected.
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.
This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
This reverts commit 254d206ee2a337cb38ba347c896f7c6a14c7f218.
+Added a fix in ExpandDivRem24 to disqualify if DivNumBits exceed 24.
Original commit & msg:
ce6e955ac374f2b86cbbb73b2f32174dffd85f25.
Handle signed and unsigned path differently in getDivNumBits. Using
computeKnownBits, this rejects shrinking unsigned div/rem if operands
exceed signed max since we know NumSignBits will be always 0.
… (SignedMax,UnsignedMax] (#116733)"
This reverts commit 905e831f8c8341e53e7e3adc57fd20b8e08eb999.
Handle signed and unsigned path differently in getDivNumBits. Using
computeKnownBits, this rejects shrinking unsigned div/rem if operands
exceed signed max since we know NumSignBits will be always 0.
Rebased and re-attempt after first one was reverted due to unrelated
failure in LibC (should be fixed by now I'm told).
Use references instead of pointers for most state, initialize it all in
the constructor, and common up some of the initialization between the
legacy and new pass manager paths.
Do this by using ComputeKnownBits and checking for !isNonNegative and
isUnsigned. This rejects shrinking unsigned div/rem if operands exceed
smax_bitwidth since we know NumSignBits will be always 0.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
This was assuming the source address space was at least as large
as the destination of the cast. I'm not sure why this was casting
to begin with; the assumption seems to be the source
address space from the root addrspacecast matches the underlying
object so directly check that.
Fixes#97457
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
When i1 true is used as an index, SExt extends it to i32 -1. This would
cause BitVector to overflow.
The language manual have specified that the index shall be treated as an
unsigned number, this patch fixes that.
(https://llvm.org/docs/LangRef.html#insertelement-instruction)
This patch fixes#85717
---------
Signed-off-by: Peter Rong <PeterRong96@gmail.com>
Use IR analysis to infer when an addrspacecast operand is nonnull, then
lower it to an intrinsic that the DAG can use to skip the null check.
I did this using an intrinsic as it's non-intrusive. An alternative
would have been to allow something like `!nonnull` on `addrspacecast`
then lower that to a custom opcode (or add an operand to the
addrspacecast MIR/DAG opcodes), but it's a lot of boilerplate for just
one target's use case IMO.
I'm hoping that when we switch to GISel that we can move all this logic
to the MIR level without losing info, but currently the DAG doesn't see
enough so we need to act in CGP.
Fixes: SWDEV-316445
There was an error where dividend of type i64 and actual used number of
bits of 32 fell into path that assumes only 24 bits being used. Check
that AtLeast field is used correctly when using computeNumSignBits and
add necessary extend/trunc for 32 bits path.
Regolden and update testcases.
@jrbyrnes @bcahoon @arsenm @rampitec
This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
https://github.com/llvm/llvm-project/pull/73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.
Example (extracted from
[fmt/format.h](e17bc67547/include/fmt/format.h (L3555-L3566))):
```
define float @test(float %x, i1 %cond) {
%i32 = bitcast float %x to i32
%cmp = icmp slt i32 %i32, 0
br i1 %cmp, label %if.then1, label %if.else
if.then1:
%fneg = fneg float %x
br label %if.end
if.else:
br i1 %cond, label %if.then2, label %if.end
if.then2:
br label %if.end
if.end:
%value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
%ret = call float @llvm.fabs.f32(float %value)
ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
Instead, create a mul24 with a 64 bit result and let ISel take care of
it.
This allows patterns to simply match mul24 even for 64-bit muls instead of having to match both mul/mulhi and a buildvector/bitconvert/etc.
Make codegen emit correctly rounded sqrt by default.
Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.
https://reviews.llvm.org/D158129
We were basing the defer the fast case to codegen based on the fdiv
itself, and not looking for a foldable sqrt input.
https://reviews.llvm.org/D158127
Otherwise stale pointers pollute the cache and
when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157711
Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector.
We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.
The solution I found is to consider the whole "PHI chain" when looking at PHI.
That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.
The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value.
In simple terms, two-thirds (rounded up) of the PHIs should be breakable.
This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.
Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156414
Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.
Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.