This pattern can be often met in Flang generated LLVM IR,
for example, for the counts of the loops generated for array
expressions like: `a(x:x+y)` or `a(x+z:x+z)` or their variations.
In order to compute the loop count, Flang needs to subtract
the lower bound of the array slice from the upper bound
of the array slice. To avoid the sign wraps, it sign extends
the original values (that may be of any user data type)
to `i64`.
This peephole is really helpful in CPU2017/548.exchange2,
where we have multiple following statements like this:
```
block(row+1:row+2, 7:9, i7) = block(row+1:row+2, 7:9, i7) - 10
```
While this is just a 2x3 iterations loop nest, LLVM cannot
figure it out, ending up vectorizing the inner loop really
hard (with a vector epilog and scalar remainder). This, in turn,
causes problems for LSR that ends up creating too many loop-carried
values in the loop containing the above statement, which are then
causing too many spills/reloads.
Alive2: https://alive2.llvm.org/ce/z/gLgfYX
Related to #143219.
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
The proposed patch, in general, tries to transform the below code
sequence:
x = 1.0 / sqrt (a);
r1 = x * x; // same as 1.0 / a
r2 = a / sqrt(a); // same as sqrt (a)
TO
(If x, r1 and r2 are all used further in the code)
r1 = 1.0 / a
r2 = sqrt (a)
x = r1 * r2
The transform tries to make high latency sqrt and div operations
independent and also saves on one multiplication.
The patch was tested with SPEC17 suite with cpu=neoverse-v2. The
performance uplift achieved was:
544.nab_r ~4%
No other regressions were observed. Also, no compile time differences
were observed with the patch.
Closes#54652
Fixes#115767
This PR folds `X udiv Y` to `X lshr cttz(Y)` if Y is a power of two
since bitwise operations are faster than division.
Proof: https://alive2.llvm.org/ce/z/qHmLta
Also add `tryGetLog2` helper that encapsulates the common pattern:
```
if (takeLog2(..., /*DoFold=*/false)) {
Value * Log2 = takeLog2(..., /*DoFold=*/true);
...
}
```
Closes#122498
We left some easy opportunities for further simplifications.
log2(trunc(x)) is simply trunc(log2(x)). This is safe if we know that
trunc is NUW because it means that the truncation didn't drop any bits.
It is also safe if the caller is OK with zero as a possible answer.
log2(x >>u y) is simply `log2(x) - y`.
log2(x & y) is a funny one. It comes up when doing something like:
```
unsigned int f(unsigned int x, unsigned int y) {
unsigned char a = 1u << x;
return y / a;
}
```
LLVM would canonicalize this to:
```
%shl = shl nuw i32 1, %x
%conv1 = and i32 %shl, 255
%div = udiv i32 %y, %conv1
```
In cases like these, we can ignore the mask entirely.
This is equivalent to `y >> x`.
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
Use the constant folding API instead (we later call isNotMinSignedValue
on it, so we do need the Constant* return type here). Use ImmConstant
to guarantee that constant folding succeeds.
foldFDivPowDivisor can address A / powi(x, y) to A * powi(x, -y),
while for small const value y, for example y=2, the instcombine will
transform powi(x, 2) to fmul x, x, so it is not optimal for A / powi(x, 2).
Fix https://github.com/llvm/llvm-project/issues/77171
Since PR86428, foldPowiReassoc is called by both FMul and FDiv,
as the optimization of FDiv is placed after the FMul, so now
it is correct we don't add the checking of FDiv for powi(X, Y) / X.
But, we may add more matching scenarios later, so add the checking opcode
explicitly is easier to understand.
Now that we don't accept undef splat in PatternMatch, we can remove some
uses of replaceUndefsWith(). I believe in all these cases only poison
splats are possible now, in which case no replacement is necessary.
In #88217 a large set of matchers was changed to only accept poison
values in splats, but not undef values. This is because we now use
poison for non-demanded vector elements, and allowing undef can cause
correctness issues.
This patch covers the remaining matchers by changing the AllowUndef
parameter of getSplatValue() to AllowPoison instead. We also carry out
corresponding renames in matchers.
As a followup, we may want to change the default for things like m_APInt
to m_APIntAllowPoison (as this is much less risky when only allowing
poison), but this change doesn't do that.
There is one caveat here: We have a single place
(X86FixupVectorConstants) which does require handling of vector splats
with undefs. This is because this works on backend constant pool
entries, which currently still use undef instead of poison for
non-demanded elements (because SDAG as a whole does not have an explicit
poison representation). As it's just the single use, I've open-coded a
getSplatValueAllowUndef() helper there, to discourage use in any other
places.
This change updates a few of the transformations in foldFMulReassoc to
respect absent fast-math flags in cases where fmul and fdiv, fadd, or fsub
instructions were being folded but the code was only checking for
fast-math flags on the fmul instruction and was transferring flags to
the folded instruction that were not present on the other original
instructions.
This fixes https://github.com/llvm/llvm-project/issues/82857
Remove the fold working on abs in SPF representation now that we
canonicalize SPF to intrinsics.
This is not strictly NFC because the SPF fold might fire for
non-canonical IR due to multi-use, but given the lack of test coverage,
I assume this is not important.
Try to transform the powi(X, Y) * X into powi(X, Y+1) with Ofast
For this case, when the Y is 3, then powi(X, 4) is replaced by
X2 = X * X; X2 * X2 in the further step.
Similar to D109954, who requires reassoc.
Fixes https://github.com/llvm/llvm-project/issues/69862.
This patch fixes the issues introduced in
bb5c3899d1.
I moved the check for the instruction to be div before I check for the
fast math flags which resolves the crash in
```
float a, b;
double sqrt();
void c() { b = a / sqrt(a); }
```
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The full fold is one of the following:
1) `(fp_binop ({s|u}itofp x), ({s|u}itofp y))`
-> `({s|u}itofp (int_binop x, y))`
2) `(fp_binop ({s|u}itofp x), FpC)`
-> `({s|u}itofp (int_binop x, (fpto{s|u}i FpC)))`
And support the following binops:
`fmul` -> `mul`
`fadd` -> `add`
`fsub` -> `sub`
Proofs: https://alive2.llvm.org/ce/z/zuacA8
The proofs timeout, so they must be reproduced locally.
Closes#82555
If there are two undef operands, the select would get folded away
entirely. One undef operand can occur if the other two operands
do not satisfy the poison implication check. However, I don't think
that handling this edge case is worthwhile in this fold. If we
wanted to handle this, it would be more natural to do so in the
simplifyValueKnownNonZero() fold (as this is actually the property
we would be exploiting -- this doesn't really have any relation
to taking the log2).
This patch cleans up the duplicate code for folding commutative binops
over `select/phi/minmax`.
Related commits:
+ select support:
88cc35b27e
+ phi support:
8674a023bc
+ minmax support:
624973806c