The load folding optimization was very conservative by requiring the
root OR instruction to have a single use. This prevented optimization
when to fold loads when only the root had multiple uses.
For example:
%val = or i32 ... ; Assembles 4 bytes to i32
%use1 = call @foo(%val)
%use2 = call @bar(%val)
When LOps.RootInsert comes after LI2, since we use LI2 as the new insert
point, we should make sure the memory region accessed by LOps isn't
modified. However, the original implementation passes the bit width
`LOps.LoadSize` as the number of bytes to be accessed, causing BasicAA
to return NoAlias:
a941e15074/llvm/lib/Analysis/BasicAliasAnalysis.cpp (L1658-L1667)
With `-aa-trace`, we get:
```
End ptr getelementptr inbounds nuw (i8, ptr @g, i64 4) @ LocationSize::precise(1), %gep1 = getelementptr i8, ptr %p, i64 4 @ LocationSize::precise(32) = NoAlias
```
This patch uses `getTypeStoreSize` to compute the correct access size
for LOps. Instead of modifying the MemoryLocation for End (i.e.,
`LOps.RootInsert`), it also uses the computed base and AATag for
correctness.
Closes https://github.com/llvm/llvm-project/issues/169921.
This patch adds recognition of high-half multiply by parts into a single
larger multiply.
Considering a multiply made up of high and low parts, we can split the
multiply into:
x * y == (xh*T + xl) * (yh*T + yl)
where `xh == x>>32` and `xl == x & 0xffffffff`. `T = 2^32`.
This expands to
xh*yh*T*T + xh*yl*T + xl*yh*T + xl*yl
which I find it helpful to be drawn as
[ xh*yh ]
[ xh*yl ]
[ xl*yh ]
[ xl*yl ]
We are looking for the "high" half, which is xh*yh + xh*yl>>32 + xl*yh>>32 +
carrys. The carry makes this difficult and there are multiple ways of
representing it. The ones we attempt to support here are:
Carry: xh*yh + carry + lowsum
carry = lowsum < xh*yl ? 0x1000000 : 0
lowsum = xh*yl + xl*yh + (xl*yl>>32)
Ladder: xh*yh + c2>>32 + c3>>32
c2 = xh*yl + (xl*yl >> 32); c3 = c2&0xffffffff + xl*yh
Carry4: xh*yh + carry + crosssum>>32 + (xl*yl + crosssum&0xffffffff) >> 32
crosssum = xh*yl + xl*yh
carry = crosssum < xh*yl ? 0x1000000 : 0
Ladder4: xh*yh + (xl*yh)>>32 + (xh*yl)>>32 + low>>32;
low = (xl*yl)>>32 + (xl*yh)&0xffffffff + (xh*yl)&0xfffffff
They all start by matching `xh*yh` + 2 or 3 other operands. The bottom of the
tree is `xh*yh`, `xh*yl`, `xl*yh` and `xl*yl`.
Based on #156879 by @c-rhodes
This picks up from #166028, making the `Function` argument optional:
most cases don't need to provide it, but in e.g. InstCombine's case,
where the instruction (select, branch) is not attached to a function
yet, the function needs to be passed explicitly.
Co-authored-by: Florian Hahn <flo@fhahn.com>
When lowering a `table-based cttz` calculation to the `llvm.cttz`
intrinsic, `AggressiveInstCombine` was not attaching profile metadata to
the newly generated `select` instruction.
This PR adds heuristic branch weights to the `select`. It uses a strong
100-to-1 probability favoring the `cttz` path over the zero-input case.
This allows later passes to optimize code layout and branch prediction.
The memchr inliner creates new switch branches but was failling to add
profile metada. This patch fixes the issue by explicitly adding unknown
branch weights to these branches.
Issue [#147390](https://github.com/llvm/llvm-project/issues/147390)
The strcmp/strncmp inliner creates new conditional branches but was
failing to add profile metadata. This caused the ProfileVerifierPass to
fail when profcheck is enabled.
This patch fixes the issue by explicitly adding unknown branch weights
to these branches.
Issue #147390
This patch was a part of
https://github.com/llvm/llvm-project/pull/154375.
Two functional changes:
1. Allow matching other commuted patterns.
2. Allow combining loads even if there are multiple uses on a load. It
is beneficial in practice.
Similar to #150639 this fixes the AggressiveInstCombine fold for convert
tables to cttz instructions if the gep types are not array types. i.e
`gep i16 @glob, i64 %idx` instead of `gep [64 x i16] @glob, i64 0, i64 %idx`.
This is a small extension of #147540, resolving one of the FIXMEs.
Instead of bailing out on any instruction that may read/write memory,
use AA to check whether it can alias the stored parts. Do this using a
crude check based on the underlying object only.
This pattern occurs rarely in practice, but at the same time it also doesn't
seem to add any compile-time cost, so it's probably worth handling.
This is a minor extension of #147540, resolving one of the FIXMEs. If
the collected parts contain some non-consecutive elements, we can still
handle smaller ranges that *are* consecutive.
This is not common in practice and mostly shows up when the same value
is stored at two different offsets.
Merge multiple small stores that were originally extracted from one
value into a single store.
This is the store equivalent of the load merge optimization that
AggressiveInstCombine already performs.
This implementation is something of an MVP, with various generalizations
possible.
Fixes https://github.com/llvm/llvm-project/issues/147456.
Seeing how we can't generate any debug intrinsics any more: delete a
variety of codepaths where they're handled. For the most part these are
plain deletions, in others I've tweaked comments to remain coherent, or
added a type to (what was) type-generic-lambdas.
This isn't all the DbgInfoIntrinsic call sites but it's most of the
simple scenarios.
Co-authored-by: Nikita Popov <github@npopov.com>
If the input has known zero bits, InstCombine may have simplied one
of the expected And masks. Teach AggressiveInstCombine to use
MaskedValueIsZero to recover these missing bits.
Fixes#142042.
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
If we have `icmp eq or(a, shl(b)), 0` then the shift can be removed so
long as it is nuw or nsw. It is still comparing that some bits are
non-zero.
https://alive2.llvm.org/ce/z/nhrBVX.
This is also true of ne, and true for longer or chains.
When AggressiveInstCombine replaces a memchr with a switch instruction,
it currently drops the DILocation for that memchr. This patch changes
this, propagating the memchr DILocation to all the generated
instructions that replace it.
Found using https://github.com/llvm/llvm-project/pull/107279.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
This replaces some of the most frequent offenders of using a DenseMap that
cause a malloc, where the typical element-count is small enough to fit in
an initial stack allocation.
Most of these are fairly obvious, one to highlight is the collectOffset
method of GEP instructions: if there's a GEP, of course it's going to have
at least one offset, but every time we've called collectOffset we end up
calling malloc as well for the DenseMap in the MapVector.
When AggressiveInstCombine inlines a strcmp call, we currently copy the
strcmp's DILocation only to the br instruction that jumps to the inline
code. While this is roughly analogous to the original call, it leaves
the generated code without any source location, which is precarious for
a memory operation. This patch copies the strcmp call's DILocation to
all the generated code.
An alternative solution would be to generate a new DILocation with a
line 0 location and an inlinedAt pointing to the original call location,
but this would still give limited attribution to the generated code
without traversing the DIE, whereas the submitted solution allows
attribution with just the line table; even though it would be
technically more accurate, pragmatically I believe that copying the
call's location will be more useful for users.
This patch converts memchr with a small constant string into a switch.
It will reduce overhead of libcall and enable more folds (e.g.,
comparing the result with null).
References: https://en.cppreference.com/w/c/string/byte/memchr
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
Inline calls to strcmp(s1, s2) and strncmp(s1, s2, N), where N and
exactly one of s1 and s2 are constant.
For example:
```c
int res = strcmp(s, "ab");
```
is converted to
```c
int res = (int)s[0] - (int)'a';
if (res != 0)
goto END;
res = (int)s[1] - (int)'b';
if (res != 0)
goto END;
res = (int)s[2] - (int)'\0';
END:
```
Ported from a similar gcc feature [Inline strcmp with small constant
strings](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809).
This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
https://github.com/llvm/llvm-project/pull/73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.
Example (extracted from
[fmt/format.h](e17bc67547/include/fmt/format.h (L3555-L3566))):
```
define float @test(float %x, i1 %cond) {
%i32 = bitcast float %x to i32
%cmp = icmp slt i32 %i32, 0
br i1 %cmp, label %if.then1, label %if.else
if.then1:
%fneg = fneg float %x
br label %if.end
if.else:
br i1 %cond, label %if.then2, label %if.end
if.then2:
br label %if.end
if.end:
%value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
%ret = call float @llvm.fabs.f32(float %value)
ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
We previously included debug instructions when counting instructions when
looking for loads to combine. This meant that the presence of debug
instructions could affect optimization, as shown in the updated testcase.
This fixes#69925.
The comments contain IR where some instructions don't fit in
80 columns. The extra part of the line was placed in front of
the next IR instruction instead of on its own line.
This reverts commit 5dde755188e34c0ba5304365612904476c8adfda,
cbfcf90152de5392a36d0a0241eef25f5e159eef and
8981520b19f2d2fe3d2bc80cf26318ee6b5b7473 due to a miscompile introduced in
8981520b19f2d2fe3d2bc80cf26318ee6b5b7473 (see
https://reviews.llvm.org/D154725#4568845 for details)
Differential Revision: https://reviews.llvm.org/D157430
As pointed out in D125755 the operand of a call to getCastInstrCost had the Src
and Dst the wrong way around.
Differential Revision: https://reviews.llvm.org/D154841
Partial progress towards replacing in-tree uses of
`Type::getPointerTo()`.
If `getPointerTo()` is used solely to support an unnecessary bitcast,
remove the bitcast.
Reviewed By: barannikov88, nikic
Differential Revision: https://reviews.llvm.org/D153307
This seems to be an issue currently where there are nested/chained GEP/BitCast Pointers.
The patch generates a new GEP for the wider load to avoid dominance problems.
Differential Revision: https://reviews.llvm.org/D150864