Extend folding for `X Pred C2 ? X BOp C1 : C2 BOp C1` to `min/max(X, C2)
BOp C1` to allow min and max as `BOp`. This ensures a constant clamping
pattern is folded into a pair of min/max instructions. Here is a
simplified example of a case where this folding is not occurring
currently.
int clampToU8(int v) {
if (v < 0) return 0;
if (v > 255) return 255;
return v;
}
https://godbolt.org/z/78jhKPWbv
Generic proof: https://alive2.llvm.org/ce/z/cdpLYy
We currently generate code like this on x86 for a jump table with 5 elements,
assuming the call target is in rbx:
lea global_addr(%rip), %rax # initialize temporary rax with base address
mov %rbx, %rcx # initialize another temporary rcx for index (rbx will be used for the call, so it is still live)
sub %rax, %rcx # compute `address - base`
ror $0x3, %rcx # compute `(address - base) ror 3` i.e. index
cmp $0x4, %rcx # check index <= 4
ja .Ltrap
[...]
.Ltrap:
ud1
A more efficient instruction sequence, that only needs one temporary
register and one fewer instruction, is possible by subtracting the
address we are testing from the fixed address instead of vice versa:
lea (global_addr + 4*8)(%rip), %rax # initialize temporary rax with address of last element
sub %rbx, %rax # compute `last element - address`
ror $0x3, %rax # compute `(last element - address) ror 3` i.e. 4 - index
cmp $0x4, %rax # check 4 - index <= 4 (same as above)
ja .Ltrap
[...]
.Ltrap:
ud1
Change LowerTypeTests to generate that sequence. As a consequence, the
order of bits in the bitsets is reversed. Because it doesn't matter how we
do the subtraction on other architectures (to the best of my knowledge),
do so unconditionally.
Reviewers: fmayer, vitalybuka
Reviewed By: fmayer
Pull Request: https://github.com/llvm/llvm-project/pull/142887
To aid in debugging, (optionally) dump the dot graph immediately after
the stack update phase (which matches nodes to interior callsites) and
before we cleanup mismatched callee edges (either via tail call fixup,
indirect call fixup, or nulling otherwise).
In the LowerTypeTests pass we used to create IR like this:
%3 = zext i8 ptrtoint (ptr @__typeid_allones7_align to i8) to i64
%4 = lshr i64 %2, %3
%5 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid_allones7_align to i8)) to i64
%6 = shl i64 %2, %5
%7 = or i64 %4, %6
This is because when this code was originally written there were no
funnel shifts and as I recall it was necessary to create an i8 and zext
to pointer width (instead of just having a ptrtoint of pointer width)
in order for the shl/shr/or to be pattern matched to ror. At the time
this caused no problems because there existed a zext ConstantExpr. But
after zext ConstantExpr was removed in #71040, the newly present zext
instruction can prevent pattern matching the rotate, for example if
the zext gets hoisted to a loop preheader or common ancestor of the
check. LowerTypeTests was made to use fshr in #141735 so now we can
ptrtoint to pointer width and stop creating the zext.
Reviewers: fmayer, nikic
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/142886
This pass figures out whether inlining has exposed a constant address to
a lowered type test, and remove the test if so and the address is known
to pass the test. Unfortunately this pass ends up needing to reverse
engineer what LowerTypeTests did; this is currently inherent to the design
of ThinLTO importing where LowerTypeTests needs to run at the start.
Reviewers: teresajohnson
Reviewed By: teresajohnson
Pull Request: https://github.com/llvm/llvm-project/pull/141327
This option complements -funique-source-file-names and allows the user
to use a different unique identifier than the source file path.
Reviewers: teresajohnson
Reviewed By: teresajohnson
Pull Request: https://github.com/llvm/llvm-project/pull/142901
Most of the recent development on the MemProfiler has been on the Use part. The instrumentation has been quite stable for a while. As the complexity of the use grows (with undrifting, diagnostics etc) I figured it would be good to separate these two implementations.
This patch fixes the handling of a confused `Dependence` object. Such an
object doesn’t contain any information about dependencies, so we must
process it conservatively. However, it was converted into a direction
vector like `[I I ... I]`. As a result, it was treated as if there are
no loop-carried dependencies, which can lead to illegal loop exchanges.
Fixes#140238
This patch implements a simple pass that tries to de-duplicate packs. If
there are two packing patterns inserting the exact same values in the
exact same order, then we will keep the top-most one of them. Even
though such patterns may be optimized away by subsequent passes it is
still useful to do this within the vectorizer because otherwise the cost
estimation may be off, making the vectorizer over conservative.
Reapply PR142507 with fix for test: add in the same x86_64-linux
requirement as other tests as the stack ids are currently computed
differently on big endian systems. This will be investigated separately.
In order to allow selective reporting of context hinting during the LTO
link, and in the future to allow selective more aggressive cloning, add
an option to specify a minimum percent of the max cold size in the
profile summary. Contexts that meet that threshold will get context size
info metadata (and ThinLTO summary information) on the associated
allocations.
Specifying -memprof-report-hinted-sizes during the pre-LTO compile step
will continue to cause all contexts to receive this metadata. But
specifying -memprof-report-hinted-sizes only during the LTO link will
cause only those that meet the new threshold and have the metadata to
get reported.
To support this, because the alloc info summary and associated bitcode
requires the context size information to be in the same order as the
other context information, 0s are inserted for contexts without this
metadata. The bitcode writer uses a more compact format for the context
ids to allow better compression of the 0s.
As part of this change several helper methods are added to query whether
metadata contains context size info on any or all contexts.
We'd hit an assertion checking proper alignment for an i8 when building
chromium because we used the prefered alignment (which is 4 bytes)
instead of the ABI alignment (which is 1 byte). The ABI alignment should
be used because that's the actual alignment needed to load a constant
from the vtable.
This also updates the two `virtual-const-prop-small-alignment-*` to
explicitly give ABI alignments for i64s.
Currently if there's any memory access that AccessAnalysis couldn't
analyze then all of the runtime pointer check results are discarded.
This patch makes this able to be controlled with the AllowPartial
option, which makes it so we generate the runtime check information
for those pointers that we could analyze, as transformations may still
be able to make use of the partial information.
Of the transformations that use LoopAccessAnalysis, only
LoopVersioningLICM changes behaviour as a result of this change. This is
because the others either:
* Check canVectorizeMemory, which will return false when we have partial
pointer information as analyzeLoop() will return false.
* Examine the dependencies returned by getDepChecker(), which will be
empty as we exit analyzeLoop if we have partial pointer information
before calling areDepsSafe(), which is what fills in the dependency
information.
Before this patch, InstCombine hung because it replaced a value with a
more complex one:
```
%sel = select i1 %cmp, i32 %smax, i32 0 ->
%sel = select i1 %cmp, i32 %masked, i32 0 ->
%sel = select i1 %cmp, i32 %smax, i32 0 ->
...
```
This patch makes this replacement more conservative. It only performs
the replacement iff the new value is one of the operands of the original
value.
Closes https://github.com/llvm/llvm-project/issues/142405.
Consider the following case:
```
define i1 @src(i8 %x) {
%cmp = icmp slt i8 %x, -1
%not1 = xor i1 %cmp, true
%or = or i1 %cmp, %not1
%not2 = xor i1 %or, true
ret i1 %not2
}
```
`sinkNotIntoLogicalOp(%or)` calls `freelyInvert(%cmp,
/*IgnoredUser=*/%or)` first. However, as `%cmp` is also used by `Op1 =
%not1`, the RHS of `%or` is set to `%cmp.not = xor i1 %cmp, true`. Thus
`Op1` is out of date in the second call to `freelyInvert`. Similarly,
the second call may change `Op0`. According to the analysis above, I
decided to avoid this fold when one of the operands is also a user of
the other.
Closes https://github.com/llvm/llvm-project/issues/142518.
ShapeInfo for the store operand may be dropped, e.g. because the operand
got folded by transpose optimizations to another instruction w/o shape
info. This was exposed by the assertion added in
https://github.com/llvm/llvm-project/pull/142416.
This updates VisitStore to use the shape-info directly from the
instruction, which is in line with the other Visit* functions and
ensures that we won't lose shape info.
PR: https://github.com/llvm/llvm-project/pull/142664
Add a `alloc-variant-zeroed` function attribute which can be used to
inform folding allocation+memset. This addresses
https://github.com/rust-lang/rust/issues/104847, where LLVM does not
know how to perform this transformation for non-C languages.
Co-authored-by: Jamie <jamie@osec.io>
Commutative intrinsics go through a separate code path, which did not
check for attribute compatibility, resulting in a later assertion
failure.
Fixes https://github.com/llvm/llvm-project/issues/142462.
They are all DFS state related, as `Visited`. But `Visited` is already a
class member, so we make things more consistent and less
parameters to pass around.
By itself, the patch has little value, but it simplifies stuff in the
#142474.
For #142461
In order to allow selective reporting of context hinting during the LTO
link, and in the future to allow selective more aggressive cloning, add
an option to specify a minimum percent of the max cold size in the
profile summary. Contexts that meet that threshold will get context size
info metadata (and ThinLTO summary information) on the associated
allocations.
Specifying -memprof-report-hinted-sizes during the pre-LTO compile step
will continue to cause all contexts to receive this metadata. But
specifying -memprof-report-hinted-sizes only during the LTO link will
cause only those that meet the new threshold and have the metadata to
get reported.
To support this, because the alloc info summary and associated bitcode
requires the context size information to be in the same order as the
other context information, 0s are inserted for contexts without this
metadata. The bitcode writer uses a more compact format for the context
ids to allow better compression of the 0s.
As part of this change several helper methods are added to query whether
metadata contains context size info on any or all contexts.
If the input has known zero bits, InstCombine may have simplied one
of the expected And masks. Teach AggressiveInstCombine to use
MaskedValueIsZero to recover these missing bits.
Fixes#142042.
Just for consistency, to avoid confusing conditions.
`reverse` helps to avoid tests updates as nothing is
changing for for successors count <=2.
For #142461
'goto' is essentially a shortcut for push/pop for worklist.
It can be expensive if we copy vectors, but if we move them, it
should not be an issue.
Without 'goto' it's easier to reason about the
code, when `PromoteMem2Reg::RenamePass` processes
exactly one edge at a time.
There is out of order processing of the first
successor, I keep it just to make this patch pure NFC. I'll
remove this in follow up patches.
For #142461
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
Apart from the stylistic improvement, lookup has the nice property of
returning a default-constructed object on failure-to-find, while find
returns the end iterator, which cannot be dereferenced.
After updating #118638 on tip of tree, expanding
VPWidenIntOrFpInductionRecipes fails because it needs the loop region to
get the latch to insert the increment into:
VPBasicBlock *ExitingBB =
Plan->getVectorLoopRegion()->getExitingBasicBlock();
Builder.setInsertPoint(ExitingBB,
ExitingBB->getTerminator()->getIterator());
auto *Next = Builder.createNaryOp(AddOp, {Prev, Inc}, Flags,
WidenIVR->getDebugLoc(), "vec.ind.next");
However after #117506, the region is dissolved so it doesn't work.
This shuffles the dissolveLoopRegions steps to be after
convertToConcreteRecipes so we can use the region when expanding
VPWidenIntOrFpInductionRecipes
If the control flow between `lifetime.start` and `lifetime.end` is too
complex, it is acceptable to give up the optimization opportunity and
collect the alloca to the frame. However, storing to the frame will
lengthen the lifetime of the alloca, and the sanitizer will complain. I
propose we always erase lifetime intrinsics of spilled allocas.
Fix#124612
---------
Co-authored-by: Chuanqi Xu <yedeng.yd@linux.alibaba.com>
Directly replace the canonical IV when we dissolve the containing
region. That ensures that it won't get removed before the region gets
removed, which would result in an invalid region.
This removes the current ordering constraint between
convertToConcreteRecipes and dissolving regions.
PR: https://github.com/llvm/llvm-project/pull/142372
The function currently checks for the command line argument only to
check if compiling for kernel. This is incorrect as the setting can also
be passed programatically.
Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.
PR: https://github.com/llvm/llvm-project/pull/135673
With:
commit 2425626d803002027cbf71c39df80cb7b56db0fb
Author: Kazu Hirata <kazu@google.com>
Date: Sun Jun 1 08:09:58 2025 -0700
we print out a lot of duplicate alloc site matches.
This patch partially reverts the patch above. The core idea of using
a map to deduplicate entries remains the same, but details are
different. Specifically:
- This PR uses the [FullStackID, MatchLength] as the key, where
MatchLength is the length of an alloc site match.
- AllocMatchInfo in this PR no longer has Matched because we always
report matches.
- AllocMatchInfo in this PR no longer has NumFramesMatched because it
has become part of the key.
This deduplication roughly halves the amount of messages printed out.