Update HCFG builder to set the incoming values directly at construction
for non-header phis.
Simplification/clarification as suggested independently in
https://github.com/llvm/llvm-project/pull/126388.
To simplify debugging and analysis, particularly for very large
applications with large graphs, this patch adds support for either
highlighting a single context id or allocation's context ids, and/or
only exporting the nodes/edges for a single context id or allocation's
context ids. When highlighting, the specified nodes and edges are a
brighter color and larger.
This can be controlled by the new -memprof-dot-scope={all,alloc,context}
flag which controls how much to export, along with two companion flags:
-memprof-dot-alloc-id=ID
-memprof-dot-context-id=ID
These two are interpreted differently depending on the value of
-memprof-dot-scope (where "all" is the default).
If exporting all, one of the above flags can optionally be passed to
highlight the nodes/edges for the given context id or allocation's
context ids.
If exporting alloc scope, an alloc id must be provided. A context id can
optionally be provided to highlight that context.
If exporting context scope, a context id must be provided.
The ids to use can be obtained either by looking at the full graph, or a
context id can be identified from the -memprof-report-hinted-sizes
output after PR128188 is merged.
During the whole program reporting of contexts when hinted byte
reporting is enabled via -memprof-report-hinted-sizes, also print the
internal context id. This is useful for debugging, as well as for
guiding the dot file dumping with some upcoming changes that will
accept a context id to focus the graph on a context of interest.
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
Replace binary of of two reductions with one reduction of the binary op
applied to vectors. For example:
```
%v0_red = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %v0)
%v1_red = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %v1)
%res = add i32 %v0_red, %v1_red
```
gets transformed to:
```
%1 = add <16 x i32> %v0, %v1
%res = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %1)
```
Set VPBBs predecessors before creating VPInstructions, as setting
incoming values for non-header phis directly there will require
predecessors to be available.
Common code has been unified and generalized.
Original commit: 123dca9b56e1359d8ec7771ea3bd0afd4b1ea6af
Previously reverted due to accidentally merged incompletely. The issue has
been addressed by restoring missing code.
I noticed this when looking at all allocations by clang. For a medium
sized file this was around 6000 calls to operator new, although i
suspect there were more allocations in total as the SmallVectors in
DataLayout may have their own allocations in some cases.
In a follow-up i'm tempted to make the DataLayout copy constructor
private, to avoid this in future. There are a few tests which copy the
DataLayout, and perhaps need to (I didn't check yet), but we could
provide a clone() method for them if needed. Its only accidental copying
I think we should consider avoiding, not people who really do need to
copy it for reasons.
This reverts commit 123dca9b56e1359d8ec7771ea3bd0afd4b1ea6af.
This breaks building on macOS with clang and multiple build bots,
including https://lab.llvm.org/buildbot/#/builders/175/builds/13585
llvm-project/llvm/lib/Transforms/Utils/SimplifyCFG.cpp: In function ‘bool sinkCommonCodeFromPredecessors(llvm::BasicBlock*, llvm::DomTreeUpdater*)’:
/b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/lib/Transforms/Utils/SimplifyCFG.cpp:2503:3: error: reference to ‘LockstepReverseIterator’ is ambiguous
2503 | LockstepReverseIterator<true> LRI(UnconditionalPreds);
| ^~~~~~~~~~~~~~~~~~~~~~~
Invoke the backedge computation (refactored as a new method) at the end
of the graph construction, instead of at the start of cloning. That
makes more logical sense, and it also makes it easier to look at the
results in the postbuild dot graph with a follow on change to display
those differently.
Common code has been unified and generalized. Not sure if it may be
worth to generalize this further, since it looks closely tied to Blocks
(might make sense to rename it in `LockstepReverseInstructionIterator`).
Two misc cleanup/improvements to the dot printing.
Remove a redundant "style=filled" in the Node attributes. No effect on
resulting graph.
Add a "color" attribute to the Edge, with the same color name as
"fillcolor". The latter only fills in the arrowhead, and the former is
what affects the line. This makes the edge colors more visible,
previously it was a black edge with a colored in arrowhead.
For the second change, I added the new Edge color attributes to the
checking in the two "basic.ll" tests, so we get some testing coverage of
the full printing. For the other affected tests I removed the final "]'"
after the fillcolor so it matches up through that attribute and ignores
the rest of the line.
This patch implements the check for not allowing re-scheduling of
instructions that have already been scheduled in a scheduling bundle.
Rescheduling should only happen if the instructions were temporarily
scheduled in singleton bundles during a previous call to
`trySchedule()`.
Up until now the generation of vector instructions was taking place
during the top-down post-order traversal of vectorizeRec(). The issue
with this approach is that the vector instructions emitted during the
traversal can be reordered by the scheduler, making it challenging to
place them without breaking the def-before-uses rule.
With this patch we separate the vectorization decisions (done in
`vectorizeRec()`) from the code generation phase (`emitVectors()`). The
vectorization decisions are stored in the `Actions` vector and are used
by `emitVectors()` to drive code generation.
We already push the old shuffles to the worklist as part of the replaceValue calls, so we shouldn't need to add them to the deferred list as well - my guess is this was to ensure that the instructions got erased first to help cleanup unused instructions, but eraseInstruction should handle this now.
The assertion was left over from a time when VPBBs still had an
associated condition bit. This is not the case any more (comment was
stale). In case a branch on condition is needed, a BranchOnCond
VPInstruction is added when constructing recipes. That's also where it
is checked if the condition is available.
Exposed by 38376dee9.
Proof: https://alive2.llvm.org/ce/z/mzVj-u
I will add some follow-up patches to avoid duplicate code, support more
memory instructions, and bypass gep instructions.
In order to facilitate cloning of recursive cycles, we first identify
backedges using a standard DFS search from the root callers, then
initially defer recursively invoking the cloning function via those
edges. This is because the cloning opportunity along the backedge may
not be exposed until the current node is cloned for other non-backedge
callers that are cold after the earlier recursive cloning, resulting
in a cold predecessor of the backedge. So we recursively invoke the
cloning function for the backedges during the cloning of the current
node for its caller edges (which were sorted to enable handling cold
callers first).
There was no significant time or memory overhead measured for several
large applications.
In removePartiallyOverlappedStores we iterate over
InstOverlapIntervalsTy which is a DenseMap. Change that map into using
MapVector to ensure that we apply the transforms in a deterministic
order. I've only seen that the order matters if starting to use names
for the instructions created when doing the transforms. But such things
are a bit annoying when debugging etc.
Querying TTI creates a Subtarget object, but an llvm.memcpy declaration
doesn't have target-cpu and target-feature attributes like functions
with definitions. This can cause a warning to be printed on RISC-V
because the target-abi in the Module requires floating point, but the
subtarget features don't enable floating point. So far we've only seen
this in LTO when an -mcpu is not supplied for the TargetMachine.
To fix this, get TTI for the calling function instead.
Fixes the issue reported here
https://github.com/llvm/llvm-project/issues/69780#issuecomment-2665273161
The mapping of IR ExitBB to a VPBB isn't used. It also sets an incorrect
VPBB for the ExitBB; the regions successor is the middle block, no the
exit block.
It also unnecessarily triggers an assertion after 38376dee922.
for `trunc nuw` saves a instruction and otherwise only other
instructions without the select, same behavior as for bit test before.
proof: https://alive2.llvm.org/ce/z/a6QmyV
In a particular scenario (see test) we used to insert scheduled
instructions into the ready list. This patch fixes this by fixing the
trimSchedule() function.
This patch implements a small region pass that saves the context's
state. The patch is now used in the default pipeline to save the context
state instead of the hard-coded call to Context::save().
The concept behind this is that the passes themselves should not have to
do the actual saving/restoring of the IR state, because that would make
it challenging to reorder them in the pipeline. Having separate
save/restore passes makes the transformation passes more composable as
parts of arbitrary pipelines.
This patch moves the seed collection logic from the BottomUpVec pass
into a new Sandbox IR Function pass. The new "seed-collection" pass
collects the seeds, builds a region and runs the region pass pipeline.
The last use was removed in:
commit fa6ea7a419f37befbed04368bcb8af4c718facbb
Author: Arthur Eubanks <aeubanks@google.com>
Date: Mon Mar 20 11:18:35 2023 -0700
Consider IR like this
call void @llvm.memset.p0.i64(ptr dereferenceable(28) %p, i8 0, i64 28,
i1 false)
store i32 1, ptr %p
In the past it has been optimized like this:
%p2 = getelementptr inbounds i8, ptr %p, i64 4
call void @llvm.memset.p0.i64(ptr dereferenceable(28) %p2, i8 0, i64 24,
i1 false)
store i32 1, ptr %p
As the input IR doesn't guarantee that it is OK to deref 28 bytes
starting at the adjusted pointer %p2 the transformation has been a bit
flawed.
With this patch we make sure to drop any
dereferenceable/dereferenceable_or_null attributes when doing such
transforms. An alternative would have been to adjust the amount of
dereferenceable bytes, but since a memset with a constant length already
implies dereferenceability by itself it is simpler to just drop the
attributes.
The new filtering of attributes is done using a helper that only keep
attributes that we explicitly handle. For the adjusted mem instrinsic
pointers that currently involve "NonNull", "NoUndef" and "Alignment"
(when the alignment is known to be fulfilled also after offsetting the
pointer).
Fixes#115976