According to definition of canonical form, it is a canonical
if scale reg does not contain addrec for loop L then none of bases
should contain addrec for this loop.
The critical word here is "contains".
Current checker of canonical form checks not "containing" property
but "is". So it does not check whether it contains but whether it is.
Fix the checker and canonicalizing utility to follow definition.
Without this fix in the test attached the base formula looking as
reg((-1 * {0,+,8}<nuw><nsw><%bb2>)<nsw>) + 1*reg((8 * (%arg /u 8))<nuw>)
is considered as conanocial while base contains an addrec.
And modified formula we want to insert
reg({0,+,8}<nuw><nsw><%bb2>) + 1*reg((-8 * (%arg /u 8)))
is considered as not canonical.
Reviewed By: mkazantsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D122457
We have the same code repeated in both callers, sink it into callee.
The motivation here isn't just code style, we can also defer the relatively expensive aliasing checks until the cheap structural preconditions have been validated. (e.g. Don't bother aliasing if src is not an alloca.) This helps compile time significantly.
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.
This is purely for test output quality and readability. It should have no effect in any sane pipeline.
Differential Revision: https://reviews.llvm.org/D122493
Inline assembly is scary but we need to support it for the OpenMP GPU
device runtime. The new assumption expresses the fact that it may not
have call semantics, that is, it will not call another function but
simply perform an operation or side-effect. This is important for
reachability in the presence of inline assembly.
Differential Revision: https://reviews.llvm.org/D109986
Before we gave up if a call through bitcast had parameter attributes.
Interestingly, we allowed attributes for the return value already. We
now handle both the same way, namely, we drop the ones that are
incompatible with the new type and keep the rest. This cannot cause
"more UB" than initially present.
Differential Revision: https://reviews.llvm.org/D119967
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.
When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.
This change removes the requirement to create the latch block up-front
for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.
As an alternative, the increment could be modeled as separate recipe,
but that would require more work and a bit of redundant code, as we need
to create the step-vector during VPWidenIntOrFpInductionRecipe::execute
anyways, to create the values for different parts.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121617
This was reusing a cast to GlobalVariable to check for an
Instruction, which means we'll try to dereference a null pointer
if it's not actually a GlobalVariable. We should be casting
MTI->getSource() instead.
I don't think this problem is really specific to opaque pointers,
but it certainly makes it a lot easier to reproduce.
Fixes https://github.com/llvm/llvm-project/issues/54572.
The current implementation of Function Specialization does not allow
specializing more than one arguments per function call, which is a
limitation I am lifting with this patch.
My main challenge was to choose the most suitable ADT for storing the
specializations. We need an associative container for binding all the
actual arguments of a specialization to the function call. We also
need a consistent iteration order across executions. Lastly we want
to be able to sort the entries by Gain and reject the least profitable
ones.
MapVector fits the bill but not quite; erasing elements is expensive
and using stable_sort messes up the indices to the underlying vector.
I am therefore using the underlying vector directly after calculating
the Gain.
Differential Revision: https://reviews.llvm.org/D119880
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.
The prior attempt had failed to account for this case:
%a = alloca i8
stacksave
stackrestore
store i8 0, i8* %a
If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug.
There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.
Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
CoverageMappingModuleGen generates a coverage mapping record
even for unused functions with internal linkage, e.g.
static int foo() { return 100; }
Clang frontend eliminates such functions, but InstrProfiling pass
still pulls in profile runtime since there is a coverage record.
Fuchsia uses runtime counter relocation, and pulling in profile
runtime for unused functions causes a linker error:
undefined hidden symbol: __llvm_profile_counter_bias.
Since 389dc94d4be7, we do not hook profile runtime for the binaries
that none of its translation units have been instrumented in Fuchsia.
This patch extends that for the instrumented binaries that
consist of only unused functions.
Differential Revision: https://reviews.llvm.org/D122336
Update all places that currently assume the entry block to the plan is
also the vector loop header to use getVectorLoopRegion instead.
getVectorLoopRegion will keep doing the right thing when the pre-header
is modeled explicitly (and becomes the new entry block in the plan).
Probe-based profile leads to a better performance when combined with profi and ext-tsp block layout. I'm turning them on by default.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122442
We already do this for SelectionDAG, but we're missing it here.
Noticed while re-triaging PR21929
Differential Revision: https://reviews.llvm.org/D122340
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.
Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D121787
This reverts commit 7aea3ea8c3b33c9bb338d5d6c0e4832be1d09ac3 as it
breaks the buildbots.
I didn't see these failures in the pre-merge checks, looking into it.
Most intrinsics, especially "default" ones, will not call back into the
IR module. `nocallback` encodes this nicely. As it was not used before,
this patch also makes use of `nocallback` in the Attributor which
results in many more `norecurse` deductions.
Tablegen part is mechanical, test updates by script.
Differential Revision: https://reviews.llvm.org/D118680
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.
Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121615
Add the "(0 - X) --> (X * -1)" reverse identity to the list of alternate form binops.
We need a little hack to make the existing logic work because it does not expect to
move constants from op0 to op1, but the code comment hopefully makes that clear.
I don't think there are any other identities like that.
Fixes#54364
Differential Revision: https://reviews.llvm.org/D122390
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
There is potential for endless recursion if we try to determine the
underlying objects of a load, just to end up with the load as underlying
object. A proper solution will require us to pass a visited set around.
This will happen as we cleanup genericValueTraversal soon.
When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions.
Differential Revision: https://reviews.llvm.org/D121862
CoroSplit lowers various coroutine intrinsics. It's a CGSCC pass and
CGSCC passes don't run on unreachable functions. Normally GlobalDCE will
come along and delete unreachable functions, but we don't run GlobalDCE
under -O0, so an unreachable function with coroutine intrinsics may
never have CoroSplit run on it.
This patch adds GlobalDCE when coroutines intrinsics are present. It
also now runs all coroutine passes conditional when coroutine intrinsics
are present. This should also solve the -O0 regression reported in
D105877 due to LazyCallGraph construction.
Fixes https://github.com/llvm/llvm-project/issues/54117
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D122275
EarlyCSE currently optimizes all MemoryUses upfront. However,
EarlyCSE only actually queries the clobbering memory access for
a subset of uses, namely those where a CSE candidate has already
been identified. Delaying use optimization to the clobber query
improves compile-time in practice.
This change is not NFC because EarlyCSE has a limit on the number
of clobber queries (EarlyCSEMssaOptCap), in which case it falls
back to the defining access. The defining access for uses will now
no longer coincide with the optimized access.
If there are performance regressions from this change, we should
be able to address them by raising this limit.
Differential Revision: https://reviews.llvm.org/D121987
The first attempt at this missed a validity check.
This version includes a test of the narrow source
type for modulo-16-bits.
Original commit message:
This is the IR counterpart to 370ebc9d9a573d6
which provided a bswap narrowing fix for issue #53867.
Here we can be more general (although I'm not sure yet
what would happen for illegal types in codegen - too
rare to worry about?):
https://alive2.llvm.org/ce/z/3-CPfo
This will be more effective if we have moved the shift
after the bswap as proposed in D122010, but it is
independent of that patch.
Differential Revision: https://reviews.llvm.org/D122166