This was introduced ~5yrs ago (by me), and has never really gotten
any adoption. By now, it's significantly out of sync with new/changed
poison propoagation rules. The idea is still reasonable, but the
imagined use case is largely covered by alive2 these days anyways.
https://github.com/llvm/llvm-project/pull/65972 introduced
-ubsan-unique-traps and -bounds-checking-unique-traps, which attach the
function size to the ubsantrap intrinsic.
https://github.com/llvm/llvm-project/pull/117651 changed
ubsan-unique-traps to use nomerge instead of the function size, but did
not update -bounds-checking-unique-traps. This patch adds nomerge to
bounds-checking-unique-traps.
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
consistently name them
- move `isTriviallyScalarizable` to VectorUtils
- update all uses of the api and provide the TTI parameter
Resolves#117030
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
This patch undrifts source locations in MemProfRecord before readMemprof
starts the matching process.
The thoery of operation is as follows:
1. Collect the lists of direct calls, one from the IR and the other
from the profile.
2. Compute the correspondence (called undrift map in the patch)
between the two lists with longestCommonSequence.
3. Apply the undrift map just before readMemprof consumes
MemProfRecord.
The new function gated by a flag that is off by default.
This sets up the initial blocks needed to initialize a VPlan directly
in the constructor. This will allow tracking of all created blocks
directly in VPlan, simplifying block deletion.
Don't unnecessarily clone for a caller that wasn't matched to a call
instruction.
This necessitated updated a couple of tests that were either
unnecessarily cloning or unnecessarily processing an allocation and
hinting it not cold.
Currently the addUsersInExitBlocks incorrectly assumes exit phis only
have a single operand, which may not be the case for loops with early
exits when they share a common exit block.
Also further relax the assertion in fixupIVUsers to allow exit values if
they come from theloop latch/middle.block.
PR: https://github.com/llvm/llvm-project/pull/120260
PR #112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:
1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR #88385.
6. The new split middle blocks weren't being added to the
parent loop.
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask
Original combining left the combine between vectors of different lengths as a TODO.
When building SPEC CPU 2017 with RISC-V and EVL tail folding, this
assertion in VPTypeAnalysis would trigger during the transformation to
EVL recipes:
d8a0709b10/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (L135-L142)
It was caused by this recipe:
```
WIDEN ir<%shr> = vp.or ir<%add33>, ir<0>, vp<%6>
```
Having its type inferred as i16, when ir<%add33> and ir<0> had inferred
types of i32 somehow.
The cause of this turned out to be because the VPTypeAnalysis cache was
getting clobbered: In this transform we were erasing recipes but keeping
around the same mapping from VPValue* to Type*. In the meantime, new
recipes would be created which would have the same address as the old
value. They would then incorrectly get the old erased VPValue*'s cached
type:
```
--- before ---
0x600001ec5030: WIDEN ir<%mul21.neg> = vp.mul vp<%11>, ir<0>, vp<%6>
0x600001ec5450: <badref> <- some value that was erased
--- after ---
0x600001ec5030: WIDEN ir<%mul21.neg> = vp.mul vp<%11>, ir<0>, vp<%6>
0x600001ec5450: WIDEN ir<%shr> = vp.or ir<%add33>, ir<0>, vp<%6> <- a new value that happens to have the same address
```
This fixes this by deferring the erasing of recipes till after the
transformation.
The test case might be a bit flakey since it just happens to have the
right conditions to recreate this. I tried to add an assert in
inferScalarType that every VPValue in the cache was valid, but couldn't
find a way of telling if a VPValue had been erased.
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
This fixes a crash that shows up when building SPEC CPU 2017 with EVL
tail folding on RISC-V.
A VPWidenCastRecipe doesn't always have an underlying value, and in the
case of this crash this happens whenever a widened cast is created via
truncateToMinimalBitwidths.
Fix this by just using the opcode stored in the recipe itself.
I think a similar issue exists with VPWidenIntrinsicRecipe and how it's
widened, but I haven't run into any crashes with it just yet.
Optionally unconditionally hint allocations as cold or not cold during
the matching step if the percentage of bytes allocated is at least that
of the given threshold.
This was originally done to reduce the diff for the change. Remove it
and update the remaining tests. NFC modulo reordering of incoming
values.
Clean up after https://github.com/llvm/llvm-project/pull/114292.
We don't fold "shuffle (binop), (binop)" -> "binop (shuffle), (shuffle)" if the old/new costs are equal, but we can relax this if either new shuffle will constant fold as it will reduce instruction count.
This patch introduces the LLVM components of a type sanitizer: a
sanitizer for type-based aliasing violations.
It is based on Hal Finkel's https://reviews.llvm.org/D32198.
C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit
these given TBAA metadata added by Clang. Roughly, a pointer of given
type cannot be used to access an object of a different type (with, of
course, certain exceptions). Unfortunately, there's a lot of code in the
wild that violates these rules (e.g. for type punning), and such code
often must be built with -fno-strict-aliasing. Performance is often
sacrificed as a result. Part of the problem is the difficulty of finding
TBAA violations. Hopefully, this sanitizer will help.
For each TBAA type-access descriptor, encoded in LLVM's IR using
metadata, the corresponding instrumentation pass generates descriptor
tables. Thus, for each type (and access descriptor), we have a unique
pointer representation. Excepting anonymous-namespace types, these
tables are comdat, so the pointer values should be unique across the
program. The descriptors refer to other descriptors to form a type
aliasing tree (just like LLVM's TBAA metadata does). The instrumentation
handles the "fast path" (where the types match exactly and no
partial-overlaps are detected), and defers to the runtime to handle all
of the more-complicated cases. The runtime, of course, is also
responsible for reporting errors when those are detected.
The runtime uses essentially the same shadow memory region as tsan, and
we use 8 bytes of shadow memory, the size of the pointer to the type
descriptor, for every byte of accessed data in the program. The value 0
is used to represent an unknown type. The value -1 is used to represent
an interior byte (a byte that is part of a type, but not the first
byte). The instrumentation first checks for an exact match between the
type of the current access and the type for that address recorded in the
shadow memory. If it matches, it then checks the shadow for the
remainder of the bytes in the type to make sure that they're all -1. If
not, we call the runtime. If the exact match fails, we next check if the
value is 0 (i.e. unknown). If it is, then we check the shadow for the
remainder of the byes in the type (to make sure they're all 0). If
they're not, we call the runtime. We then set the shadow for the access
address and set the shadow for the remaining bytes in the type to -1
(i.e. marking them as interior bytes). If the type indicated by the
shadow memory for the access address is neither an exact match nor 0, we
call the runtime.
The instrumentation pass inserts calls to the memset intrinsic to set
the memory updated by memset, memcpy, and memmove, as well as
allocas/byval (and for lifetime.start/end) to reset the shadow memory to
reflect that the type is now unknown. The runtime intercepts memset,
memcpy, etc. to perform the same function for the library calls.
The runtime essentially repeats these checks, but uses the full TBAA
algorithm, just as the compiler does, to determine when two types are
permitted to alias. In a situation where access overlap has occurred and
aliasing is not permitted, an error is generated.
Clang's TBAA representation currently has a problem representing unions,
as demonstrated by the one XFAIL'd test in the runtime patch. We'll
update the TBAA representation to fix this, and at the same time, update
the sanitizer.
When the sanitizer is active, we disable actually using the TBAA
metadata for AA. This way we're less likely to use TBAA to remove memory
accesses that we'd like to verify.
As a note, this implementation does not use the compressed shadow-memory
scheme discussed previously
(http://lists.llvm.org/pipermail/llvm-dev/2017-April/111766.html). That
scheme would not handle the struct-path (i.e. structure offset)
information that our TBAA represents. I expect we'll want to further
work on compressing the shadow-memory representation, but I think it
makes sense to do that as follow-up work.
It goes together with the corresponding clang changes
(https://github.com/llvm/llvm-project/pull/76260) and compiler-rt
changes (https://github.com/llvm/llvm-project/pull/76261)
PR: https://github.com/llvm/llvm-project/pull/76259
Summary:
Previously, we'd add all SPs distinct from the cloned one into a set.
Then when cloning a local scope we'd check if it's from one of those
'distinct' SPs by checking if it's in the set. We don't need to do that.
We can just check against the cloned SP directly and drop the set.
Test Plan:
ninja check-llvm-unit check-llvm
VPInstruction has a definition of mayWriteToMemory, which seems to only
be used by VPlanSLP. However VPInstructions are already handled in
VPRecipeBase::mayWriteToMemory, and everywhere else seems to use this
definition. I think these should be the same for all intents and
purposes. The VPRecipeBase definition is more conservative but returns
true for stores/calls/invokes/SLPStores.
Update VPReductionPHIRecipe::execute to use the start value from the
start value operand of the recipe. This is needed to make sure we resume
from the correct value during epilogue vectorization.
At the moment, the start value is set to the sentinel value in
adjustRecipesForReductions, as the original start value needs to be used
when creating ResumePhi recipes.
Fixes a mis-compile introduced by b3cba9be41bfa8 in SPEC2017 on AArch64.
Summary:
The new API expects the caller to populate the VMap. We need it this way
for a subsequent change around coroutine cloning.
Test Plan:
ninja check-llvm-unit check-llvm
Split off from https://github.com/llvm/llvm-project/pull/112145. This
ensures that getOrCreateVectorTripCount creates the trip count as needed
when induction resume value creation is moved to VPlan and no longer
creates the vector trip count early.
ListSeparator from StringExtras.h is essentially the same as
FieldSeparator being removed in this patch. ListSeparator returns the
empty string on the first use via "operator StringRef()". It returns
", " on subsequent uses.
Loop Optimizations expect the input loop to be in LCSSA form. But it
seems that LoopVersioning doesn't have any check to see if the loop is
actually in LCSSA form. As a result, if we give it a loop which is not
in LCSSA form but still correct semantically, the resulting
transformation fails to pass through verifier pass with the following
error.
Instruction does not dominate all uses!
%inc = add nsw i16 undef, 1
store i16 %inc, ptr @c, align 1
As the loop is not in LCSSA form, LoopVersioning's transformations leads
to invalid IR! As some instructions do not dominate all their uses.
This patch checks if a loop is in LCSSA form, if not it will call
formLCSSARecursively on the loop before passing it to LoopVersioning.
Fixes: #36998
I'm not sure if getStepVector was used for other things in the past
where StartIdx was non-zero, but nowadays VPWidenIntOrFpInductionRecipe
is the only user of it, and just passes zero to it. I presume
InstCombine was already catching this so hopefully removing this won't
affect codegen.
Modernize VPWidenIntOrFpInductionRecipe printing by including the result
VPValue and all operand VPValues, similar to VPScalarIVStepsRecipe and
VPDerivedIVRecipe.
When the samesign flag is present on an icmp, we can transfer all the
facts on the unsigned system to the signed system, and vice-versa: we do
this by specializing transferToOtherSystem when samesign is present.