This implements the first half of #151459, by changing the AVL so it's
no longer computed as `trip-count - EVL-based IV`, but instead a
separate scalar phi that is decremented by EVL each iteration.
This shortens the dependency chain for computing the AVL and should
eventually allow us to convert the branch condition to `branch-count
avl-next, 0`.
`simplifyBranchConditionForVFAndUF` had to be updated to prevent a
regression because this introduces a VPPhi in the header block.
hwasan-globals does not instrument globals with custom sections, because
existing code may use `__start_`/`__stop_` symbols to iterate over
globals in such a way which will cause hwasan assertions.
Introduce new hwasan-all-globals option, which instruments all
user-defined globals (but not those globals which are generated by the
hwasan instrumentation itself), including those with custom sections.
fixes#142442
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.
An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
Similar to #150639 this fixes the AggressiveInstCombine fold for convert
tables to cttz instructions if the gep types are not array types. i.e
`gep i16 @glob, i64 %idx` instead of `gep [64 x i16] @glob, i64 0, i64 %idx`.
Noticed this when checking the invariant that all phis in the header
block must be header phis. I think there's a missing set of parentheses
here, since otherwise it only cast<VPInstruction> when RecipeI isn't a
VPInstruction.
Extend jump-threading to allow local defs that are live outside of the
threaded block. Allow threading to destinations where the local defs are
not live.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
https://github.com/llvm/llvm-project/pull/147026 will enable sub
reductions, which require that the phi value is the first operand since
they aren't commutative. This re-orders the operands when executing
reductions, which actually matches other existing code in
VPReductionRecipe::execute.
Using GEP to index into a vector is not disallowed, but not recommended.
The SPIR-V backend needs to generate structured access into types, which
is impossible with an untyped GEP instruction unless we add more info to
the IR. Finding a solution is a work-in-progress, but in the meantime,
we'd like to reduce the amount of failures.
Preventing this optimizations from rewritting extract/insert
instructions into a GEP helps us lower more code to SPIR-V. This change
should be OK as it's only active when targeting SPIR-V and disabling a
non-recommended transformation.
Related to #145002
With the current aliases metadata we lose information about which groups
of aliases survive symbol resolution. This causes various problems such
as #150075 where symbol resolution breaks the link between alias groups.
In this redesign of the aliases metadata, we stop representing the
individual aliases in !aliases. Instead, the individual aliases are
represented in !cfi.functions in the same way as functions, and the
alias groups (i.e. groups of symbols with the same address) are stored
in !aliases. At symbol resolution time, we filter out all non-prevailing
members of !aliases; the resulting set is used by LowerTypeTests to
recreate the aliases.
With this change it is now possible for a jump table entry to refer
to an alias in one of the ThinLTO object files (e.g. if a function is
non-prevailing but its alias is prevailing), so instead of deleting them,
rename them with the ".cfi" suffix.
Fixes#150070.
Fixes#150075.
Reviewers: teresajohnson, vitalybuka
Reviewed By: vitalybuka
Pull Request: https://github.com/llvm/llvm-project/pull/150690
e.g.,
<16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8>, <16 x i8>, i8)
<32 x i8> @llvm.x86.vgf2p8affineqb.256(<32 x i8>, <32 x i8>, i8)
<64 x i8> @llvm.x86.vgf2p8affineqb.512(<64 x i8>, <64 x i8>, i8)
Out A x b
where A and x are packed matrices, b is a vector, Out = A * x + b in
GF(2)
Multiplication in GF(2) is equivalent to bitwise AND. However, the
matrix computation also includes a parity calculation.
For the bitwise AND of bits V1 and V2, the exact shadow is:
Out_Shadow = (V1_Shadow & V2_Shadow) | (V1 & V2_Shadow) | (V1_Shadow &
V2)
We approximate the shadow of gf2p8affine using:
Out_Shadow = _mm512_gf2p8affine_epi64_epi8(x_Shadow, A_shadow, 0)
| _mm512_gf2p8affine_epi64_epi8(x, A_shadow, 0)
| _mm512_gf2p8affine_epi64_epi8(x_Shadow, A, 0)
| _mm512_set1_epi8(b_Shadow)
This approximation has false negatives: if an intermediate dot-product
contains an even number of 1's, the parity is 0.
It has no false positives.
Updates the test from https://github.com/llvm/llvm-project/pull/149258
Split GEPs that have more than one variable index into two. This is in
preparation for the ptradd migration, which will not support multi-index
GEPs.
This also enables the split off part to be CSEd and LICMed.
Loop regions require fixed-length steps and rounded-up trip counts, but
after dissolution creates explicit control flow, EVL loops can leverage
variable-length stepping with original trip counts.
This patch adds a post-dissolution transform pass to convert EVL loops
from fixed-length to variable-length stepping .
With EVL tail folding, the EVL may not always be VF on the
second-to-last iteration.
Recipes that have been converted to VP intrinsics via optimizeMaskToEVL
account for this, but recipes that are left behind will still use the
old header mask which may end up having a different vector length.
This is effectively the same as #95368, and fixes this by converting
header masks from icmp ule wide-canonical-iv, backedge-trip-count ->
icmp ult step-vector, evl. Without it, recipes that fall through
optimizeMaskToEVL may use the wrong vector length, e.g. in #150074 and
#149981.
We really need to split off optimizeMaskToEVL into
VPlanTransforms::optimize and move transformRecipestoEVLRecipes into
tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for
correctness and what is needed to optimize away the mask computations.
We should be able to still generate a correct albeit suboptimal VPlan
without running optimizeMaskToEVL. I've added a TODO for this, which I
think we can do after #148274Fixes#150197
This partially reverts https://github.com/llvm/llvm-project/pull/140744,
restoring the original TheLoop->isLoopInvariant check instead the more
powerful Legal->isInvariant, which uses SCEV.
This causes a mis-compile, because SCEV can prove that the stored value
is loop-invariant, which in turn converts the store to a uniform store.
But in VPlan, we aren't yet able to determine that the stored value is
loop-invariant, so we extract the last lane, which is incorrect, because
it does not account for the mask of the store.
Restoring the original code is a safe fix and avoids this subtle
divergence.
Fixes https://github.com/llvm/llvm-project/issues/149347.
PR: https://github.com/llvm/llvm-project/pull/150828
I found the naming here confusing. This is not something generic
for intrinsics, it's specifically about predicates, and serves to
remember a previous swap choice.
There is no reason to use std::map for the call maps maintained for
function clones during function clone assignment, as we don't iterate
over them and don't need deterministic ordering, so use the more
efficient DenseMap.
Update getSmallConstantTripCount() to return scalable ElementCount
values that is used to acurrately determine the maximum value for UF,
namely:
TripCount / VF ==> X * VScale / Y * VScale ==> X / Y
This improves the chances of being able to remove the scalar loop and
also fixes an issue where a UF=2 is choosen for a scalar loop with
exactly VF(= X * VScale) iterations.
When inferring attributes, we should not bail out early on unknown calls
(such as virtual calls), as we may still have call-site attributes that
can be used for inference.
Fixes https://github.com/llvm/llvm-project/issues/150817.
This PR adds a new interface to IRBuilder called CreateVectorInterleave,
which can be used to create vector.interleave intrinsics of factors 2-8.
For convenience I have also moved getInterleaveIntrinsicID and
getDeinterleaveIntrinsicID from VectorUtils.cpp to Intrinsics.cpp where
it can be used by IRBuilder.
This patch fixes:
llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp:4771:9:
error: non-void lambda does not return a value in all control paths
[-Werror,-Wreturn-type]
This reverts commit 314e22bcab2b0f3d208708431a14215058f0718f, reapplying
PR150735 with a fix for the unstable iteration order exposed by the new
tests (PR151039).
When compiling in `--hipstdpar` mode, the builtins corresponding to the
standard library might end up in code that is expected to execute on the
accelerator (e.g. by using the `std::` prefixed functions from
`<cmath>`). We do not have uniform handling for this in AMDGPU, and the
errors that obtain are quite arcane. Furthermore, the user-space changes
required to work around this tend to be rather intrusive.
This patch adds an additional `--hipstdpar` specific pass which forwards
to the run time component of HIPSTDPAR the intrinsics / libcalls which
result from the use of the math builtins, and which are not properly
handled. In the long run we will want to stop relying on this and handle
things in the compiler, but it is going to be a rather lengthy journey,
which makes this medium term escape hatch necessary.
The paired change in the run time component is here
<https://github.com/ROCm/rocThrust/pull/551>.
We iterate over a std::map indexed by FuncInfo, which is a pair of a
pointer and a clone number. In the ThinLTO case, this isn't an issue as
the function pointer always points to the same FunctionSummary object.
However, for regular LTO, this is a pointer to a Function object, which
is different for each clone. This will lead to unstable iteration order.
This was exposed in a test case added for PR150735, which added a new
instance of iteration over this map.
Since these function clones are added and numbered sequentially, change
this to a vector indexed by clone number, which points to a structure
containing the clone FuncInfo and the call map (the old map's key and
value, respectively).
shrinkSplatShuffle in InstCombine would only move truncs up through
shuffles if those shuffles inputs had the exact same type as their
output, this PR weakens this constraint to only requiring that the
scalar type of the input and output match.
If we have instructions in second loop's preheader which can be sunk, we
should also be adjusting PHI nodes to receive values from the fused loop's latch block.
Fixes#128600
Fix a bug in function assignment where we were not assigning all
callsite clones to a function clone. This led to incorrect call updates
because multiple callsite clones could look like they were assigned to
the same function clone.
Add in a stat and debug message to help identify and debug cases where
this is still happening.