For consistency with `ConstantInt::get()`, add an ImplicitTrunc
parameter to `ConstantInt::getSigned()` as well. It currently defaults
to true and will be flipped to false in the future (by #171456).
This change adds full support for the ptx `barrier.cta.red` instruction,
following the same conventions as are already used for
`barrier.cta.sync` and `barrier.cta.arrive`.
In addition this MR removes the following intrinsics which are no longer
needed:
* llvm.nvvm.barrier0.popc -->
llvm.nvvm.barrier.cta.red.popc.aligned.all(0, c)
* llvm.nvvm.barrier0.and -->
llvm.nvvm.barrier.cta.red.and.aligned.all(0, z)
* llvm.nvvm.barrier0.or -->
llvm.nvvm.barrier.cta.red.or.aligned.all(0, z)
This adds support for FatLTO to COFF targets in clang and lld.
The changes are adapted from
610fc5cbcc
and
14e3bec8fc
but much smaller because it just needed the COFF-specific parts wired
in, and I tried my best to adapt the pre-existing ELF tests for the COFF
version.
My main goal is to be able to use this for shipping pre-built
https://github.com/XboxDev/nxdk container images someday, which uses the
`i386-pc-win32` target.
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
If we're shuffling/concatenating the same operands then ensure we don't
duplicate the total cost, ensure we reuse the final shuffle and
recognise that we reduce the total instruction count (so fold even when
NewCost == OldCost, not just NewCost < OldCost).
Reapply 8a115b6934a90441 with an update to tests handling remarks.
The patch now directly emits a clear remark when we bail out
due to the memory check threshold.
Original message:
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.
In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.
However for some VPlans we may have a blend with only the first lane
used:
BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
vp<%5> = vector-pointer ir<%gep>
And in the legacy cost model we cost a blend as a phi if it's uniform:
// If we know that this instruction will remain uniform, check the cost
of
// the scalar version.
if (isUniformAfterVectorization(I, VF))
VF = ElementCount::getFixed(1);
So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.
A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
This patch optimizes vector scatters that have a uniform (single-scalar)
address by replacing them with "extract-last-lane + scalar store" when
the scatter is unmasked.
Notes:
- The legacy cost model can scalarize a store if both the address and
the value are uniform. In VPlan we materialize the stored value via
ExtractLastLane, so only the address must be uniform.
- Some of the loops won't be vectorized any sine no vector instructions
will be generated.
Previously, we only synthesized VP metadata with the callee GUIDs from
the memprof profile if no VP metadata already existed (i.e. from PGO).
With this change we will add in any that are not already in the VP
metadata, also with count 1.
Keeping the extracted element in a natural position in the narrowed
vector has two beneficial effects:
1. It makes the narrowing shuffles cheaper (at least on AMDGPU), which
allows the insert/extract fold to trigger.
2. It makes the narrowing shuffles in a chain of extract/insert
compatible, which allows foldLengthChangingShuffles to successfully
recognize a chain that can be folded.
There are minor X86 test changes that look reasonable to me. The IR
change for AVX2 in
llvm/test/Transforms/VectorCombine/X86/extract-insert-poison.ll
doesn't change the assembly generated by `llc -mtriple=x86_64--
-mattr=AVX2`
at all.
Adds an option -module-summary-max-indirect-edges, and wiring into the
ICP logic that collects promotion candidates from VP metadata, to
support a larger number of promotion candidates for use in building the
ThinLTO summary. Also use this in the MemProf ThinLTO backend handling
where we perform memprof ICP during cloning.
The new option, essentially off by default, can be used to override the
value of -icp-max-prom, which is checked internally in ICP, with a
larger max value when collecting candidates from the VP metadata.
For MemProf in particular, where we synthesize new VP metadata targets
from allocation contexts, which may not be all that frequent, we need to
be able to include a larger set of these targets in the summary in order
to correctly handle indirect calls in the contexts. Otherwise we will
not set up the callsite graph edges correctly.
Need to check if the extractelement instruction is part of other
buildvector node, before trying to mark it for the deletion, otherwise
the compiler may reuse the deleted instruction.
Fixes#172221
This pr resolves [#170867](https://github.com/llvm/llvm-project/issues/170867)
Existing code recomputes the cost for creating a shuffle instruction even for the
repeating Intrinsic operand pairs. This will result in higher newCost.
Hence the runtime will decide not to fold.
The change proposed in this pr will address this issue. When calculating
the newCost we are skipping the cost calculation of an operand pair if
it was already considered. And when creating the transformed code, we
are reusing the already created shuffle instruction for repeated operand
pair.
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.
Pass backedge values directly to VPFirstOrderRecurrencePHIRecipe and
VPReductionPHIRecipe directly, as they must be provided and availbale.
Split off from https://github.com/llvm/llvm-project/pull/168291.
Use SCEV to simplify all live-ins during VPlan0 construction. This
enables us to remove special SCEV queries when constructing
VPWidenRecipes and improves results in some cases.
This leads to simplifications in a number of cases in real-world
applications (~250 files changed across LLVM, SPEC, ffmpeg)
PR: https://github.com/llvm/llvm-project/pull/155304
Considering that the current loop fusion only supports adjacent loops,
we are able to simplify the checks in this pass. By removing
`isControlFlowEquivalent` check, this patch fixes multiple issues
including #166560, #166535, #165031, #80301 and #168263.
Now only the sequential/adjacent candidates are collected in the same
list. This patch is the implementation of approach 2 discussed in post
#171207.
As per @arsenm 's instructions, I've separated the non-functional
changes from https://github.com/llvm/llvm-project/pull/169958.
Afterwards I'll tackle the functional ones one by one. I hope I did
everything right this time.
Full descriptions in the article:
https://pvs-studio.com/en/blog/posts/cpp/1318/
3. Array overrun is possible.
The PVS-Studio warning: V557 Array overrun is possible. The value of
'regIdx' index could reach 31. VEAsmParser.cpp 696
10. Excessive check.
The PVS-Studio warning: V547 Expression 'IsLeaf' is always false.
PPCInstrInfo.cpp 419
11. Doubling the same check.
The PVS-Studio warning: V581 The conditional expressions of the 'if'
statements situated alongside each other are identical. Check lines:
5820, 5823. PPCInstrInfo.cpp 5823
15. Excessive check.
The PVS-Studio warning: V547 Expression 'i != e' is always true.
MachineFunction.cpp 1444
17. Excessive assignment.
The PVS-Studio warning: V1048 The 'FirstOp' variable was assigned the
same value. MachineInstr.cpp 1995
18. Excessive check.
The PVS-Studio warning: V547 Expression 'AllSame' is always true.
SimplifyCFG.cpp 1914
19. Excessive check.
The PVS-Studio warning: V547 Expression 'AbbrevDecl' is always true.
LVDWARFReader.cpp 398
When an alloc slice's users include llvm.protected.field.ptr intrinsics
and their discriminators are consistent, drop the intrinsics in order
to avoid unnecessary pointer sign and auth operations.
Reviewers: nikic
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/151650
Always include the cost of the middle block in
isOutsideLoopWorkProfitable. This addresses the TODO from
https://github.com/llvm/llvm-project/pull/168949 and removes the
temporary restriction.
isOutsideLoopWorkProfitable already scales the cost outside loops
according the expected trip counts.
In practice this increases the minimum iteration threshold in a few
cases. On a large IR corpus based on C/C++ workloads, ~50 out of 179450
vector loops have their thresholds increased slightly.
PR: https://github.com/llvm/llvm-project/pull/171102
Add the -memprof-print-matched-alloc-stack option to enable emitting the
full allocation call context (of stack ids) for each matched allocation
reported by -memprof-print-match-info. Noop when the latter is not
enabled.