For Neon, the default nonconst stride cost is conservative,
and it is a local variable, which is not convenience to
to tune the loop vectorize.
So I try to use a option, which is similar to SVEGatherOverhead brought in D115143.
Fix https://github.com/llvm/llvm-project/issues/63082.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D152253
AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.
Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
This patch implements the enhancement proposed by
https://github.com/llvm/llvm-project/issues/59312.
Suppose we have following code
v0 = load %addr
br %LoadBB
LoadBB:
v1 = load %addr
...
PredBB:
...
br %cond, label %LoadBB, label %SuccBB
SuccBB:
v2 = load %addr
...
Instruction v1 in LoadBB is partially redundant, edge (PredBB, LoadBB) is a
critical edge. SuccBB is another successor of PredBB, it contains another load
v2 which is identical to v1. Current GVN splits the critical edge
(PredBB, LoadBB) and inserts a new load in it. A better method is move the load
of v2 into PredBB, then v1 can be changed to a PHI instruction.
If there are two or more similar predecessors, like the test case in the bug
entry, current GVN simply gives up because otherwise it needs to split multiple
critical edges. But we can move all loads in successor blocks into predecessors.
Differential Revision: https://reviews.llvm.org/D141712
This reverts commit 02369b75fdd7b5fc5d9b47f1b60587c225918511.
At the moment, live-outs used *only* for the resume values in the scalar
loop are not modeled in VPlan yet. This means first-order recurrence
recipes could be removed, when a scalar epilogue is required and the
only use of a FOR is outside the loop.
Keep treating recurrence recipes as having side-effects for now, to
avoid them being removed.
Fixes#62954.
The logic and implementation follows the removal of no-op barriers. If
the fence is not making updates visible, either to the world or the
current thread, it is not needed. Said differently, the fences we remove
do not establish synchronization (happens-before) edges.
This allows us to eliminate some of the regression caused by:
https://reviews.llvm.org/D145290
Different offsets can be handled by expansion rather than defaulting to
an unknown offset. Thus, [4,4] & [8,8] will result in [4, 12] rather
than [unknown, unknown].
Derive the mustprogress attribute based on the willreturn attribute
or the fact that all callers are mustprogress.
Differential Revision: https://reviews.llvm.org/D94740
Define the function @llvm.amdgcn.make.buffer.rsrc, which take a 64-bit
pointer, the 16-bit stride/swizzling constant that replace the high 16
bits of an address in a buffer resource, the 32-bit extent/number of
elements, and the 32-bit flags (the latter two being the 3rd and 4th
wards of the resource), and combines them into a ptr addrspace(8).
This intrinsic is lowered during the early phases of the backend.
This intrinsic is needed so that alias analysis can correctly infer
that a certain buffer resource points to the same memory as some
global pointer. Previous methods of constructing buffer resources,
which relied on ptrtoint, would not allow for such an inference.
Depends on D148184
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D148957
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.
The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.
One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.
In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.
This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.
Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.
Depends on D145441
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D147547
Handling `true` and `false` constant replacements is now abstracted
out into a single lambda function `ReplaceCmpWithConstant`, so as to
reduce code duplication.
This removes BitCasts from isSource in Type Promotion, as I don't believe they
need to be treated as Sources. They will usually be from floats or hoisted
constants, where constants will be handled already.
This fixes#62513, but didn't otherwise cause any differences in the tests I
ran.
Differential Revision: https://reviews.llvm.org/D152112
For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.
This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.
For image stores it also trims DMask if necessary.
Reviewed by: arsenm, foad, nhaehnle, piotr
Reapply with extra check for struct types, which caused buildbot
failures last time.
-----
The freeze instruction has not been handled by SCCPInstVisitor.
This patch adds SCCPInstVisitor::visitFreezeInst(FreezeInst &I)
method to handle freeze instructions.
Differential Revision: https://reviews.llvm.org/D151659
`select i1 non-const, i1 true, i1 false` has been optimized to
`non-const`. There is no reason that we can not optimize `select i1
ConstExpr, i1 true, i1 false` to `ConstExpr`.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D151631
When we inline a callee into a caller, the compiler needs to make sure
that the caller supports a superset of instruction sets that the
callee is allowed to use. Normally, we check for the compatibility of
target features via functionsHaveCompatibleAttributes, but that
happens after we decide to honor call site attribute
Attribute::AlwaysInline. If the caller contains a call marked with
Attribute::AlwaysInline, which can happen with
__attribute__((flatten)) placed on the caller, the caller could end up
with code that cannot be lowered to assembly code.
This patch fixes the problem by checking the target feature
compatibility before we honor Attribute::AlwaysInline.
Fixes https://github.com/llvm/llvm-project/issues/62664
Differential Revision: https://reviews.llvm.org/D150396
Unlike every other analysis and transform, simplifyInstruction
permitted operating on instructions which are not inserted
into a function. This created an edge case no other code needs
to really worry about, and limited transforms in cases that
can make use of the context function. Only the inliner and a handful
of other utilities were making use of this, so just fix up these
edge cases. Results in some IR ordering differences since
cloned blocks are inserted eagerly now. Plus some additional
simplifications trigger (e.g. some add 0s now folded out that
previously didn't).
KCFI traps should always be recoverable, but as Intrinsic::trap
is marked noreturn, it's not possible to continue execution after
handling the trap as the compiler is free to assume we never
return. Switch to debugtrap instead to ensure we have the option
to resume execution after the trap.
This fixes the largest remaining discrepancy between results of
computeKnownBits() and SimplifyDemandedBits(). We only care about
the multi-use case here, because the assume necessarily introduces
an extra use.
This patch extends existing IR combines for: fmul, fsub and fadd,
relying on all active predicate to also apply to their equivalent
undef (_u) intrinsics.
Differential Revision: https://reviews.llvm.org/D150768
This barely matters since 99% are converted to the generic intrinsic now,
and the only real difference is the target intrinsic supports a variable
test mask. Start propagating poison. Prefer folding to a defined result (false)
for an undef test mask. Propagate undef for the first operand.
The SCCPSolver is using a structure (AnalysisResultsForFn) where it keeps
pointers to various analyses needed by the IPSCCP pass. These analyses are
requested all at the same time, which can become problematic in some cases.
For example one could be retrieved via getCachedAnalysis() prior to the
actual execution of the analysis. In more detail:
The IPSCCP pass uses a DomTreeUpdater to preserve the PostDominatorTree
in case the PostDominatorTreeAnalysis had run before IPSCCP. Starting with
commit 1b1232047e83b the IPSCCP pass may use BlockFrequencyAnalysis for
some functions in the module. As a result, the PostDominatorTreeAnalysis
may not run until the BlockFrequencyAnalysis has run, since the latter
analysis depends on the former. Currently, we setup the DomTreeUpdater
using getCachedAnalysis to retrieve a PostDominatorTree. This happens
before BlockFrequencyAnalysis has run, therefore the cached analysis can
become invalid by the time we use it.
Differential Revision: https://reviews.llvm.org/D151666
If the step is not loop-invariant, we cannot create a modified AddRec,
as the start needs to be loop-invariant. Mark those cases as
CannotAnalyze and bail out, to fix a crash.
Disable conversion of funnel shifts (fshl/fshr) into rotates
unless one of the operands is known to be a constant value.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D150670
The freeze instruction has not been handled by SCCPInstVisitor.
This patch adds SCCPInstVisitor::visitFreezeInst(FreezeInst &I)
method to handle freeze instructions.
Differential Revision: https://reviews.llvm.org/D151659
This patch allows us to gain all the benefits provided by
LoopLoadElimination pass to descending loops.
Differential Revision: https://reviews.llvm.org/D151448