MemorySSA considers any atomic a def to any operation it dominates
just like a barrier or fence. That is correct from memory state
perspective, but not required for the no-clobber metadata since
we are not using it for reordering. Skip such atomics during the
scan just like a barrier if it does not alias with the load.
Differential Revision: https://reviews.llvm.org/D118661
Currently we cannot convert a vector load into scalar if there
is dominating barrier or fence. It is considered a clobbering
memory access to prevent memory operations reordering. While
reordering is not possible the actual memory is not being clobbered
by a barrier or fence and we can still use a scalar load for a
uniform pointer.
The solution is not to bail on a first clobbering access but
traverse MemorySSA to the root excluding barriers and fences.
Differential Revision: https://reviews.llvm.org/D118419
These test cases all rely on a default target being specified. Adding
the requirement gets the tests properly skipped when
LLVM_DEFAULT_TARGET_TRIPLE is unset.
The first phase of the analysis can avoid a vsetvli if an earlier
instruction in the block used an SEW and LMUL that when combined with
the EEW of the load/store would produce the desired EMUL. If we
avoided a vsetvli this will affect the global analysis we do in the
second phase.
The third phase where we really insert the vsetvlis needs to agree
with the first phase. If it doesn't we can insert vsetvlis that
invalidate the global analysis.
In the test case there is a VSETVLI in the preheader that sets
SEW=64 and LMUL=1. Inside the loop there is a VADD with SEW=64 and LMUL=1.
This VADD is followed by a store that wants wants SEW=32 LMUL=1/2.
Because it has EEW=32 as part of the opcode the SEW=64 LMUL=1 from the
VADD can be become EMUL=1 for the store. So the first phase determines no
vsetvli is needed.
The third phase manages CurInfo differently than BBInfo.Change from the
first phase. CurInfo is only updated when we see a vsetvli or insert
a vsetvli. This was done to allow predecessor block information from
the global analysis to be applied to multiple instructions. Since the
loop body has no vsetvli we won't update CurInfo for either the VADD
or the VSE. This prevented us from checking the store vsetvli elision
for the VSE resulting in a vsetvli SEW=32 LMUL=1/2 being emitted which
invalidated the global analysis.
To mitigate this, I've added a BBLocalInfo variable that more closely
matches the first phase propagation. This gets updated based on the
VADD and prevents emitting a vsetvli for the store like we did in the
first phase.
I wonder if we should do an earlier phase to handle the load/store case
by adding more pseudo opcodes and changing the SEW/LMUL for those
instructions before the insertion analysis. That might be more robust
than trying to guarantee two phases make the same decision.
Fixes the test from D118629.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118667
This patch updates the P10 patterns with a load feeding into an insertelt to
utilize the refactored load and store infrastructure, as well as updating any
tests that exhibit any codegen changes.
Furthermore, custom legalization is added for v4f32 on Power9 and above to not
only assist with adjusting the refactored load/stores for P10 vector insert,
but also it enables the utilization of direct moves.
Differential Revision: https://reviews.llvm.org/D115691
Pointer element types do not imply that the pointer is ABI aligned.
We should be using either an explicit align attribute here, or fall
back to an alignment of 1. This fixes a new element type access
introduced in D117764.
I don't think this makes any practical difference though, as the
lowering does not depend on alignment.
Differential Revision: https://reviews.llvm.org/D118681
Inspired by a recent Discourse post on undef vs. poison usage, this
series of patches should reduce the number of undefs in LLVM tests by
around 10%.
Only undef vector operands to insertelement/shufflevector have been
handled, which are by far the most common we've got.
The switchover is split into 3 fairly arbitrary clusters to make it
slightly more manageable: vector predication, fixed-length vectors,
scalable vectors.
This test shows a loop, whose preheader uses a SEW=64, LMUL=1 vector
operation. The loop body starts off with another SEW=64, LMUL=1 VADD
vector operation, before switching to a SEW=32, LMUL=1/2 vector store
instruction.
We can see that the VSETVLI insertion pass omits a VSETVLI before the
VADD (thinking it inherits its configuration from the preheader) but
does place a SEW=32, LMUL=1/2 VSETVLI before the store. This results in
a miscompilation as when the loop comes back around, the VADD is
incorrectly configured with SEW=32, LMUL=1/2.
It appears to be a bad load/store optimization, as replacing the vector
store with an SEW=32, LMUL=1/2 VADD does correctly insert a VSETVLI. The
issue is therefore possibly arising from canSkipVSETVLIForLoadStore.
Differential Revision: https://reviews.llvm.org/D118629
Do "simplifyShift" and "FoldConstantArithmetic" folds for the SSHLSAT
and USHLSAT DAG nodes.
This includes folds such as:
(shlsat undef/poison, x) -> 0
(shlsat x, undef/poison) -> undef
(shlsat x, too_large_shamt) -> undef
(shlsat 0, x) -> 0
(shlsat x, 0) -> x
(shlsat c1, c2) -> c3
Differential Revision: https://reviews.llvm.org/D118603
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
In some cases StructurizeCFG inserts i1 xor instructions to invert
predicates. Add a quick loop to clean these up afterwards if we can get
away with modifying an existing compare instruction instead.
(StructurizeCFG is generally run late in the pipeline so instcombine
does not clean them up for us.)
Differential Revision: https://reviews.llvm.org/D118623
Summary:
Add code object v5 support (deafult is still v4)
Generate metadata for implicit kernel args for the new ABI
Set the metadata version to be 1.2
Reviewers:
t-tye, b-sumner, arsenm, and bcahoon
Fixes:
SWDEV-307188, SWDEV-307189
Differential Revision:
https://reviews.llvm.org/D118272
Factoring it out so we can subsequently cache it. This should be a NFC,
however, for the float quantities, we see small errors in the least
significant digits. This is because, before, we were summing up one by
one. Now, we sum up results of sums.
This shouldn't matter for ML, and will require rework when we do
quantization (avoiding floats altogether), but meanwhile, it did require
an update to the reference file used for testing.
The patch also bumps the precision of the variables involved in this, to
reduce the error (note they are casted back to float at the end by the
SET macro, since we only work with float and not double in TF)
Differential Revision: https://reviews.llvm.org/D118659
New target SDNodes are added: AArch64ISD::MOPS_MEMSET, etc.
Each intrinsic is translated to one of these in SelectionDAGBuilder
via EmitTargetCodeForMOPS.
A custom lowering routine for INTRINSIC_W_CHAIN is added to handle
llvm.aarch64.mops.memset.tag. This takes a separate path from the common
intrinsics but ultimately ends up in the same EmitMOPS().
This is part 4/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Patch by Tomas Matheson, Lucas Prates and Son Tuan Vu.
Differential Revision: https://reviews.llvm.org/D117764
This implements codegen for Armv8.8/9.3 Memory Operations extension (MOPS).
Any memcpy/memset/memmov intrinsics will always be emitted as a series
of three consecutive instructions P, M and E which perform the
operation. The SelectionDAG implementation is split into a separate
patch.
AArch64LegalizerInfo will now consider the following generic opcodes
if +mops is available, instead of legalising by expanding them to
libcalls: G_BZERO, G_MEMCPY_INLINE, G_MEMCPY, G_MEMMOVE, G_MEMSET
The s8 value of memset is legalised to s64 to match the pseudos.
AArch64O0PreLegalizerCombinerInfo will still be able to combine
G_MEMCPY_INLINE even if +mops is present, as it is unclear whether it is
better to generate fixed length copies or MOPS instructions for the
inline code of small or zero-sized memory operations, so we choose to be
conservative for now.
AArch64InstructionSelector will select the above as new pseudo
instructions: AArch64::MOPSMemory{Copy/Move/Set/SetTagging} These are
each expanded to a series of three instructions (e.g. SETP/SETM/SETE)
which must be emitted together during code emission to avoid scheduler
reordering.
This is part 3/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Patch by Tomas Matheson and Son Tuan Vu
Differential Revision: https://reviews.llvm.org/D117763
This reverts commit 00bf4755e90c89963a135739218ef49c2417109f.
This change broke the emscripten builder (among other things):
https://ci.chromium.org/ui/p/emscripten-releases/builders/try/linux/b8823500584349280721/overview
Sample failure:
```
test_unistd_unlink (test_core.core0) ...
wasm-ld: error: symbol type mismatch: __stdio_write
>>> defined as WASM_SYMBOL_TYPE_FUNCTION in /usr/local/google/home/sbc/dev/wasm/emscripten/cache/sysroot/lib/wasm32-emscripten/libc-debug.a(__stdio_write.o)
>>> defined as WASM_SYMBOL_TYPE_DATA in /usr/local/google/home/sbc/dev/wasm/emscripten/cache/sysroot/lib/wasm32-emscripten/libc-debug.a(stderr.o)
```
The optimization added in D118139 causes a crash on the added test case
while trying to zero extend an vector of floats.
Fix the crash by bailing out for floating point operands.
Reviewed By: DavidTruby
Differential Revision: https://reviews.llvm.org/D118615
The spec doesn't seem to be written as if Zfh implies Zfhmin. They
seem to be separate extensions.
This patch moves the instructions from Zfhmin to be enabled with
either the Zfh or Zfhmin extensions.
Reviewed By: achieveartificialintelligence
Differential Revision: https://reviews.llvm.org/D118581
This reverts commit a6b54ddaba2d5dc0f72dcc4591c92b9544eb0016.
Apparently it is not safe to modify the condition even if it passes the
hasOneUse test, because StructurizeCFG might have other references to
the condition that are not manifest in the IR use-def chains.
Fixes a crash ('Invalid size request on a scalable vector') in visitAlloca()
when we call this function for a scalable alloca instruction, caused
by the implicit conversion of TySize to uint64_t.
This patch changes TySize to a TypeSize as returned by getTypeAllocSize()
and ensures the allocation size is multiplied by vscale for scalable vectors.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D118372
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.
If AMDGPUAnnotateUniformValues finds a load from a uniform pointer with
no potentially clobbering stores between the kernel entry point and the
load instruction, it adds noclobber metadata to the *address*. This is
unsafe because it can get applied to other loads in the same which do
have aliasing stores.
Differential Revision: https://reviews.llvm.org/D118458
As raised by @efriedma on D117995 - the source must not be undef/poison to demand any bits in mul(x,x) other than bit[1]
https://alive2.llvm.org/ce/z/Cxkjen
This avoids various cases where StructurizeCFG would otherwise insert an
xor i1 instruction, and it since it generally runs late in the pipeline,
instcombine does not clean up the xor-of-cmp pattern.
Differential Revision: https://reviews.llvm.org/D118478
This patches fixes the visibility and linkage information of symbols
referring to IR globals.
Emission of external declarations is now done in the first execution
of emitConstantPool rather than in emitLinkage (and a few other
places). This is the point where we have already gathered information
about used symbols (by running the MC Lower PrePass) and not yet
started emitting any functions so that any declarations that need to
be emitted are done so at the top of the file before any functions.
This changes the order of a few directives in the final asm file which
required an update to a few tests.
Reviewed By: sbc100
Differential Revision: https://reviews.llvm.org/D118122
We can use the RISCVISD::GREV encoding that swaps the bits in
each byte. This allows it to use the existing computeKnownBits
support for RISCVISD::GREV.
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
Allows pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.
This would allow pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.