Suppress warnings like
WARNING: Prefix AVX had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX1 had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX2 had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX2-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX512 had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX512F had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX512F-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX512-FAST had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
WARNING: Prefix AVX512DQ-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-store-i16-stride-7.ll
This fills out the fcmp handling to be more like the other instructions,
adding better support for fp16 and some larger vectors.
Select of f16 values is still not handled optimally in places as the
select is only legal for s32 values, not s16. This would be correct for
integer but not necessarily for fp. It is as if we need to do
legalization -> regbankselect -> extra legaliation -> selection.
Suppress warnings like
WARNING: Prefix AVX had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX1 had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX2 had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX2-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX512 had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX512F had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX512F-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX512-FAST had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
WARNING: Prefix AVX512DQ-ONLY had conflicting output from different RUN lines for all functions in test vector-interleaved-load-i16-stride-7.ll
This patch is aiming at resolving the below missed-optimization case.
### Code
```
define <8 x i64> @vwadd_mask_v8i32(<8 x i32> %x, <8 x i64> %y) {
%mask = icmp slt <8 x i32> %x, <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>
%a = select <8 x i1> %mask, <8 x i32> %x, <8 x i32> zeroinitializer
%sa = sext <8 x i32> %a to <8 x i64>
%ret = add <8 x i64> %sa, %y
ret <8 x i64> %ret
}
```
### Before this patch
[Compiler Explorer](https://godbolt.org/z/cd1bKTrx6)
```
vwadd_mask_v8i32:
li a0, 42
vsetivli zero, 8, e32, m2, ta, ma
vmslt.vx v0, v8, a0
vmv.v.i v10, 0
vmerge.vvm v16, v10, v8, v0
vwadd.wv v8, v12, v16
ret
```
### After this patch
```
vwadd_mask_v8i32:
li a0, 42
vsetivli zero, 8, e32, m2, ta, ma
vmslt.vx v0, v8, a0
vsetvli zero, zero, e32, m2, tu, mu
vwadd.wv v12, v12, v8, v0.t
vmv4r.v v8, v12
ret
```
This pattern could be found in a reduction with a widening destination
Specifically, we first do a fold like `(vwadd.wv y, (vmerge cond, x, 0))
-> (vwadd.wv y, x, y, cond)`, then do pattern matching on it.
Almost all loops with getNumVirtRegs skip unused registers by means
of reg_nodbg_empty or empty live interval. Except for these two cases
that are revealed by GlobalISel since it can skip RegClass assignment
for unused registers.
Closes#64452, closes#71926
Currently, the way that recomputeLiveIns works is that it will recompute
the livein registers for that MachineBasicBlock but it matters what
order you call recomputeLiveIn which can result in incorrect register
allocations down the line.
This PR fixes that by simply recomputing the liveins for the entire CFG
until convergence is achieved. This makes it harder to introduce subtle
bugs which alter liveness.
The glibc now adds the required minimum ISA level for libc-nonshared.a
(linked on all programs) and this is done with an inline asm along with
.note.gnu.property and .pushsection/.popsection. However, the x86
backend always ends the 'note.gnu.property' section when building with
-fcf-protection, leading to assert failure:
llvm/llvm-project-git/llvm/lib/MC/MCStreamer.cpp:1251: virtual void
llvm::MCStreamer::switchSection(llvm::MCSection*, const llvm::MCExpr*):
Assertion `!Section->hasEnded() && "Section already ended"' failed.
[1]
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/isa-level.c;h=3f1b269848a52f994275bab6f60dded3ded6b144;hb=HEAD
This patch disallows the use of the -maix-small-local-exec-tls and
-fno-data-sections options within clang, and also disallows the use of
the aix-small-local-exec-tls attribute with the -data-sections=false
option in llc.
This is because having data sections off when using the
aix-small-local-exec-tls feature is not ideal for performance. As the
small-local-exec-tls region is a limited resource, this space should not
used for variables that may be replaced.
Note, that on AIX, data sections is turned on by default, so this patch
makes it so that a diagnostic is emitted when users explicitly turn off
data sections while using the aix-small-local-exec-tls feature.
This test started to fail when LLVM created the release/18.x branch and
the main branch subsequently had the version number increased from 18 to
19.
I investigated this failure (it was blocking our internal automation)
and discovered that the CHECK statement on line 27 seemed to have the
compiler version number (1800) encoded in octal that it was checking
for. I don't know if this is something that explicitly needs to be
checked, so I am leaving it in, but it should be more flexible so the
test doesn't fail anytime the version number is changed. To accomplish
that, I changed the check for the 4-digit version number to be a regex.
I originally updated this test for the 18->19 transition in
a01195ff5cc3d7fd084743b1f47007645bb385f4. This change makes the CHECK
line more flexible so it doesn't need to be continually updated.
Make __builtin_cpu_{init|supports|is} target independent and provide an
opt-in query for targets that want to support it. Each target is still
responsible for their specific lowering/code-gen. Also provide code-gen
for PowerPC.
I originally proposed this in https://reviews.llvm.org/D152914 and this
addresses the comments I received there.
---------
Co-authored-by: Nemanja Ivanovic <nemanjaivanovic@nemanjas-air.kpn>
Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
This commit extends separate-const-offset-from-gep to look at the
newly-added `disjoint` flag on `or` instructions so as to preserve
additional opportunities for optimization.
The tests were pre-committed in #76972.
Currently if the merged string is used by metadata, its metadata uses
are not replaced if the string is merged. This is to add code support
for the metadata use replacement.
This fixes a miscompile from #79072 where we were taking the wrong SrcVec to do
the M1 shuffle. E.g. if the SrcVecIdx was 2 and we had 2 VRegsPerSrc, we ended
up taking it from V1 instead of V2.
Improve codegen for (trunc X to <3 x i8>) by converting it to a sequence
of 3 ST1.b, but first converting the truncate operand to either v8i8 or
v16i8, extracting the lanes for the truncate results and storing them.
At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790
PR: https://github.com/llvm/llvm-project/pull/78637
If we have a bitrotate shuffle, this is also by definition a vreg
splitable shuffle when exact VLEN is known. However, there's no profit
to be had from splitting the wider bitrotate lowering into individual m1
pieces. We'd rather leave it the higher lmul to reduce code size.
This is a general problem for any linear-in-LMUL shuffle expansions when
the vreg splitting still has to do linear work per piece. On first
reflection it seems like element rotation might have the same
interaction, but in that case, splitting can be done via a set of whole
register moves (which may get folded into the consumer depending) which
at least as good as a pair of slideup/slidedown. I think that bitrotate
is the only shuffle expansion we have that actually needs handled here.
This reverts commit 72f10f7eb536da58cb79e13974895cd97d4e1a5f.
This change was causing a miscompile on an internal test and is being reverted at the author's request until it can be fixed.
In the past PerformSplittingToNarrowingStores handled both int and float
ops, but since the introduction of MVETRUNC now only operates on float
operations, creating VCVTN nodes. It should be guarded by hasMVEFloatOps
to prevent a failure to select.
Previously, RISCVInsertReadWriteCSR inserted an FRM swap for any value
other than 7 and restored the original value right after the vector
instruction. This is inefficient if multiple vector instructions use the
same rounding mode if the next vector instruction uses a different
explicit rounding mode.
This patch implements a local optimization to solve the above problem.
We assume the starting rounding mode of the basic block is "dynamic."
When iterating through a basic block and encountering an instruction
whose rounding mode is not the same as the current rounding mode, we
change the current rounding mode and save the current rounding mode if
needed. And we may need to restore FRM when encountering function call,
inline asm and some uses of FRM.
The advanced version of this is to perform cross basic block analysis
for the starting rounding mode of each basic block.
If we're lowering an e8 m8 shuffle and we have an index value greater than
255, we have no available space to generate an e16 index vector. The
code had originally handled this correctly, but in a recent refactoring
I had moved the single source code above the check, and thus broke the
single source by accident.
I have a change on review to rework this (https://github.com/llvm/llvm-project/pull/79330), but for now, go with the most obvious fix.
Triggered by discussion on https://github.com/llvm/llvm-project/pull/79330. In the process of writing this, realized one of my recent refactorings appears to have broken the legalization for the single source case here. Fix to follow in separate patch.
Machines with vector support handle i128 in vector registers and
therefore only have the small displacement available for memory
accesses. Update isLegalAddressingMode() to reflect this.
The PTX ISA specifies that initializers may be incomplete ([5.4.4.
Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers))
> As in C, array initializers may be incomplete, i.e., the number of
initializer elements may be less than the extent of the corresponding
array dimension, with remaining array locations initialized to the
default value for the specified array type.
Emitting initializers in this form is preferable because it reduces the
size of the PTX, in some cases significantly, and can improve compile
time of ptxas as a result.
This builds on bdc41106ee48dce59c500c9a3957af947f30c8c3.
This change completes the migration to a recursive shuffle lowering
strategy where when we encounter an unknown two argument shuffle, we
lower each operand as a single source permute, and then use a vselect
(i.e. a vmerge) to combine the results. This relies for code quality on
the post-isel combine which will aggressively fold that vmerge back into
the materialization of the second operand if possible.
Note: The change includes only the most immediately obvious of the
stylistic cleanup. There's a bunch of code movement that this enables
that I'll do as a separate patch as rolling it into this creates an
unreadable diff.