SVE has some non-temporal masked loads and stores. The metadata coming
from the nodes is not copied to the MMO at the moment though, meaning it
will generate a normal instruction. This patch ensures that the right
flags are set if the instruction has non-temporal metadata.
This attempts to standardize and extend some of the insert vector
element lowering. Most notably:
- More types are handled by splitting illegal vectors.
- The index type for G_INSERT_VECTOR_ELT is canonicalized to
TLI.getVectorIdxTy(), similar to extact_vector_element.
- Some of the existing patterns now have the index type specified to
make sure they can apply to GISel too.
- The C++ selection code has been removed, relying on tablegen patterns.
- G_INSERT_VECTOR_ELT with small GPR input elements are pre-selected to
use a i32 type, allowing the existing patterns to apply.
- Variable index inserts are lowered in post-legalizer lowering,
expanding into a stack store and reload.
The conversion expansions did not properly handle bfloat types.
I'm not certain that these expansions are completely correct;
I don't have any experience with AMDGPU or the ability to run
anything to test it.
Note that it doesn't seem like AMDGPU with GlobalISel can
handle fptrunc of float to bfloat, which is needed for itofp.
I've omitted the GISEL run for the bfloat case.
This fixes#85379.
When the encoding of register tuples are aligned, we can use a copy
with larger LMUL to reduce copies.
Reviewers: preames, topperc, lukel97
Reviewed By: topperc, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/84455
In SplitCriticalEdge, DebugLoc of the branch instruction in new created
MBB was set to empty. It should be set and we can find proper DebugLoc
for it in most cases. This patch set it to non empty merged DebugLoc of
current MBB branches.
Adds two sets of tests. First, one for prolog/epilogue insertions where
the second stack adjustment can be done with shNadd for zba. Second, a
set of tests with offsets off SP in the same ranges, but also adding
varying alignments.
Align(8) is QWORD aligned, but this was checking to see if alignment was
greater than that, when it should have been checking for being greater
than OR EQUAL to Align(8).
This bug was introduced in
https://github.com/llvm/llvm-project/commit/6a6af30d433d7 during the
transition to the Align type.
Fold BICi if all destination bits are already known to be zeroes
```llvm
define <8 x i16> @haddu_known(<8 x i8> %a0, <8 x i8> %a1) {
%x0 = zext <8 x i8> %a0 to <8 x i16>
%x1 = zext <8 x i8> %a1 to <8 x i16>
%hadd = call <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16> %x0, <8 x i16> %x1)
%res = and <8 x i16> %hadd, <i16 511, i16 511, i16 511, i16 511,i16 511, i16 511, i16 511, i16 511>
ret <8 x i16> %res
}
declare <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16>, <8 x i16>)
```
```
haddu_known: // @haddu_known
ushll v0.8h, v0.8b, #0
ushll v1.8h, v1.8b, #0
uhadd v0.8h, v0.8h, v1.8h
bic v0.8h, #254, lsl #8 <-- this one will be removed as we know high bits are zero extended
ret
```
Fixes#53881Fixes#53622
Allow using atomicrmw fadd, fsub, fmin, and fmax with vectors of
floating-point type. AMDGPU supports atomic fadd for <2 x half> and <2 x
bfloat> on some targets and address spaces.
Note this only supports the proper floating-point operations; float
vector typed xchg is still not supported. cmpxchg still only supports
integers, so this inserts bitcasts for the loop expansion.
I have support for fp vector typed xchg, and vector of int/ptr
separately implemented but I don't have an immediate need for those
beyond feature consistency.
sed was being used to use the same test functions with eq/ne branch
condition.
This commit duplicates the test functions so that we have a version
with each condition. This allows us to remove 2 RUN lines.
I plan to add a Zca testing to this file which now requires 1 new
RUN line instead of 2.
Test description says constant does not fit in 12 bits, but the constant
used was -2048 which does fit in 12 bits. Update to -2049.
Also remove uses of -NOT in favor of positive checks. One of the -NOT
should have been using RESBROPT instead of "c.beqz" so that it would
check for the absense of the correct instruction based on the sed
replacement on the RUN line.
Consider the following:
ldr r0, [r4]
ldr r7, [r0, #4]
cmp r7, r3
bhi .LBB0_6
cmp r0, r2
push {r0}
pop {r4}
bne .LBB0_3
movs r0, r6
pop {r4, r5, r6, r7}
pop {r1}
bx r1
Here is a snippet of the generated THUMB1 code of the K&R malloc
function that clang currently compiles to.
push {r0} ends up being popped to pop {r4}.
movs r4, r0 would destroy the flags set by cmp right above.
The compiler has no alternative in this case, except one:
the only alternative is to transfer through a high register.
However, it seems like LLVM does not consider that this is a valid
approach, even though it is a free clobbering a high register.
This patch addresses the FIXME so the compiler can do that when it can
in r10 or r11, or r12.
This adds support for using the L and H argument modifiers for twinword
operands in inline asm code, such as in:
```
%1 = tail call i64 asm sideeffect "rd %pc, ${0:L} ; srlx ${0:L}, 32, ${0:H}", "={o4}"()
```
This is needed by the Linux kernel.
The existing heuristics were assuming that every core behaves like an
Apple A7, where any extend/shift costs an extra micro-op... but in
reality, nothing else behaves like that.
On some older Cortex designs, shifts by 1 or 4 cost extra, but all other
shifts/extensions are free. On all other cores, as far as I can tell,
all shifts/extensions for integer loads are free (i.e. the same cost as
an unshifted load).
To reflect this, this patch:
- Enables aggressive folding of shifts into loads by default.
- Removes the old AddrLSLFast feature, since it applies to everything
except A7 (and even if you are explicitly targeting A7, we want to
assume extensions are free because the code will almost always run on a
newer core).
- Adds a new feature AddrLSLSlow14 that applies specifically to the
Cortex cores where shifts by 1 or 4 cost extra.
I didn't add support for AddrLSLSlow14 on the GlobalISel side because it
would require a bunch of refactoring to work correctly. Someone can pick
this up as a followup.
Depends on #87545
Emit `GNU_PROPERTY_AARCH64_FEATURE_PAUTH` property in
`.note.gnu.property` section depending on
`aarch64-elf-pauthabi-platform` and `aarch64-elf-pauthabi-version` llvm
module flags.
Following https://github.com/llvm/llvm-project/pull/68313 this patch
extends the idea to M-profile PACBTI.
The Machine Scheduler can reorder instructions within a scheduling
region depending on the scheduling policy set. If a BTI-clearing
instruction happens to partake in one such region, it might be moved
around, therefore ending up where it shouldn't.
The solution is to mark all BTI-clearing instructions as scheduling
region boundaries. This essentially means that they must not be part of
any scheduling region, and as consequence never get moved:
- PAC
- PACBTI
- BTI
- SG
Note that PAC isn't BTI-clearing, but it's replaced by PACBTI late in
the compilation pipeline.
As far as I know, currently it isn't possible to organically obtain code
that's susceptible to the bug:
- Instructions that write to SP are region boundaries. PAC seems to
always be followed by the pushing of r12 to the stack, so essentially
PAC is always by itself in a scheduling region.
- CALL_BTI is expanded into a machine instruction bundle. Bundles are
unpacked only after the last machine scheduler run. Thus setjmp and BTI
can be separated only if someone deliberately run the scheduler once
more.
- The BTI insertion pass is run late in the pipeline, only after the
last machine scheduling has run. So once again it can be reordered only
if someone deliberately runs the scheduler again.
Nevertheless, one can reasonably argue that we should prevent the bug in
spite of the compiler not being able to produce the required conditions
for it. If things change, the compiler will be robust against this
issue.
The tests written for this are contrived: bogus MIR instructions have
been added adjacent to the BTI-clearing instructions in order to have
them inside non-trivial scheduling regions.
Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds.
This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements.
There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads?
Fixes#86968
Call generateWaitcnt unconditionally at the end of
SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to
generate a new waitcnt instruction it has the effect of combining or
removing redundant waitcnts that were already present. Tests show
various small improvements in waitcnt placement.
This PR:
* fixes OpVariable instructions place in a function (see
https://github.com/llvm/llvm-project/issues/66261),
* improves type inference,
* helps avoiding unneeded bitcasts when validating function call's
This allows to improve existing and add new test cases with more strict
checks. OpVariable fix refers to "All OpVariable instructions in a
function must be the first instructions in the first block" requirement
from SPIR-V spec.
On RV64, we legalize zexts of i1s to (vselect m, (splat_vector i64 1),
(splat_vector i64 0)), where the splat_vectors are implicitly
truncating.
When the vselect is used by a binop we want to pull the vselect out via
foldSelectWithIdentityConstant. But because vectors with an element size
< i64 will truncate, isNeutralConstant will return false.
This patch handles truncating splats by getting the APInt value and
truncating it. We almost don't need to do this since most of the neutral
elements are either one/zero/all ones, but it will make a difference for
smax and smin.
I wasn't able to figure out a way to write the tests in terms of select,
since we need the i1 zext legalization to create a truncating
splat_vector.
This supercedes #87236. Fixed vectors are unfortunately not handled by
this patch (since they get legalized to _VL nodes), but they don't seem
to appear in the wild.
Fixed vectors have their sext/zext operands legalized to _VL nodes, so
we need to handle them in the patterns.
This adds a riscv_ext_vl_oneuse pattern since we don't care about the
type of extension used for the shift amount, and extends
Low8BitsSplatPat to handle other _VL nodes. We don't actually need to
check the mask or VL there since none of the _VL nodes have passthru
operands.
The remaining test cases that are widening from i8->i64 need to be
handled by extending combineBinOp_VLToVWBinOp_VL.
This also fixes Low8BitsSplatPat incorrectly checking the vector size
instead of the element size to determine if the splat value might have
been truncated below 8 bits.
- When both operands are constant, the matcher runs into an infinite
loop as the commutation should be applied only when LHS is a constant
and RHS is not.
Reviewers: arsenm
Reviewed By: arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/87426
This patch legalizes G_ZEXT, G_SEXT, and G_ANYEXT. If the type is a
legal mask type, then the instruction is legalized as the element-wise
select, where the condition on the select is the mask typed source
operand, and the true and false values are 1 or -1 (for
zero/any-extension and sign extension) and zero. If the type is a legal integer
or vector integer type, then the instruction is marked as legal.
The legalization of the extends may introduce a G_SPLAT_VECTOR, which
needs to be legalized in this patch for the extend test cases to pass.
A G_SPLAT_VECTOR is legal if the vector type is a legal integer or
floating point vector type and the source operand is sXLen type. This is
because the SelectionDAG patterns only support sXLen typed
ISD::SPLAT_VECTORS, and we'd like to reuse those patterns. A
G_SPLAT_VECTOR is cutom legalized if it has a legal s1 element vector
type and s1 scalar operand. It is legalized to G_VMSET_VL or G_VMCLR_VL
if the splat is all ones or all zeros respectivley. In the case of a
non-constant mask splat, we legalize by promoting the scalar value to
s8.
In order to get the s8 element vector back into s1 vector, we use a
G_ICMP. In order for the splat vector and extend tests to pass, we also
need to legalize G_ICMP in this patch.
A G_ICMP is legal if the destination type is a legal bool vector and the LHS and
RHS are legal integer vector types.
Based off #85592 - our truncation -> PACKSS/PACKUS folds should be able to use the nsw/nuw flags to recognise when we don't need to mask/sext_inreg prior to the PACKSS/PACKUS nodes.
Hi all,
This patch is a follow-up of #79101. It migrates logic from
`visitVSELECT` to `visitVP_SELECT` to simplify `vp.select`. With this
patch we can do the following combinations:
```
vp.select undef, T, F --> T (if T is a constant), F otherwise
vp.select <condition>, undef, F --> F
vp.select <condition>, T, undef --> T
vp.select false, T, F --> F
vp.select <condition>, T, T --> T
```
I'm a total newbie to llvm and I'm sure there's room for improvements in
this patch. Please let me know if you have any advice. Thank you in
advance!