The miscompile case's G_ZEXT has a G_FREEZE source. Similar to D127154, this patch removed isDef32, relying on the AArch64MIPeephole optimizer to remove redundant SUBREG_TO_REG nodes also in GISel.
Fix#58431
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D136433
If OP in PTEST(PG, OP(PG, ...)) has a flag-setting variant change the
opcode so the PTEST becomes redundant. This patch extends this existing
optimization in AArch64::optimizePTestInstr to cover all flag-setting
opcodes.
Reviewed By: peterwaller-arm
Differential Revision: https://reviews.llvm.org/D136083
A follow on patch will extend existing
PTEST(PG, OP(PG, ...)) -> OP_FLAG_SETTING(PG, ...)
optimization in AArch64InstrInfo::optimizePTestInstr to cover more of
the flag-setting instructions
Reviewed By: peterwaller-arm
Differential Revision: https://reviews.llvm.org/D136161
The Chain wasn't set correctly in the DAG for functions marked
with aarch64_pstate_sm_body, which meant that SelectionDAG would
dead-code some of the CopyToReg's. This didn't show up in the
existing tests because all uses were in the same block, but when
adding some control-flow, suddenly things would break.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D136579
Recognize when source could have been unmerged to pieces with DstTy
without having to split source to smaller elements
and then merge small elements into DstTy pieces.
This happens when vector was meant to be split to sub-vectors but there
was leftover. At this point artifact combiner have already dealt with
leftover and we can continue to use sub-vectors.
Differential Revision: https://reviews.llvm.org/D109241
Recognize copy that is represented as split of a source register to
elements that were reassembled to another register with the same type.
Differential Revision: https://reviews.llvm.org/D109240
This reverts commit e8b3ffa532b8ebac5dcdf17bb91b47817382c14d.
The AMDGPU/mad_64_32.ll seems to fail on some of the build bots but
passes locally. I'm really confused.
(sra X, BW-1) is either 0 or -1. So the multiply is a conditional
negate of Y.
This pattern shows up when type legalizing wide multiplies involving
a sign extended value.
Fixes PR57549.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D133399
The crash comes from mismatch between load count in epilogue and seh instruction count.
Still because of the pass AArch64LoadStoreOpt. It remove some load in the epilogue but haven't remove the corresponding seh instruction.
This patch don't optimize the load in the epilogue to fix the issue.
Fix: #58516
Reviewed By: mstorsjo
Differential Revision: https://reviews.llvm.org/D136430
Arm64EC has two different ways to refer to dllimport'ed functions in an
object file. One is using the usual __imp_ prefix, the other is using an
Arm64EC-specific prefix __imp_aux_. As far as I can tell, if a function
is in an x64 DLL, __imp_aux_ refers to the actual x64 address, while
__imp_ points to some linker-generated code that calls the exit thunk.
So __imp_aux_ is used to refer to the address in non-call contexts,
while __imp_ is used for calls to avoid the indirect call checker.
There's one twist to this, though: if an object refers to a symbol using
the __imp_aux_ prefix, the object file's symbol table must also contain
the symbol with the usual __imp_ prefix. The symbol doesn't actually
have to be used anywhere, it just has to exist; otherwise, the linker's
symbol lookup in x64 import libraries doesn't work correctly. Currently,
this is handled by emitting a .globl __imp_foo directive; we could try
to design some better way to handle this.
One minor quirk I haven't figured out: apparently, in Arm64EC mode, MSVC
prefers to use a linker-synthesized stub to call dllimport'ed functions,
instead of branching directly. The linker stub appears to do the same
thing that inline code would do, so not sure if it's just a code-size
optimization, or if the synthesized stub can actually do something other
than just load from the import table in some circumstances.
Differential Revision: https://reviews.llvm.org/D136202
Make sure we don't call getReg() on the first operand of instruction
without knowing that operand is actually a register.
(This codepath isn't enabled for most CPUs; only triggers on certain
CPUs, like Cortex-X1.)
Differential Revision: https://reviews.llvm.org/D136296
Change the costmodel to lower a = b * C where C = (1 + 2^m) * (1 + 2^n) to
add w8, w0, w0, lsl #m
add w0, w8, w8, lsl #n
Note: The latency can vary depending on the shirt amount
Reviewed By: efriedma, dmgreen
Differential Revision: https://reviews.llvm.org/D135441
This intrinsic can be removed in favour of using a call to
__arm_sme_state() directly and testing the LSB of X0.
In IR that would look like:
%pstate = call aarch64_sme_preservemost_from_x2 {i64, i64} @__arm_sme_state()
%pstate.x0 = extractvalue {i64, i64} %pstate, 0
%pstate.sm = and i64 %pstate.x0, 1
This patch adds the assembly/disassembly for the following instructions:
For INT:
ADD(array results, multiple and single vector): Add replicated single
vector to multi-vector with ZA array vector results.
SUB(array results, multiple and single vector): Subtract replicated single
vector from multi-vector with ZA array vector results.
For FP:
FMLA (multiple and single vector): Multi-vector floating-point fused
multiply-add by vector.
FMLS (multiple and single vector): Multi-vector floating-point
multiply-subtract long by vector.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
The Matriz Operand has 2 new sizes 32(.s) and 64(.d) bits
(MatrixOp32 and MatrixOp64)
Depends on: D135448
Depends on: D135952
Differential Revision: https://reviews.llvm.org/D135455
The names in developer.arm for these SME features are:
HaveSMEI16I64 and HaveSMEF64F64
so the new flag names are consistent with the documentation page
Reviewed By: sdesmalen, c-rhodes
Differential Revision: https://reviews.llvm.org/D135974
Before this patch (and refactor patch D135843), isBitfieldPositioningOp won't handle "and(any_extend(shl(val, N), shifted-mask)" (bail out if AND op is not SHL)
After this patch, isBitfieldPositioningOp will see through "any_extend" to find "shl" to find possible bit-field-positioning nodes.
https://gcc.godbolt.org/z/3ncGKbGW6 is a four-liner LLVM IR that could be optimized to UBFIZ (see added test case test_and_extended_shift_with_imm in llvm/test/CodeGen/AArch64/bitfield-insert.ll). One existing test case also improves.
Differential Revision: https://reviews.llvm.org/D135852
Before this patch (and D135844)
- Given DAG node shl(op, N), isBitfieldPositioningOp uses (optionally shifted [1] ) op as the Src (least significant bits of Src are inserted into DstLSB of Dst node).
After this patch
- If op is and(val, mask), isBitfieldPositioningOp tries to see through and and find if val is a simpler source than op.
It helps in a similar (probably symmetric) way how isSeveralBitsExtractOpFromShr [2] optimizes isBitfieldExtractOpFromShr
Existing test cases are improved without regressions.
[1] cbd8464595/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp (L2546)
[2] cbd8464595/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp (L2057)
Differential Revision: https://reviews.llvm.org/D135850
In https://github.com/llvm/llvm-project/issues/57452, we found that IRTranslator is translating `i1 true` into `i32 -1`.
This is because IRTranslator uses SExt for indices.
In this fix, we change the expected behavior of extractelement's index, moving from SExt to ZExt.
This change includes both documentation, SelectionDAG and IRTranslator.
We also included a test for AMDGPU, updated tests for AArch64, Mips, PowerPC, RISCV, VE, WebAssembly and X86
This patch fixes issue #57452.
Differential Revision: https://reviews.llvm.org/D132978
Whenever a call to __chkstk was made, the frame lowering previously
omitted the aligning (as NumBytes was reset to zero before doing
alignment).
This fixes https://github.com/llvm/llvm-project/issues/56182.
The initial version of this produced invalid code for small
functions with no local stack allocations, if those functions
were marked with the "stackrealign" attribute. If building
with -mstack-alignment=16 (which otherwise mostly would be a
no-op), this attribute is added on the main function.
Differential Revision: https://reviews.llvm.org/D135687
This patch has the prefered disassembly changed for SVE vector list.
For instance, instead of printing this assembly:
ld4d { z1.d, z2.d, z3.d, z4.d }, p0/z, [x0]
it will print this:
ld4d { z1.d-z4.d }, p0/z, [x0]
Differential Revision: https://reviews.llvm.org/D135952
Add a compile-time flag for enabling streaming mode.
When streaming mode is enabled, lower basic loads and stores of fixed-width vectors;
to generate code that is compatible to streaming mode.
Differential Revision: https://reviews.llvm.org/D133433
The crash case comes from #58350. It have two stores, one store is type f32 and the other is v1f32.
When we try to merge these two stores on v1f32, the memVT is vector type so the old code will use ISD::EXTRACT_SUBVECTOR for type f32 also then compiler crash.
So this patch insert a build_vector for f32 store to generate v1f32 also when memVT is v1f32.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D135954
Functions with `aarch64_sme_pstatesm_body` will emit a SMSTART at the start
of the function, and a SMSTOP at the end of the function, such that all
operations use the right value for vscale.
Because the placement of these nodes is critically important (i.e. no
vscale-dependent operations should be done before SMSTART has been issued),
we require glueing the CopyFromReg to the Entry node such that we can
insert the SMSTART as part of that glued chain.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131582
CCMP/CCMN's second operator support const from 0 to 31. When the CCMP's second operator is in the range [-31, -1] we can replace it with CCMN to avoid extra mov.
Fix: #57034
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D135939
This reverts commit 50e0aced4521260af842dba73f1d8c50d36314ea.
This could accidentally start producing invalid code in some
cases (in particular, if compiling with -mstack-alignment=16, which
one could expect to be a no-op for a target where the stack always
is aligned to 16 bytes anyway).
This is a simple addition to the convertPhiTypes in CodeGenPrepare to
consider and convert constants as it converts the phi type. Someone
fixed the bug in the motivating example, so the undef is now a constant
0. This does mean converting between integer and floating point
constants, which may have different materialization.
Differential Revision: https://reviews.llvm.org/D135561
To achieve this, we need this observation:
`uzp1` is just a `xtn` that operates on two registers
For example, given the following register with type v2i64:
LSB_______MSB
x0 x1 x2 x3
Applying xtn on it we get:
x0 x2
This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:
x0 x1 x2 x3
|
uzp1
v
x0 x2 <value from other register>
We can transform xtn to uzp1 by this observation, and vice versa.
This observation only works on little endian target. Big endian target has
a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy
in the behavior of uzp1 between the little endian and big endian.
To illustrate, take the following for example:
LSB____________________MSB
x0 x1 x2 x3
On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it
grabs x3 and x1, which doesn't match what I saw on the document. But, since
I'm new to AArch64, take my word with a pinch of salt. This bevavior is
observed on gdb, maybe there's issue in the order of the value printed by it ?
Whatever the reason is, the execution result given by qemu just doesn't match.
So I disable this on big endian target temporarily until we find the crux.
Fixes#57502
Reviewed By: dmgreen, mingmingl
Co-authored-by: Mingming Liu <mingmingl@google.com>
Differential Revision: https://reviews.llvm.org/D133850
This matches what was done for the ARM implementation (where getting
the instruction sizes right is even more tricky, and hence needed
tighter testing).
This will allow catching any future cases where prologs and epilogs
don't match the instructions within them.
Differential Revision: https://reviews.llvm.org/D131394
Without this, unwinding through functions that does use PAC
would fail, if PAC actually was active.
Differential Revision: https://reviews.llvm.org/D135103
Pre-commit test cases to show cases when UZP1 (TRUNC, TRUNC) could be
combined into TRUNC (UZP1) (with some proper bit conversions in the middle) to generate more efficient code.
Differential Revision: https://reviews.llvm.org/D133280
After setting up the FP, the rest of the prologue doesn't need to
be replayed for unwinding the stack frame.
This allows reverting the functional parts of
2f7fbf837625267193351cc334e506a3a9161958 (but fixing inconsistent
duplicate setting of HasWinCFI).
Differential Revision: https://reviews.llvm.org/D135686
The BRKNS instruction is unlike the other instructions that set flags
since it has an all active implicit predicate, so the existing
PTEST(PG, BRKN(PG, A, B)) -> BRKNS(PG, A, B)
in AArch64InstrInfo::optimizePTestInstr is incorrect, however
PTEST(PTRUE_B(31), BRKN(PG, A, B)) -> BRKNS(PG, A, B)
is correct.
Spotted by @paulwalker-arm in D134946.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D135655
This reverts commit 0148df8157f05ecf3b1064508e6f012aefb87dad.
Getting a lit test failures on AMDGPU but I can't reproduce it so far.
Reverting to investigate.
(sra X, BW-1) is either 0 or -1. So the multiply is a conditional
negate of Y.
This pattern shows up when type legalizing wide multiplies involving
a sign extended value.
Fixes PR57549.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D133399
Reference: https://gcc.gnu.org/onlinedocs/gccint/Machine-Constraints.html
k: A memory operand whose address is formed by a base register and
(optionally scaled) index register.
m: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as st.w and ld.w.
ZB: An address that is held in a general-purpose register. The offset
is zero.
ZC: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as ll.w and sc.w.
Note:
The INLINEASM SDNode flags in below tests are updated because the new
introduced enum `Constraint_k` is added before `Constraint_m`.
llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/X86/callbr-asm-kill.mir
This patch passes `ninja check-all` on a X86 machine with all official
targets and the LoongArch target enabled.
Differential Revision: https://reviews.llvm.org/D134638
These are harmless for the unwinder - the unwinder doesn't need to
handle them for being able to unwind correctly.
Only add the opcodes when the branch target is in a SEH prologue;
for jumptables e.g. within a function, we shouldn't add any SEH
opcodes.
Differential Revision: https://reviews.llvm.org/D135277