Machine Copy Propagation Pass may lose some opportunities to further
remove the redundant copy instructions during the ForwardCopyPropagateBlock
procedure. When we Clobber a "Def" register, we also need to remove the record
from the copy maps that indicates "Src" defined "Def" to ensure the correct semantics
of the ClobberRegister function.
For more information, please see the C++ test case generated code in
"vector.body" after the MCP Pass: https://gcc.godbolt.org/z/nK4oMaWv5.
This patch add intrinsics of the form
sv<type>_t svld1q_gather_u64offset_<typ>(svbool_t pg, const <type>_t *base, svuint64_t offs);
void svst1q_scatter_u64offset_<typ>(sbvool_t, <type>_t *base, svuint64_t offst, sv<type>_t data);
as well as their short forms.
ACLE spec: ARM-software/acle#257
…d LLVM intrinisc (#70362)""
This reverts commit e1ee0e85104eed2c68b6821d9e5a2066e4154099.
The patch https://github.com/llvm/llvm-project/pull/70362 had a test in
LLDB failing. The Feature sve2p1 in AArch64InstrInfo.td uses
AssemblerPredicate instead of AssemblerPredicateWithAll This patch adds
again PR #70362 with the fix in the AArch64InstrInfo.td.
The floating-point and MVE features together specify the MVE
functionality that is supported on the Cortex-M85 processor. But the FPU
extension for the underlying architecture(armv8.1-m.main) is FPV5 which
does not include MVE-F. So Compiler's -S output and `-save-temps=obj`
loses MVE feature which leads to assembler error. What happening here is
.fpu directive overrides any previously set features by .cpu directive.
Since the the corresponding .fpu generated (.fpu fpv5-d16) does not
include MVE-F, it overrides those features even though it is supported
and set by the .cpu directive. Looks like .fpu is supposed to do this.
In this case, there should be an .arch_extension directive re-enabling
the relevant extensions after .fpu if the goal is to keep these
extensions enabled. GCC also does the same.
So this patch enables the MVE features by emitting the below arch
extension:
.fpu fpv5-d16
.arch_extension mve.fp
---------
Co-authored-by: Simi Pallipurath <simi.pallipurath.com>
DAGCombiner folds (select_cc seteq (and x, y), 0, 0, A) to (and (sra
(shl x)) A) where y has a single bit set. Previously, DAGCombiner relies
on `shouldAvoidTransformToShift` to decide when to do the combine, but
`shouldAvoidTransformToShift` is only about shift cost. This patch
introuduces a specific hook to decide when to do the combine and disable
the combine when Zicond enabled and AndMask <= 1024.
This is part-2 change to improve codegen for vec_fabs. In this patch,
v16f16 and v132f16 fabs are improved.
There will be at least two followups patches after this one.
1) fixing the ISEL crash when fabs.v32f16 uses custom lowering with
AVX512
2) better expansion for v16f16, v32f16 types on AVX1 subtargets.
On non-DQI AVX512 targets, X86InstrInfo::setExecutionDomainCustom will convert EVEX int-domain instructions to VEX fp-domain instructions. But, if we have the chance to use a broadcast fold we're better off using a EVEX instruction, so handle a reverse fold.
This patch adds the quadword gather load intrinsics of the form
sv<type>_t svld1q_gather_u64index_<typ>(svbool_t, const <type>_t *, svuint64_t);
sv<type>_t svld1q_gather_u64base_index_<typ>(svbool_t, svuint64_t, int64_t);
and the quadword scatter store intrinsics of the form
void svst1q_scatter_u64index_<typ>(svbool_t, <type>_t *, svuint64_t, sv<type>_t);
void svst1q_scatter_u64base_index_<typ>(svbool, svuint64_t, int64_t, sv<type>_t);
ACLE spec: https://github.com/ARM-software/acle/pull/257
Previously, bolt could not get FixupKind for BL correctly, because bolt
cannot get target-flags for BL. Here just add support in MCCodeEmitter.
Fixes https://github.com/llvm/llvm-project/pull/72826.
This patch added backend consumption on a new loop metadata:
!1 = !{!"llvm.loop.align", i32 64}
which is generated from clang's new loop attribute:
[[clang::code_align()]]
clang patch: #70762
This patch is used to test whether fixupkind for bl can be returned
correctly. When BL has target-flags(loongarch-call), there is no error.
But without this flag, an assertion error will appear. So the test is
just tagged as "Expectedly Failed" now until the following patch fix it.
`-debug-only=isel-dump` is the new debug type for printing SelectionDAG
after each ISel phase. This can be furthered filter by
`-filter-print-funcs=<function names>`.
Note that the existing `-debug-only=isel` will take precedence over the
new behavior and print SelectionDAG dumps of every single function
regardless of `-filter-print-funcs`'s values.
If we already have a YMM/ZMM constant that a smaller XMM/YMM has matching lower bits, then ensure we reuse the same constant pool entry.
Extends the similar combines we already have to reuse VBROADCAST_LOAD/SUBV_BROADCAST_LOAD constant loads.
This is a mainly a canonicalization, but should make it easier for us to merge constant loads in a future commit (related to both #70947 and better X86FixupVectorConstantsPass usage for #71078).
Reapplied with fix to ensure we don't 'flip-flop' between multiple matching constants - only perform the fold if the new constant pool entry is larger than the current entry.
When storing a subvector, too many element were written when the
size of the alloca is smaller than the size of the vector store.
This patch checks for the minimum of the alloca vector and the
store vector to determine the number of elements to store.
Add live-through register set printing, assuming live-through register
is in live-in and live-out sets, has no redefinitions but may have uses
in the block.
Avoids infinite issues in some upcoming patches to help D152928 - x86 sees a number of regressions that are addressed by extending SimplifyDemandedVectorEltsForTargetNode to cover more binop opcodes
When there is a COPY instruction in the loop with other uses, we want to
hoist the COPY, which in turn leads to the users being hoisted as well.
Co-authored-by David Green : David.Green@arm.com
The svldr_vnum and svstr_vnum builtins always modify the base register
and tile slice and provide immediate offsets of zero, even when the
offset provided to the builtin is an immediate. This patch optimises the
output of the builtins when the offset is an immediate, to pass it
directly to the instruction and to not need the base register and tile
slice updates.
For chain functions, PAL uses a `backend_stack_size` metadata item,
which at the moment has the same meaning as `stack_frame_size_in_bytes`.
We emit both for now in order to simplify coordination with PAL.
The new item must be emitted in the `shader_functions` section, just as
the metadata for other module entry functions. For simplicity, we mark
chain functions as module entry functions and emit the same metadata for
all of them.
InitUndef pass need replace the implicit def with Undef pseudo, but
current remove method will make noreg2implicit borken.
This patch postpone the removal until all basicblock be processed.