This commit introduces a helper function to DWARFAcceleratorTable::Entry
which follows DW_IDX_Parent attributes to returns the corresponding
parent Entry in the table.
It is tested by enhancing dwarfdump so that it now prints:
1. When data is corrupt.
2. When parent information is present, but the parent is not indexed.
3. The parent entry offset, when the parent is present and indexed. This
is printed in terms a real entry offset (the same that gets printed at
the start of each entry: "Entry @ 0x..."), instead of the encoded number
in the table (which is an offset from the start off the Entry list).
This makes it easy to visually inspect the dwarfdump and check what the
parent is.
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
If we're loading a vector constant that is known to be zero in the upper elements, then attempt to shrink the constant and just scalar load the lower 32/64 bits.
Always chose the vzload/broadcast with the smallest constant load, and prefer vzload over broadcasts for same bitwidth to avoid domain flips (mainly a AVX1 issue).
Fixes#73783
One V_NOP or unrelated VALU instruction in between is required for
correctness when matrix A or B of current WMMA instruction overlaps with
matrix D of previous WMMA instruction.
Remaining cases of WMMA operand overlaps are handled by the hardware and
do not require handling in hazard recognizer.
Hardware may stall in cases where:
- matrix C of current WMMA instruction overlaps with matrix D of
previous WMMA instruction
- VALU instruction reads matrix D of previous WMMA instruction
- matrix A,B or C of WMMA instruction reads result of previous VALU
instruction
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
The test contains a CHECK line which verifies an .ascii line which originally checks
for 18001970010100000000. After the bump of the compiler version to 19, the test
started to fail with the string now being 19001970010100000000.
This should fix this failing test on bots.
This patch allows VCIX instructions that have side effect to be
reordered
with memory and other side effecting instructions. However we don't want
VCIX instructions to be reordered with each other, so we propose a dummy
register called VCIX_STATE and make these instructions implicitly define
and use
it.
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.
This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
Now that the work embedding PGO information in SHT_LLVM_BB_ADDR_MAP ELF
sections has landed, there is no longer a need to keep around the
mbb-profile-dump flag.
This patch adds basic TLSDESC support in the RISC-V backend.
Specifically, we add new relocation types for TLSDESC, as prescribed in
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/373, and add a
new pseudo instruction to simplify code generation.
This patch does not try to optimize the local dynamic case, which can be
improved in separate patches.
Linker side changes will also be handled separately.
The current implementation is only enabled when passing the new
`-enable-tlsdesc` codegen flag.
Currently, the UnifiedLTO pipeline seems to have trouble with several
LTO features, like SplitLTO units, which means we cannot use important
optimizations like Whole Program Devirtualization or security hardening
instrumentation like CFI.
This patch reverts FatLTO to using distinct pipelines for Full LTO and
ThinLTO. It still avoids module cloning, since that was error prone.
This is the first step towards an alternate shuffle lowering design for
the general two vector argument case. The goal is to leverage the
existing lowering for single vector permutes to avoid as many of the
vrgathers as required - even if we do need the other.
This patch handles only the first argument, and is arguably a slightly
weird half-step. However, the test changes from the full two argument
recurse patch are a lot harder to reason about. Taking this half step
gives much more easily reviewable changes, and is thus worthwhile. I
intend to post the patch for the second argument once this has landed.
If we have a shuffle which is larger than m1, we may be able to split it
into a series of individual m1 shuffles. This patch starts with the
subcase where the mask allows a 1-to-1 mapping from source register to
destination register - each with a possible permutation of their own. We
can potentially extend this later, thought in practice this seems to
already catch a number of the most interesting cases.
int_amdgcn_global_load_tr did not specify non-temporal load transpose,
thus we should
not genetrate the non-temporal hint for the load. We need to implement
getTgtMemIntrinsic
to create the corresponding MemSDNode. And we don't set the non-temporal
flag because
the intrinsic did not specify it.
NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.
Fixes#78897 - although the test case still has a number of poor codegen issues (in particular for i686 triples) that will need addressing (combining the nodes in topological order should help).
The RemoveDIs project is aiming to eliminate debug intrinsics like
dbg.value and dbg.declare from LLVM, and replace them with DPValue objects
attached to instructions. ISel is one of the "terminals" where that
information needs to be converted into MIR format: this patch implements
support for that in GlobalISel. We aim for the output of LLVM to be
identical with/without RemoveDIs debug-info.
This patch should be NFC, as we're handling the same data about variables
stored in a different format -- it now appears in a DPValue object rather
than as an intrinsic. To that end, I've refactored the handling of
dbg.values into a dedicated function, and call it whenever a dbg.value or a
DPValue is encountered. dbg.declare is handled in a similar way.
Testing: adding the --try-experimental-debuginfo-iterators switch to llc
causes it to try and convert to the "new" debug-info format if it's built
in (LLVM_EXPERIMENTAL_DEBUGINFO_ITERATORS=On), and it'll be covered by our
buildbot. One test has a few extra wildcard-regexes added: this is because
there's some extra data printed about attached debug-info, which is safe to
ignore.
This is /almost/ NFC - the only annoyance is that for some reason we were using "<C1,C2,..>" for ConstantVector types unlike all other cases - these now use the same "[C1,C2,..]" format as the other constant printers.
Allows shuffle to fold constant vectors that have already been lowered to constant pool - shuffle combining can then constant fold this.
Noticed while triaging #79100
This reverts commit 955417ade2648c2b1a4e5f0be697f61570590a88.
The problem with the previous version was that the bundle instruction
had arguments like "target arg1 arg2". When it's expanded we produced a
BL or BLR which can only accept one argument, the target of the branch.
Now I realise why expandCALL_RVMARKER has to copy them in mutiple steps.
The operands for the called function need to be changed to implicit
arguments of the branch instruction.
* Copy the branch target.
* Copy all register operands, marking them as implicit.
* Copy any other operands without modifying them.
Prior to any attempt to fix#77915:
BL @_setjmp, csr_aarch64_aapcs, implicit-def $lr, implicit $sp,
implicit-def dead $lr, implicit $sp, implicit-def $sp
Which is dropping the use of the arguments for the called function.
My first fix attempt produced:
BL @_setjmp, $x0, $w1, <regmask $fp ...>, implicit-def $lr, implicit
$sp, implicit-def dead $lr, implicit $sp, implicit-def $sp
It copied the arguments but as explicit arguments to the BL which only
expects 1, failing verification.
With this new change we produce:
BL @_setjmp, csr_aarch64_aapcs, implicit-def $lr, implicit $sp, implicit
$x0, implicit $w1, implicit-def dead $lr, implicit $sp, implicit-def $sp
Note specifically the added "implicit $x0, implicit $w1". So BL only has
1 explicit argument, but the arguments to the function are still used.
Ensure that getVLENFactoredAmount does not fail when the scale amount
requires the use of a non-trivial multiplication but the M extension is
not enabled. In such case, perform the multiplication using shifts and
adds.
This combines the previously posted patches with some additional work
I've done to more closely match MSVC output.
Most of the important logic here is implemented in
AArch64Arm64ECCallLowering. The purpose of the
AArch64Arm64ECCallLowering is to take "normal" IR we'd generate for
other targets, and generate most of the Arm64EC-specific bits:
generating thunks, mangling symbols, generating aliases, and generating
the .hybmp$x table. This is all done late for a few reasons: to
consolidate the logic as much as possible, and to ensure the IR exposed
to optimization passes doesn't contain complex arm64ec-specific
constructs.
The other changes are supporting changes, to handle the new constructs
generated by that pass.
There's a global llvm.arm64ec.symbolmap representing the .hybmp$x
entries for the thunks. This gets handled directly by the AsmPrinter
because it needs symbol indexes that aren't available before that.
There are two new calling conventions used to represent calls to and
from thunks: ARM64EC_Thunk_X64 and ARM64EC_Thunk_Native. There are a few
changes to handle the associated exception-handling info,
SEH_SaveAnyRegQP and SEH_SaveAnyRegQPX.
I've intentionally left out handling for structs with small
non-power-of-two sizes, because that's easily separated out. The rest of
my current work is here. I squashed my current patches because they were
split in ways that didn't really make sense. Maybe I could split out
some bits, but it's hard to meaningfully test most of the parts
independently.
Thanks to @dpaoliello for extensive testing and suggestions.
(Originally posted as https://reviews.llvm.org/D157547 .)
VGPRs used for spilling do not require explicit reservation with MRI.
freezeReservedRegs() executed before register allocation ensures these
are placed in the reserve set.
The only pass after SILowerSGPRSpills is SIPreAllocateWWMRegs which
explicitly tests for interference before register allocation so should
not reuse a WWM VGPR holding spill data. reserveReg prevents calculation
of correct liveness for physical registers which could be used to extend
SIPreAllocateWWMRegs.
Although there are predicated versions of minnum/maxnum, the ones for
minimum/maximum are currently missing. This patch introduces these
intrinsics and implements their lowering to RISC-V.
Some operations behave like selects. For example `or(zext(c), y)` is the
same as select(c, y|1, y)` and instcombine can canonicalize the select
to the or form. These operations can still be worthwhile converting to
branch as opposed to keeping as a select or or instruction.
This patch attempts to add some basic handling for them, creating a
SelectLike abstraction in the select optimization pass. The backend can
opt into handling `or(zext(c),x)` as a select if it could be profitable,
and the select optimization pass attempts to handle them in much the
same way as a `select(c, x|1, x)`. The Or(x, 1) may need to be added as
a new instruction, generated as the or is converted to branches.
This helps fix a regression from selects being converted to or's
recently.