The current implementation of aliases tries to remove all the aliases in
the module to prevent the generic version of `AsmPrinter` from emitting
them incorrectly. Unfortunately, if the aliases are used this will fail.
Instead let's override the function to print aliases directly.
In addition, the declarations of the alias functions must occur before
the uses. To fix this we emit alias declarations as part of
`emitDeclarations` and only emit the `.alias` directives at the end
(where we can assume the aliasee has also already been declared).
With this, I get a clean test suite running under RemoveDIs, the
non-intrinsic representation of debug-info, including under asan. We've
previously established that we generate identical binaries for some
large projects, so this i just edge-case cleanup. The changes:
* CodeGenPrepare fixups need to apply to dbg.assigns as well as
dbg.values (a dbg.assign is a dbg.value).
* Pin a test for constant-deletion to intrinsic debug-info: this very
rare scenario uses a different kill-location sigil in dbg.value mode to
RemoveDIs mode, which generates spurious test differences.
* Suppress a memory leak in a unit test: the code for dealing with
trailing debug-info in a block is necessarily fiddly, leading to this
leak when testing it. Developer-facing interfaces for moving
instructions around always deal with this behind the scenes.
* SROA, when replacing some vector-loads, needs to insert the
replacement loads ahead of any debug-info records so that their values
remain dominated by a definition. Set the head-bit indicating our
insertion should come before debug-info.
There are some rare circumstances where dbg.assign intrinsics can reach
FastISel. They are a more specialised kind of dbg.value intrinsic with
more information about the originating alloca. They only occur during
optimisation, but might reach FastISel through always_inlining an
optimised function into an optnone function.
This is a slight problem as it's not safe (for debug-info accuracy) to
ignore any intrinsics, and for RemoveDIs (the intrinsic-replacement
project) it causes a crash through an unhandled switch case. To get
around this, we can just treat the dbg.assign as a dbg.value (it's an
actual subclass) and use the variable location information from the
dbg.value fields. This loses a small amount of debug-info about stack
locations, but is more accurate than just ignoring the intrinsic.
(This has popped up deep in an LTO build of a large codebase while
testing RemoveDIs, I figured it'd be good to fix it for the
intrinsic-form at the same time, just to demonstrate the correct
behaviour).
This handles two cases where we can work out some known-zero bits for
ISD::STEP_VECTOR.
The first case handles when we know the low bits are zero because the
step
amount is a power of two. This is taken from
https://reviews.llvm.org/D128159,
and even though the original patch didn't end up landing this case due
to it
not having any test difference, I've included it here for completeness's
sake.
The second case handles the case when we have an upper bound on
vscale_range.
We can use this to work out the upper bound on the number of elements,
and thus
what the maximum step will be. From the maximum step we then know which
hi bits
are zero.
On its own, computing the known hi bits results in some small
improvements for
RVV with -mrvv-vector-bits=zvl across the llvm-test-suite. However I'm
hoping
to be able to use this later to reduce the LMUL in index calculations
for
vrgather/indexed accesses.
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
With the legacy pass manager, MachineModuleInfoWrapperPass owned the
MachineModuleInfo used in the codegen pipeline. It can do this since
it's an ImmutablePass that doesn't get invalidated.
However, with the new pass manager, it is legal for the
ModuleAnalysisManager to clear all of its analyses, regardless of if the
analysis does not want to be invalidated. So we must move ownership of
the MachineModuleInfo outside of the analysis (this is similar to
PassInstrumentation). For now, make the PassBuilder user register a
MachineModuleAnalysis that returns a reference to a MachineModuleInfo
that the user owns. Perhaps we can find a better place to own the
MachineModuleInfo to make using the codegen pass manager less cumbersome
in the future.
This gives the target a chance to keep an atomicrmw op that is smaller
than the minimum cmpxchg size. This is needed to support the Zabha
extension for RISC-V which provides i8/i16 atomicrmw operations, but
does not provide an i8/i16 cmpxchg or LR/SC instructions.
This moves the widening until after the target requests
LLSC/CmpXChg/MaskedIntrinsic expansion. Once we widen, we call
shouldExpandAtomicRMWInIR again to give the target another chance to
make a decision about the widened operation.
I considered making the targets return AtomicExpansionKind::Expand or a
new expansion kind for And/Or/Xor, but that required the targets to
special case And/Or/Xor which they weren't currently doing.
This function can be called from buildCopyToRegs where at least one of
the types is a scalable vector type. This function crashed because it
did not know how to handle scalable vector types.
This patch extends the functionality of getGCDType to handle when at
least one of the types is a scalable vector. getGCDType between a fixed
and scalable vector is not implemented since the docstring of the
function explains that getGCDType is used to build MERGE/UNMERGE
instructions and we will never build a MERGE/UNMERGE between fixed and
scalable vectors.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Don't call TLI.SimplifyDemandedVectorElts directly from every SimplifyDemandedBits call, use the more expressive wrappers instead first.
This reduces the number of places we call TLI.SimplifyDemandedVectorElts and CommitTargetLoweringOpt to make it easier to track.
Part of the work to process DAG nodes in topological order.
If part of a register (lowered from REG_SEQUENCE) is undefined then we
should propagate undef flags to uses of those lanes. This is only
performed when live intervals are present as it requires live intervals
to correctly match uses to defs, and the primary goal is to allow
precise computation of subrange intervals.
This function can be called from buildCopyToRegs where at least one of
the types is a scalable vector type. This function crashed because it
did not know how to handle scalable vector types.
This patch extends the functionality of getLCMType to handle when at
least one of the types is a scalable vector. getLCMType between a fixed
and scalable vector is not implemented since the docstring of the
function explains that getLCMType is used to build MERGE/UNMERGE
instructions and we will never build a MERGE/UNMERGE between fixed and
scalable vectors.
Since we used getNumRegisters right before this, I think this is the
correct interface we should be using here.
I'm experimenting with making i32 legal on RISC-V 64, but using i64 for
the register type between basic blocks. This was one of the first issues
I found trying to do that.
The comment and code here seems to match getTypeForExtReturn. The
history shows that at the time this code was added, similar code existed
in SelectionDAGBuilder. SelectionDAGBuiler code has since been
refactored into getTypeForExtReturn.
This patch makes FastISel match SelectionDAGBuilder.
The test changes are because X86 has customization of
getTypeForExtReturn. So now we only extend returns to i8.
Stumbled onto this difference by accident.
If we have a `SETCC (SETCC), 0, NE` and ZeroOrOneBooleanContent, we can remove
the outer setcc as it will produce the same value as the inner. This can be
generalized to anything where the top bits are known to be 0, as the value will
remain as 1 or 0.
Fixes https://github.com/llvm/llvm-project/issues/80744. This transform
doesn't handled vectors at all, The fixed length ones pass the first
check, but would fail the constant operand checks which immediate follow.
This patch takes the simplest approach, and just guards the transform
for scalar integers.
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
If we have a shifted mask, we may be able to reduce the load width
to the width of the non-zero part of the mask and use an offset
to the base address to remove the srl. The offset is given by
C+trailingzeros(ShiftedMask).
Then we add a final shl to restore the trailing zero bits.
I've use the ARM test because that's where the existing (and (srl
(load))) tests were.
The X86 test was modified to keep the H register.
The legacy version print machine functions to a string stream, then
output the module and string in `doFinalization`. This patch break
`MIRPrintingPass` into two parts `PrintMIRPreparePass` and
`PrintMIRPass`. `PrintMIRPreparePass` output the original IR in yaml
string, `PrintMIRPass` just print the machine function, so we can avoid
the `doFinalization`.
Prior to this patch, SelectionDAG generated aligned move onto stacks for
AVX registers when the function was marked as a no-realign-stack
function. This lead to misalignment between the stack and the
instruction generated. This patch fixes the issue.
Fixes#77730
Currently, half operations can be promoted in one of two ways.
* If softPromoteHalfType() returns false, fp16 values are passed around
in fp32 registers, and whole chains of fp16 operations are promoted to
fp32 in one go.
* If softPromoteHalfType() returns true, fp16 values are passed around
in i16 registers, and individual fp16 operations are promoted to fp32
and the result truncated to fp16 right away.
The softPromoteHalfType behavior is necessary for correctness, but
changing this for an existing target breaks the ABI. Therefore, this
commit adds a third option:
* If softPromoteHalfType() returns true and useFPRegsForHalfType()
returns true as well, fp16 values are passed around in fp32 registers,
but individual fp16 operations are promoted to fp32 and the result
truncated to fp16 right away.
This change does not yet update any target to make use of it.
Today `-split-machine-functions` and `-fbasic-block-sections={all,list}`
cannot be combined with `-basic-block-sections=labels` (the labels
option will be ignored).
The inconsistency comes from the way basic block address map -- the
underlying mechanism for basic block labels -- encodes basic block
addresses
(https://lists.llvm.org/pipermail/llvm-dev/2020-July/143512.html).
Specifically, basic block offsets are computed relative to the function
begin symbol. This relies on functions being contiguous which is not the
case for MFS and basic block section binaries. This means Propeller
cannot use binary profiles collected from these binaries, which limits
the applicability of Propeller for iterative optimization.
To make the `SHT_LLVM_BB_ADDR_MAP` feature work with basic block section
binaries, we propose modifying the encoding of this section as follows.
First let us review the current encoding which emits the address of each
function and its number of basic blocks, followed by basic block entries
for each basic block.
| | |
|--|--|
| Address of the function | Function Address |
| Number of basic blocks in this function | NumBlocks |
| BB entry 1
| BB entry 2
| ...
| BB entry #NumBlocks
To make this work for basic block sections, we treat each basic block
section similar to a function, except that basic block sections of the
same function must be encapsulated in the same structure so we can map
all of them to their single function.
We modify the encoding to first emit the number of basic block sections
(BB ranges) in the function. Then we emit the address map of each basic
block section section as before: the base address of the section, its
number of blocks, and BB entries for its basic block. The first section
in the BB address map is always the function entry section.
| | |
|--|--|
| Number of sections for this function | NumBBRanges |
| Section 1 begin address | BaseAddress[1] |
| Number of basic blocks in section 1 | NumBlocks[1] |
| BB entries for Section 1
|..................|
| Section #NumBBRanges begin address | BaseAddress[NumBBRanges] |
| Number of basic blocks in section #NumBBRanges |
NumBlocks[NumBBRanges] |
| BB entries for Section #NumBBRanges
The encoding of basic block entries remains as before with the minor
change that each basic block offset is now computed relative to the
begin symbol of its containing BB section.
This patch adds a new boolean codegen option `-basic-block-address-map`.
Correspondingly, the front-end flag `-fbasic-block-address-map` and LLD
flag `--lto-basic-block-address-map` are introduced.
Analogously, we add a new TargetOption field `BBAddrMap`. This means BB
address maps are either generated for all functions in the compiling
unit, or for none (depending on `TargetOptions::BBAddrMap`).
This patch keeps the functionality of the old
`-fbasic-block-sections=labels` option but does not remove it. A
subsequent patch will remove the obsolete option.
We refactor the `BasicBlockSections` pass by separating the BB address
map and BB sections handing to their own functions (named
`handleBBAddrMap` and `handleBBSections`). `handleBBSections` renumbers
basic blocks and places them in their assigned sections.
`handleBBAddrMap` is invoked after `handleBBSections` (if requested) and
only renumbers the blocks.
- New tests added:
- Two tests basic-block-address-map-with-basic-block-sections.ll and
basic-block-address-map-with-mfs.ll to exercise the combination of
`-basic-block-address-map` with `-basic-block-sections=list` and
'-split-machine-functions`.
- A driver sanity test for the `-fbasic-block-address-map` option
(basic-block-address-map.c).
- An LLD test for testing the `--lto-basic-block-address-map` option.
This reuses the LLVM IR from `lld/test/ELF/lto/basic-block-sections.ll`.
- Renamed and modified the two existing codegen tests for basic block
address map (`basic-block-sections-labels-functions-sections.ll` and
`basic-block-sections-labels.ll`)
- Removed `SHT_LLVM_BB_ADDR_MAP_V0` tests. Full deprecation of
`SHT_LLVM_BB_ADDR_MAP_V0` and `SHT_LLVM_BB_ADDR_MAP` version less than 2
will happen in a separate PR in a few months.
The stack slot coloring pass is concerned with optimizing spill
slots. If any change is a pass is made over the function to remove
stack stores that use the same register and stack slot as an
immediately preceding load.
The register check is too simple for constant registers like AArch64
and RISC-V's zero register. This register can be used as the result
of a load if we want to discard the result, but still have the memory
access performed. Like for a volatile or atomic load.
If the code sees a load from the zero register followed by a store
of the zero register at the same stack slot, the pass mistakenly
believes the store isn't needed.
Since the main stack coloring optimization is only concerned with
spill slots, it seems reasonable that RemoveDeadStores should
only be concerned with spills. Since we never generate a reload of
x0, this avoids the issue seen by RISC-V.
Test case concept is adapted from pr30821.mir from X86. That test
had to be updated to mark the stack slot as a spill slot.
Fixes#80052.
RegisterBank Selection for scalable vector G_ADD and G_SUB by creating
new mappings for different types of vector register banks.
Then implement Instruction Selection for the same operations by choosing
the correct RISC-V vector register class.
This patch adds support for common and local symbols in the TOC for AIX.
Note that we need to update isVirtualSection so as a common symbol in
TOC will have the symbol type XTY_CM and will be initialized when placed
in the TOC so sections with this type are no longer virtual.
---------
Co-authored-by: Zaara Syeda <syzaara@ca.ibm.com>
https://github.com/llvm/llvm-project/pull/78171 added support for
non-consecutive local value numbers. This extends the support for global
value numbers (for globals and functions).
This means that it is now possible to delete an unnamed global
definition/declaration without breaking the IR.
This is a lot less common than unnamed local values, but it seems like
something we should support for consistency. (Unnamed globals are used a
lot in Rust though.)
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
This reverts commit f8525030004f907cd108e7c18df255a6d3b23124.
It was supposed to speed things up but llvm-compile-time-tracker.com
showed a slight slow down.
Previously we called ignoreCSRForAllocationOrder on every alias of every
CSR which was expensive on targets like AMDGPU which define a very large
number of overlapping register tuples.
On such targets it is simpler and faster to call
ignoreCSRForAllocationOrder once for every physical register.
Differential Revision: https://reviews.llvm.org/D146735
This is a fix for the regression seen in
https://github.com/llvm/llvm-project/pull/79498
> Currently, the way that recomputeLiveIns works is that it will
recompute the livein registers for that MachineBasicBlock but it matters
what order you call recomputeLiveIn which can result in incorrect
register allocations down the line.
Now we do not recompute the entire CFG but we do ensure that the newly
added MBB do reach convergence.
32-bit ARMv6 with thumb doesn't support MULHS/MUL_LOHI as legal/custom
nodes during expansion which will cause fixed point multiplication of
_Accum types to fail with fixed point arithmetic. Prior to this, we just
happen to use fixed point multiplication on platforms that happen to
support these MULHS/MUL_LOHI.
This patch attempts to check if the multiplication can be done via
libcalls, which are provided by the arm runtime. These libcall attempts
are made elsewhere, so this patch refactors that libcall logic into its
own functions and the fixed point expansion calls and reuses that logic.
Change the implementation of getLastCalleeSavedAlias to use RegUnits
instead of register aliases. This is much faster on targets like AMDGPU
which define a very large number of overlapping register tuples.
No functional change intended. If PhysReg overlaps multiple CSRs then
getLastCalleeSavedAlias(PhysReg) could conceivably return a different
arbitrary one, but currently it is only used for some debug printing
anyway.
Differential Revision: https://reviews.llvm.org/D146734