The target should not have to construct MachineIRBuilders during
RegBankSelect (we should perhaps hide the constructors for it). The
pass should own the builder setup with the desired CSE configuration
(although currently the pass does not use the CSE builder, which is
what I want to fix).
https://reviews.llvm.org/D156479
If the pre-truncated value was the same width as the extension, and the assertzext guarantees that the extended bits are already zero, then skip the zext/trunc 'zero_extend_inreg' pattern.
Addresses several regressions noticed in D155472
This modifies the switch-statement generation in SelectionDAGBuilder,
specifically the part that generates case clusters of type CC_JumpTable.
A table-based branch of any kind is at risk of being a JOP gadget, if
it doesn't range-check the offset into the table. For some types of
table branch, such as Arm TBB/TBH, the impact of this is limited
because the value loaded from the table is a relative offset of
limited size; for others, such as a MOV PC,Rn computed branch into a
table of further branch instructions, the gadget is fully general.
When compiling for branch-target enforcement via Arm's BTI system,
many of these table branch idioms use branch instructions of types
that do not require a BTI instruction at the branch destination. This
avoids the need to put a BTI at the start of each case handler,
reducing the number of available gadgets //with// BTIs (i.e. ones
which could be used by a JOP attack in spite of the BTI system). But
without a range check, the use of a non-BTI-requiring branch also
opens up a larger range of followup gadgets for an attacker's use.
A defence against this is to avoid optimising away the range check on
the table offset, even if the compiler believes that no out-of-range
value should be able to reach the table branch. (Rationale: that may
be true for values generated legitimately by the program, but not
those generated maliciously by attackers who have already corrupted
the control flow.)
The effect of keeping the range check and branching to an unreachable
block is that no actual code is generated at that block, so it will
typically point at the end of the function. That may still cause some
kind of unpredictable code execution (such as executing data as code,
or falling through to the next function in the code section), but even
if so, there will only be //one// possible invalid branch target,
rather than giving an attacker the choice of many possibilities.
This defence is enabled only when branch target enforcement is in use.
Without branch target enforcement, the range check is easily bypassed
anyway, by branching in to a location just after it. But with
enforcement, the attacker will have to enter the jump table dispatcher
at the initial BTI and then go through the range check. (Or, if they
don't, it's because they //already// have a general BTI-bypassing
gadget.)
Reviewed By: MaskRay, chill
Differential Revision: https://reviews.llvm.org/D155485
The option is used to force the use of resource intervals
in the machine scheduler, effectively ignoring the value of
`EnableIntervals` in the instance of the `SchedMachineModel`.
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D156540
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:
- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS
Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D154766
Change the scheduler's physical register dependency tracking from
registers-and-their-aliases to regunits. This has a couple of advantages
when subregisters are used:
- The dependency tracking is more accurate and creates fewer useless
edges in the dependency graph. An AMDGPU example, edited for clarity:
SU(0): $vgpr1 = V_MOV_B32 $sgpr0
SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1
SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0
There is a data dependency on $vgpr1 from SU(0) to SU(1) and from
SU(1) to SU(2). But the old dependency tracking code also added a
useless edge from SU(0) to SU(2) because it thought that SU(0)'s def
of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1.
- On targets like AMDGPU that make heavy use of subregisters, each
register can have a huge number of aliases - it can be quadratic in
the size of the largest defined register tuple. There is a much lower
bound on the number of regunits per register, so iterating over
regunits is faster than iterating over aliases.
The LLVM compile-time tracker shows a tiny overall improvement of 0.03%
on X86. I expect a larger compile-time improvement on targets like
AMDGPU.
Differential Revision: https://reviews.llvm.org/D156552
The current implementation generates a csect with a
".rodata.str.x.y" prefix for a MergeableCString variable definition.
However, a reference to such variable does not get the prefix in its
name because there's not enough information in the containing IR.
In particular, without seeing the initializer and absent of some other
indicators, we cannot tell that the referenced variable is a null-
terminated string.
When the AIX codegen in llvm was being developed, the prefixing was copied
from ELF without having the linker take advantage of the info.
Currently, the AIX linker does not have the capability to merge
MergeableCString variables. If such feature would ever get implemented,
the contract between the linker and compiler would have to be reconsidered.
Here's the before and after of this change:
```
@a = global i64 320255973571806, align 8
@strA = unnamed_addr constant [7 x i8] c"hello\0A\00", align 1 ;; Mergeable1ByteCString
@strB = unnamed_addr constant [8 x i8] c"Blahah\0A\00", align 1 ;; Mergeable1ByteCString
@strC = unnamed_addr constant [2 x i16] [i16 1, i16 0], align 2 ;; Mergeable2ByteCString
@strD = unnamed_addr constant [2 x i16] [i16 1, i16 1], align 2 ;; !isMergeableCString
@strE = external unnamed_addr constant [2 x i16], align 2
-fdata-sections:
.text extern .rodata.str1.1strA .text extern strA
0 SD RO 0 SD RO
.text extern .rodata.str1.1strB .text extern strB
0 SD RO 0 SD RO
.text extern .rodata.str2.2strC ===> .text extern strC
0 SD RO 0 SD RO
.text extern strD .text extern strD
0 SD RO 0 SD RO
.data extern a .data extern a
0 SD RW 0 SD RW
undef extern strE undef extern strE
0 ER UA 0 ER UA
-fno-data-sections:
.text unamex .rodata.str1.1 .text unamex .rodata
0 SD RO 0 SD RO
.text extern strA .text extern strA
0 LD RO 0 LD RO
.text extern strB .text extern strB
0 LD RO 0 LD RO
.text unamex .rodata.str2.2 ===> .text extern strC
0 SD RO 0 LD RO
.text extern strC .text extern strD
0 LD RO 0 LD RO
.text unamex .rodata .data unamex .data
0 SD RO 0 SD RW
.text extern strD .data extern a
0 LD RO 0 LD RW
.data unamex .data undef extern strE
0 SD RW 0 ER UA
.data extern a
0 LD RW
undef extern strE
0 ER UA
```
Reviewed by: David Tenty, Fangrui Song
Differential Revision: https://reviews.llvm.org/D156202
This adds better support for call frame pseudos that adjust SP in
PEI::replaceFrameIndicesBackward.
Running frame index elimination backwards is preferred because it can
do backwards register scavenging (on targets that require scavenging)
which does not rely on accurate kill flags.
Differential Revision: https://reviews.llvm.org/D156434
Record the call frame size on entry to each basic block. This is usually
zero except when a basic block has been split in the middle of a call
sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D156113
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
Fix the GenericSSAContext template so that it actually declares all the
necessary typenames and the methods that must be implemented by its
specializations SSAContext and MachineSSAContext.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156288
hasPredecessorHelper method, that is used by DAGCombiner to combine to pre-indexed and post-indexed load/stores, is a major source of slowdown while compiling a large function with MSan enabled on Arm. This patch caps the DFS-graph traversal for this method to 8192 which cuts compile time by 50% (4m -> 2m compile time) at the cost of less overall nodes combined.
Here's the summary of pre-index DAG nodes created and time it took to compile the pathological case with different MaxDepth limit:
1. With MaxDepth = 0 (unlimited): 1800, took 4m
2. With MaxDepth = 32k, 560, took 2m31s
3. With MaxDepth = 8k, 139, took 2m.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D154885
This reverts commit d20e4a1d68aa8e14c4e524e4d4eeb4445acac401.
After committing 2ee4d0386c783f58abe708298228de648239b435, We don't support subprogram definitions nested within `DICompositeType` when doing LTO builds.
For a detailed discussion, see https://reviews.llvm.org/D152095.
Use the maximum 64 for BitWidth of getVScaleRange to avoid returning an empty range.
the previous changes bring in a Buildbot failure because MinSVEVectorSize = MinSVEVectorSize.
error: explicitly assigning value of variable of type 'unsigned int' to itself [-Werror,-Wself-assign]
Reviewed By: sdesmalen, nikic, dmgreen
Differential Revision: https://reviews.llvm.org/D155708
Use the maximum 64 for BitWidth of getVScaleRange to
avoid returning an empty range.
Reviewed By: sdesmalen, nikic, dmgreen
Differential Revision: https://reviews.llvm.org/D155708
Summary: D80642 added support for emitting AvailableExternally Linkage on AIX. However, an assertion of "Trying to get csect representation of this symbol but none was set." occurred when a function is declared as available_externally. This is due to we missing to generate a csect for the function. This patch fixes it.
Reviewed By: hubert.reinterpretcast, shchenz
Differential Revision: https://reviews.llvm.org/D156213
Signed-off-by: Esme Yi <esme.yi@ibm.com>
When inline assembly code requests more registers than available, the
MachineInstr::emitError function in the RegAllocFast pass emits an error
but doesn't stop the pass, and then the compiler crashes later with an
assertion failure. This commit, mimicking the RegAllocGreedy pass, assigns
a random physical register, and therefore avoids the crash after producing
the diagnostic. This problem has been observed for both rustc and clang,
while it doesn't occur in gcc.
In this example an implicit def had live-out undef subrange
defs. After coalescing with the def from a previous block, the
undef-defed lanes are no longer live out of the block in the new
interval. An empty subrange was tenatively created for these lanes,
but it must be deleted.
A live out implicit_def wasn't deleted, but the subranges weren't
correctly updated. The main range was correct but the def
corresponding to the initial main range def instruction was missing
from the lanes redefined in another block.
The written lanes are not quite the same as the valid lanes in the
case of an implicit_def.
Fixes verifier error in blender. There is an additional verifier in
some of the testcase variants where an empty subrange remains.
During the construction of SelectionDAG, there are no explicit canonicalization rules to adjust the order of operands for AND nodes. This may prevent the optimization in DAGCombiner::visitANDLike from being triggered. This patch canonicalizes the operands before matches, which can be observed to improve optimization on the RISC-V target architecture.
Canonicalize:
```
and(x, add) -> and(add, x)
```
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154760
The gain is usually suffiscient to go the extra mile and reconstruct a carry in some cases.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154533
These can be triggered by in various ways when intrinsics are used wrong or a target doesn't correctly
not support something. Using a fatal error prevents strange behavior
like infinite loops.
We already do this for some of the vector type legalization handles.
Because branch relaxation needs to factor in if branches target
a block in the same section or a different one, it needs to run
after the Basic Block Sections / Machine Function Splitting passes.
Because Jump table compression relies on block offsets remaining
fixed after the table is compressed, we must also move the JT
compression pass.
The only tests affected are ones enforcing just the ordering and
the a few that have basic block ids changed because RenumberBlocks
hasn't run yet.
Differential Revision: https://reviews.llvm.org/D153829
We only attempted to determine KnownBits for uniform constant shift amounts, but ComputeKnownBits is able to handle some non-uniform cases as well that we can use as a fallback.
Entries of the same DJB hash are in the hash lookup table/name table are
ordered by the iteration order of `Entries` (a StringMap). Change
`Entries` to a MapVector to stabilize the order and simplify future
changes like D142862 that change the StringMap hash function.
Following recent changes switching from xxh64 to xxh32 for better
hashing performance. This particular instance may or may not have
noticeable performance difference, but this change makes us toward
removing xxHash64.
Refactor to use BasicBlockUtils functions and make life easier for
a subsequent patch for updating the dominator tree.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D154053
AArch64 introduced CMLA and CADD instructions as part of SVE2. This
change allows to generate such instructions when this architecture
feature is available.
Differential Revision: https://reviews.llvm.org/D153808
In D152577 @xur has a post-submit comment regarding to an awkward usage
of MFS for Autofdo - instead of just using -fsplit-machine-function, the
user needs to add "-mllvm -mfs-psi-cutoff=0" to choose the right logic
for AutoFDO. The compiler should choose the right default values for
such case.
This CL separate MFS logic for different profile types.
Reviewed By: xur, wenlei
Differential Revision: https://reviews.llvm.org/D155253
I was attempting to add llvm.reduce.fminimum/fmaximum support for GlobalISel.
In the process I noticed that llvm.reduce.fmin/fmax was missing, and could do
with being added first. That led on to adding additional vector support for
minnum/maxnum, which in turn led to needing to handle fptrunc and fpext for
some of the fp16 types. So this patch extends the vector handling for fptrunc,
adding support for f16 types which are clamped to 4 elements, and scalarizing
the rest.
I went round in circles a little with how smaller than legal vectors should be
handled, but this seems simple and seems to work, if not always optimally yet.
Differential Revision: https://reviews.llvm.org/D155311