Rather then defining these tags in each object file that requires them
we can can declare them as undefined and require that they defined
externally in, for example, compiler-rt or libcxxabi.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strlen routine is a millicode implementation;
we use millicode for the strlen function instead of a library call to
improve performance.
In this commit:
(1) Added new pass manager support for `ReachingDefAnalysis`.
(2) Added printer pass.
(3) Make old pass manager use `ReachingDefInfoWrapperPass`
This is a generalization of the LookupPtrRegClass mechanism.
AMDGPU has several use cases for swapping the register class of
instruction operands based on the subtarget, but none of them
really fit into the box of being pointer-like.
The current system requires manual management of an arbitrary integer
ID. For the AMDGPU use case, this would end up being around 40 new
entries to manage.
This just introduces the base infrastructure. I have ports of all
the target specific usage of PointerLikeRegClass ready.
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.
This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.
For SWDEV-516125.
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds,
that targets can use to allow transformations to introduce out-of-bounds
pointer arithmetic. It also moves two such transformations from the
AMDGPU-specific DAG combines to the generic DAGCombiner.
This is motivated by target features like AArch64's checked pointer
arithmetic, CPA, which does not tolerate the introduction of
out-of-bounds pointer arithmetic.
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.
For SWDEV-516125.
Previously this would only accept full copy hints. This relaxes
this to accept some subregister copies. Specifically, this now
accepts:
- Copies to/from physical registers if there is a compatible
super register
- Subreg-to-subreg copies
This has the potential to repeatedly add the same hint to the
hint vector, but not sure if that's a real problem.
This actually reverts 418120556398c01550d42500d56e6d328290185b.
The original commit omits unit with all symbols inlined into current
one, which leads to crash when a module using split-dwarf inlined a
function from another module with mismatched split-dwarf-inlining
option. This revert guarantees that DIEs are created in both DWO and the
skeleton sections whenever split-dwarf is active.
Most of the pass works in terms of MachineBasicBlock::iterator
(MachineInstrBundleIterator), but here one is constructed from an
arbitrary instruction which may be within a bundle, causing an
assertion.
As reported in https://github.com/llvm/llvm-project/issues/141034
SelectionDAG::getNode had some unexpected
behaviors when trying to create vectors with UNDEF elements. Since
we treat both UNDEF and POISON as undefined (when using isUndef())
we can't just fold away INSERT_VECTOR_ELT/INSERT_SUBVECTOR based on
isUndef(), as that could make the resulting vector more poisonous.
Same kind of bug existed in DAGCombiner::visitINSERT_SUBVECTOR.
Here are some examples:
This fold was done even if vec[idx] was POISON:
INSERT_VECTOR_ELT vec, UNDEF, idx -> vec
This fold was done even if any of vec[idx..idx+size] was POISON:
INSERT_SUBVECTOR vec, UNDEF, idx -> vec
This fold was done even if the elements not extracted from vec could
be POISON:
sub = EXTRACT_SUBVECTOR vec, idx
INSERT_SUBVECTOR UNDEF, sub, idx -> vec
With this patch we avoid such folds unless we can prove that the
result isn't more poisonous when eliminating the insert.
Fixes https://github.com/llvm/llvm-project/issues/141034
A common idiom is the usage of the PatternMatch match function within a
functional algorithm like all_of. Introduce a match functor to shorten
this idiom.
Co-authored-by: Luke Lau <luke@igalia.com>
With this change, construction of abstract subprogram DIEs is split in
two stages/functions:
creation of DIE (in DwarfCompileUnit::getOrCreateAbstractSubprogramDIE)
and its population with children (in
DwarfCompileUnit::constructAbstractSubprogramScopeDIE).
With that, abstract subprograms can be created/referenced from
DwarfDebug::beginModule, which should solve the issue with static local
variables DIE creation of inlined functons with optimized-out
definitions. It fixes https://github.com/llvm/llvm-project/issues/29985.
LexicalScopes class now stores mapping from DISubprograms to their
corresponding llvm::Function's. It is supposed to be built before
processing of each function (so, now LexicalScopes class has a method
for "module initialization" alongside the method for "function
initialization"). It is used by DwarfCompileUnit to determine whether a
DISubprogram needs an abstract DIE before DwarfDebug::beginFunction is
invoked.
DwarfCompileUnit::getOrCreateSubprogramDIE method is added, which can
create an abstract or a concrete DIE for a subprogram. It accepts
llvm::Function* argument to determine whether a concrete DIE must be
created.
This is a temporary fix for
https://github.com/llvm/llvm-project/issues/29985. Ideally, it will be
fixed by moving global variables and types emission to
DwarfDebug::endModule (https://reviews.llvm.org/D144007,
https://reviews.llvm.org/D144005).
Some code proposed by Ellis Hoag <ellis.sparky.hoag@gmail.com> in
https://github.com/llvm/llvm-project/pull/90523 was taken for this
commit.
Accept mismatched register size and type size if the type is legal
for the register class.
For AMDGPU boolean registers have 2 possible interpretations depending
on the use context type. e.g., these are both equally valid:
%0:_(s1) = COPY $vcc
%1:_(s64) = COPY $vcc
vcc is a 64-bit register, which can be interpreted as a 1-bit or 64-bit
value depending on the use context. SelectionDAG has never required
exact
match between the register size and the used value type. You can assign
a type with a smaller size to a larger register class. Relax the
verifier
to match. There are several hacks holding together these copies in
various places, and this is preparation to remove one of them.
The x86 test change is from what I would consider an X86 usage bug. X86
defines an FR32 register class and F16 register class, but the F16
register
class is functionally an alias of F32 with the same members and size.
There's
no need to have the F16 class.
The PowerPC changes are caused by shifts created by different IR
operations being CSEd now. This allows consecutive loads to be turned
into vectors earlier. This has effects on the ordering of other combines
and legalizations. This leads to some improvements and some regressions.
Hi, I compared the following LLVM IR with GCC and Clang, and there is a small difference between the two. The LLVM IR is:
```
define i64 @test_smin_neg_one(i64 %a) {
%1 = tail call i64 @llvm.smin.i64(i64 %a, i64 -1)
%retval.0 = xor i64 %1, -1
ret i64 %retval.0
}
```
GCC generates:
```
cmp x0, 0
csinv x0, xzr, x0, ge
ret
```
Clang generates:
```
cmn x0, #1
csinv x8, x0, xzr, lt
mvn x0, x8
ret
```
Clang keeps flipping x0 through x8 unnecessarily.
So I added the following folds to DAGCombiner:
fold (xor (smax(x, C), C)) -> select (x > C), xor(x, C), 0
fold (xor (smin(x, C), C)) -> select (x < C), xor(x, C), 0
alive2: https://alive2.llvm.org/ce/z/gffoir
---------
Co-authored-by: Yui5427 <785369607@qq.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Change shouldRewriteCopySrc to return the common register
class and expose it as a utility function. I've found myself
reproducing essentially the same logic in multiple places. The
purpose of this function is to jsut work through the API constraints
of which combination of register class and subreg indexes you have.
i.e. you need to use a different function if you have 0, 1, or 2
subregister indexes involved in a pair of copy-like operations.
There is a cache on the known-bit computed by global-isel. It only works
inside a single query to computeKnownBits, which limits its usefulness,
and according to the tests can sometimes limit the effectiveness of
known-bits queries. (Although some AMD tests look longer). Keeping the
cache valid and clearing it at the correct times can also require being
careful about the functions called inside known-bits queries.
I measured compile-time of removing it and came up with:
```
7zip 2.06405E+11 2.06436E+11 0.015018992
Bullet 1.01298E+11 1.01186E+11 -0.110236169
ClamAV 57942466667 57848066667 -0.16292023
SPASS 45444466667 45402966667 -0.091320249
consumer 35432466667 35381233333 -0.144594317
kimwitu++ 40858833333 40927933333 0.169118877
lencod 70022366667 69950633333 -0.102443457
mafft 38439900000 38413233333 -0.069372362
sqlite3 35822266667 35770033333 -0.145812474
tramp3d 82083133333 82045600000 -0.045726
Average -0.068828739
```
The last column is % difference between with / without the cache. So in
total it seems to be costing slightly more to keep the current
known-bits cache than if it was removed. (Measured in instruction count,
similar to llvm-compile-time-tracker).
The hit rate wasn't terrible - higher than I expected. In the
llvm-test-suite+external projects it was hit 4791030 times out of
91107008 queries, slightly more than 5%.
Note that as globalisel increases in complexity, more known bits calls
might be made and the numbers might shift. If that is the case it might
be better to have a cache that works across calls, providing it doesn't
make effectiveness worse.
This ensures we don't need to fixup the shift amount later.
Unfortunately, this enabled the
(SRA (SHL X, ShlConst), SraConst) -> (SRA (sext_in_reg X), SraConst -
ShlConst) combine in combineShiftRightArithmetic for some cases in
is_fpclass-fp80.ll. So we need to also update checkSignTestSetCCCombine
to look through sign_extend_inreg to prevent a regression.
This function contained old code for handling the case that the type
returned getScalarShiftAmountTy can't hold the shift amount.
These days this is handled by getShiftAmountTy which is used by
getShiftAmountConstant.
Tail duplicator removes the first PHI income from the predecessor basic
block, while it should remove all operands for this block.
PHI instructions happen to have duplicated values for the same
predecessor block:
* `UnreachableMachineBlockElim` assumes that PHI instruction might have
duplicates:
7289f2cd0c/llvm/lib/CodeGen/UnreachableBlockElim.cpp (L160)
* `AArch64` directly states that PHI instruction might have duplicates:
7289f2cd0c/llvm/lib/Target/AArch64/AArch64ConditionalCompares.cpp (L244)
* And `Hexagon`:
7289f2cd0c/llvm/lib/Target/Hexagon/HexagonConstPropagation.cpp (L844)
We have caught the bug on custom out-of-tree backend. `TailDuplicator`
should remove all operands corresponding to the removing block.
Please note, that bug likely does not affect in-tree backends, because:
* It happens only in scenario of **partial** tail duplication (i.e. tail
block is duplicated in some predecessors, but not in all of them)
* It happens in **Pre-RA** tail duplication only (Post-RA does not
contain PHIs, obviously)
* The only backend (I know) uses Pre-RA tail duplication is X86. It uses
tail duplication via `early-tailduplication` pass which declines partial
tail duplication via `canCompletelyDuplicateBB` check, because it uses
`TailDuplicator::tailDuplicateBlocks` public API.
So, bug happens only in the case of pre-ra partial tail duplication if
backend uses `TailDuplicator::tailDuplicate` public API directly.
That's why I can not add reproducer test for in-tree backends.
getPointerRegClass is a layering violation. Its primary purpose
is to determine how to interpret an MCInstrDesc's operands RegClass
fields. This should be context free, and only depend on the subtarget.
The model of this is also wrong, since this should be an
instruction / operand specific property, not a global pointer class.
Remove the the function argument to help stage removal of this hook
and avoid introducing any new obstacles to replacing it.
The remaining uses of the function were to get the subtarget, which
TargetRegisterInfo already belongs to. A few targets needed new
subtarget derived properties copied there.
This adds value types for representing capability types, enabling their use in instruction selection and other parts of the backend.
These types are distinguished from each other only by size. This is sufficient, at least today, because no existing CHERI configuration supports multiple capability sizes simultaneously. Hybrid configurations supporting intermixed integral pointers and capabilities do exist, and are one of the reasons why these value types are needed beyond existing integral types.
Co-authored-by: David Chisnall <theraven@theravensnest.org>
Co-authored-by: Jessica Clarke <jrtc27@jrtc27.com>
Change `SpecificConstantMatch` constructor and `isBuildVectorConstantSplat` overloads to take `const APInt&` instead of by value to avoid unnecessary copies, especially for wide integers.
Similar to the implementation in
https://github.com/llvm/llvm-project/pull/104411 , the `fmin.s`/`fmax.s`
instructions follow IEEE 754-2019 semantics, and
`G_FMINIMUMNUM`/`G_FMAXIMUMNUM` are legal.
This makes it easier to follow a value through the pass. If we pass the
original name to the create function, a number will be added as a suffix
since the original name is still used until it is replaced.
Align the syntax used for the optimization level argument of the
expand-fp pass in textual descriptions of pass pipelines with the syntax
used by other passes taking a similar argument. That is, use e.g.
`expand-fp<O1>` instead of `expand-fp<opt-level=1>`.
Checking `isOperationLegalOrCustom` instead of `isOperationLegal` allows
more optimization opportunities. In particular, if a target wants to
mark `extract_vector_elt` as `Custom` rather than `Legal` in order to
optimize some certain cases, this combiner would otherwise miss some
improvements.
Previously, using `isOperationLegalOrCustom` was avoided due to the risk
of getting stuck in infinite loops (as noted in
61ec738b60).
After testing, the issue no longer reproduces, but the coverage is
limited to the regression/unit tests and the test-suite.