The target should not have to construct MachineIRBuilders during
RegBankSelect (we should perhaps hide the constructors for it). The
pass should own the builder setup with the desired CSE configuration
(although currently the pass does not use the CSE builder, which is
what I want to fix).
https://reviews.llvm.org/D156479
Add a const qualifier to this API call, since this is a member of
MemoryDepChecker and LoopAccessInfo returns an object of this class as a
const, as follows:
const MemoryDepChecker &getDepChecker() const { return *DepChecker; }
If one tries to use function as follows:
LAI->getDepChecker().getMaxSafeDepDistBytes()
results in the following error:
passing ‘const llvm::MemoryDepChecker’ as ‘this’ argument discards
qualifiers
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D156304
InstCombine is a worklist-driven algorithm, which works roughly
as follows:
* All instructions are initially pushed to the worklist.
The initial order is in RPO program order.
* All newly inserted instructions get added to the worklist.
* When an instruction is folded, its users get added back to the
worklist.
* When the use-count of an instruction decreases, it gets added
back to the worklist.
* And a few of other heuristics on when we should revisit
instructions.
On top of the worklist algorithm, InstCombine layers an additional
fix-point iteration: If any fold was performed in the previous
iteration, then InstCombine will re-populate the worklist from
scratch and fold the entire function again. This continues until
a fix-point is reached.
In the vast majority of cases, InstCombine will reach a fix-point
within a single iteration: However, a second iteration is performed
to verify that this is indeed the fixpoint. We can see this in the
statistics for llvm-test-suite:
"instcombine.NumOneIteration": 411380,
"instcombine.NumTwoIterations": 117921,
"instcombine.NumThreeIterations": 236,
"instcombine.NumFourOrMoreIterations": 2,
The way to read these numbers is that in 411380 cases, InstCombine
performs no folds. In 117921 cases it performs a fold and reaches
the fix-point within one iteration (the second iteration verifies
the fixpoint). In the remaining 238 cases, more than one iteration
is needed to reach the fixpoint.
In other words, only in 0.04% of cases are additional iterations
needed to reach a fixpoint. Conversely, in 22.3% of cases InstCombine
performs a completely useless extra iteration to verify the fix point.
This patch removes the fixpoint iteration from InstCombine, and always
only perform a single iteration. This results in a major compile-time
improvement of around 4% at negligible codegen impact.
This explicitly does accept that we will not reach a fixpoint in all
cases. However, this is mitigated by two factors: First, the data
suggests that this happens very rarely in practice. Second,
InstCombine runs many times during the optimization pipeline
(8 times even without LTO), so there are many chances to recover
such cases.
In order to prevent accidental optimization regressions in the
future, this implements a verify-fixpoint option, which is enabled
by default when instcombine is specified in -passes and disabled
when InstCombinePass() is constructed from C++. This means that
test cases need to explicitly use the no-verify-fixpoint option
if they fail to reach a fixed point (for a well understand reason
we cannot / do not want to avoid).
Differential Revision: https://reviews.llvm.org/D154579
The option is used to force the use of resource intervals
in the machine scheduler, effectively ignoring the value of
`EnableIntervals` in the instance of the `SchedMachineModel`.
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D156540
This patch allows constant folding of PHIs when estimating the user
bonus. Phi nodes are a special case since some of their inputs may
remain unresolved until all the specialization arguments have been
processed by the InstCostVisitor. Therefore, we keep a list of dead
basic blocks and then lazily visit the Phi nodes once the user bonus
has been computed for all the specialization arguments.
Differential Revision: https://reviews.llvm.org/D154852
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:
- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS
Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D154766
Comment mentions MIIdx, but the actual parameter is def.
Fix comment, but also rename parameter to Def to match current
coding standards while touching the code.
The ExecutionSession::removeJITDylibs operation will remove all JITDylibs in
the given list (i.e. first clear them, then remove them from the session).
ExecutionSession::endSession is updated to remove JITDylibs rather than just
clearing them. This prevents new code from being added to any JITDylib once
endSession has been called.
Change the scheduler's physical register dependency tracking from
registers-and-their-aliases to regunits. This has a couple of advantages
when subregisters are used:
- The dependency tracking is more accurate and creates fewer useless
edges in the dependency graph. An AMDGPU example, edited for clarity:
SU(0): $vgpr1 = V_MOV_B32 $sgpr0
SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1
SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0
There is a data dependency on $vgpr1 from SU(0) to SU(1) and from
SU(1) to SU(2). But the old dependency tracking code also added a
useless edge from SU(0) to SU(2) because it thought that SU(0)'s def
of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1.
- On targets like AMDGPU that make heavy use of subregisters, each
register can have a huge number of aliases - it can be quadratic in
the size of the largest defined register tuple. There is a much lower
bound on the number of regunits per register, so iterating over
regunits is faster than iterating over aliases.
The LLVM compile-time tracker shows a tiny overall improvement of 0.03%
on X86. I expect a larger compile-time improvement on targets like
AMDGPU.
Differential Revision: https://reviews.llvm.org/D156552
Currently, m_Mul() style matchers also match constant expressions.
This is a regular source of assertion failures (usually by trying
to do a match and then cast to Instruction or BinaryOperator) and
infinite combine loops. At the same time, I don't think this provides
useful optimization capabilities (all of the tests affected here are
regression tests for crashes / infinite loops).
Long term, all of these constant expressions (apart from possibly
add/sub) are slated for removal per
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179
-- but doing those removals can itself expose new crashes and
infinite loops due to the current PatternMatch behavior.
Differential Revision: https://reviews.llvm.org/D156401
In SelectionDAG InstrEmitter automatically puts dead flags on unused
physreg defs everywhere. The generated selectors should also set dead
on physreg defs that were not used in the pattern.
This ensures that llvm-symbolizer ignores them for symbolization.
Note: unlike aarch64-mapping-symbol.s, the test included here does not
test if the mapping symbols are actually in the symbol table. The reason
is that llvm-mc support for RISC-V mapping symbols (D153260) has not
landed yet, so the mapping symbols simply aren't there. However, D153260
would like to depend on this patch together with D156190 to avoid having
to update a large amount of tests.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D156236
Similar to D96617 for llvm-symbolizer.
This patch matches the GNU objdump -d behavior to suppress printing
labels for mapping symbols. Mapping symbol names don't convey much
information.
When --show-all-symbols (not in GNU) is specified, we still print
mapping symbols.
Note: the `for (size_t SI = 0, SE = Symbols.size(); SI != SE;)` loops
needs to iterate all mapping symbols, even if they are not displayed.
We use the new field `IsMappingSymbol` to recognize mapping symbols.
This field also enables simplification after D139131.
ELF/ARM/disassemble-all-mapping-symbols.s is enhanced to add `.space 2`.
If `End = std::min(End, Symbols[SI].Addr);` is not correctly set, we
would print a `.word`.
Reviewed By: jhenderson, jobnoorman, peter.smith
Differential Revision: https://reviews.llvm.org/D156190
and omit them from llvm-nm output unless --special-syms is specified,
similar to ARM and AArch64.
This is a prerequisite of D156190 as llvm-objdump will only perform
mapping symbol recognition for SF_FormatSpecific symbols.
to prevent overload resolution confusion. In particular, if we add
another parameter to the generic constructor, MCDisassemblerTest.cpp
specified constructors will be resolve to the generic constructor, which
is unintended.
This patch adds codegen support for vector with bfloat16 type in llvm backend.
With this patch, Zvbfmin/Zvbfwma instructions as well as vle16/vse16 can generated from newly added bf16 IR intrinsics.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D156287
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
Previously when objcopy generated section headers, it padded the LEB
that encodes the section size out to 5 bytes, matching the behavior of
clang. This is correct, but results in a binary that differs from the
input. This can sometimes have undesirable consequences (e.g. breaking
source maps).
This change makes the object reader remember the size of the LEB
encoding in the section header, so that llvm-objcopy can reproduce it
exactly. For sections not read from an object file (e.g. that
llvm-objcopy is adding itself), pad to 5 bytes.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D155535
This patch allows constant folding of PHIs when estimating the user
bonus. Phi nodes are a special case since some of their inputs may
remain unresolved until all the specialization arguments have been
processed by the InstCostVisitor. Therefore, we keep a list of dead
basic blocks and then lazily visit the Phi nodes once the user bonus
has been computed for all the specialization arguments.
In addition to the last revision this one fixes the bug reported on
Phabricator.
Differential Revision: https://reviews.llvm.org/D154852
We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.
The algorithm shares some similarities with C^3 and an approach for basic block
reordering (ext-tsp). It works with chains (ordered lists)
of functions. Initially all chains are isolated functions. On every iteration,
we pick a pair of chains whose merging yields the biggest increase in the
objective, which is a weighted combination of frequency-based and distance-based
locality. That is, we try to co-locate hot functions together (so they can share
the cache lines) and functions frequently executed together. The merging process
stops when there is only one chain left, or when merging does not improve the
objective. In the latter case, the remaining chains are sorted by density in the
decreasing order.
**Complexity**
We regularly apply the algorithm for large data-center binaries containing 10K+
(hot) functions, and the algorithm takes only a few seconds. For some extreme
cases with 100K-1M nodes, the runtime is within minutes.
**Perf-impact**
We extensively tested the implementation extensively on a benchmark of isolated
binaries and prod services. The impact is measurable for "larger" binaries that
are front-end bound: the cpu time improvement (on top of C^3) is in the range
of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and
i-cache miss rate (up to 5%).
Reviewed By: rahmanl
Differential Revision: https://reviews.llvm.org/D152834
* Expand understood `FileType`s that InterfaceFile class can represent.
* Add `hasTarget` function.
* Cleanup symbol `<` comparator to account for SymbolSet operations.
This allows us to handle dead blocks with multiple incoming edges,
where we can determine that all of those edges are dead (or cycles).
This allows InstCombine to handle certain dead code patterns that
can be produced by LoopVectorize in a single iteration.
This is in preparation for D154579.
Record the call frame size on entry to each basic block. This is usually
zero except when a basic block has been split in the middle of a call
sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D156113
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.
Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().
Reviewed By: arsenm, Pierre-vh
Differential Revision: https://reviews.llvm.org/D155556
This restores commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.
Originally reverted in d0f7850b01cf17e50a4f4b00e3b84dded94df6b8.
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
This reverts commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.
The changes did not cover all occurrences of the deteleted function
MachineInstr::getIntrinsicID().
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.
Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().
Reviewed By: arsenm, Pierre-vh
Differential Revision: https://reviews.llvm.org/D155556
Fix the GenericSSAContext template so that it actually declares all the
necessary typenames and the methods that must be implemented by its
specializations SSAContext and MachineSSAContext.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156288
ELF object files can contain `.ctors` and `.dtors` sections that also
participate as initializers.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D154802
`inverse.ballot` checks if a cc bit is set for the current
lane. Therefore, it is not convergent.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D156088
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
Reverting due to the crash reported in D154852.
Also reverting the subsequent commit as collateral damage:
"[FuncSpec] Split the specialization bonus into CodeSize and Latency."
Currently we use a combined metric TargetTransformInfo::TCK_SizeAndLatency
when estimating the specialization bonus. This is suboptimal, and in some
cases erroneous. For example we shouldn't be weighting the codesize decrease
attributed to constant propagation by the block frequency of the dead code.
Instead only the latency savings should be weighted by block frequency. The
total codesize savings from all the specialization arguments should be
deducted from the specialization cost.
Differential Revision: https://reviews.llvm.org/D155103
We can improve our deduction if we stop at PHI and select instructions
and also iterate the returned values explicitly. The latter helps with
isImpliedByIR deductions.
After PrePrunePass `claimOrExternalizeWeakAndCommonSymbols`, a defined symbol might become external. So determine a function call is external or not when building the linkgraph is not accurate. This largely affects updating TOC pointer on PowerPC. TOC pointer is supposed to be the same in one object file(if no mulitple TOC appears) and is updated when control flow transferred to another object file.
This patch defers checking a function call is external or not, in `buildTables_ELF_ppc64` which is a PostPrunePass.
This patch fixes failures when `jitlink -orc-runtime=/path/to/libort_rt.a` is used.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D155925
As described in [1][2], `-mtune=` is used to select the type of target
microarchitecture, defaults to the value of `-march`. The set of
possible values should be a superset of `-march` values. Currently
possible values of `-march=` and `-mtune=` are `native`, `loongarch64`
and `la464`.
D136146 has supported `-march={loongarch64,la464}` and this patch adds
support for `-march=native` and `-mtune=`.
A new ProcessorModel called `loongarch64` is defined in LoongArch.td
to support `-mtune=loongarch64`.
`llvm::sys::getHostCPUName()` returns `generic` on unknown or future
LoongArch CPUs, e.g. the not yet added `la664`, leading to
`llvm::LoongArch::isValidArchName()` failing to parse the arch name.
In this case, use `loongarch64` as the default arch name for 64-bit
CPUs.
And these two preprocessor macros are defined:
- __loongarch_arch
- __loongarch_tune
[1]: https://github.com/loongson/LoongArch-Documentation/blob/2023.04.20/docs/LoongArch-toolchain-conventions-EN.adoc
[2]: https://github.com/loongson/la-softdev-convention/blob/v0.1/la-softdev-convention.adoc
Reviewed By: xen0n, wangleiat
Differential Revision: https://reviews.llvm.org/D155824
The current implementation has two primary issues:
* Matching `a*a*a*b` against `aaaaaa` has exponential complexity.
* BitVector harms data cache and is inefficient for literal matching.
and a minor issue that `\` at the end may cause an out of bounds access
in `StringRef::operator[]`.
Switch to an O(|S|*|P|) greedy algorithm instead: factor the pattern
into segments split by '*'. The segment is matched sequentianlly by
finding the first occurrence past the end of the previous match. This
algorithm is used by lots of fnmatch implementations, including musl and
NetBSD's.
In addition, `optional<StringRef> Exact, Suffix, Prefix` wastes space.
Instead, match the non-metacharacter prefix against the haystack, then
match the pattern with the rest.
In practice `*suffix` style patterns are less common and our new
algorithm is fast enough, so don't bother storing the non-metacharacter
suffix.
Note: brace expansions (D153587) can leverage the `matchOne` function.
Differential Revision: https://reviews.llvm.org/D156046
This restores D132311, which was reverted in
29c841ce93e087fa4e0c5f3abae94edd460bc24a (Sep 2022) due to certain files
not buildable with GCC 7.3.0. The previous attempt was reverted by
6cd9608fb37ca2418fb44b57ec955bb5efe10689 (Dec 2020).
This time, GCC 7.3.0 has existing build errors for a long time due to
structured bindings for many files, e.g.
```
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9098:13: error: cannot decompose class type ‘std::pair<llvm::Value*, const llvm::SCEV*>’: both it and it
s base class ‘std::pair<llvm::Value*, const llvm::SCEV*>’ have non-static data members
for (auto [_, Stride] : Legal->getLAI()->getSymbolicStrides()) {
^~~~~~~~~~~
```
... and also some `error: duplicate initialization of` instances due to llvm/Transforms/IPO/Attributor.h.
---
GCC 7.5.0 has a bug that, without this change, certain `SmallVector` with a `std::pair` element type like `SmallVector<std::pair<Instruction * const, Info>, 0> X;` lead to spurious
```
/tmp/opt/gcc-7.5.0/include/c++/7.5.0/type_traits:878:48: error: constructor required before non-static data member for ‘...’ has been parsed
```
Switching to std::is_trivially_{copy/move}_constructible fixes the error.
instead of DW_ELE_start_length in debug_rnglists section
This patch tries to reduce the size of the debug_rnglist section by
replacing the DW_RLE_start_length opcodes currently emitted by dsymutil
in favor of using DW_RLE_base_addressx + DW_RLE_offset_pair instead.
The DW_RLE_start_length is one AddressSize followed by a ULEB per entry,
whereas, the DW_RLE_base_addressx + DW_RLE_offset_pair will use one ULEB
for the base address, and then the DW_RLE_offset_pair is a pair of
ULEBs. This will be more efficient.
Differential Revision: https://reviews.llvm.org/D156166
Similar to D156016 for MapVector.
This brings back commit fae7b98c221b5b28797f7b56b656b6b819d99f27 with a
fix to llvm/unittests/Support/ThreadPool.cpp's `_WIN32` code path.
This reverts commit 0d12683046ca75fb08e285f4622f2af5c82609dc and
reapplies ef9ec4bbcca2fa4f64df47bc426f1d1c59ea47e2 with an extension to
fix the Flang build.
Differential Revision: https://reviews.llvm.org/D156184
Currently, objcopy cannot set the new flag SHF_X86_64_LARGE. This change introduces the named flag "large" which translates to that section flag.
An "invalid argument" error is produced if a user attempts to set the flag on an architecture other than X86_64.
Reviewed By: jhenderson, MaskRay
Differential Revision: https://reviews.llvm.org/D153262