Summary:
According to DWARF5 specification and gnu specification for DWARF4 the offset
entry in the CU/TU Index is 32 bits. This presents a problem when
.debug_info.dwo in DWP file grows beyond 4GB. The CU Index becomes partially
corrupted.
This diff adds manual parsing of .debug_info.dwo/.debug_abbrev.dwo to
reconstruct CU index in general, and TU index for DWARF5. This is a work around
until DWARF6 spec is finalized.
Next patch will change internal CU/TU struct to 64 bit, and change uses as
necessary. The plan is to land all the patches in one go after all are approved.
This patch originates from the discussion in: https://discourse.llvm.org/t/dwarf-dwp-4gb-limit/63902
Differential Revision: https://reviews.llvm.org/D137882
Noticed while trying to use llvm-exegesis to get some accurate capture numbers on some old Atom/Silverment hardware as part of the work with D103695.
These targets' frontends are particularly poor and the use of the xmm8-xmm15 SSE registers results in longer instruction encodings which were affecting the latency/throughput estimates.
Thanks to @lebedev.ri for the --skip-measurements command line argument which made testing much easier!
Differential Revision: https://reviews.llvm.org/D138832
Sometimes we only want to ensure that we can produce snippets (all the way
through `SnippetRepetitor`!), but don't care for the execution.
E.g. all of our tests are this way.
I've built LLVM without PFM and removed my CPU from `X86PfmCounters.td`,
and this produces the expected results in that configuration.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D139448
Before performing this change, I checked that `ByteAlignment` was never `0` inside `MCAsmStreamer:emitZeroFill` and `MCAsmStreamer::emitLocalCommonSymbol`.
I believe it is NFC as `0` values are illegal in `emitZeroFill` anyways, `Log2(ByteAlignment)` would be undefined.
And currently, all calls to `emitLocalCommonSymbol` are provably `>0`.
Differential Revision: https://reviews.llvm.org/D139439
This breaks Windows bots with
`warning C4334: '<<': result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?)`
Some shift operators are lacking a proper literal unit ('1ULL' instead of
'1'). Will reland once fixed.
This reverts commit c621c1a8e81856e6bf2be79714767d80466e9ede.
Before performing this change, I checked that `ByteAlignment` was never `0` inside `MCAsmStreamer:emitZeroFill` and `MCAsmStreamer::emitLocalCommonSymbol`.
I believe it is NFC as `0` values are illegal in `emitZeroFill` anyways, `Log2(ByteAlignment)` would be undefined.
And currently, all calls to `emitLocalCommonSymbol` are provably `>0`.
Differential Revision: https://reviews.llvm.org/D139439
Let Propeller use specialized IDs for basic blocks, instead of MBB number.
This allows optimizations not just prior to asm-printer, but throughout the entire codegen.
This patch only implements the functionality under the new `LLVM_BB_ADDR_MAP` version, but the old version is still being used. A later patch will change the used version.
####Background
Today Propeller uses machine basic block (MBB) numbers, which already exist, to map native assembly to machine IR. This is done as follows.
- Basic block addresses are captured and dumped into the `LLVM_BB_ADDR_MAP` section just before the AsmPrinter pass which writes out object files. This ensures that we have a mapping that is close to assembly.
- Profiling mapping works by taking a virtual address of an instruction and looking up the `LLVM_BB_ADDR_MAP` section to find the MBB number it corresponds to.
- While this works well today, we need to do better when we scale Propeller to target other Machine IR optimizations like spill code optimization. Register allocation happens earlier in the Machine IR pipeline and we need an annotation mechanism that is valid at that point.
- The current scheme will not work in this scenario because the MBB number of a particular basic block is not fixed and changes over the course of codegen (via renumbering, adding, and removing the basic blocks).
- In other words, the volatile MBB numbers do not provide a one-to-one correspondence throughout the lifetime of Machine IR. Profile annotation using MBB numbers is restricted to a fixed point; only valid at the exact point where it was dumped.
- Further, the object file can only be dumped before AsmPrinter and cannot be dumped at an arbitrary point in the Machine IR pass pipeline. Hence, MBB numbers are not suitable and we need something else.
####Solution
We propose using fixed unique incremental MBB IDs for basic blocks instead of volatile MBB numbers. These IDs are assigned upon the creation of machine basic blocks. We modify `MachineFunction::CreateMachineBasicBlock` to assign the fixed ID to every newly created basic block. It assigns `MachineFunction::NextMBBID` to the MBB ID and then increments it, which ensures having unique IDs.
To ensure correct profile attribution, multiple equivalent compilations must generate the same Propeller IDs. This is guaranteed as long as the MachineFunction passes run in the same order. Since the `NextBBID` variable is scoped to `MachineFunction`, interleaving of codegen for different functions won't cause any inconsistencies.
The new encoding is generated under the new version number 2 and we keep backward-compatibility with older versions.
####Impact on Size of the `LLVM_BB_ADDR_MAP` Section
Emitting the Propeller ID results in a 23% increase in the size of the `LLVM_BB_ADDR_MAP` section for the clang binary.
Reviewed By: tmsriram
Differential Revision: https://reviews.llvm.org/D100808
Adding a switch to drop profile symbol list during merging AutoFDO profiles. This is needed to minimize the impact on default profiles when the profile symbol list is enabled for the source input profiles. The symbol list is quite large and could potentially slow down the compiler.
Reviewed By: davidxl, wenlei
Differential Revision: https://reviews.llvm.org/D139486
The 3-parameter std::equal used in this code access FileBuffer from [0,
OutputBuffer->getBufferEnd() - OutputBuffer->getBufferStart()). If the
size of FileBuffer is shorter than OutputBuffer, this ends up
overflowing.
This wasn't found on the sanitizer buildbots as they use an instrumented
libcxx, and libcxx implements std::equal using a loop. libstdc++ on my
local macine finds the bug, as it implements std::equal using bcmp(),
which ASan intercepts and does a range check.
The existing test doesn't technically do a buffer-overflow, but the code
definitely can. If OutputBuffer was "AAABBB" and FileBuffer was "AAA",
then the code would overflow.
Reviewed By: abrachet
Differential Revision: https://reviews.llvm.org/D139457
The main motivation for this change is to avoid ambiguity because
mapping symbol names may not be unique across a binary and do not allow uniquely
identifying target address. So that mapping symbols used as branch target
labels make llvm-objdump output less readable.
Another point is that mapping symbols sometimes appear in
non-allocatable sections, like debug info sections which make objdump
output even more confusing.
For example, a small AArch64 executable may contain plenty of `$d[.*]`
symbols and none of them would be useful as a label for resolving
a branch or a memory operand target address:
```
0000000000000254 l .note.ABI-tag 0000000000000000 $d
00000000000008d4 l .eh_frame 0000000000000000 $d
0000000000000868 l .rodata 0000000000000000 $d
0000000000011028 l .data 0000000000000000 $d
0000000000010db8 l .fini_array 0000000000000000 $d
0000000000010db0 l .init_array 0000000000000000 $d
00000000000008e8 l .eh_frame 0000000000000000 $d
0000000000011034 l .bss 0000000000000000 $d
```
Note that GNU objdump doesn't use mapping symbols as branch target
labels for all targets that support such symbols (ARM, AArch64, CSKY).
Differential Revision: https://reviews.llvm.org/D139131
Currently, DWARFLinker receives kind of accel tables as predefined sets:
```
Apple, ///< .apple_names, .apple_namespaces, .apple_types, .apple_objc.
Dwarf, ///< DWARF v5 .debug_names.
Default, ///< Dwarf for DWARF5 or later, Apple otherwise.
Pub, ///< .debug_pubnames, .debug_pubtypes
```
This patch removes implicit sets of tables(Default, Dwarf) and allows to ask for several sets:
```
Apple, ///< .apple_names, .apple_namespaces, .apple_types, .apple_objc.
Pub, ///< .debug_pubnames, .debug_pubtypes
DebugNames ///< .debug_names.
```
It allows seamlessness adding more accel tables in the future: .gdb_index, .debug_cu_index...
Doing things that way, DWARFLinker will be independent of consumers' requirements.
f.e. dsymutil and llvm-dwarfutil may have different variants for Default set
(so, instead of implementing these differencies inside DWARFLinker it could be
implemented in the corresponding module).
Differential Revision: https://reviews.llvm.org/D132371
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
There's a data race on the UninterestingChunks set. The code seems to
be operating on the assumption that all the tasks completed, so ensure
the unused results do complete. This started showing up about 50% of
the time when running operands-skip-parallel.ll after the recent
switch to use DenseSet; previously it failed much less frequently with
std::set.
We should introduce a mechanism to early terminate unused
results. Alternatively, I've been thinking about ways to to make the
reduction order smarter. I frequently have tests that take multiple
minutes to compile and hit the failure. It may be helpful to see which
chunks took the least time and prefer those over just taking the first
result.
As stated in
https://discourse.llvm.org/t/rfc-llc-add-expandlargeintfpconvert-pass-for-fp-int-conversion-of-large-bitint/65528,
this implementation is very similar to ExpandLargeDivRem, which expands
‘fptoui .. to’, ‘fptosi .. to’, ‘uitofp .. to’, ‘sitofp .. to’ instructions
with a bitwidth above a threshold into auto-generated functions. This is
useful for targets like x86_64 that cannot lower fp convertions with more
than 128 bits. The expanded nodes are referring from the IR generated by
`compiler-rt/lib/builtins/floattidf.c`, `compiler-rt/lib/builtins/fixdfti.c`,
and etc.
Corner cases:
1. For fp16: as there is no related builtins added in compliler-rt. So I
mainly utilized the fp32 <-> fp16 lib calls to implement.
2. For fp80: as this pass is soft fp emulation and no fp80 instructions can
help in this problem. I recommend users to deprecate this usage. For now, the
implementation uses fp128 as the temporary conversion type and inserts
fptrunc/ext at top/end of the function.
3. For bf16: as clang FE currently doesn't support bf16 algorithm operations
(convert to int, float, +, -, *, ...), this patch doesn't consider bf16 for
now.
4. For unsigned FPToI: since both default hardware behaviors and libgcc are
ignoring "returns 0 for negative input" spec. This pass follows this old way
to ignore unsigned FPToI. See this example:
https://gcc.godbolt.org/z/bnv3jqW1M
The end-to-end tests are uploaded at https://reviews.llvm.org/D138261
Reviewed By: LuoYuanke, mgehre-amd
Differential Revision: https://reviews.llvm.org/D137241
We need to flatten the SampleFDO profile in profile supplementation
because the InstrFDO profile does not have inlined callsite counters.
Without flattening profile, FDO optimizations are not stable:
we will not supplement the second generation profile when the modified
functions are all inlined.
This patch fixes this issue: we will flatten the profile for functions
that appears in FDO profile.
Note that we only need to find the hot/warm functions in SampleFDO
profile, so we will not perform a full flatten. We will use
a DFS traversal to compute the accumulated entry count and max bodycount.
This is much cheaper than full flattening.
Differential Revision: https://reviews.llvm.org/D138893