Commit 1b531d54f623 (#74203) removed the usage of unique_ptrs of arrays
in favour of using vectors, but inadvertently increased peak memory
usage by removing the ability to deallocate vector memory that was no
longer needed mid-LDV.
In that same review, it was pointed out that `FuncValueTable` typedef
could be removed, since it was "just a vector".
This commit addresses both issues by making `FuncValueTable` a real data
structure, capable of mapping BBs to ValueTables and able to free
ValueTables as needed.
This reduces peak memory usage in the compiler by 10% in the benchmarks
flagged by the original review.
As a consequence, we had to remove a handful of instances of the
"declare-then-initialize" antipattern in unittests, as the
FuncValueTable class is no longer default-constructible.
The original brute force dominates algorithm is O(n) complexity so it is
very slow for very large machine basic block which is very common with
O0. This patch added InstrPosIndexes to assign index for each
instruction and use it to determine dominance. The complexity is now
O(1).
Instead, we add a `BranchOnly` parameter to indicate that only
branches with its predecessors will be fused.
X86 is the only user of `createBranchMacroFusionDAGMutation`.
Renaming a member variable from "Endoding" to "Encoding".
Also replace inlined code for "isNormalized" with a call to the
function, so that if the definition of normalization ever changes, we
only need to change the one place.
IR intrinsics were already defined, but no codegen support had been
added.
I extracted this code from our downstream. Some of it may have come from
https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi/ originally.
we need first judge N1.getNumOperands() > 0.
If Lowering Generated SDNode like.
```
v2i32 t20: TargetOpNode.
i32 t21: extract_vector_elt t20 0
i32 t22: extract_vector_elt t20 1
```
will cause a error.
[TLI] Pass replace-with-veclib works with Scalable Vectors.
The pass is heavily refactored.
It uses the Masked variant of a TLI method when the Intrinsic operates on Scalable Vectors.
Improve tests for ArmPL and SLEEF Intrinsics:
- Auto-generate test `armpl-intrinsics.ll`, and use active lane mask to have shorter `shufflevector` check lines.
- Update scripts now add `@llvm.compiler.used` instead of using the regex: `@[[LLVM_COMPILER_USED:[a-zA-Z0-9_$"\\.-]+]]`
- Add simplifycfg pass and noalias to ensure tail folding. `noalias` attribute was added only to the `%in.ptr` parameter of the ArmPL Intrinsics.
Pass the original MMO instead of different individual values.
getAlign() was used before where actually getOriginalAlign() would have been
better, and this patch has the same effect.
Avoid the pre-truncate of BUILD_VECTOR sources when there is more than
one use. This can avoid using unnecessary movs later down the
instruction selection pipeline.
The followed byte of `OPC_EmitRegister` is a MVT type, which is
usually i32 or i64.
We add `OPC_EmitRegisterI32` and `OPC_EmitRegisterI64` so that we
can reduce one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 10K.
We were only folding cases which remained extloads, but DAG.getExtLoad can also handle the cases which don't need to extend at all (we just can't do truncloads).
reduceLoadWidth can handle this for scalar loads, but not for vectors.
Noticed while triaging D152928
The CheckInteger routine called from TableGen-generated selection logic
uses getSExtValue - which will abort if the underlying APInt does not
fit into an int64_t.
This case is now triggered by the SystemZ back-end since i128 is a legal
type on certain machines. While we do not have any regular instructions
that take 128-bit immediates (like most other platforms), there are
patterns in the .td files that recognize an i128 "xor ..., -1" as a
"not".
These patterns cause code to be generated that calls the CheckInteger
routine on some i128-valued integer, which may trigger the assert.
Fix by using trySExtValue instead.
Fixes https://github.com/llvm/llvm-project/issues/75710
This reverts commit 08b306dc8e7c0b2498f4f194a3c51686d56dbd20.
it causes the following assertion failure:
llvm/include/llvm/CodeGen/MachineFrameInfo.h:530: int64_t
llvm::MachineFrameInfo::getObjectOffset(int) const: Assertion
`!isDeadObjectIndex(ObjectIdx) && "Getting frame offset for a dead
object?"' failed.
If we're lowering a fixed length vector load or store which happens to
exactly VLEN in size (when VLEN is exactly known), we can use a whole
register load or store instead of the unit strided variants. This
doesn't require a vsetvli in some cases, allows additional flexibility
of vsetvli cases in others, and doesn't have a runtime dependency on the
value of VL.
These options are used by `TargetPassConfig` to build CodeGen pass
pipeline, add them to `CGPassBuilderOption` so `CodeGenPassBuilder` can
use them. Currently not all options are added, but it is enough to build
a prototype of `CodeGenPassBuilder`. Part of #69879.
Unfortunately, there are two `KCFI` passes in source tree,
`CodeGen/KCFI.cpp` and `Transforms/Instrumentation/KCFI.cpp`, use
`MachineKCFIPass` for machine function pass.
`MIRProfileLoaderPass` is resolved to the legacy one when
`LLVM_ENABLE_MODULES=ON`, use `MIRProfileLoaderNewPass` as a workaround.
This ensures we clip the index to be in bounds of the vector we are
inserting into. If the index is out of bounds the results of the insert
element is poison. If we don't clip the index we can write memory that
was not part of the original store.
Fixes#74248#75557.
For non-GlobalValue references, the small and medium code models can use
32 bit constants.
For GlobalValue references, use TargetMachine::isLargeGlobalObject().
Look through aliases for determining if a GlobalValue is small or large.
Even the large code model can reference small objects with 32 bit
constants as long as we're in no-pic mode, or if the reference is offset
from the GOT.
Original commit broke the build...
First reland broke large PIC builds referencing small data since it was using GOTOFF as a 32-bit constant.
Type dereferenced fragments are specified by offset and length in bits.
The representation in CodeView is defined in terms of byte offsets. If
the bit slice overlaps at a byte that is included, we would create
invalid definition ranges.
Consider the following scenario:
~~~
01234567 01234567
---------+---------
==== ======
~~~
Here bits 1-4 are marked as defined as well as bits 7-9. The byte range
for the second portion overlaps and so we would say that bytes 1 and 2
are valid though there is potentially a hole. There is no way to
represent this in the defined range for the local variable in CodeView.
We simply can drop the fragment definition in such a scenario with the
variables are "optimized out".
Thanks to @rnk and @hjyamauchi for the discussion around this.
... by lowering them as lazy resolve-on-first-use symbol resolvers. Note that this is subtly different timing than on ELF platforms, where ifunc resolution happens at load time.
Since ld64 and ld-prime don't support all the cases we need for these, we lower them manually in the AsmPrinter.
## Motivation
Since we don't need the metadata sections at runtime, we can somehow
offload them from memory at runtime. Initially, I explored [debug info
correlation](https://discourse.llvm.org/t/instrprofiling-lightweight-instrumentation/59113),
which is used for PGO with value profiling disabled. However, it
currently only works with DWARF and it's be hard to add such artificial
debug info for every function in to CodeView which is used on Windows.
So, offloading profile metadata sections at runtime seems to be a
platform independent option.
## Design
The idea is to use new section names for profile name and data sections
and mark them as metadata sections. Under this mode, the new sections
are non-SHF_ALLOC in ELF. So, they are not loaded into memory at runtime
and can be stripped away as a post-linking step. After the process
exits, the generated raw profiles will contains only headers + counters.
llvm-profdata can be used correlate raw profiles with the unstripped
binary to generate indexed profile.
## Data
For chromium base_unittests with code coverage on linux, the binary size
overhead due to instrumentation reduced from 64M to 38.8M (39.4%) and
the raw profile files size reduce from 128M to 68M (46.9%)
```
$ bloaty out/cov/base_unittests.stripped -- out/no-cov/base_unittests.stripped
FILE SIZE VM SIZE
-------------- --------------
+121% +30.4Mi +121% +30.4Mi .text
[NEW] +14.6Mi [NEW] +14.6Mi __llvm_prf_data
[NEW] +10.6Mi [NEW] +10.6Mi __llvm_prf_names
[NEW] +5.86Mi [NEW] +5.86Mi __llvm_prf_cnts
+95% +1.75Mi +95% +1.75Mi .eh_frame
+108% +400Ki +108% +400Ki .eh_frame_hdr
+9.5% +211Ki +9.5% +211Ki .rela.dyn
+9.2% +95.0Ki +9.2% +95.0Ki .data.rel.ro
+5.0% +87.3Ki +5.0% +87.3Ki .rodata
[ = ] 0 +13% +47.0Ki .bss
+40% +1.78Ki +40% +1.78Ki .got
+12% +1.49Ki +12% +1.49Ki .gcc_except_table
[ = ] 0 +65% +1.23Ki .relro_padding
+62% +1.20Ki [ = ] 0 [Unmapped]
+13% +448 +19% +448 .init_array
+8.8% +192 [ = ] 0 [ELF Section Headers]
+0.0% +136 +0.0% +80 [7 Others]
+0.1% +96 +0.1% +96 .dynsym
+1.2% +96 +1.2% +96 .rela.plt
+1.5% +80 +1.2% +64 .plt
[ = ] 0 -99.2% -3.68Ki [LOAD #5 [RW]]
+195% +64.0Mi +194% +64.0Mi TOTAL
$ bloaty out/cov-cor/base_unittests.stripped -- out/no-cov/base_unittests.stripped
FILE SIZE VM SIZE
-------------- --------------
+121% +30.4Mi +121% +30.4Mi .text
[NEW] +5.86Mi [NEW] +5.86Mi __llvm_prf_cnts
+95% +1.75Mi +95% +1.75Mi .eh_frame
+108% +400Ki +108% +400Ki .eh_frame_hdr
+9.5% +211Ki +9.5% +211Ki .rela.dyn
+9.2% +95.0Ki +9.2% +95.0Ki .data.rel.ro
+5.0% +87.3Ki +5.0% +87.3Ki .rodata
[ = ] 0 +13% +47.0Ki .bss
+40% +1.78Ki +40% +1.78Ki .got
+12% +1.49Ki +12% +1.49Ki .gcc_except_table
+13% +448 +19% +448 .init_array
+0.1% +96 +0.1% +96 .dynsym
+1.2% +96 +1.2% +96 .rela.plt
+1.2% +64 +1.2% +64 .plt
+2.9% +64 [ = ] 0 [ELF Section Headers]
+0.0% +40 +0.0% +40 .data
+1.2% +32 +1.2% +32 .got.plt
+0.0% +24 +0.0% +8 [5 Others]
[ = ] 0 -22.9% -872 [LOAD #5 [RW]]
-74.5% -1.44Ki [ = ] 0 [Unmapped]
[ = ] 0 -76.5% -1.45Ki .relro_padding
+118% +38.8Mi +117% +38.8Mi TOTAL
```
A few things to note:
1. llvm-profdata doesn't support filter raw profiles by binary id yet,
so when a raw profile doesn't belongs to the binary being digested by
llvm-profdata, merging will fail. Once this is implemented,
llvm-profdata should be able to only merge raw profiles with the same
binary id as the binary and discard the rest (with mismatched/missing
binary id). The workflow I have in mind is to have scripts invoke
llvm-profdata to get all binary ids for all raw profiles, and
selectively choose the raw pnrofiles with matching binary id and the
binary to llvm-profdata for merging.
2. Note: In COFF, currently they are still loaded into memory but not
used. I didn't do it in this patch because I noticed that `.lcovmap` and
`.lcovfunc` are loaded into memory. A separate patch will address it.
3. This should works with PGO when value profiling is disabled as debug
info correlation currently doing, though I haven't tested this yet.
For non-GlobalValue references, the small and medium code models can use
32 bit constants.
For GlobalValue references, use TargetMachine::isLargeGlobalObject().
Look through aliases for determining if a GlobalValue is small or large.
Even the large code model can reference small objects with 32 bit
constants as long as we're in no-pic mode, or if the reference is offset
from the GOT.
Original commit broke the build...
For non-GlobalValue references, the small and medium code models can use
32 bit constants.
For GlobalValue references, use TargetMachine::isLargeGlobalObject().
Look through aliases for determining if a GlobalValue is small or large.
Even the large code model can reference small objects with 32 bit
constants as long as we're in no-pic mode, or if the reference is offset
from the GOT.
These are usually difficult to reason about, and they were being used to
pass raw pointers around with array semantic (i.e., we were using
operator [] on raw pointers). To put it in InstrRef terminology: we were
passing a pointer to a ValueTable but using it as if it were a
FuncValueTable.
These could have easily been SmallVectors, which now allow us to have
reference semantics in some places, as well as simpler initialization.
In the future, we can use even more pass-by-reference with some extra
changes in the code.