## Purpose
This patch is one in a series of code-mods that annotate LLVM’s public
interface for export. This patch annotates the `llvm/Target` library.
These annotations currently have no meaningful impact on the LLVM build;
however, they are a prerequisite to support an LLVM Windows DLL (shared
library) build.
## Background
This effort is tracked in #109483. Additional context is provided in
[this
discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307),
and documentation for `LLVM_ABI` and related annotations is found in the
LLVM repo
[here](https://github.com/llvm/llvm-project/blob/main/llvm/docs/InterfaceExportAnnotations.rst).
A sub-set of these changes were generated automatically using the
[Interface Definition Scanner (IDS)](https://github.com/compnerd/ids)
tool, followed formatting with `git clang-format`.
The bulk of this change is manual additions of `LLVM_ABI` to
`LLVMInitializeX` functions defined in .cpp files under llvm/lib/Target.
Adding `LLVM_ABI` to the function implementation is required here
because they do not `#include "llvm/Support/TargetSelect.h"`, which
contains the declarations for this functions and was already updated
with `LLVM_ABI` in a previous patch. I considered patching these files
with `#include "llvm/Support/TargetSelect.h"` instead, but since
TargetSelect.h is a large file with a bunch of preprocessor x-macro
stuff in it I was concerned it would unnecessarily impact compile times.
In addition, a number of unit tests under llvm/unittests/Target required
additional dependencies to make them build correctly against the LLVM
DLL on Windows using MSVC.
## Validation
Local builds and tests to validate cross-platform compatibility. This
included llvm, clang, and lldb on the following configurations:
- Windows with MSVC
- Windows with Clang
- Linux with GCC
- Linux with Clang
- Darwin with Clang
Currently, middle-end generates 'unreachable' insn if the compiler
feels the code is indeed unreachable or the code becomes invalid
due to some optimizaiton (e.g. code optimization with uninitialized
variables).
Right now BPF backend ignores 'unreachable' insn during selectiondag
lowering. For cases where 'unreachable' is due to invalid code
transformation, such a signal will be missed. Later on, users needs
some effort to debug it which impacts developer productivity.
This patch enabled selectiondag lowering for 'unreachable' insn.
Previous attempt ([1]) tries to have a backend IR pass to filter
out 'unreachable' insns in a number of cases. But such pattern
matching may misalign with future middle-end optimization with
'unreachable' insns.
This patch takes a different approach. The 'unreachable' insn is
lowered with special encoding in bpf object file and verifier
will do proper verification for the bpf prog. More specifically,
the 'unreachable' insn is replaced by a __bpf_trap() function.
This function will be a kfunc (in ".ksyms" section) with a weak
attribute, but does not have definition. The actual kfunc definition
is expected to be in kernel. The __bpf_trap() extern function
is also encoded in BTF. The name __bpf_trap() is chosen to satisfy
reserved identifier requirement.
Besides the uninitialized variable case, the builtin function
'__builtin_trap' can also generate kfunc __bpf_trap(). For example
in [3], we have
```
# define __bpf_unreachable() __builtin_trap()
```
If the compiler didn't remove __builtin_trap() during middle-end
optimization, compilation will fail.
With this patch, compilation will not fail and __builtin_trap()
is converted to __bpf_trap() kfunc. The eventual failure will be
in verifier instead of llvm compilation. To keep compilation
time failure, user can add an option like `-ftrap-function=<something>`.
I tested this patch on bpf selftests and all tests are passed.
I also tried original example in [2] and the code looks like below:
```
; {
0: bf 16 00 00 00 00 00 00 r6 = r1
; bpf_printk("Start");
1: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
0000000000000008: R_BPF_64_64 .rodata
3: b4 02 00 00 06 00 00 00 w2 = 0x6
4: 85 00 00 00 06 00 00 00 call 0x6
; DEFINE_FUNC_CTX_POINTER(data)
5: 61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
; bpf_printk("pre ipv6_hdrlen_offset");
6: 18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
0000000000000030: R_BPF_64_64 .rodata
8: b4 02 00 00 17 00 00 00 w2 = 0x17
9: 85 00 00 00 06 00 00 00 call 0x6
10: 85 10 00 00 ff ff ff ff call -0x1
0000000000000050: R_BPF_64_32 __bpf_trap
11: 95 00 00 00 00 00 00 00 exit
<END>
```
Eventually kernel verifier will emit the following logs:
```
10: (85) call __bpf_trap#74479
unexpected __bpf_trap() due to uninitialized variable?
```
In another internal sched-ext bpf prog, with the patch we have bpf code:
```
Disassembly of section .text:
0000000000000000 <scx_storage_init_single>:
; {
0: bc 13 00 00 00 00 00 00 w3 = w1
1: b4 01 00 00 00 00 00 00 w1 = 0x0
; const u32 zero = 0;
...
0000000000003a80 <create_dom>:
; {
1872: bc 16 00 00 00 00 00 00 w6 = w1
; bpf_printk("dom_id %d", dom_id);
1873: 18 01 00 00 3f 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x3f ll
0000000000003a88: R_BPF_64_64 .rodata
1875: b4 02 00 00 0a 00 00 00 w2 = 0xa
1876: bc 63 00 00 00 00 00 00 w3 = w6
1877: 85 00 00 00 06 00 00 00 call 0x6
; ret = scx_bpf_create_dsq(dom_id, 0);
1878: bc 61 00 00 00 00 00 00 w1 = w6
1879: b4 02 00 00 00 00 00 00 w2 = 0x0
1880: 85 10 00 00 ff ff ff ff call -0x1
0000000000003ac0: R_BPF_64_32 scx_bpf_create_dsq
; domc->node_cpumask = node_data[node_id];
1881: 85 10 00 00 ff ff ff ff call -0x1
0000000000003ac8: R_BPF_64_32 __bpf_trap
1882: 95 00 00 00 00 00 00 00 exit
<END>
```
The verifier can easily report the error too.
A bpf flag `-bpf-disable-trap-unreachable` is introduced to disable
trapping for 'unreachable' or __builtin_trap.
[1] https://github.com/llvm/llvm-project/pull/126858
[2] https://github.com/msune/clang_bpf/blob/main/Makefile#L3
[3] https://github.com/libbpf/libbpf/blob/master/src/bpf_helpers.h
Register assembly printer passes in the pass registry.
This makes it possible to use `llc -start-before=<target>-asm-printer ...` in tests.
Adds a `char &ID` parameter to the AssemblyPrinter constructor to allow
targets to use the `INITIALIZE_PASS` macros and register the pass in the
pass registry. This currently has a default parameter so it won't break
any targets that have not been updated.
Replace "concept based polymorphism" with simpler PImpl idiom.
This pursues two goals:
* Enforce static type checking. Previously, target implementations hid
base class methods and type checking was impossible. Now that they
override the methods, the compiler will complain on mismatched
signatures.
* Make the code easier to navigate. Previously, if you asked your
favorite LSP server to show a method (e.g. `getInstructionCost()`), it
would show you methods from `TTI`, `TTI::Concept`, `TTI::Model`,
`TTIImplBase`, and target overrides. Now it is two less :)
There are three commits to hopefully simplify the review.
The first commit removes `TTI::Model`. This is done by deriving
`TargetTransformInfoImplBase` from `TTI::Concept`. This is possible
because they implement the same set of interfaces with identical
signatures.
The first commit makes `TargetTransformImplBase` polymorphic, which
means all derived classes should `override` its methods. This is done in
second commit to make the first one smaller. It appeared infeasible to
extract this into a separate PR because the first commit landed
separately would result in tons of `-Woverloaded-virtual` warnings (and
break `-Werror` builds).
The third commit eliminates `TTI::Concept` by merging it with the only
derived class `TargetTransformImplBase`. This commit could be extracted
into a separate PR, but it touches the same lines in
`TargetTransformInfoImpl.h` (removes `override` added by the second
commit and adds `virtual`), so I thought it may make sense to land these
two commits together.
Pull Request: https://github.com/llvm/llvm-project/pull/136674
Following discussions in #110443, and the following earlier discussions
in https://lists.llvm.org/pipermail/llvm-dev/2017-October/117907.html,
https://reviews.llvm.org/D38482, https://reviews.llvm.org/D38489, this
PR attempts to overhaul the `TargetMachine` and `LLVMTargetMachine`
interface classes. More specifically:
1. Makes `TargetMachine` the only class implemented under
`TargetMachine.h` in the `Target` library.
2. `TargetMachine` contains target-specific interface functions that
relate to IR/CodeGen/MC constructs, whereas before (at least on paper)
it was supposed to have only IR/MC constructs. Any Target that doesn't
want to use the independent code generator simply does not implement
them, and returns either `false` or `nullptr`.
3. Renames `LLVMTargetMachine` to `CodeGenCommonTMImpl`. This renaming
aims to make the purpose of `LLVMTargetMachine` clearer. Its interface
was moved under the CodeGen library, to further emphasis its usage in
Targets that use CodeGen directly.
4. Makes `TargetMachine` the only interface used across LLVM and its
projects. With these changes, `CodeGenCommonTMImpl` is simply a set of
shared function implementations of `TargetMachine`, and CodeGen users
don't need to static cast to `LLVMTargetMachine` every time they need a
CodeGen-specific feature of the `TargetMachine`.
5. More importantly, does not change any requirements regarding library
linking.
cc @arsenm @aeubanks
The early simplication pipeline is used in non-LTO and (Thin/Full)LTO
pre-link
stage. There are some passes that we want them in non-LTO mode, but not
at LTO
pre-link stage. The control is missing currently. This PR adds the
support. To
demonstrate the use, we only enable the internalization pass in non-LTO
mode for
AMDGPU because having it run in pre-link stage causes some issues.
This patch contains two pars:
- first to revert the patch https://github.com/llvm/llvm-project/pull/101428.
- second to remove `atomic_fetch_and_*()` to `atomic_<op>()`
conversion (when return value is not used), but preserve
`__sync_fetch_and_add()` to locked insn with cpu v1/v2.
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...)
without return value like
__sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
lock *(u32 *)(r1 + 0) += r2
If __sync_fetch_and_add(...) returns a value like
res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)
The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem
in jit since proper barrier is not inserted.
The above discrepancy is due to commit [2] where it tries to maintain
backward compatability since before commit [2],
__sync_fetch_and_add(...) generates lock insn in BPF backend.
Based on discussion in [1], now it is time to fix the above discrepancy
so we can have proper barrier support in jit. This patch made sure that
__sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns.
Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. I
also removed the whole BPFMIChecking.cpp file whose original purpose is
to detect and issue errors if XADD{W,D,W32} may return a value used
subsequently. Since insns XADD{W,D,W32} are all inline asm only now,
such error detection is not needed.
[1]
https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@google.com/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
[2]
286daafd65
Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
On MSVC the `this` uses inside `decltype` require a lambda capture. On
clang they result in an unused capture warning instead. Add the capture
and suppress the warning with `(void)this`.
-----
Initializing this map is somewhat expensive (especially for O0), so we
currently only do it if certain flags are used. I would like to make use
of it for crash dumps (#96078), where we don't know in advance whether
it will be needed or not.
This patch changes the initialization to a lazy approach, where a
callback is registered that does the actual initialization. The
callbacks will be run the first time the pass name is requested.
This way there is no compile-time impact if the mapping is not used.
My attempt to fix the Windows build made things worse,
revert entirely for now.
This reverts commit e7137f2fed5cfee822ae3c4c6d39188adb59a16c.
This reverts commit 6eaf204dbb0a6a81cddfd02f625c130f7bb1aae5.
This reverts commit 957dc4366dd2ce9d5d2991c3ad76bbf438e9954e.
Initializing this map is somewhat expensive (especially for O0), so we
currently only do it if certain flags are used. I would like to make use
of it for crash dumps (#96078), where we don't know in advance whether
it will be needed or not.
This patch changes the initialization to a lazy approach, where a
callback is registered that does the actual initialization. The
callbacks will be run the first time the pass name is requested.
This way there is no compile-time impact if the mapping is not used.
- Fix build with `EXPENSIVE_CHECKS`
- Remove unused `PassName::ID` to resolve warning
- Mark `~SelectionDAGISel` virtual so AArch64 backend can work properly
This reverts commit de37c06f01772e02465ccc9f538894c76d89a7a1 to
de37c06f01772e02465ccc9f538894c76d89a7a1
It still breaks EXPENSIVE_CHECKS build. Sorry.
Port selection dag isel to new pass manager.
Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.
This commit aims to support BPF arena kernel side
[feature](https://lore.kernel.org/bpf/20240209040608.98927-1-alexei.starovoitov@gmail.com/):
- arena is a memory region accessible from both BPF program and
userspace;
- base pointers for this memory region differ between kernel and user
spaces;
- `dst_reg = addr_space_cast(src_reg, dst_addr_space, src_addr_space)`
translates src_reg, a pointer in src_addr_space to dst_reg, equivalent
pointer in dst_addr_space, {src,dst}_addr_space are immediate constants;
- number 0 is assigned to kernel address space;
- number 1 is assigned to user address space.
On the LLVM side, the goal is to make load and store operations on arena
pointers "transparent" for BPF programs:
- assume that pointers with non-zero address space are pointers to
arena memory;
- assume that arena is identified by address space number;
- assume that address space zero corresponds to kernel address space;
- assume that every BPF-side load or store from arena is done via
pointer in user address space, thus convert base pointers using
`addr_space_cast(src_reg, 0, 1)`;
Only load, store, cmpxchg and atomicrmw IR instructions are handled by
this transformation.
For example, the following C code:
```c
#define __as __attribute__((address_space(1)))
void copy(int __as *from, int __as *to) { *to = *from; }
```
Compiled to the following IR:
```llvm
define void @copy(ptr addrspace(1) %from, ptr addrspace(1) %to) {
entry:
%0 = load i32, ptr addrspace(1) %from, align 4
store i32 %0, ptr addrspace(1) %to, align 4
ret void
}
```
Is transformed to:
```llvm
%to2 = addrspacecast ptr addrspace(1) %to to ptr ;; !
%from1 = addrspacecast ptr addrspace(1) %from to ptr ;; !
%0 = load i32, ptr %from1, align 4, !tbaa !3
store i32 %0, ptr %to2, align 4, !tbaa !3
ret void
```
And compiled as:
```asm
r2 = addr_space_cast(r2, 0, 1)
r1 = addr_space_cast(r1, 0, 1)
r1 = *(u32 *)(r1 + 0)
*(u32 *)(r2 + 0) = r1
exit
```
Co-authored-by: Eduard Zingerman <eddyz87@gmail.com>
Targets affected:
- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).
Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.
The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.
This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
updating the field offset;
BPF verifier limits access patterns allowed for certain data
types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
types only `LD/ST <reg> <static-offset>` memory loads and stores are
allowed.
This is so because offsets of the fields of these structures do not
match real offsets in the running kernel. During BPF program
load/verification loads and stores to the fields of these types are
rewritten so that offsets match real offsets. For this rewrite to
happen static offsets have to be encoded in the instructions.
See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
kernel source tree for details.
- runtime environment might disallow access to the field of the type
through modified pointers.
During BPF program verification a tag `PTR_TO_CTX` is tracked for
register values. In case if register with such tag is modified BPF
programs are not allowed to read or write memory using register. See
kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
source tree for details.
Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`
During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).
Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)
The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
- `(load (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.load`;
- `(store (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.
Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.
The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.
This addresses the issue reported by the following thread:
https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6
This is a second attempt to commit this change, previous reverted
commit is: cb13e9286b6d4e384b5d4203e853d44e2eff0f0f.
The following items had been fixed:
- test case bpf-preserve-static-offset-bitfield.c now uses
`-triple bpfel` to avoid different codegen for little/big endian
targets.
- BPFPreserveStaticOffset.cpp:removePAICalls() modified to avoid
use after free for `WorkList` elements `V`.
Differential Revision: https://reviews.llvm.org/D133361
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.
This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
updating the field offset;
BPF verifier limits access patterns allowed for certain data
types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
types only `LD/ST <reg> <static-offset>` memory loads and stores are
allowed.
This is so because offsets of the fields of these structures do not
match real offsets in the running kernel. During BPF program
load/verification loads and stores to the fields of these types are
rewritten so that offsets match real offsets. For this rewrite to
happen static offsets have to be encoded in the instructions.
See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
kernel source tree for details.
- runtime environment might disallow access to the field of the type
through modified pointers.
During BPF program verification a tag `PTR_TO_CTX` is tracked for
register values. In case if register with such tag is modified BPF
programs are not allowed to read or write memory using register. See
kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
source tree for details.
Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`
During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).
Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)
The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
- `(load (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.load`;
- `(store (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.
Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.
The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.
This addresses the issue reported by the following thread:
https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6
Differential Revision: https://reviews.llvm.org/D133361
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.
Short description
=================
The `BPFMIPeepholeTruncElim` handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ...
%2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1:
%a = LDB %0, ... %a = LDB %0, ...
br %bb3 br %bb3
bb.2: bb.2:
%b = LDB %0, ... -> %b = LDB %0, ...
br %bb3 br %bb3
bb.3: bb.3:
%1 = PHI %a, %b %1 = PHI %a, %b
%2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.
When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.
Commit also adds a few test cases to make sure that patterns in
question are handled.
Long description
================
Why this works: Pattern #1
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:
...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
t19: i64 = and t16, Constant:i64<255>
t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Optimized type-legalized selection DAG:
- `t19 = and t16, 255` had been replaced by `t16` (load).
- Patterns like `(and (load ... i8), 255)` are replaced by `load`
in `DAGCombiner::BackwardsPropagateMask` called from
`DAGCombiner::visitAND`.
- Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
`(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
look for `TLI.shouldFoldConstantShiftPairToMask()` call).
Why this works: Pattern #2
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next
next:
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
t0: ch,glue = EntryToken
t3: i64 = Constant<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5
t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
w/o special `isZExtFree()` overload added by this commit
it is instead lowered as `t6 = (any_extend (load ...))`.
- The decision to generate `zero_extend` or `any_extend` is
done in `RegsForValue::getCopyToRegs` called from
`SelectionDAGBuilder::CopyValueToVirtualRegister`:
- if `isZExtFree()` for load returns true `zero_extend` is used;
- `any_extend` is used otherwise.
- Optimized lowered selection DAG:
- `t6 = (any_extend (load ...))` is replaced by
`t9 = load ..., zext from i8`
This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
`DAGCombiner::visitZERO_EXTEND`.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t5: i8 = truncate t4
t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
t9: i64 = any_extend t8
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t18: i64 = and t4, Constant:i64<255>
t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Initial selection DAG:
- `t0` is an input value for this basic block, it corresponds load
instruction (`t9`) from the first basic block.
- It is accessed within basic block via
`t4` (AssertZext (CopyFromReg t0, ...)).
- The `AssertZext` is generated by RegsForValue::getCopyFromRegs
called from SelectionDAGBuilder::getCopyFromRegs, it is generated
only when `LiveOutInfo` with known number of leading zeros is
present for `t0`.
- Known register bits in `LiveOutInfo` are computed by
`SelectionDAG::computeKnownBits` called from
`SelectionDAGISel::ComputeLiveOutVRegInfo`.
- `computeKnownBits()` generates leading zeros information for
`(load ..., zext from ...)` but *does not* generate leading zeros
information for `(load ..., anyext from ...)`.
This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
- `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
- `t18 = and t4, 255` is replaced by `t4`, this is done by
`DAGCombiner::SimplifyDemandedBits` called from
`DAGCombiner::visitAND`, which simplifies patterns like
`(and (assertzext ...))`
Impact
------
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
remove_truncate_8 improved slightly, for other tests it is
unchanged.
For remove_truncate_7:
Before this revision After this revision
-------------------- -------------------
r1 <<= 0x20 r1 <<= 0x20
r1 >>= 0x20 r1 >>= 0x20
if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2>
r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0)
goto +0x1 <LBB0_3> goto +0x1 <LBB0_3>
<LBB0_2>: <LBB0_2>:
r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>: <LBB0_3>:
r0 = r1 exit
exit
For remove_truncate_8:
Before this revision After this revision
-------------------- -------------------
r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0)
r3 = r2 r3 = r2
r3 <<= 0x20 r3 <<= 0x20
r4 = r3 r3 s>>= 0x20
r4 s>>= 0x20
if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3>
r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4)
r3 >>= 0x20
if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3>
r2 += 0x2 r2 += 0x2
*(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2
<LBB0_3>: <LBB0_3>:
r0 = 0x3 r0 = 0x3
exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
in all cases total number of instructions marginally decreased
(-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased
(-129 instruction in total): some redundant `rX &= 0xffff`
and register to register assignments removed;
- For 2 object files number of instructions increased +2
instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same
example:
define void @foo(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
%b = sext i32 %a to i64
%c = icmp ult i64 1, %b
br i1 %c, label %next, label %end
next:
call void inttoptr (i64 62 to ptr)(i32 %a)
br label %end
end:
ret void
}
Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:
Before this revision After this revision
-------------------- -------------------
r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0)
r1 <<= 32 r1 <<= 32
r1 s>>= 32 r1 s>>= 32
if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2>
r1 <<= 32
r1 >>= 32
call 62 call 62
<LBB0_2>: <LBB0_2>:
exit exit
Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.
The difference comes from `DAGCombiner` operation on the initial DAG:
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = any_extend t5 <--------------------- (1)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5 <--------------------- (2)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
and 1 other values
This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
converted to a pair of shifts. This pair of shifts survives till the
end of translation.
Differential Revision: https://reviews.llvm.org/D157870
D137796 made these passes unused.
`opt --bpf-ir-peephole` is specified in one test. Add a `registerPipelineParsingCallback`
so that we can use change the test to use `opt --passes=bpf-ir-peephole` instead.
Follow up to the series:
1. https://reviews.llvm.org/D140161
2. https://reviews.llvm.org/D140349
3. https://reviews.llvm.org/D140331
4. https://reviews.llvm.org/D140323
Completes the work from the previous two for remaining targets.
This creates the following named passes that can be run via
`llc -{start|stop}-{before|after}`:
- arc-isel
- arm-isel
- avr-isel
- bpf-isel
- csky-isel
- hexagon-isel
- lanai-isel
- loongarch-isel
- m68k-isel
- msp430-isel
- mips-isel
- nvptx-isel
- ppc-codegen
- riscv-isel
- sparc-isel
- systemz-isel
- ve-isel
- wasm-isel
- xcore-isel
A nice way to write tests for SelectionDAGISel might be to use a RUN:
line like:
llc -mtriple=<triple> -start-before=<arch>-isel -stop-after=finalize-isel -o -
Fixes: https://github.com/llvm/llvm-project/issues/59538
Reviewed By: asb, zixuan-wu
Differential Revision: https://reviews.llvm.org/D140364
Since opt no longer supports to run default (O0/O1/O2/O3/Os/Oz)
pipelines using the legacy PM, there are no in-tree uses of
TargetMachine::adjustPassManager remaining. This patch removes the
no longer used adjustPassManager functions.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D137796
Paul Chaignon reported a bpf verifier failure ([1]) due to using
non-ABI register R11. For the test case, llvm11 is okay while
llvm12 and later generates verifier unfriendly code.
The failure is related to variable length array size.
The following mimics the variable length array definition
in the test case:
struct t { char a[20]; };
void foo(void *);
int test() {
const int a = 8;
char tmp[AA + sizeof(struct t) + a];
foo(tmp);
...
}
Paul helped bisect that the following llvm commit is
responsible:
552c6c232872 ("PR44406: Follow behavior of array bound constant
folding in more recent versions of GCC.")
Basically, before the above commit, clang frontend did constant
folding for array size "AA + sizeof(struct t) + a" to be 68,
so used alloca for stack allocation. After the above commit,
clang frontend didn't do constant folding for array size
any more, which results in a VLA and llvm.stacksave/llvm.stackrestore
is generated.
BPF architecture API does not support stack pointer (sp) register.
The LLVM internally used R11 to indicate sp register but it should
not be in the final code. Otherwise, kernel verifier will reject it.
The early patch ([2]) tried to fix the issue in clang frontend.
But the upstream discussion considered frontend fix is really a
hack and the backend should properly undo llvm.stacksave/llvm.stackrestore.
This patch implemented a bpf IR phase to remove these intrinsics
unconditionally. If eventually the alloca can be resolved with
constant size, r11 will not be generated. If alloca cannot be
resolved with constant size, SelectionDag will complain, the same
as without this patch.
[1] https://lore.kernel.org/bpf/20210809151202.GB1012999@Mem/
[2] https://reviews.llvm.org/D107882
Differential Revision: https://reviews.llvm.org/D111897
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.
This allows us to ensure that Support doesn't have includes from MC/*.
Differential Revision: https://reviews.llvm.org/D111454
Pulled out the OptimizationLevel class from PassBuilder in order to be able to access it from within the PassManager and avoid include conflicts.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D107025
Printing pass manager invocations is fairly verbose and not super
useful.
This allows us to remove DebugLogging from pass managers and PassBuilder
since all logging (aside from analysis managers) goes through
instrumentation now.
This has the downside of never being able to print the top level pass
manager via instrumentation, but that seems like a minor downside.
Reviewed By: ychen
Differential Revision: https://reviews.llvm.org/D101797
This patch implemented TTI.IntImmCost() properly.
Each BPF insn has 32bit immediate space, so for any immediate
which can be represented as 32bit signed int, the cost
is technically free. If an int cannot be presented as
a 32bit signed int, a ld_imm64 instruction is needed
and a TCC_Basic is returned.
This change is motivated when we observed that
several bpf selftests failed with latest llvm trunk, e.g.,
#10/16 strobemeta.o:FAIL
#10/17 strobemeta_nounroll1.o:FAIL
#10/18 strobemeta_nounroll2.o:FAIL
#10/19 strobemeta_subprogs.o:FAIL
#96 snprintf_btf:FAIL
The reason of the failure is due to that
SpeculateAroundPHIsPass did aggressive transformation
which alters control flow for which currently verifer
cannot handle well. In llvm12, SpeculateAroundPHIsPass
is not called.
SpeculateAroundPHIsPass relied on TTI.getIntImmCost()
and TTI.getIntImmCostInst() for profitability
analysis. This patch implemented TTI.getIntImmCost()
properly for BPF backend which also prevented
transformation which caused the above test failures.
Differential Revision: https://reviews.llvm.org/D96448
Add an IR phase right before main module optimization.
This is to modify IR to restrict certain downward optimizations
in order to generate verifier friendly code.
> prevent certain instcombine optimizations, handling both
in-block/cross-block instcombines.
> avoid speculative code motion if the variable used in
condition is also used in the later blocks.
Internally, a bpf IR builtin
result = __builtin_bpf_passthrough(seq_num, result)
is used to enforce ordering. This builtin is only used
during target independent IR optimizations and it will
be removed at the beginning of target dependent IR
optimizations.
For example, removing the following workaround,
--- a/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c
+++ b/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c
@@ -47,7 +47,7 @@ int sysctl_tcp_mem(struct bpf_sysctl *ctx)
/* a workaround to prevent compiler from generating
* codes verifier cannot handle yet.
*/
- volatile int ret;
+ int ret;
this patch is able to generate code which passed the verifier.
To disable optimization, users need to use "opt" command like below:
clang -target bpf -O2 -S -emit-llvm -Xclang -disable-llvm-passes test.c
// disable icmp serialization
opt -O2 -bpf-disable-serialize-icmp test.ll | llvm-dis > t.ll
// disable avoid-speculation
opt -O2 -bpf-disable-avoid-speculation test.ll | llvm-dis > t.ll
llc t.ll
Differential Revision: https://reviews.llvm.org/D85570
This involves porting BPFAbstractMemberAccess and BPFPreserveDIType to
NPM, then adding them BPFTargetMachine::registerPassBuilderCallbacks
(the NPM equivalent of adjustPassManager()).
Reviewed By: yonghong-song, asbirlea
Differential Revision: https://reviews.llvm.org/D88855
Move abstractMemberAccess and PreserveDIType passes as early as
possible, right after clang code generation.
Currently, compiler may transform the above code
p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
a = llvm.bpf.builtin.preserve_field_info(p2, EXIST);
if (a) {
p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
bpf_probe_read(buf, buf_size, p2);
}
to
p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
a = llvm.bpf.builtin.preserve_field_info(p2, EXIST);
if (a) {
bpf_probe_read(buf, buf_size, p2);
}
and eventually assembly code looks like
reloc_exist = 1;
reloc_member_offset = 10; //calculate member offset from base
p2 = base + reloc_member_offset;
if (reloc_exist) {
bpf_probe_read(bpf, buf_size, p2);
}
if during libbpf relocation resolution, reloc_exist is actually
resolved to 0 (not exist), reloc_member_offset relocation cannot
be resolved and will be patched with illegal instruction.
This will cause verifier failure.
This patch attempts to address this issue by do chaining
analysis and replace chains with special globals right
after clang code gen. This will remove the cse possibility
described in the above. The IR typically looks like
%6 = load @llvm.sk_buff:0:50$0:0:0:2:0
%7 = bitcast %struct.sk_buff* %2 to i8*
%8 = getelementptr i8, i8* %7, %6
for a particular address computation relocation.
But this transformation has another consequence, code sinking
may happen like below:
PHI = <possibly different @preserve_*_access_globals>
%7 = bitcast %struct.sk_buff* %2 to i8*
%8 = getelementptr i8, i8* %7, %6
For such cases, we will not able to generate relocations since
multiple relocations are merged into one.
This patch introduced a passthrough builtin
to prevent such optimization. Looks like inline assembly has more
impact for optimizaiton, e.g., inlining. Using passthrough has
less impact on optimizations.
A new IR pass is introduced at the beginning of target-dependent
IR optimization, which does:
- report fatal error if any reloc global in PHI nodes
- remove all bpf passthrough builtin functions
Changes for existing CORE tests:
- for clang tests, add "-Xclang -disable-llvm-passes" flags to
avoid builtin->reloc_global transformation so the test is still
able to check correctness for clang generated IR.
- for llvm CodeGen/BPF tests, add "opt -O2 <ir_file> | llvm-dis" command
before "llc" command since "opt" is needed to call newly-placed
builtin->reloc_global transformation. Add target triple in the IR
file since "opt" requires it.
- Since target triple is added in IR file, if a test may produce
different results for different endianness, two tests will be
created, one for bpfeb and another for bpfel, e.g., some tests
for relocation of lshift/rshift of bitfields.
- field-reloc-bitfield-1.ll has different relocations compared to
old codes. This is because for the structure in the test,
new code returns struct layout alignment 4 while old code
is 8. Align 8 is more precise and permits double load. With align 4,
the new mechanism uses 4-byte load, so generating different
relocations.
- test intrinsic-transforms.ll is removed. This is used to test
cse on intrinsics so we do not lose metadata. Now metadata is attached
to global and not instruction, it won't get lost with cse.
Differential Revision: https://reviews.llvm.org/D87153
The following bpf linux kernel selftest failed with latest
llvm:
$ ./test_progs -n 7/10
...
The sequence of 8193 jumps is too complex.
verification time 126272 usec
stack depth 320
processed 114799 insns (limit 1000000)
...
libbpf: failed to load object 'pyperf600_nounroll.o'
test_bpf_verif_scale:FAIL:110
#7/10 pyperf600_nounroll.o:FAIL
#7 bpf_verif_scale:FAIL
After some investigation, I found the following llvm patch
https://reviews.llvm.org/D84108
is responsible. The patch disabled hoisting common instructions
in SimplifyCFG by default. Later on, the code changes and a
SimplifyCFG phase with hoisting on cannot do the work any more.
A test is provided to demonstrate the problem.
The IR before simplifyCFG looks like:
for.cond:
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
%cmp = icmp ult i32 %i.0, 6
br i1 %cmp, label %for.body, label %for.cond.cleanup
for.cond.cleanup:
%2 = load i8*, i8** %frame_ptr, align 8, !tbaa !2
%cmp2 = icmp eq i8* %2, null
%conv = zext i1 %cmp2 to i32
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %1) #3
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %0) #3
ret i32 %conv
for.body:
%3 = load i8*, i8** %frame_ptr, align 8, !tbaa !2
%tobool.not = icmp eq i8* %3, null
br i1 %tobool.not, label %for.inc, label %land.lhs.true
The first two insns of `for.cond.cleanup` and `for.body`, load and
icmp, can be hoisted to `for.cond` block. With Patch D84108, the
optimization is delayed. But unfortunately, later on loop rotation
added addition phi nodes to `for.body` and hoisting cannot
be done any more.
Note such a hoisting is beneficial to bpf programs as
bpf verifier does path sensitive analysis and verification.
The hoisting preverts reloading from stack which will assume
conservative value and increase exploited insns. In this case,
it caused verifier failure.
To fix this problem, I added an IR pass from bpf target
to performance additional simplifycfg with hoisting common inst
enabled.
Differential Revision: https://reviews.llvm.org/D85434
The builtin function
u32 btf_type_id = __builtin_btf_type_id(param, 0)
can help preserve type info for the following use case:
extern void foo(..., void *data, int size);
int test(...) {
struct t { int a; int b; int c; } d;
d.a = ...; d.b = ...; d.c = ...;
foo(..., &d, sizeof(d));
}
The function "foo" in the above only see raw data and does not
know what type of the data is. In certain cases, e.g., logging,
the additional type information will help pretty print.
This patch handles the builtin in BPF backend. It includes
an IR pass to translate the IR intrinsic to a load of
a global variable which carries the metadata, and an MI
pass to remove the intermediate load of the global variable.
Finally, in AsmPrinter pass, proper instruction are generated.
In the above example, the second argument for __builtin_btf_type_id()
is 0, which means a relocation for local adjustment,
i.e., w.r.t. bpf program BTF change, will be generated.
The value 1 for the second argument means
a relocation for remote adjustment, e.g., against vmlinux.
Differential Revision: https://reviews.llvm.org/D74572