There are a few places where `arena` name is used for pointers in
non-zero address space in BPF backend, rename these to use a more
generic `address_space`:
- macro `__BPF_FEATURE_ARENA_CAST` -> `__BPF_FEATURE_ADDR_SPACE_CAST
- name for arena global variables section `.arena.N` ->
`.addr_space.N`
This commit aims to support BPF arena kernel side
[feature](https://lore.kernel.org/bpf/20240209040608.98927-1-alexei.starovoitov@gmail.com/):
- arena is a memory region accessible from both BPF program and
userspace;
- base pointers for this memory region differ between kernel and user
spaces;
- `dst_reg = addr_space_cast(src_reg, dst_addr_space, src_addr_space)`
translates src_reg, a pointer in src_addr_space to dst_reg, equivalent
pointer in dst_addr_space, {src,dst}_addr_space are immediate constants;
- number 0 is assigned to kernel address space;
- number 1 is assigned to user address space.
On the LLVM side, the goal is to make load and store operations on arena
pointers "transparent" for BPF programs:
- assume that pointers with non-zero address space are pointers to
arena memory;
- assume that arena is identified by address space number;
- assume that address space zero corresponds to kernel address space;
- assume that every BPF-side load or store from arena is done via
pointer in user address space, thus convert base pointers using
`addr_space_cast(src_reg, 0, 1)`;
Only load, store, cmpxchg and atomicrmw IR instructions are handled by
this transformation.
For example, the following C code:
```c
#define __as __attribute__((address_space(1)))
void copy(int __as *from, int __as *to) { *to = *from; }
```
Compiled to the following IR:
```llvm
define void @copy(ptr addrspace(1) %from, ptr addrspace(1) %to) {
entry:
%0 = load i32, ptr addrspace(1) %from, align 4
store i32 %0, ptr addrspace(1) %to, align 4
ret void
}
```
Is transformed to:
```llvm
%to2 = addrspacecast ptr addrspace(1) %to to ptr ;; !
%from1 = addrspacecast ptr addrspace(1) %from to ptr ;; !
%0 = load i32, ptr %from1, align 4, !tbaa !3
store i32 %0, ptr %to2, align 4, !tbaa !3
ret void
```
And compiled as:
```asm
r2 = addr_space_cast(r2, 0, 1)
r1 = addr_space_cast(r1, 0, 1)
r1 = *(u32 *)(r1 + 0)
*(u32 *)(r2 + 0) = r1
exit
```
Co-authored-by: Eduard Zingerman <eddyz87@gmail.com>
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
Targets affected:
- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).
Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.
The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6
Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|
The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
Currently on mcpu=v3 we do not support sdiv, srem instructions. And the
backend crashes with stacktrace & coredump, which is misleading for end
users, as this is not a "bug"
Add llvm bug reporting for sdiv/srem on ISel legalize-op phase.
For clang frontend we can get detailed location & bug report.
$ build/bin/clang -g -target bpf -c local/sdiv.c
local/sdiv.c:1:35: error: unsupported signed division, please convert to
unsigned div/mod.
1 | int sdiv(int a, int b) { return a / b; }
| ^
1 error generated.
Fixes: #70433Fixes: #48647
This also improves error handling for dynamic stack allocation:
local/vla.c:2:3: error: unsupported dynamic stack allocation
2 | int b[n];
| ^
1 error generated.
Fixes: https://github.com/llvm/llvm-project/issues/57171
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.
This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
updating the field offset;
BPF verifier limits access patterns allowed for certain data
types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
types only `LD/ST <reg> <static-offset>` memory loads and stores are
allowed.
This is so because offsets of the fields of these structures do not
match real offsets in the running kernel. During BPF program
load/verification loads and stores to the fields of these types are
rewritten so that offsets match real offsets. For this rewrite to
happen static offsets have to be encoded in the instructions.
See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
kernel source tree for details.
- runtime environment might disallow access to the field of the type
through modified pointers.
During BPF program verification a tag `PTR_TO_CTX` is tracked for
register values. In case if register with such tag is modified BPF
programs are not allowed to read or write memory using register. See
kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
source tree for details.
Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`
During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).
Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)
The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
- `(load (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.load`;
- `(store (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.
Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.
The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.
This addresses the issue reported by the following thread:
https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6
This is a second attempt to commit this change, previous reverted
commit is: cb13e9286b6d4e384b5d4203e853d44e2eff0f0f.
The following items had been fixed:
- test case bpf-preserve-static-offset-bitfield.c now uses
`-triple bpfel` to avoid different codegen for little/big endian
targets.
- BPFPreserveStaticOffset.cpp:removePAICalls() modified to avoid
use after free for `WorkList` elements `V`.
Differential Revision: https://reviews.llvm.org/D133361
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.
This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
updating the field offset;
BPF verifier limits access patterns allowed for certain data
types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
types only `LD/ST <reg> <static-offset>` memory loads and stores are
allowed.
This is so because offsets of the fields of these structures do not
match real offsets in the running kernel. During BPF program
load/verification loads and stores to the fields of these types are
rewritten so that offsets match real offsets. For this rewrite to
happen static offsets have to be encoded in the instructions.
See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
kernel source tree for details.
- runtime environment might disallow access to the field of the type
through modified pointers.
During BPF program verification a tag `PTR_TO_CTX` is tracked for
register values. In case if register with such tag is modified BPF
programs are not allowed to read or write memory using register. See
kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
source tree for details.
Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`
During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).
Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)
The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
- `(load (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.load`;
- `(store (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.
Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.
The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.
This addresses the issue reported by the following thread:
https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6
Differential Revision: https://reviews.llvm.org/D133361
As far as I can tell, there's nothing in this code which actually
assumes the two predicates in (FoundLHS FoundPred FoundRHS) => (LHS Pred
RHS) are the same.
Noticed while investigating something else, this is purely an
oppurtunistic optimization while I'm looking at the code. Unfortunately,
this doesn't solve my original problem. :)
BPF upstream reported an inconsistent behavior w.r.t. BPF_TYPE_ID_LOCAL
vs. BPF_TYPE_ID_TARGET (or BPF_TYPE_ID_REMOTE in LLVM terminology).
For BPF_TYPE_ID_TARGET, all modifiers (like 'const' and 'volatile') are
ignored in the final type encoding. For example, for type
'const struct foo', the eventually encoding in BTF relocation
is 'struct foo'. This faciliates libbpf to match corresponding kernel
types with considering any modifiers.
Currently behavior for BPF_TYPE_ID_LOCAL is different. It will encode
'const struct foo' in BTF relocation and such discrepancy confused users
([1]).
This patch fixed this discrepancy by making BPF_TYPE_ID_LOCAL BTF type
representation the sams as BPF_TYPE_ID_TARGET. This should have minimum
user impact since ultimately user wants to get a real time not a 'const'
type modifier.
The selftest builtin-btf-type-id-2.ll is used to test BPF_TYPE_ID_TARGET
with 'const' modifier. Adapt the same test for BPF_TYPE_ID_LOCAL. And
the below diff shows now both BPF_TYPE_ID_LOCAL and BPF_TYPE_ID_TARGET
produces the same type:
$ diff test/CodeGen/BPF/BTF/builtin-btf-type-id-2.ll
test/CodeGen/BPF/BTF/builtin-btf-type-id-local.ll
--- test/CodeGen/BPF/BTF/builtin-btf-type-id-2.ll 2023-07-30
16:58:20.657528310 -0700
+++ test/CodeGen/BPF/BTF/builtin-btf-type-id-local.ll 2023-11-02
10:23:25.356959008 -0700
@@ -6,7 +6,7 @@
; int a;
; };
; int test(void) {
-; return __builtin_btf_type_id(*(const struct s *)0, 1);
+; return __builtin_btf_type_id(*(const struct s *)0, 0);
; }
; Compilation flag:
; clang -target bpf -O2 -g -S -emit-llvm -Xclang -disable-llvm-passes
test.c
$
[1]
https://lore.kernel.org/bpf/CAN+4W8h3yDjkOLJPiuKVKTpj_08pBz8ke6vN=Lf8gcA=iYBM-g@mail.gmail.com/
Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
As usual, making it easier for an upcoming test delta to be seen.
Note that several of these are examples of extremely bad testing practice.
Checking internal debug output (for no real purpose), and checking the
result of a fully O2 + llc run instead of reducing the specific problematic
pass.
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.
One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.
Differential Revision: https://reviews.llvm.org/D141060
Use BatchAA with EarliestEscapeInfo instead of callCapturesBefore() in
MemDepAnalysis. The advantage of this is that it will also take
not-captured-before information into account for non-calls (see
test_store_before_capture for a representative example), and that this
is a cached analysis. The disadvantage is that EII is slightly less
precise than full CapturedBefore analysis.
In practice the impact is positive, with gvn.NumGVNLoad going from 22022
to 22808 on test-suite.
The impact to compile-time is also positive, mainly in the ThinLTO
configuration.
Reapply now that generation of incorrect debuginfo for FnDef
in rustc has been fixed.
-----
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.
This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.
Differential Revision: https://reviews.llvm.org/D158743
This reverts commit 47324cfd7d8ca1a2a5cbb9f948ecff66a28ee6bc.
This exposed incorrect debuginfo in rustc. Revert the verification
until this has been fixed.
Reapply after fixing a clang bug this exposed in D158972 and
adjusting a number of tests that failed for 32-bit targets.
-----
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.
This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.
Differential Revision: https://reviews.llvm.org/D158743
The issue is uncovered by #47698: for IR files without a target triple,
-mtriple= specifies the full target triple while -march= merely sets the
architecture part of the default target triple, leaving a target triple which
may not make sense, e.g. riscv64-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without a target
triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead
of rejecting it outrightly.
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.
Short description
=================
The `BPFMIPeepholeTruncElim` handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ...
%2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1:
%a = LDB %0, ... %a = LDB %0, ...
br %bb3 br %bb3
bb.2: bb.2:
%b = LDB %0, ... -> %b = LDB %0, ...
br %bb3 br %bb3
bb.3: bb.3:
%1 = PHI %a, %b %1 = PHI %a, %b
%2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.
When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.
Commit also adds a few test cases to make sure that patterns in
question are handled.
Long description
================
Why this works: Pattern #1
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:
...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
t19: i64 = and t16, Constant:i64<255>
t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Optimized type-legalized selection DAG:
- `t19 = and t16, 255` had been replaced by `t16` (load).
- Patterns like `(and (load ... i8), 255)` are replaced by `load`
in `DAGCombiner::BackwardsPropagateMask` called from
`DAGCombiner::visitAND`.
- Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
`(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
look for `TLI.shouldFoldConstantShiftPairToMask()` call).
Why this works: Pattern #2
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next
next:
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
t0: ch,glue = EntryToken
t3: i64 = Constant<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5
t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
w/o special `isZExtFree()` overload added by this commit
it is instead lowered as `t6 = (any_extend (load ...))`.
- The decision to generate `zero_extend` or `any_extend` is
done in `RegsForValue::getCopyToRegs` called from
`SelectionDAGBuilder::CopyValueToVirtualRegister`:
- if `isZExtFree()` for load returns true `zero_extend` is used;
- `any_extend` is used otherwise.
- Optimized lowered selection DAG:
- `t6 = (any_extend (load ...))` is replaced by
`t9 = load ..., zext from i8`
This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
`DAGCombiner::visitZERO_EXTEND`.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t5: i8 = truncate t4
t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
t9: i64 = any_extend t8
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t18: i64 = and t4, Constant:i64<255>
t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Initial selection DAG:
- `t0` is an input value for this basic block, it corresponds load
instruction (`t9`) from the first basic block.
- It is accessed within basic block via
`t4` (AssertZext (CopyFromReg t0, ...)).
- The `AssertZext` is generated by RegsForValue::getCopyFromRegs
called from SelectionDAGBuilder::getCopyFromRegs, it is generated
only when `LiveOutInfo` with known number of leading zeros is
present for `t0`.
- Known register bits in `LiveOutInfo` are computed by
`SelectionDAG::computeKnownBits` called from
`SelectionDAGISel::ComputeLiveOutVRegInfo`.
- `computeKnownBits()` generates leading zeros information for
`(load ..., zext from ...)` but *does not* generate leading zeros
information for `(load ..., anyext from ...)`.
This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
- `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
- `t18 = and t4, 255` is replaced by `t4`, this is done by
`DAGCombiner::SimplifyDemandedBits` called from
`DAGCombiner::visitAND`, which simplifies patterns like
`(and (assertzext ...))`
Impact
------
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
remove_truncate_8 improved slightly, for other tests it is
unchanged.
For remove_truncate_7:
Before this revision After this revision
-------------------- -------------------
r1 <<= 0x20 r1 <<= 0x20
r1 >>= 0x20 r1 >>= 0x20
if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2>
r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0)
goto +0x1 <LBB0_3> goto +0x1 <LBB0_3>
<LBB0_2>: <LBB0_2>:
r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>: <LBB0_3>:
r0 = r1 exit
exit
For remove_truncate_8:
Before this revision After this revision
-------------------- -------------------
r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0)
r3 = r2 r3 = r2
r3 <<= 0x20 r3 <<= 0x20
r4 = r3 r3 s>>= 0x20
r4 s>>= 0x20
if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3>
r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4)
r3 >>= 0x20
if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3>
r2 += 0x2 r2 += 0x2
*(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2
<LBB0_3>: <LBB0_3>:
r0 = 0x3 r0 = 0x3
exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
in all cases total number of instructions marginally decreased
(-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased
(-129 instruction in total): some redundant `rX &= 0xffff`
and register to register assignments removed;
- For 2 object files number of instructions increased +2
instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same
example:
define void @foo(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
%b = sext i32 %a to i64
%c = icmp ult i64 1, %b
br i1 %c, label %next, label %end
next:
call void inttoptr (i64 62 to ptr)(i32 %a)
br label %end
end:
ret void
}
Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:
Before this revision After this revision
-------------------- -------------------
r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0)
r1 <<= 32 r1 <<= 32
r1 s>>= 32 r1 s>>= 32
if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2>
r1 <<= 32
r1 >>= 32
call 62 call 62
<LBB0_2>: <LBB0_2>:
exit exit
Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.
The difference comes from `DAGCombiner` operation on the initial DAG:
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = any_extend t5 <--------------------- (1)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5 <--------------------- (2)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
and 1 other values
This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
converted to a pair of shifts. This pair of shifts survives till the
end of translation.
Differential Revision: https://reviews.llvm.org/D157870
Generate store immediate instruction when CPUv4 is enabled.
For example:
$ cat test.c
struct foo {
unsigned char b;
unsigned short h;
unsigned int w;
unsigned long d;
};
void bar(volatile struct foo *p) {
p->b = 1;
p->h = 2;
p->w = 3;
p->d = 4;
}
$ clang -O2 --target=bpf -mcpu=v4 test.c -c -o - | llvm-objdump -d -
...
0000000000000000 <bar>:
0: 72 01 00 00 01 00 00 00 *(u8 *)(r1 + 0x0) = 0x1
1: 6a 01 02 00 02 00 00 00 *(u16 *)(r1 + 0x2) = 0x2
2: 62 01 04 00 03 00 00 00 *(u32 *)(r1 + 0x4) = 0x3
3: 7a 01 08 00 04 00 00 00 *(u64 *)(r1 + 0x8) = 0x4
4: 95 00 00 00 00 00 00 00 exit
Take special care to:
- apply `BPFMISimplifyPatchable::checkADDrr` rewrite for BPF_ST
- validate immediate value when BPF_ST write is 64-bit:
BPF interprets `(BPF_ST | BPF_MEM | BPF_DW)` writes as writes with
sign extension. Thus it is fine to generate such write when
immediate is -1, but it is incorrect to generate such write when
immediate is +0xffff_ffff.
This commit was previously reverted in e66affa17e32.
The reason for revert was an unrelated bug in BPF backend,
triggered by test case added in this commit if LLVM is built
with LLVM_ENABLE_EXPENSIVE_CHECKS.
The bug was fixed in D157806.
Differential Revision: https://reviews.llvm.org/D140804
When LLVM is build with `LLVM_ENABLE_EXPENSIVE_CHECKS=ON` option the
following C code snippet:
struct t {
int a;
} __attribute__((preserve_access_index));
void test(struct t *t) {
t->a = 42;
}
Causes an assertion:
$ clang -g -O2 -c --target=bpf -mcpu=v2 t.c -o /dev/null
Function Live Ins: $r1 in %0
bb.0.entry:
liveins: $r1
DBG_VALUE $r1, $noreg, !"t", ...
%0:gpr = COPY $r1
DBG_VALUE %0:gpr, $noreg, !"t", ...
%1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
%3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
%4:gpr = MOV_ri 42
CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
RET debug-location !25; t.c:7:1
*** Bad machine code: Explicit definition marked as use ***
- function: test
- basic block: %bb.0 entry (0x6210000d8a90)
- instruction: CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
- operand 0: killed %4:gpr
This happens because `CORE_MEM` instruction is defined to have output
operands:
def CORE_MEM : TYPE_LD_ST<BPF_MEM.Value, BPF_W.Value,
(outs GPR:$dst),
(ins u64imm:$opcode, GPR:$src, u64imm:$offset),
"$dst = core_mem($opcode, $src, $offset)",
[]>;
As documented in [1]:
> By convention, the LLVM code generator orders instruction operands
> so that all register definitions come before the register uses, even
> on architectures that are normally printed in other orders.
In other words, the first argument for `CORE_MEM` is considered to be
a "def", while in reality it is "use":
%1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
%3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
%4:gpr = MOV_ri 42
'---------------.
v
CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
Here is how `CORE_MEM` is constructed in
`BPFMISimplifyPatchable::checkADDrr()`:
BuildMI(*DefInst->getParent(), *DefInst, DefInst->getDebugLoc(), TII->get(COREOp))
.add(DefInst->getOperand(0)).addImm(Opcode).add(*BaseOp)
.addGlobalAddress(GVal);
Note that first operand is constructed as `.add(DefInst->getOperand(0))`.
For `LD{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
destination register of a load, so instruction is constructed in
accordance with `outs` declaration.
For `ST{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
source register of a store (value to be stored), so instruction
violates the `outs` declaration.
This commit fixes the issue by splitting `CORE_MEM` in three
instructions: `CORE_ST`, `CORE_LD64`, `CORE_LD32` with correct `outs`
specifications.
[1] https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class
Differential Revision: https://reviews.llvm.org/D157806
When compiling Rust code we may end up with calls to functions provided
by other code units. Presently this code crashes on a null pointer
dereference - this patch avoids that crash and adds a test.
Reviewed By: ast
Differential Revision: https://reviews.llvm.org/D156446
This patch contains a number of uncontroversial changes:
- Replace all uses of
`errs`, `assert`, `llvm_unreachable` with `report_fatal_error` with
informative error strings.
- Replace calls to `fail` in loops with at most one call per error
instance. Previously a function with 19 arguments would log "too many
args" 14 times. This was not helpful.
- Change one `if (..) switch ...` to `if (..) { switch ...`. The added
brace is consistent with a near-identical switch immediately above.
- Elide one `SDValue` copy by using a reference rather than value. This
is consistent with a variable declared immediately before it.
Reviewed By: yonghong-song
Differential Revision: https://reviews.llvm.org/D156136
In [1], a few new insns are proposed to expand BPF ISA to
. fixing the limitation of existing insn (e.g., 16bit jmp offset)
. adding new insns which may improve code quality
(sign_ext_ld, sign_ext_mov, st)
. feature complete (sdiv, smod)
. better user experience (bswap)
This patch implemented insn encoding for
. sign-extended load
. sign-extended mov
. sdiv/smod
. bswap insns
. unconditional jump with 32bit offset
The new bswap insns are generated under cpu=v4 for __builtin_bswap.
For cpu=v3 or earlier, for __builtin_bswap, be or le insns are generated
which is not intuitive for the user.
To support 32-bit branch offset, a 32-bit ja (JMPL) insn is implemented.
For conditional branch which is beyond 16-bit offset, llvm will do
some transformation 'cond_jmp' -> 'cond_jmp + jmpl' to simulate 32bit
conditional jmp. See BPFMIPeephole.cpp for details. The algorithm is
hueristic based. I have tested bpf selftest pyperf600 with unroll account
600 which can indeed generate 32-bit jump insn, e.g.,
13: 06 00 00 00 9b cd 00 00 gotol +0xcd9b <LBB0_6619>
Eduard is working on to add 'st' insn to cpu=v4.
A list of llc flags:
disable-ldsx, disable-movsx, disable-bswap,
disable-sdiv-smod, disable-gotol
can be used to disable a particular insn for cpu v4.
For example, user can do:
llc -march=bpf -mcpu=v4 -disable-movsx t.ll
to enable cpu v4 without movsx insns.
References:
[1] https://lore.kernel.org/bpf/4bfe98be-5333-1c7e-2f6d-42486c8ec039@meta.com/
Differential Revision: https://reviews.llvm.org/D144829
Extended BPFCheckAndAdjustIR pass with sinkMinMax() transformation
that undoes LICM hoistMinMax pass.
The undo transformation converts the following patterns:
x < min(a, b) -> x < a && x < b
x > min(a, b) -> x > a || x > b
x < max(a, b) -> x < a || x < b
x > max(a, b) -> x > a && x > b
Where 'a' or 'b' is a constant.
Also supports `sext min(...) ...` and `zext min(...) ...`.
~~~
This was previously commited as 09feee559a29 and reverted in
0bf9bfeacc8c because of the testbot memory leak report:
https://lab.llvm.org/buildbot/#/builders/5/builds/34931
The memory leak issue was caused by incorrect instruction removal
sequence in skinMinMaxBB():
I->dropAllReferences(); --------> I->eraseFromParent();
I->removeFromParent(); fixed to
Differential Revision: https://reviews.llvm.org/D147990
Extended BPFCheckAndAdjustIR pass with sinkMinMax() transformation
that undoes LICM hoistMinMax pass.
The undo transformation converts the following patterns:
x < min(a, b) -> x < a && x < b
x > min(a, b) -> x > a || x > b
x < max(a, b) -> x < a || x < b
x > max(a, b) -> x > a && x > b
Where 'a' or 'b' is a constant.
Also supports `sext min(...) ...` and `zext min(...) ...`.
Differential Revision: https://reviews.llvm.org/D147990
`NoMerge` attribute on machine instructions prevents certain
transformations from merging these instructions.
One of such transformations is 'llvm/lib/CodeGen/BranchFolding.cpp'.
This attribute should be copied from IR `call` instructions to machine
level instructions. See `X86TargetLowering::LowerCall` as another
example.
Differential Revision: https://reviews.llvm.org/D152987
D137796 made these passes unused.
`opt --bpf-ir-peephole` is specified in one test. Add a `registerPipelineParsingCallback`
so that we can use change the test to use `opt --passes=bpf-ir-peephole` instead.
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
`BPF.td` is used to generate (among other things) `MCSubtargetInfo`
setup function for BPF target.
Specifically, the `BPFGenSubtargetInfo.inc` file:
enum {
ALU32 = 0,
...
};
...
extern const llvm::SubtargetSubTypeKV BPFSubTypeKV[] = {
{ "generic", { { { 0x0ULL, ... } } }, ... },
{ "probe", { { { 0x0ULL, ... } } }, ... },
{ "v1", { { { 0x0ULL, ... } } }, ... },
{ "v2", { { { 0x0ULL, ... } } }, ... },
{ "v3", { { { 0x1ULL, ... } } }, ... },
};
...
static inline MCSubtargetInfo *createBPFMCSubtargetInfoImpl(...) {
return new BPFGenMCSubtargetInfo(..., BPFSubTypeKV, ...);
}
The `SubtargetSubTypeKV` is defined in `MCSubtargetInfo.h` as:
/// Used to provide key value pairs for feature and CPU bit flags.
struct SubtargetSubTypeKV {
const char *Key; ///< K-V key string
FeatureBitArray Implies; ///< K-V bit mask
FeatureBitArray TuneImplies; ///< K-V bit mask
const MCSchedModel *SchedModel;
...
}
The first bit array specifies features enabled by default for a
specific CPU. This commit makes sure that this information is
communicated to `tablegen` and correct `BPFSubTypeKV` table is
generated. This allows tools like `objdump` to detect available
features when `--mcpu` flag is specified.
Differential Revision: https://reviews.llvm.org/D148037
Fixes BPF assembler parsing errors for the following instructions:
- atomic_fetch_add
- atomic_fetch_and
- atomic_fetch_xor
- atomic_fetch_or
- cmpxchg32_32
- cmpxchg_64
- xchg32_32
- xchg_64
Also add a test to verify that all instructions could be assembled and disassembled.
Differential Revision: https://reviews.llvm.org/D147421
Commit 3671bdbcd214("[BPF] Fix a BTF type pruning bug") fixed a
pruning bug to allow generate more types. But the commit has a bug
which permits to generate more types than necessary. The following
is an example to illustrate the problem.
struct t1 {
int a;
};
struct t2 {
struct t1 *p1;
struct t1 *p2;
int b;
};
int foo(struct t2 *arg) {
return arg->b;
}
The following is the part of BTF generation sequence:
(1). 'struct t2 *arg' -> 'struct t1 *p1'
In this step, the type 'struct t1' will be generated as
a forward decl and the ptr type (to 'struct t1') will
be stored in the internal type table.
(2). now the second field 'struct t1 *p2' will be processed.
Since the ptr type (to 'struct t1') already in the type
table, the existing logic strips out ptr modifier and
is able to generate BTF type for 'struct t1'.
In the above step (2), if CheckPointer is true (the type traversal
chain including a struct member), 'ptr' modifier should be checked
and the subsequent type generation should be skipped since
the same case has been processed in visitDerivedType().
The issue is exposed when I am trying to use llvm15 to compile
some internal bpf programs. The bpf skeleton put the whole
ELF section (after striping some sections like dwarf) as a string.
The large BTF section triggered the following error:
bpf_object_with_struct_ops_test_prog_bpf/BpfObjectWithStructOpsTestProg.skel.h:222:23:
error: string literal of length 140144 exceeds maximum length 65536 that C++ compilers
are required to support [-Werror,-Woverlength-strings]
return (const void *)"\
^~
1 error generated.
Although adding -Wno-overlength-strings could workaround the issue,
improving llvm BTF generation sounds better esp. for users using vmlinux.h.
Differential Revision: https://reviews.llvm.org/D145816
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.
The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.
Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).
Differential Revision: https://reviews.llvm.org/D135462
After frontend changes in the following commit:
"BPF: preserve btf_decl_tag for parameters of extern functions"
same mechanics could be used to get the list of function parameters
and associated btf_decl_tag entries for both extern and non-extern
functions.
This commit extracts this mechanics as a separate auxiliary function
BTFDebug::processDISubprogram(). The function is called for both
extern and non-extern functions in order to generated corresponding
BTF_DECL_TAG records.
Differential Revision: https://reviews.llvm.org/D140971
Use function TargetLoweringObjectFile::SectionForGlobal() to compute
section names for globals described in BTF_KIND_DATASEC records.
This fixes a discrepancy in section name computation between
BTFDebug::processGlobals and the rest of the LLVM pipeline.
Specifically, the following example illustrates the discrepancy
before this commit:
struct Foo {
int i;
} __attribute__((aligned(16)));
struct Foo foo = { 0 };
The initializer for 'foo' looks as follows:
%struct.Foo { i32 0, [12 x i8] undef }
TargetLoweringObjectFile::SectionForGlobal() classifies 'foo' as
a part of '.bss' section, while BTFDebug::processGlobals
classified it as a part of '.data' section because of the
following expression:
SecName = Global.getInitializer()->isZeroValue() ? ".bss" : ".data"
The isZeroValue() returns false because of the undef tail of the
initializer, while SectionForGlobal() allows such patterns in '.bss'.
Differential Revision: https://reviews.llvm.org/D140505