This commit introduces attribute bpf_fastcall to declare BPF functions
that do not clobber some of the caller saved registers (R0-R5).
The idea is to generate the code complying with generic BPF ABI,
but allow compatible Linux Kernel to remove unnecessary spills and
fills of non-scratched registers (given some compiler assistance).
For such functions do register allocation as-if caller saved registers
are not clobbered, but later wrap the calls with spill and fill
patterns that are simple to recognize in kernel.
For example for the following C code:
#define __bpf_fastcall __attribute__((bpf_fastcall))
void bar(void) __bpf_fastcall;
void buz(long i, long j, long k);
void foo(long i, long j, long k) {
bar();
buz(i, j, k);
}
First allocate registers as if:
foo:
call bar # note: no spills for i,j,k (r1,r2,r3)
call buz
exit
And later insert spills fills on the peephole phase:
foo:
*(u64 *)(r10 - 8) = r1; # Such call pattern is
*(u64 *)(r10 - 16) = r2; # correct when used with
*(u64 *)(r10 - 24) = r3; # old kernels.
call bar
r3 = *(u64 *)(r10 - 24); # But also allows new
r2 = *(u64 *)(r10 - 16); # kernels to recognize the
r1 = *(u64 *)(r10 - 8); # pattern and remove spills/fills.
call buz
exit
The offsets for generated spills/fills are picked as minimal stack
offsets for the function. Allocated stack slots are not used for any
other purposes, in order to simplify in-kernel analysis.
This commit introduces attribute bpf_fastcall to declare BPF functions
that do not clobber some of the caller saved registers (R0-R5).
The idea is to generate the code complying with generic BPF ABI, but
allow compatible Linux Kernel to remove unnecessary spills and fills of
non-scratched registers (given some compiler assistance).
For such functions do register allocation as-if caller saved registers
are not clobbered, but later wrap the calls with spill and fill patterns
that are simple to recognize in kernel.
For example for the following C code:
#define __bpf_fastcall __attribute__((bpf_fastcall))
void bar(void) __bpf_fastcall;
void buz(long i, long j, long k);
void foo(long i, long j, long k) {
bar();
buz(i, j, k);
}
First allocate registers as if:
foo:
call bar # note: no spills for i,j,k (r1,r2,r3)
call buz
exit
And later insert spills fills on the peephole phase:
foo:
*(u64 *)(r10 - 8) = r1; # Such call pattern is
*(u64 *)(r10 - 16) = r2; # correct when used with
*(u64 *)(r10 - 24) = r3; # old kernels.
call bar
r3 = *(u64 *)(r10 - 24); # But also allows new
r2 = *(u64 *)(r10 - 16); # kernels to recognize the
r1 = *(u64 *)(r10 - 8); # pattern and remove spills/fills.
call buz
exit
The offsets for generated spills/fills are picked as minimal stack
offsets for the function. Allocated stack slots are not used for any
other purposes, in order to simplify in-kernel analysis.
Corresponding functionality had been merged in Linux Kernel as
[this](https://lore.kernel.org/bpf/172179364482.1919.9590705031832457529.git-patchwork-notify@kernel.org/)
patch set (the patch assumed that `no_caller_saved_regsiters` attribute
would be used by LLVM, naming does not matter for the Kernel).
Targets affected:
- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).
Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.
The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
Currently on mcpu=v3 we do not support sdiv, srem instructions. And the
backend crashes with stacktrace & coredump, which is misleading for end
users, as this is not a "bug"
Add llvm bug reporting for sdiv/srem on ISel legalize-op phase.
For clang frontend we can get detailed location & bug report.
$ build/bin/clang -g -target bpf -c local/sdiv.c
local/sdiv.c:1:35: error: unsupported signed division, please convert to
unsigned div/mod.
1 | int sdiv(int a, int b) { return a / b; }
| ^
1 error generated.
Fixes: #70433Fixes: #48647
This also improves error handling for dynamic stack allocation:
local/vla.c:2:3: error: unsupported dynamic stack allocation
2 | int b[n];
| ^
1 error generated.
Fixes: https://github.com/llvm/llvm-project/issues/57171
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.
Short description
=================
The `BPFMIPeepholeTruncElim` handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ...
%2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1:
%a = LDB %0, ... %a = LDB %0, ...
br %bb3 br %bb3
bb.2: bb.2:
%b = LDB %0, ... -> %b = LDB %0, ...
br %bb3 br %bb3
bb.3: bb.3:
%1 = PHI %a, %b %1 = PHI %a, %b
%2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.
When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.
Commit also adds a few test cases to make sure that patterns in
question are handled.
Long description
================
Why this works: Pattern #1
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:
...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
t19: i64 = and t16, Constant:i64<255>
t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Optimized type-legalized selection DAG:
- `t19 = and t16, 255` had been replaced by `t16` (load).
- Patterns like `(and (load ... i8), 255)` are replaced by `load`
in `DAGCombiner::BackwardsPropagateMask` called from
`DAGCombiner::visitAND`.
- Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
`(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
look for `TLI.shouldFoldConstantShiftPairToMask()` call).
Why this works: Pattern #2
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next
next:
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
t0: ch,glue = EntryToken
t3: i64 = Constant<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5
t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
w/o special `isZExtFree()` overload added by this commit
it is instead lowered as `t6 = (any_extend (load ...))`.
- The decision to generate `zero_extend` or `any_extend` is
done in `RegsForValue::getCopyToRegs` called from
`SelectionDAGBuilder::CopyValueToVirtualRegister`:
- if `isZExtFree()` for load returns true `zero_extend` is used;
- `any_extend` is used otherwise.
- Optimized lowered selection DAG:
- `t6 = (any_extend (load ...))` is replaced by
`t9 = load ..., zext from i8`
This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
`DAGCombiner::visitZERO_EXTEND`.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t5: i8 = truncate t4
t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
t9: i64 = any_extend t8
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t18: i64 = and t4, Constant:i64<255>
t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Initial selection DAG:
- `t0` is an input value for this basic block, it corresponds load
instruction (`t9`) from the first basic block.
- It is accessed within basic block via
`t4` (AssertZext (CopyFromReg t0, ...)).
- The `AssertZext` is generated by RegsForValue::getCopyFromRegs
called from SelectionDAGBuilder::getCopyFromRegs, it is generated
only when `LiveOutInfo` with known number of leading zeros is
present for `t0`.
- Known register bits in `LiveOutInfo` are computed by
`SelectionDAG::computeKnownBits` called from
`SelectionDAGISel::ComputeLiveOutVRegInfo`.
- `computeKnownBits()` generates leading zeros information for
`(load ..., zext from ...)` but *does not* generate leading zeros
information for `(load ..., anyext from ...)`.
This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
- `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
- `t18 = and t4, 255` is replaced by `t4`, this is done by
`DAGCombiner::SimplifyDemandedBits` called from
`DAGCombiner::visitAND`, which simplifies patterns like
`(and (assertzext ...))`
Impact
------
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
remove_truncate_8 improved slightly, for other tests it is
unchanged.
For remove_truncate_7:
Before this revision After this revision
-------------------- -------------------
r1 <<= 0x20 r1 <<= 0x20
r1 >>= 0x20 r1 >>= 0x20
if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2>
r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0)
goto +0x1 <LBB0_3> goto +0x1 <LBB0_3>
<LBB0_2>: <LBB0_2>:
r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>: <LBB0_3>:
r0 = r1 exit
exit
For remove_truncate_8:
Before this revision After this revision
-------------------- -------------------
r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0)
r3 = r2 r3 = r2
r3 <<= 0x20 r3 <<= 0x20
r4 = r3 r3 s>>= 0x20
r4 s>>= 0x20
if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3>
r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4)
r3 >>= 0x20
if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3>
r2 += 0x2 r2 += 0x2
*(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2
<LBB0_3>: <LBB0_3>:
r0 = 0x3 r0 = 0x3
exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
in all cases total number of instructions marginally decreased
(-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased
(-129 instruction in total): some redundant `rX &= 0xffff`
and register to register assignments removed;
- For 2 object files number of instructions increased +2
instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same
example:
define void @foo(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
%b = sext i32 %a to i64
%c = icmp ult i64 1, %b
br i1 %c, label %next, label %end
next:
call void inttoptr (i64 62 to ptr)(i32 %a)
br label %end
end:
ret void
}
Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:
Before this revision After this revision
-------------------- -------------------
r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0)
r1 <<= 32 r1 <<= 32
r1 s>>= 32 r1 s>>= 32
if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2>
r1 <<= 32
r1 >>= 32
call 62 call 62
<LBB0_2>: <LBB0_2>:
exit exit
Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.
The difference comes from `DAGCombiner` operation on the initial DAG:
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = any_extend t5 <--------------------- (1)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5 <--------------------- (2)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
and 1 other values
This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
converted to a pair of shifts. This pair of shifts survives till the
end of translation.
Differential Revision: https://reviews.llvm.org/D157870
This patch contains a number of uncontroversial changes:
- Replace all uses of
`errs`, `assert`, `llvm_unreachable` with `report_fatal_error` with
informative error strings.
- Replace calls to `fail` in loops with at most one call per error
instance. Previously a function with 19 arguments would log "too many
args" 14 times. This was not helpful.
- Change one `if (..) switch ...` to `if (..) { switch ...`. The added
brace is consistent with a near-identical switch immediately above.
- Elide one `SDValue` copy by using a reference rather than value. This
is consistent with a variable declared immediately before it.
Reviewed By: yonghong-song
Differential Revision: https://reviews.llvm.org/D156136
In [1], a few new insns are proposed to expand BPF ISA to
. fixing the limitation of existing insn (e.g., 16bit jmp offset)
. adding new insns which may improve code quality
(sign_ext_ld, sign_ext_mov, st)
. feature complete (sdiv, smod)
. better user experience (bswap)
This patch implemented insn encoding for
. sign-extended load
. sign-extended mov
. sdiv/smod
. bswap insns
. unconditional jump with 32bit offset
The new bswap insns are generated under cpu=v4 for __builtin_bswap.
For cpu=v3 or earlier, for __builtin_bswap, be or le insns are generated
which is not intuitive for the user.
To support 32-bit branch offset, a 32-bit ja (JMPL) insn is implemented.
For conditional branch which is beyond 16-bit offset, llvm will do
some transformation 'cond_jmp' -> 'cond_jmp + jmpl' to simulate 32bit
conditional jmp. See BPFMIPeephole.cpp for details. The algorithm is
hueristic based. I have tested bpf selftest pyperf600 with unroll account
600 which can indeed generate 32-bit jump insn, e.g.,
13: 06 00 00 00 9b cd 00 00 gotol +0xcd9b <LBB0_6619>
Eduard is working on to add 'st' insn to cpu=v4.
A list of llc flags:
disable-ldsx, disable-movsx, disable-bswap,
disable-sdiv-smod, disable-gotol
can be used to disable a particular insn for cpu v4.
For example, user can do:
llc -march=bpf -mcpu=v4 -disable-movsx t.ll
to enable cpu v4 without movsx insns.
References:
[1] https://lore.kernel.org/bpf/4bfe98be-5333-1c7e-2f6d-42486c8ec039@meta.com/
Differential Revision: https://reviews.llvm.org/D144829
`NoMerge` attribute on machine instructions prevents certain
transformations from merging these instructions.
One of such transformations is 'llvm/lib/CodeGen/BranchFolding.cpp'.
This attribute should be copied from IR `call` instructions to machine
level instructions. See `X86TargetLowering::LowerCall` as another
example.
Differential Revision: https://reviews.llvm.org/D152987
The term "next stack offset" is misleading because the next argument is
not necessarily allocated at this offset due to alignment constrains.
It also does not make much sense when allocating arguments at negative
offsets (introduced in a follow-up patch), because the returned offset
would be past the end of the next argument.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D149566
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
Delyan Kratunov reported an issue where __builtin_memcmp is
not inlined into simple load/compare instructions.
This is a known issue. In the current state, __builtin_memcmp
will be converted to memcmp call which won't work for
bpf programs.
This patch added support for expanding __builtin_memcmp with
actual loads and compares up to currently maximum 128 total loads.
The implementation is identical to PowerPC.
Differential Revision: https://reviews.llvm.org/D122676
Latest upstream llvm caused the kernel bpf selftest emitting the
following warnings:
In file included from progs/profiler3.c:6:
progs/profiler.inc.h:489:2: warning: loop not unrolled:
the optimizer was unable to perform the requested transformation;
the transformation might be disabled or specified as part of an unsupported
transformation ordering [-Wpass-failed=transform-warning]
for (int i = 0; i < MAX_PATH_DEPTH; i++) {
^
Further bisecting shows this SimplifyCFG patch ([1]) changed
the condition on how to fold branch to common dest. This caused
some unroll pragma is not honored in selftests/bpf.
The patch [1] test getUserCost() as the condition to
perform the certain basic block folding transformation.
For the above example, before the loop unroll pass, the control flow
looks like:
cond_block:
branch target: body_block, cleanup_block
body_block:
branch target: cleanup_block, end_block
end_block:
branch target: cleanup_block, end10_block
end10_block:
%add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
%inc = add nuw nsw i32 %i.0, 1
branch target: cond_block
In the above, %call2 is an unknown scalar.
Before patch [1], end10_block will be folded into end_block, forming
the code like
cond_block:
branch target: body_block, cleanup_block
body_block:
branch target: cleanup_block, end_block
end_block:
branch target: cleanup_block, cond_block
and the compiler is happy to perform unrolling.
With patch [1], getUserCost(), which calls getGEPCost(), which calls
isLegalAddressingMode() in TargetLoweringBase.cpp, considers IR
%add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
is free, so the above basic block folding transformation is not performed
and unrolling does not happen.
For BPF target, the IR
%add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
is not free and we don't have ld/st instruction address with 'r+r' mode.
This patch implemented a BPF hook for isLegalAddressingMode(), which is
identical to Mips isLegalAddressingMode() implementation where
the address pattern like 'r+r', 'r+r+i' or '2*r' are not allowed.
With testing kernel bpf selftests, all loop not unrolled warnings
are gone and all selftests run successfully.
[1] https://reviews.llvm.org/D108837
Differential Revision: https://reviews.llvm.org/D110789
Currently, BPF backend does not support all variants of
atomic_load_{add,and,or,xor}, atomic_swap and atomic_cmp_swap
For example, it only supports 32bit (with alu32 mode) and 64bit
operations for atomic_load_{and,or,xor}, atomic_swap and
atomic_cmp_swap. Due to historical reason, atomic_load_add is
always supported with 32bit and 64bit.
If user used an unsupported atomic operation, currently,
codegen selectiondag cannot find bpf support and will issue
a fatal error. This is not user friendly as user may mistakenly
think this is a compiler bug.
This patch added Custom rule for unsupported atomic operations
and will emit better error message during ReplaceNodeResults()
callback. The following is an example output.
$ cat t.c
short sync(short *p) {
return __sync_val_compare_and_swap (p, 2, 3);
}
$ clang -target bpf -O2 -g -c t.c
t.c:2:11: error: Unsupported atomic operations, please use 64 bit version
return __sync_val_compare_and_swap (p, 2, 3);
^
fatal error: error in backend: Cannot select: t19: i64,ch =
AtomicCmpSwap<(load store seq_cst seq_cst 2 on %ir.p)> t0, t2,
Constant:i64<2>, Constant:i64<3>, t.c:2:11
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t1: i64 = Register %0
t11: i64 = Constant<2>
t10: i64 = Constant<3>
In function: sync
PLEASE submit a bug report ...
Fatal error will still happen since we did not really do proper
lowering for these unsupported atomic operations. But we do get
a much better error message.
Differential Revision: https://reviews.llvm.org/D98471
The current pattern matching for zext results in the following code snippet
being produced,
w1 = w0
r1 <<= 32
r1 >>= 32
Because BPF implementations require zero extension on 32bit loads this
both adds a few extra unneeded instructions but also makes it a bit
harder for the verifier to track the r1 register bounds. For example in
this verifier trace we see at the end of the snippet R2 offset is unknown.
However, if we track this correctly we see w1 should have the same bounds
as r8. R8 smax is less than U32 max value so a zero extend load should keep
the same value. Adding a max value of 800 (R8=inv(id=0,smax_value=800)) to
an off=0, as seen in R7 should create a max offset of 800. However at the
end of the snippet we note the R2 max offset is 0xffffFFFF.
R0=inv(id=0,smax_value=800)
R1_w=inv(id=0,umax_value=2147483647,var_off=(0x0; 0x7fffffff))
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8_w=inv(id=0,smax_value=800,umax_value=4294967295,var_off=(0x0; 0xffffffff))
R9=inv800 R10=fp0 fp-8=mmmm????
58: (1c) w9 -= w8
59: (bc) w1 = w8
60: (67) r1 <<= 32
61: (77) r1 >>= 32
62: (bf) r2 = r7
63: (0f) r2 += r1
64: (bf) r1 = r6
65: (bc) w3 = w9
66: (b7) r4 = 0
67: (85) call bpf_get_stack#67
R0=inv(id=0,smax_value=800)
R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0; 0xffffffff))
R3_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
R4_w=inv0 R6=ctx(id=0,off=0,imm=0)
R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8_w=inv(id=0,smax_value=800,umax_value=4294967295,var_off=(0x0; 0xffffffff))
R9_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
R10=fp0 fp-8=mmmm????
After this patch R1 bounds are not smashed by the <<=32 >>=32 shift and we
get correct bounds on R2 umax_value=800.
Further it reduces 3 insns to 1.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Differential Revision: https://reviews.llvm.org/D73985
Currently, BPF does not support dynamic static allocation.
For a program like below:
extern void bar(int *);
void foo(int n) {
int a[n];
bar(a);
}
The current error message looks like:
unimplemented operand
UNREACHABLE executed at /.../llvm/lib/Target/BPF/BPFISelLowering.cpp:199!
Let us make error message explicit so it will be clear to the user
what is the problem. With this patch, the error message looks like:
fatal error: error in backend: Unsupported dynamic stack allocation
...
Differential Revision: https://reviews.llvm.org/D74521
Currently, isTruncateFree() and isZExtFree() callbacks return false
as they are not implemented in BPF backend. This may cause suboptimal
code generation. For example, if the load in the context of zero extension
has more than one use, the pattern zextload{i8,i16,i32} will
not be generated. Rather, the load will be matched first and
then the result is zero extended.
For example, in the test together with this commit, we have
I1: %0 = load i32, i32* %data_end1, align 4, !tbaa !2
I2: %conv = zext i32 %0 to i64
...
I3: %2 = load i32, i32* %data, align 4, !tbaa !7
I4: %conv2 = zext i32 %2 to i64
...
I5: %4 = trunc i64 %sub.ptr.lhs.cast to i32
I6: %conv13 = sub i32 %4, %2
...
The I1 and I2 will match to one zextloadi32 DAG node, where SUBREG_TO_REG is
used to convert a 32bit register to 64bit one. During code generation,
SUBREG_TO_REG is a noop.
The %2 in I3 is used in both I4 and I6. If isTruncateFree() is false,
the current implementation will generate a SLL_ri and SRL_ri
for the zext part during lowering.
This patch implement isTruncateFree() in the BPF backend, so for the
above example, I3 and I4 will generate a zextloadi32 DAG node with
SUBREG_TO_REG is generated during lowering to Machine IR.
isZExtFree() is also implemented as it should help code gen as well.
This patch also enables the change in https://reviews.llvm.org/D73985
since it won't kick in generates MOV_32_64 machine instruction.
Differential Revision: https://reviews.llvm.org/D74101
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: nemanjai, javed.absar, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, s.egerton, pzheng, ychen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67267
llvm-svn: 371212
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: jyknight, sdardis, nemanjai, javed.absar, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, s.egerton, pzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67229
llvm-svn: 371200
Summary:
This patch renames functions that takes or returns alignment as log2, this patch will help with the transition to llvm::Align.
The renaming makes it explicit that we deal with log(alignment) instead of a power of two alignment.
A few renames uncovered dubious assignments:
- `MirParser`/`MirPrinter` was expecting powers of two but `MachineFunction` and `MachineBasicBlock` were using deal with log2(align). This patch fixes it and updates the documentation.
- `MachineBlockPlacement` exposes two flags (`align-all-blocks` and `align-all-nofallthru-blocks`) supposedly interpreted as power of two alignments, internally these values are interpreted as log2(align). This patch updates the documentation,
- `MachineFunctionexposes` exposes `align-all-functions` also interpreted as power of two alignment, internally this value is interpreted as log2(align). This patch updates the documentation,
Reviewers: lattner, thegameg, courbet
Subscribers: dschuff, arsenm, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, Jim, s.egerton, llvm-commits, courbet
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65945
llvm-svn: 371045
Summary:
This clang-tidy check is looking for unsigned integer variables whose initializer
starts with an implicit cast from llvm::Register and changes the type of the
variable to llvm::Register (dropping the llvm:: where possible).
Partial reverts in:
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
X86FixupLEAs.cpp - Some functions return unsigned and arguably should be MCRegister
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
HexagonBitSimplify.cpp - Function takes BitTracker::RegisterRef which appears to be unsigned&
MachineVerifier.cpp - Ambiguous operator==() given MCRegister and const Register
PPCFastISel.cpp - No Register::operator-=()
PeepholeOptimizer.cpp - TargetInstrInfo::optimizeLoadInstr() takes an unsigned&
MachineTraceMetrics.cpp - MachineTraceMetrics lacks a suitable constructor
Manual fixups in:
ARMFastISel.cpp - ARMEmitLoad() now takes a Register& instead of unsigned&
HexagonSplitDouble.cpp - Ternary operator was ambiguous between unsigned/Register
HexagonConstExtenders.cpp - Has a local class named Register, used llvm::Register instead of Register.
PPCFastISel.cpp - PPCEmitLoad() now takes a Register& instead of unsigned&
Depends on D65919
Reviewers: arsenm, bogner, craig.topper, RKSimon
Reviewed By: arsenm
Subscribers: RKSimon, craig.topper, lenary, aemerson, wuzish, jholewinski, MatzeB, qcolombet, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, wdng, nhaehnle, sbc100, jgravelle-google, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, javed.absar, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, tpr, PkmX, jocewei, jsji, Petar.Avramovic, asbirlea, Jim, s.egerton, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65962
llvm-svn: 369041
JMP32 instructions has been added to eBPF ISA. They are 32-bit variants of
existing BPF conditional jump instructions, but the comparison happens on
low 32-bit sub-register only, therefore some unnecessary extensions could
be saved.
JMP32 instructions will only be available for -mcpu=v3. Host probe hook has
been updated accordingly.
JMP32 instructions will only be enabled in code-gen when -mattr=+alu32
enabled, meaning compiling the program using sub-register mode.
For JMP32 encoding, it is a new instruction class, and is using the
reserved eBPF class number 0x6.
This patch has been tested by compiling and running kernel bpf selftests
with JMP32 enabled.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
llvm-svn: 353384
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Some BPF JIT backends would want to optimize memcpy in their own
architecture specific way.
However, at the moment, there is no way for JIT backends to see memcpy
semantics in a reliable way. This is due to LLVM BPF backend is expanding
memcpy into load/store sequences and could possibly schedule them apart from
each other further. So, BPF JIT backends inside kernel can't reliably
recognize memcpy semantics by peephole BPF sequence.
This patch introduce new intrinsic expand infrastructure to memcpy.
To get stable in-order load/store sequence from memcpy, we first lower
memcpy into BPF::MEMCPY node which then expanded into in-order load/store
sequences in expandPostRAPseudo pass which will happen after instruction
scheduling. By this way, kernel JIT backends could reliably recognize
memcpy through scanning BPF sequence.
This new memcpy expand infrastructure is gated by a new option:
-bpf-expand-memcpy-in-order
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 337977
Summary:
They've been deprecated in favor of UADDO/ADDCARRY or USUBO/SUBCARRY for a while.
Target that uses these opcodes are changed in order to ensure their behavior doesn't change.
Reviewers: efriedma, craig.topper, dblaikie, bkramer
Subscribers: jholewinski, arsenm, jyknight, sdardis, nemanjai, nhaehnle, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, jordy.potman.lists, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, mgrang, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D47422
llvm-svn: 333748
Commit 37962a331c77 ("bpf: Improve expanding logic in LowerSELECT_CC")
intended to improve code quality for certain jmp conditions. The
commit, however, has a couple of issues:
(1). In code, just swap is not enough, ConditionalCode CC
should also be swapped, otherwise incorrect code will
be generated.
(2). The ConditionalCode swap should be subject to
getHasJmpExt(). If getHasJmpExt() is False, certain
conditional codes will not be supported and swap
may generate incorrect code.
The original goal for this patch is to optimize jmp operations
which does not have JmpExt turned on. If JmpExt is on,
better code could be generated. For example, the test
select_ri.ll is introduced to demonstrate the optimization.
The same result can be achieved with -mcpu=v2 flag.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 329043
Currently EVT is in the IR layer only because of Function.cpp needing a very small piece of the functionality of EVT::getEVTString(). The rest of EVT is used in codegen making CodeGen a better place for it.
The previous code converted a Type* to EVT and then called getEVTString. This was only expected to handle the primitive types from Type*. Since there only a few primitive types, we can just print them as strings directly.
Differential Revision: https://reviews.llvm.org/D45017
llvm-svn: 328806
Currently, there is no ALU32 bswap support in eBPF ISA.
BSWAP on i32 was set to EXPAND which would need about eight instructions
for single BSWAP.
It would be more efficient to promote it to i64, then doing BSWAP on i64.
For eBPF programs, most of the promotion are zero extensions which are
likely be elimiated later by peephole optimizations.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 327369
After all those preparation patches, now we could enable 32-bit subregister
support once -mattr=+alu32 specified.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325989
getScalarShiftAmount method should be implemented for eBPF backend to make
sure shift amount could still get correct type once 32-bit subregisters
support are enabled.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325986
We need to support condition comparison on i32. All these comparisons are
supposed to be combined into BPF_J* instructions which only support i64.
For ISD::BR_CC we need to promote it to i64 first, then do custom lowering.
For ISD::SET_CC, just expand to SELECT_CC like what's been done for i64.
For ISD::SELECT_CC, we also want to do custom lower for i32. However, after
32-bit subregister support enabled, it is possible the comparison operands
are i32 while the selected value are i64, or the comparison operands are
i64 while the selected value are i32. We need to define extra instruction
pattern and support them in custom instruction inserter.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325985
There is no eBPF ISA support for BSWAP, ROTR, ROTL, SREM, SDIVREM, MULHU,
ADDC, ADDE etc on i32.
They could be emulated by other basic BPF_ALU operations, we'd set their
lowering action the same as i64.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325984
This patch add new calling conventions to allow GPR32RegClass as valid
register class for arguments and return types.
New calling convention will only be choosen when -mattr=+alu32 specified.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325983
LowerSELECT_CC is not generating optimal Select_Ri pattern at the moment. It
is not guaranteed to place ConstantNode at RHS which would miss matching
Select_Ri.
A new testcase added into the existing select_ri.ll, also there is an
existing case in cmp.ll which would be improved to use Select_Ri after this
patch, it is adjusted accordingly.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
llvm-svn: 324560
kernel verifier is becoming smarter and soon will support
direct and indirect function calls.
Remove obsolete error from BPF backend.
Make call to use PCRel_4 fixup.
'bpf to bpf' calls are distinguished from 'bpf to kernel' calls
by insn->src_reg == BPF_PSEUDO_CALL == 1 which is used as relocation
indicator similar to ld_imm64->src_reg == BPF_PSEUDO_MAP_FD == 1
The actual 'call' instruction remains the same for both
'bpf to kernel' and 'bpf to bpf' calls.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 318614
We came across an llvm bug when compiling some testcases that 64-bit
immediates are silently truncated into 32-bit and then packed into
BPF_JMP | BPF_K encoding. This caused comparison with wrong value.
This bug looks to be introduced by r308080. The Select_Ri pattern is
supposed to be lowered into J*_Ri while the latter only support 32-bit
immediate encoding, therefore Select_Ri should have similar immediate
predicate check as what J*_Ri are doing.
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
llvm-svn: 315889