342 Commits

Author SHA1 Message Date
Alex Rønne Petersen
ad4a582fd9
[llvm] Consistently respect naked fn attribute in TargetFrameLowering::hasFP() (#106014)
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.

Note: I don't have commit access.
2024-10-18 09:35:42 +04:00
yonghong-song
38a8000f30
[BPF] Use mul for certain div/mod operations (#110712)
The motivation example likes below
```
  $ cat t1.c
  struct S {
    int var[3];
  };
  int foo1 (struct S *a, struct S *b)
  {
      return a - b;
  }
```
For cpu v1/v2/v3, the compilation will fail with the following error:
```
  $ clang --target=bpf -O2 -c t1.c -mcpu=v3
  t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod.
  4 | int foo1 (struct S *a, struct S *b)
        |     ^
  1 error generated.
```
The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu
v1/v2/v3, only udiv/umod is supported.

But the above example (for func foo1()) is reasonable common and user
has to workaround the compilation failure by using udiv with
conditionals.

For x86, for the above t1.c, compile and dump the asm code like below:
```
  $ clang -O2 -c t1.c && llvm-objdump -d t1.o
  0000000000000000 <foo1>:
       0: 48 29 f7                      subq    %rsi, %rdi
       3: 48 c1 ef 02                   shrq    $0x2, %rdi
       7: 69 c7 ab aa aa aa             imull   $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB
       d: c3                            retq
```
Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is
also able to generate code similar to x86 with -mcpu=v1. See
https://godbolt.org/z/feP9ETbjj

Generally multiplication is cheaper than divide. LLVM SelectionDAG
already supports to generate alternative codes (mul etc.) for div/mod
operations if the divisor is constant and if the cost of division is not
cheap. For BPF backend, let us use multiplication instead of divide
whenever possible. The following is the list of div/mod operations which
can be converted to mul.
  - 32bit signed/unsigned div/mod with constant divisor
- 64bit signed/unsigned div with constant divisor and with 'exact' flag
in IR where 'exact' means if replacing div with mod, it is guaranteed
that modulo result will be 0. The above foo1() case belongs here.
2024-10-01 14:20:54 -07:00
eddyz87
2fca0effb4
[BPF] fix sub-register handling for bpf_fastcall (#110618)
bpf_fastcall induced spill/fill pairs should be generated for
sub-register as well as for sub-registers. At the moment this is not the
case, e.g.:

    $ cat t.c
    extern int foo(void) __attribute__((bpf_fastcall));

    int bar(int a) {
      foo();
      return a;
    }

    $ clang --target=bpf -mcpu=v3 -O2 -S t.c -o -
    ...
    call foo
    w0 = w1
    exit

Modify BPFMIPeephole.cpp:collectBPFFastCalls() to check sub-registers
liveness and thus produce correct code for example above:

    *(u64 *)(r10 - 8) = r1
    call foo
    r1 = *(u64 *)(r10 - 8)
    w0 = w1
    exit
2024-10-01 21:20:22 +03:00
eddyz87
26df43faeb
[BPF] fix print_btf.py test script on bigendian machines (#110332)
Make print_btf.py correctly detect endianness of the BTF input. Input
endianness is inferred from BTF magic word [2], which is a 2-byte
integer at offset 0 of the input:
- sequence `EB 9F` signals big-endian input;
- sequence `9F EB` signals little-endian input.

Before this commit the magic sequence was read using "H" format for
`unpack` method of python's `struct` module:
- if magic is `0xEB9F` assume little-endian;
- if magic is `0x9FEB` assume big-endian.

However, format `H` reads data in native endianness.

Thus the above logic would only be correct on little endian hosts:
- byte sequence `9F EB` read as `0xEB9F` -> little-endian input;
- byte sequence `EB 9F` read as `0x9FEB` -> big-endian input.

On the big-endian host the relation should be inverse.

Fix this by always reading magic in big-endian (format `>H`).

This fixes CI error reported for a BPF test using print_btf.py script in
[1]. The error happens on a s390 host, which is big-endian.

[1] https://lab.llvm.org/buildbot/#/builders/42/builds/1192
[2]
https://www.kernel.org/doc/html/latest/bpf/btf.html#btf-type-and-string-encoding
2024-09-29 08:58:56 +03:00
yonghong-song
4c4fb6ada7
[BPF] Do atomic_fetch_*() pattern matching with memory ordering (#107343)
Three commits in this pull request:
commit 1: implement pattern matching for memory ordering seq_cst,
acq_rel, release, acquire and monotonic. Specially, for monotonic memory
ordering (relaxed memory model), if no return value is used, locked insn
is used.
commit 2: add support to handle dwarf atomic modifier in BTF generation.
Actually atomic modifier is ignored in BTF.
commit 3: add tests for new atomic ordering support and BTF support with
_Atomic type.
I removed RFC tag as now patch sets are in reasonable states.

For atomic fetch_and_*() operations, do pattern matching with memory
ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For
fetch_and_*()
operations with seq_cst/acq_rel/release/acquire ordering,
atomic_fetch_*()
instructions are generated. For monotonic ordering, locked insns are
generated
if return value is not used. Otherwise, atomic_fetch_*() insns are used.
The main motivation is to resolve the kernel issue [1].
   
The following are memory ordering are supported:
  seq_cst, acq_rel, release, acquire, relaxed
Current gcc style __sync_fetch_and_*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The
following is
an example:

```
$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit
```    

The following is another example where return value is used:

```
$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit
```    

You can see that for relaxed memory ordering, if return value is used,
atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is
used.

Here is another example with global _Atomic variable:

```
$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit
```    

Note that in the above compilations, '-g' is not used. The reason is due
to the following IR
related to _Atomic type:
```
$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c
```
The related debug info for test3.c:
```
!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
```

If compiling test.c, the related debug info:
```
...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)
```

All the above suggests _Atomic behaves like a modifier (e.g. const,
restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be
incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type.
So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF
encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC
behaves as a
       modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type
during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and
kernel
needs to handle that properly. The main advantage of it probably is to
maintain
this atomic type so it is also available to skeleton. But I think for
skeleton
a raw type might be good enough unless user space intends to do some
atomic
operation with that, which is a unlikely case.
    
So I choose choice (1) in this RFC implementation. See the commit
message of the second commit for details.

[1]
https://lore.kernel.org/bpf/7b941f53-2a05-48ec-9032-8f106face3a3@linux.dev/
 [2] https://dwarfstd.org/issues/131112.1.html

---------
2024-09-24 15:55:50 -07:00
yonghong-song
7852ebc088
[BPF] Make -mcpu=v3 as the default (#107008)
Before llvm20, (void)__sync_fetch_and_add(...) always generates locked
xadd insns. In linux kernel upstream discussion [1], it is found that
for arm64 architecture, the original semantics of
(void)__sync_fetch_and_add(...), i.e., __atomic_fetch_add(...), is
preferred in order for jit to emit proper native barrier insns.

In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will
generate the following insns:
  - for cpu v1/v2: locked xadd insns to keep backward compatibility
  - for cpu v3/v4: __atomic_fetch_add() insns

To ensure proper barrier semantics for (void)__sync_fetch_and_add(...),
cpu v3/v4 is recommended.

This patch enables cpu=v3 as the default cpu version. For users wanting
to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc
command line.

  [1]
https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@google.com/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] https://github.com/llvm/llvm-project/pull/101428
  [3] https://github.com/llvm/llvm-project/pull/106494
2024-09-03 07:15:18 -07:00
yonghong-song
06c531e808
BPF: Generate locked insn for __sync_fetch_and_add() with cpu v1/v2 (#106494)
This patch contains two pars:
- first to revert the patch https://github.com/llvm/llvm-project/pull/101428.
- second to remove `atomic_fetch_and_*()` to `atomic_<op>()`
  conversion (when return value is not used), but preserve 
  `__sync_fetch_and_add()` to locked insn with cpu v1/v2.
2024-08-30 14:00:33 -07:00
eddyz87
64e464349b
[BPF] introduce __attribute__((bpf_fastcall)) (#105417)
This commit introduces attribute bpf_fastcall to declare BPF functions
that do not clobber some of the caller saved registers (R0-R5).

The idea is to generate the code complying with generic BPF ABI,
but allow compatible Linux Kernel to remove unnecessary spills and
fills of non-scratched registers (given some compiler assistance).

For such functions do register allocation as-if caller saved registers
are not clobbered, but later wrap the calls with spill and fill
patterns that are simple to recognize in kernel.

For example for the following C code:

    #define __bpf_fastcall __attribute__((bpf_fastcall))

    void bar(void) __bpf_fastcall;
    void buz(long i, long j, long k);

    void foo(long i, long j, long k) {
      bar();
      buz(i, j, k);
    }

First allocate registers as if:

    foo:
      call bar    # note: no spills for i,j,k (r1,r2,r3)
      call buz
      exit

And later insert spills fills on the peephole phase:

    foo:
      *(u64 *)(r10 - 8) = r1;  # Such call pattern is
      *(u64 *)(r10 - 16) = r2; # correct when used with
      *(u64 *)(r10 - 24) = r3; # old kernels.
      call bar
      r3 = *(u64 *)(r10 - 24); # But also allows new
      r2 = *(u64 *)(r10 - 16); # kernels to recognize the
      r1 = *(u64 *)(r10 - 8);  # pattern and remove spills/fills.
      call buz
      exit

The offsets for generated spills/fills are picked as minimal stack
offsets for the function. Allocated stack slots are not used for any
other purposes, in order to simplify in-kernel analysis.
2024-08-22 03:40:56 +03:00
Eduard Zingerman
5e8f4618ce Revert "[BPF] introduce __attribute__((bpf_fastcall)) (#101228)"
This reverts commit e9b2e16dc98345bb1b91b1a6dacb3cec85f49e31.
Reverting because of the test failure:
https://lab.llvm.org/buildbot/#/builders/187/builds/509
2024-08-19 11:30:27 -07:00
eddyz87
e9b2e16dc9
[BPF] introduce __attribute__((bpf_fastcall)) (#101228)
This commit introduces attribute bpf_fastcall to declare BPF functions
that do not clobber some of the caller saved registers (R0-R5).

The idea is to generate the code complying with generic BPF ABI, but
allow compatible Linux Kernel to remove unnecessary spills and fills of
non-scratched registers (given some compiler assistance).

For such functions do register allocation as-if caller saved registers
are not clobbered, but later wrap the calls with spill and fill patterns
that are simple to recognize in kernel.

For example for the following C code:

     #define __bpf_fastcall __attribute__((bpf_fastcall))

     void bar(void) __bpf_fastcall;
     void buz(long i, long j, long k);

     void foo(long i, long j, long k) {
       bar();
       buz(i, j, k);
     }

First allocate registers as if:

    foo:
      call bar    # note: no spills for i,j,k (r1,r2,r3)
      call buz
      exit

And later insert spills fills on the peephole phase:

    foo:
      *(u64 *)(r10 - 8) = r1;  # Such call pattern is
      *(u64 *)(r10 - 16) = r2; # correct when used with
      *(u64 *)(r10 - 24) = r3; # old kernels.
      call bar
      r3 = *(u64 *)(r10 - 24); # But also allows new
      r2 = *(u64 *)(r10 - 16); # kernels to recognize the
      r1 = *(u64 *)(r10 - 8);  # pattern and remove spills/fills.
      call buz
      exit

The offsets for generated spills/fills are picked as minimal stack
offsets for the function. Allocated stack slots are not used for any
other purposes, in order to simplify in-kernel analysis.

Corresponding functionality had been merged in Linux Kernel as
[this](https://lore.kernel.org/bpf/172179364482.1919.9590705031832457529.git-patchwork-notify@kernel.org/)
patch set (the patch assumed that `no_caller_saved_regsiters` attribute
would be used by LLVM, naming does not matter for the Kernel).
2024-08-19 19:49:11 +03:00
Kevin McAfee
99a10f1fe8
Update load intrinsic attributes (#101562)
This patch adds default attributes to many intrinsics and the WillReturn
attribute to some as well. The defaults include WillReturn. The WillReturn
attribute is relevant for dead code elimination as intrinsics without
WillReturn are assumed to have side effects and cannot be removed even
if their return value is unused. It is also relevant for potential
changes to SDAG behavior regarding treatment of intrinsics that function
as loads.
2024-08-15 13:34:49 -07:00
yonghong-song
03958680b2
[BPF] Make llvm-objdump disasm default cpu v4 (#102166)
Currently, with the following example,
  $ cat t.c
  void foo(int a, _Atomic int *b)
  {
   *b &= a;
  }
  $ clang --target=bpf -O2 -c -mcpu=v3 t.c
  $ llvm-objdump -d t.o
  t.o:    file format elf64-bpf

  Disassembly of section .text:

  0000000000000000 <foo>:
       0:       c3 12 00 00 51 00 00 00 <unknown>
       1:       95 00 00 00 00 00 00 00 exit

Basically, the default cpu for llvm-objdump is v1 and it won't be able
to decode insn properly.

If we add --mcpu=v3 to llvm-objdump command line, we will have
  $ llvm-objdump -d --mcpu=v3 t.o

  t.o:    file format elf64-bpf

  Disassembly of section .text:

  0000000000000000 <foo>:
0: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       1:       95 00 00 00 00 00 00 00 exit

The atomic_fetch_and insn can be decoded properly. Using latest cpu
version --mcpu=v4 can also decode properly like the above --mcpu=v3.

To avoid the above '<unknown>' decoding with common 'llvm-objdump -d
t.o', this patch marked the default cpu for llvm-objdump with the
current highest cpu number v4 in ELFObjectFileBase::tryGetCPUName(). The
cpu number in ELFObjectFileBase::tryGetCPUName() will be adjusted in the
future if cpu number is increased e.g. v5 etc. Such an approach also
aligns with gcc-bpf as discussed in [1].

Six bpf unit tests are affected with this change. I changed test output
for three unit tests and added --mcpu=v1 for the other three unit tests,
to demonstrate the default (cpu v4) behavior and explicit --mcpu=v1
behavior.

[1]
https://lore.kernel.org/bpf/6f32c0a1-9de2-4145-92ea-be025362182f@linux.dev/T/#m0f7e63c390bc8f5a5523e7f2f0537becd4205200

Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
2024-08-06 18:23:46 -07:00
yonghong-song
c566769d7c
BPF: Ensure __sync_fetch_and_add() always generate atomic_fetch_add insn (#101428)
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...)
without return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem
in jit since proper barrier is not inserted.

The above discrepancy is due to commit [2] where it tries to maintain
backward compatability since before commit [2],
__sync_fetch_and_add(...) generates lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy
so we can have proper barrier support in jit. This patch made sure that
__sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns.
Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. I
also removed the whole BPFMIChecking.cpp file whose original purpose is
to detect and issue errors if XADD{W,D,W32} may return a value used
subsequently. Since insns XADD{W,D,W32} are all inline asm only now,
such error detection is not needed.

[1]
https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@google.com/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
[2]
286daafd65

Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
2024-08-04 21:03:16 -07:00
Nick Zavaritsky
ae0d2244a2
[BPF] Fix linking issues in static map initializers (#91310)
When BPF object files are linked with bpftool, every symbol must be
accompanied by BTF info. Ensure that extern functions referenced by
global variable initializers are included in BTF.

The primary motivation is "static" initialization of PROG maps:

```c
extern int elsewhere(struct xdp_md *);

struct {
  __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
  __uint(max_entries, 1);
  __type(key, int);
  __type(value, int);
  __array(values, int (struct xdp_md *));
} prog_map SEC(".maps") = { .values = { elsewhere } };
```

BPF backend needs debug info to produce BTF. Debug info is not
normally generated for external variables and functions. Previously, it
was solved differently for variables (collecting variable declarations
    in ExternalDeclarations vector) and functions (logic invoked during
    codegen in CGExpr.cpp).

This patch generalises ExternalDefclarations to include both function
and variable declarations. This change ensures that function references
    are not missed no matter the context. Previously external functions
    referenced in constant expressions lacked debug info.
2024-07-05 07:32:09 -07:00
Alexis Engelke
6859685a87
[CodeGen] Use temp symbol for MBBs (#95031)
Internal label names never occur in the symbol table, so when using an
object streamer, there's no point in constructing these names and then
adding them to hash tables -- they are never visible in the output.

It's not possible to reuse createTempSymbol, because on BPF has a
different prefix for globals and basic blocks right now.
2024-06-20 13:18:41 +02:00
Stephen Tozer
094572701d
[RemoveDIs] Print IR with debug records by default (#91724)
This patch makes the final major change of the RemoveDIs project, changing the
default IR output from debug intrinsics to debug records. This is expected to
break a large number of tests: every single one that tests for uses or
declarations of debug intrinsics and does not explicitly disable writing
records. 

If this patch has broken your downstream tests (or upstream tests on a
configuration I wasn't able to run):
1. If you need to immediately unblock a build, pass
`--write-experimental-debuginfo=false` to LLVM's option processing for all
failing tests (remember to use `-mllvm` for clang/flang to forward arguments to
LLVM).
2. For most test failures, the changes are trivial and mechanical, enough that
they can be done by script; see the migration guide for a guide on how to do
this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates
3. If any tests fail for reasons other than FileCheck check lines that need
updating, such as assertion failures, that is most likely a real bug with this
patch and should be reported as such.

For more information, see the recent PSA:
https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578
2024-06-14 15:07:27 +01:00
Yingchi Long
c6486633d2
[BPF] report Invalid usage of the XADD return value" elegantly (#92742)
Previously `report_fatal_error` is used for reporting something goes
wrong in the backend, but this is confusing because `report_fatal_error`
basically means there are something unexpected & crashed in the backend.

So, turn this "crash" into an elegant error reporting. After this patch,
clang can diagnose it:

    bpf-crash.c:4:30: error: Invalid usage of the XADD return value
4 | u32 next_event_id() { return __sync_fetch_and_add(&GLOBAL_EVENT_ID,
1); }
        |                              ^
    1 error generated.
2024-05-21 01:57:56 +08:00
yonghong-song
4c701577cd
BPF: Use DebugLoc to find Filename for BTF line info (#90302)
Andrii found an issue where the BTF line info may have empty source
which seems wrong. The program is a Meta internal bpf program. I can
reproduce with latest upstream compiler as well. Let the bpf program
built without this patch and then with the following veristat check
where veristat is a bpf verifier tool to do kernel verification for bpf
programs:

  $ veristat -vl2 yhs.bpf.o --log-size=150000000 >& log
  $ rg '^;' log | sort | uniq -c | sort -nr | head -n10
   4206 ; } else if (action->dry_run) { @ src_mitigations.h:57
   3907 ; if (now < start_allow_time) { @ ban.h:17
   3674 ;  @ src_mitigations.h:0
3223 ; if (action->vip_id != ALL_VIPS_ID && action->vip_id != vip_id) {
@ src_mitigations.h:85
1737 ; pkt_info->is_dry_run_drop = action->dry_run; @
src_mitigations.h:26
   1737 ; if (mitigation == ALLOW) { @ src_mitigations.h:28
1737 ; enum match_action mitigation = action->action; @
src_mitigations.h:25
1727 ; void* res = bpf_map_lookup_elem(bpf_map, key); @
filter_helpers.h:498
1691 ; bpf_map_lookup_elem(&rate_limit_config_map, rule_id); @
rate_limit.h:76
   1688 ; if (throttle_cfg) { @ rate_limit.h:85

You can see

   3674 ;  @ src_mitigations.h:0

where we do not have proper line information and line number.

In LLVM Machine IR, some instructions may carry DebugLoc information
to specify where the corresponding source is for this instruction.
The information includes file_name, line_num and col_num.
Each instruction may also attribute to a function in debuginfo.
So there are two ways to find file_name for a particular insn:
  (1) find the corresponding function in debuginfo
      (MI->getMF()->getFunction().getSubprogram()) and then
      find the file_name from DISubprogram.
  (2) find the corresponding file_name from DebugLoc.

The option (1) is used in current implementation. This mostly works.
But if one instruction is somehow generated from multiple functions,
the compiler has to pick just one. This may cause a mismatch between
file_name and line_num/col_num.

Besides potential incorrect mismatch of file_name vs. line_num/col_num,
There is another issue where some DebugLoc has line number 0. For
example,
I dumped the dwarf line table for the above bpf program:
    
Address Line Column File ISA Discriminator OpIndex Flags
------------------ ------ ------ ------ --- ------------- -------
-------------
0x0000000000000000 96 0 17 0 0 0 is_stmt
0x0000000000000010 100 12 17 0 0 0 is_stmt prologue_end
      0x0000000000000020      0     12     17   0             0       0
0x0000000000000058 37 7 17 0 0 0 is_stmt
      0x0000000000000060      0      0     17   0             0       0
      0x0000000000000088     37      7     17   0             0       0
0x0000000000000090 42 75 17 0 0 0 is_stmt
      0x00000000000000a8     42     52     17   0             0       0
0x00000000000000c0 120 9 17 0 0 0 is_stmt
      0x00000000000000c8      0      9     17   0             0       0
0x00000000000000d0 106 21 17 0 0 0 is_stmt
      0x00000000000000d8    106      3     17   0             0       0
0x00000000000000e0 110 25 17 0 0 0 is_stmt
      0x00000000000000f8    110     36     17   0             0       0
      0x0000000000000100      0     36     17   0             0       0
      ...
    
These DebugLoc with line number 0 needs to be skipped since we cannot
map them to the correct source code. Note that selftest
offset-reloc-basic.ll
has this issue as well which is adjusted by this patch.

With the above two fixes, empty lines for source annotation are removed.

  $ veristat -vl2 yhs.bpf.o --log-size=150000000 >& log
  $ rg '^;' log.latest | sort | uniq -c | sort -nr | head -n10
   4206 ; } else if (action->dry_run) { @ src_mitigations.h:57
   3907 ; if (now < start_allow_time) { @ ban.h:17
3223 ; if (action->vip_id != ALL_VIPS_ID && action->vip_id != vip_id) {
@ src_mitigations.h:85
1737 ; pkt_info->is_dry_run_drop = action->dry_run; @
src_mitigations.h:26
   1737 ; if (mitigation == ALLOW) { @ src_mitigations.h:28
1737 ; enum match_action mitigation = action->action; @
src_mitigations.h:25
1727 ; void* res = bpf_map_lookup_elem(bpf_map, key); @
filter_helpers.h:498
1691 ; bpf_map_lookup_elem(&rate_limit_config_map, rule_id); @
rate_limit.h:76
   1688 ; if (throttle_cfg) { @ rate_limit.h:85
   1670 ; if (rl_cfg) { @ rate_limit.h:77

You can see that we do not have empty line any more.

3223 ; if (action->vip_id != ALL_VIPS_ID && action->vip_id != vip_id) {
@ src_mitigations.h:85

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
2024-04-29 09:20:56 -07:00
Yingchi Long
70deb7bfe9
[BPF] expand cttz, ctlz for i32, i64 (#73668)
Fixes: https://github.com/llvm/llvm-project/issues/62252

Depends on: #73667
2024-04-01 10:57:54 +08:00
eddyz87
65b123e287
[BPF] rename 'arena' to 'address_space' (#85161)
There are a few places where `arena` name is used for pointers in
non-zero address space in BPF backend, rename these to use a more
generic `address_space`:
- macro `__BPF_FEATURE_ARENA_CAST` -> `__BPF_FEATURE_ADDR_SPACE_CAST
- name for arena global variables section `.arena.N` ->
`.addr_space.N`
2024-03-14 19:20:06 -07:00
4ast
2aacb56e83
BPF address space insn (#84410)
This commit aims to support BPF arena kernel side
[feature](https://lore.kernel.org/bpf/20240209040608.98927-1-alexei.starovoitov@gmail.com/):
- arena is a memory region accessible from both BPF program and
userspace;
- base pointers for this memory region differ between kernel and user
spaces;
- `dst_reg = addr_space_cast(src_reg, dst_addr_space, src_addr_space)`
translates src_reg, a pointer in src_addr_space to dst_reg, equivalent
pointer in dst_addr_space, {src,dst}_addr_space are immediate constants;
- number 0 is assigned to kernel address space;
- number 1 is assigned to user address space.

On the LLVM side, the goal is to make load and store operations on arena
pointers "transparent" for BPF programs:
- assume that pointers with non-zero address space are pointers to
  arena memory;
- assume that arena is identified by address space number;
- assume that address space zero corresponds to kernel address space;
- assume that every BPF-side load or store from arena is done via
pointer in user address space, thus convert base pointers using
`addr_space_cast(src_reg, 0, 1)`;

Only load, store, cmpxchg and atomicrmw IR instructions are handled by
this transformation.

For example, the following C code:

```c
   #define __as __attribute__((address_space(1)))
   void copy(int __as *from, int __as *to) { *to = *from; }
```

Compiled to the following IR:

```llvm
    define void @copy(ptr addrspace(1) %from, ptr addrspace(1) %to) {
    entry:
      %0 = load i32, ptr addrspace(1) %from, align 4
      store i32 %0, ptr addrspace(1) %to, align 4
      ret void
    }
```

Is transformed to:

```llvm
    %to2 = addrspacecast ptr addrspace(1) %to to ptr     ;; !
    %from1 = addrspacecast ptr addrspace(1) %from to ptr ;; !
    %0 = load i32, ptr %from1, align 4, !tbaa !3
    store i32 %0, ptr %to2, align 4, !tbaa !3
    ret void
```

And compiled as:

```asm
    r2 = addr_space_cast(r2, 0, 1)
    r1 = addr_space_cast(r1, 0, 1)
    r1 = *(u32 *)(r1 + 0)
    *(u32 *)(r2 + 0) = r1
    exit
```

Co-authored-by: Eduard Zingerman <eddyz87@gmail.com>
2024-03-13 02:27:25 +02:00
Nikita Popov
ff9af4c43a [CodeGen] Convert tests to opaque pointers (NFC) 2024-02-05 14:07:09 +01:00
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
James Y Knight
b856e77b2d
Set MaxAtomicSizeInBitsSupported for remaining targets. (#75703)
Targets affected:

- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).

Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.

This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.

The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
2024-01-08 22:34:28 -05:00
Yingwei Zheng
1228becf7d
[FuncAttrs] Deduce noundef attributes for return values (#76553)
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any 
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding 
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6

Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|

The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
2023-12-31 20:44:48 +08:00
Yingchi Long
ddf85b92aa
[BPF] improve error handling by custom lowering & fail() (#75088)
Currently on mcpu=v3 we do not support sdiv, srem instructions. And the
backend crashes with stacktrace & coredump, which is misleading for end
users, as this is not a "bug"

Add llvm bug reporting for sdiv/srem on ISel legalize-op phase.

For clang frontend we can get detailed location & bug report.

    $ build/bin/clang -g -target bpf -c local/sdiv.c
local/sdiv.c:1:35: error: unsupported signed division, please convert to
unsigned div/mod.
        1 | int sdiv(int a, int b) { return a / b; }
          |                                   ^
    1 error generated.

Fixes: #70433
Fixes: #48647

This also improves error handling for dynamic stack allocation:

    local/vla.c:2:3: error: unsupported dynamic stack allocation
        2 |   int b[n];
          |   ^
    1 error generated.

Fixes: https://github.com/llvm/llvm-project/issues/57171
2023-12-13 13:41:52 +08:00
Yingchi Long
c4ac1d239f
[BPF][GlobalISel] select non-PreISelGenericOpcode (#75034)
This selects non-PreISelGenericOpcode as-is.

Depends on: #74999

Co-authored-by: Origami404 <Origami404@foxmail.com>
2023-12-12 16:19:34 +08:00
Yingchi Long
2460bf2fac
[BPF][GlobalISel] add initial gisel support for BPF (#74999)
This adds initial codegen support for BPF backend.

Only implemented ir-translator for "RET" (but not support isel).

Depends on: #74998
2023-12-11 19:58:34 +08:00
Eduard Zingerman
030b8cb156 [BPF] Attribute preserve_static_offset for structs
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.

This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
  updating the field offset;

  BPF verifier limits access patterns allowed for certain data
  types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
  types only `LD/ST <reg> <static-offset>` memory loads and stores are
  allowed.

  This is so because offsets of the fields of these structures do not
  match real offsets in the running kernel. During BPF program
  load/verification loads and stores to the fields of these types are
  rewritten so that offsets match real offsets. For this rewrite to
  happen static offsets have to be encoded in the instructions.

  See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
  kernel source tree for details.

- runtime environment might disallow access to the field of the type
  through modified pointers.

  During BPF program verification a tag `PTR_TO_CTX` is tracked for
  register values. In case if register with such tag is modified BPF
  programs are not allowed to read or write memory using register. See
  kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
  source tree for details.

Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`

During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).

Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)

The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
  - `(load (getelementptr ...))` replaced by call to intrinsic
    `llvm.bpf.getelementptr.and.load`;
  - `(store (getelementptr ...))` replaced by call to intrinsic
    `llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.

Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.

The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.

This addresses the issue reported by the following thread:

https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6

This is a second attempt to commit this change, previous reverted
commit is: cb13e9286b6d4e384b5d4203e853d44e2eff0f0f.
The following items had been fixed:
- test case bpf-preserve-static-offset-bitfield.c now uses
  `-triple bpfel` to avoid different codegen for little/big endian
  targets.
- BPFPreserveStaticOffset.cpp:removePAICalls() modified to avoid
  use after free for `WorkList` elements `V`.

Differential Revision: https://reviews.llvm.org/D133361
2023-12-05 19:21:42 +02:00
Eduard Zingerman
2484469803 Revert "[BPF] Attribute preserve_static_offset for structs"
This reverts commit cb13e9286b6d4e384b5d4203e853d44e2eff0f0f.
Buildbot reports MSAN failures in tests added in this commit:
https://lab.llvm.org/buildbot/#/builders/5/builds/38806

Failing tests:
  LLVM :: CodeGen/BPF/preserve-static-offset/load-arr-pai.ll
  LLVM :: CodeGen/BPF/preserve-static-offset/load-ptr-pai.ll
  LLVM :: CodeGen/BPF/preserve-static-offset/load-struct-pai.ll
  LLVM :: CodeGen/BPF/preserve-static-offset/load-union-pai.ll
  LLVM :: CodeGen/BPF/preserve-static-offset/store-pai.ll
2023-11-30 22:29:45 +02:00
Eduard Zingerman
cb13e9286b [BPF] Attribute preserve_static_offset for structs
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.

This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
  updating the field offset;

  BPF verifier limits access patterns allowed for certain data
  types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
  types only `LD/ST <reg> <static-offset>` memory loads and stores are
  allowed.

  This is so because offsets of the fields of these structures do not
  match real offsets in the running kernel. During BPF program
  load/verification loads and stores to the fields of these types are
  rewritten so that offsets match real offsets. For this rewrite to
  happen static offsets have to be encoded in the instructions.

  See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
  kernel source tree for details.

- runtime environment might disallow access to the field of the type
  through modified pointers.

  During BPF program verification a tag `PTR_TO_CTX` is tracked for
  register values. In case if register with such tag is modified BPF
  programs are not allowed to read or write memory using register. See
  kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
  source tree for details.

Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`

During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).

Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)

The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
  - `(load (getelementptr ...))` replaced by call to intrinsic
    `llvm.bpf.getelementptr.and.load`;
  - `(store (getelementptr ...))` replaced by call to intrinsic
    `llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.

Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.

The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.

This addresses the issue reported by the following thread:

https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6

Differential Revision: https://reviews.llvm.org/D133361
2023-11-30 19:45:03 +02:00
Philip Reames
a7f35d54ee
[SCEV] Extend isImpliedCondOperandsViaRanges to independent predicates (#71110)
As far as I can tell, there's nothing in this code which actually
assumes the two predicates in (FoundLHS FoundPred FoundRHS) => (LHS Pred
RHS) are the same.

Noticed while investigating something else, this is purely an
oppurtunistic optimization while I'm looking at the code. Unfortunately,
this doesn't solve my original problem. :)
2023-11-07 07:25:47 -08:00
yonghong-song
32e35b21b5
[BPF] Skip modifiers for __builtin_btf_type_id() local type (#71094)
BPF upstream reported an inconsistent behavior w.r.t. BPF_TYPE_ID_LOCAL
vs. BPF_TYPE_ID_TARGET (or BPF_TYPE_ID_REMOTE in LLVM terminology).

For BPF_TYPE_ID_TARGET, all modifiers (like 'const' and 'volatile') are
ignored in the final type encoding. For example, for type
 'const struct foo', the eventually encoding in BTF relocation
is 'struct foo'. This faciliates libbpf to match corresponding kernel
types with considering any modifiers.

Currently behavior for BPF_TYPE_ID_LOCAL is different. It will encode
'const struct foo' in BTF relocation and such discrepancy confused users
([1]).

This patch fixed this discrepancy by making BPF_TYPE_ID_LOCAL BTF type
representation the sams as BPF_TYPE_ID_TARGET. This should have minimum
user impact since ultimately user wants to get a real time not a 'const'
type modifier.

The selftest builtin-btf-type-id-2.ll is used to test BPF_TYPE_ID_TARGET
with 'const' modifier. Adapt the same test for BPF_TYPE_ID_LOCAL. And
the below diff shows now both BPF_TYPE_ID_LOCAL and BPF_TYPE_ID_TARGET
produces the same type:

$ diff test/CodeGen/BPF/BTF/builtin-btf-type-id-2.ll
test/CodeGen/BPF/BTF/builtin-btf-type-id-local.ll
--- test/CodeGen/BPF/BTF/builtin-btf-type-id-2.ll 2023-07-30
16:58:20.657528310 -0700
+++ test/CodeGen/BPF/BTF/builtin-btf-type-id-local.ll 2023-11-02
10:23:25.356959008 -0700
  @@ -6,7 +6,7 @@
   ;     int a;
   ;   };
   ;   int test(void) {
  -;     return __builtin_btf_type_id(*(const struct s *)0, 1);
  +;     return __builtin_btf_type_id(*(const struct s *)0, 0);
   ;   }
   ; Compilation flag:
; clang -target bpf -O2 -g -S -emit-llvm -Xclang -disable-llvm-passes
test.c
  $

[1]
https://lore.kernel.org/bpf/CAN+4W8h3yDjkOLJPiuKVKTpj_08pBz8ke6vN=Lf8gcA=iYBM-g@mail.gmail.com/

Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
2023-11-03 12:52:16 -07:00
Philip Reames
f6f769203d [tests] Autogenerate a couple of tests
As usual, making it easier for an upcoming test delta to be seen.

Note that several of these are examples of extremely bad testing practice.
Checking internal debug output (for no real purpose), and checking the
result of a fully O2 + llc run instead of reducing the specific problematic
pass.
2023-11-03 08:42:23 -07:00
Alex Richardson
e39f6c1844 [opt] Infer DataLayout from triple if not specified
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.

One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.

Differential Revision: https://reviews.llvm.org/D141060
2023-10-26 12:07:37 -07:00
Nikita Popov
2ad9fde418
[MemDep] Use EarliestEscapeInfo (#69727)
Use BatchAA with EarliestEscapeInfo instead of callCapturesBefore() in
MemDepAnalysis. The advantage of this is that it will also take
not-captured-before information into account for non-calls (see
test_store_before_capture for a representative example), and that this
is a cached analysis. The disadvantage is that EII is slightly less
precise than full CapturedBefore analysis.

In practice the impact is positive, with gvn.NumGVNLoad going from 22022
to 22808 on test-suite. 

The impact to compile-time is also positive, mainly in the ThinLTO
configuration.
2023-10-23 09:57:26 +02:00
Nikita Popov
a72d88fb4f Revert "Reapply [Verifier] Sanity check alloca size against DILocalVariable fragment size"
This reverts commit 8840da2db237cd714d975c199d5992945d2b71e9.

This results in verifier failures during LTO, see #68929.
2023-10-16 12:17:24 +02:00
Nikita Popov
8840da2db2 Reapply [Verifier] Sanity check alloca size against DILocalVariable fragment size
Reapply now that generation of incorrect debuginfo for FnDef
in rustc has been fixed.

-----

Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.

This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.

Differential Revision: https://reviews.llvm.org/D158743
2023-10-09 14:22:12 +02:00
Alex Richardson
83c4227ab7 Auto-generate test checks for tests affected by D141060
These files had manual CHECK lines which make the diff from D141060
very difficult to review.
2023-10-04 10:51:35 -07:00
Nikita Popov
38c59b9f53 Revert "Reapply [Verifier] Sanity check alloca size against DILocalVariable fragment size"
This reverts commit 47324cfd7d8ca1a2a5cbb9f948ecff66a28ee6bc.

This exposed incorrect debuginfo in rustc. Revert the verification
until this has been fixed.
2023-09-18 17:24:53 +02:00
Nikita Popov
47324cfd7d Reapply [Verifier] Sanity check alloca size against DILocalVariable fragment size
Reapply after fixing a clang bug this exposed in D158972 and
adjusting a number of tests that failed for 32-bit targets.

-----

Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.

This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.

Differential Revision: https://reviews.llvm.org/D158743
2023-09-15 14:51:50 +02:00
Fangrui Song
806761a762 [test] Change llc -march= to -mtriple=
The issue is uncovered by #47698: for IR files without a target triple,
-mtriple= specifies the full target triple while -march= merely sets the
architecture part of the default target triple, leaving a target triple which
may not make sense, e.g. riscv64-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without a target
triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead
of rejecting it outrightly.
2023-09-11 14:42:37 -07:00
Nikita Popov
98cf20f890 Revert "[Verifier] Sanity check alloca size against DILocalVariable fragment size"
This reverts commit 183f49c3e0f4a7facf237581f83ae07e7f4544ab.

The lang/cpp/trivial_abi/TestTrivialABI.py lldb test fails on
buildbots.
2023-08-28 09:44:51 +02:00
Nikita Popov
183f49c3e0 [Verifier] Sanity check alloca size against DILocalVariable fragment size
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.

This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.

Differential Revision: https://reviews.llvm.org/D158743
2023-08-28 09:16:33 +02:00
Eduard Zingerman
651e644595 [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.

Short description
=================

The `BPFMIPeepholeTruncElim` handles two patterns:

Pattern #1:

    %1 = LDB %0, ...              %1 = LDB %0, ...
    %2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

    bb.1:                         bb.1:
      %a = LDB %0, ...              %a = LDB %0, ...
      br %bb3                       br %bb3
    bb.2:                         bb.2:
      %b = LDB %0, ...        ->    %b = LDB %0, ...
      br %bb3                       br %bb3
    bb.3:                         bb.3:
      %1 = PHI %a, %b               %1 = PHI %a, %b
      %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.

When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.

Commit also adds a few test cases to make sure that patterns in
question are handled.

Long description
================

Why this works: Pattern #1
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:

    ...
    Type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
          t19: i64 = and t16, Constant:i64<255>
        t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.1 t19: i64 = and t16, Constant:i64<255>
    With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
        t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Optimized type-legalized selection DAG:
  - `t19 = and t16, 255` had been replaced by `t16` (load).
  - Patterns like `(and (load ... i8), 255)` are replaced by `load`
    in `DAGCombiner::BackwardsPropagateMask` called from
    `DAGCombiner::visitAND`.
  - Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
    `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
    look for `TLI.shouldFoldConstantShiftPairToMask()` call).

Why this works: Pattern #2
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      br label %next

    next:
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:

    Initial selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 9 nodes:
      t0: ch,glue = EntryToken
      t3: i64 = Constant<0>
            t2: i64,ch = CopyFromReg t0, Register:i64 %1
          t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
        t6: i64 = zero_extend t5
      t8: ch = CopyToReg t0, Register:i64 %0, t6
    ...
    Replacing.1 t6: i64 = zero_extend t5
    With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized lowered selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 7 nodes:
      t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %1
        t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
      t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:
- Initial selection DAG:
  - `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
    w/o special `isZExtFree()` overload added by this commit
    it is instead lowered as `t6 = (any_extend (load ...))`.
  - The decision to generate `zero_extend` or `any_extend` is
    done in `RegsForValue::getCopyToRegs` called from
    `SelectionDAGBuilder::CopyValueToVirtualRegister`:
    - if `isZExtFree()` for load returns true `zero_extend` is used;
    - `any_extend` is used otherwise.
- Optimized lowered selection DAG:
  - `t6 = (any_extend (load ...))` is replaced by
    `t9 = load ..., zext from i8`
    This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
    `DAGCombiner::visitZERO_EXTEND`.

Log for second basic block:

    Initial selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
                t2: i64,ch = CopyFromReg t0, Register:i64 %0
              t4: i64 = AssertZext t2, ValueType:ch:i8
            t5: i8 = truncate t4
          t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
        t9: i64 = any_extend t8
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.2 t18: i64 = and t4, Constant:i64<255>
    With: t4: i64 = AssertZext t2, ValueType:ch:i8
    ...
    Type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t4: i64 = AssertZext t2, ValueType:ch:i8
          t18: i64 = and t4, Constant:i64<255>
        t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Optimized type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Initial selection DAG:
  - `t0` is an input value for this basic block, it corresponds load
    instruction (`t9`) from the first basic block.
  - It is accessed within basic block via
    `t4` (AssertZext (CopyFromReg t0, ...)).
  - The `AssertZext` is generated by RegsForValue::getCopyFromRegs
    called from SelectionDAGBuilder::getCopyFromRegs, it is generated
    only when `LiveOutInfo` with known number of leading zeros is
    present for `t0`.
  - Known register bits in `LiveOutInfo` are computed by
    `SelectionDAG::computeKnownBits` called from
    `SelectionDAGISel::ComputeLiveOutVRegInfo`.
  - `computeKnownBits()` generates leading zeros information for
    `(load ..., zext from ...)` but *does not* generate leading zeros
    information for `(load ..., anyext from ...)`.
    This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
  - `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
  - `t18 = and t4, 255` is replaced by `t4`, this is done by
    `DAGCombiner::SimplifyDemandedBits` called from
    `DAGCombiner::visitAND`, which simplifies patterns like
    `(and (assertzext ...))`

Impact
------

This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
  remove_truncate_8 improved slightly, for other tests it is
  unchanged.

For remove_truncate_7:

    Before this revision                 After this revision
    --------------------                 -------------------
        r1 <<= 0x20                          r1 <<= 0x20
        r1 >>= 0x20                          r1 >>= 0x20
        if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
        r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
        goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
    <LBB0_2>:                            <LBB0_2>:
        r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
    <LBB0_3>:                            <LBB0_3>:
        r0 = r1                              exit
        exit

For remove_truncate_8:

    Before this revision                 After this revision
    --------------------                 -------------------
        r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
        r3 = r2                              r3 = r2
        r3 <<= 0x20                          r3 <<= 0x20
        r4 = r3                              r3 s>>= 0x20
        r4 s>>= 0x20
        if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
        r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
        r3 >>= 0x20
        if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
        r2 += 0x2                            r2 += 0x2
        *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
    <LBB0_3>:                            <LBB0_3>:
        r0 = 0x3                             r0 = 0x3
        exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
  in all cases total number of instructions marginally decreased
  (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
  - For 19 object files number of instruction decreased
    (-129 instruction in total): some redundant `rX &= 0xffff`
    and register to register assignments removed;
  - For 2 object files number of instructions increased +2
    instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same
example:

    define void @foo(ptr %p) {
    entry:
      %a = load i32, ptr %p, align 4
      %b = sext i32 %a to i64
      %c = icmp ult i64 1, %b
      br i1 %c, label %next, label %end

    next:
      call void inttoptr (i64 62 to ptr)(i32 %a)
      br label %end

    end:
      ret void
    }

Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:

    Before this revision          After this revision
    --------------------          -------------------
        r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
        r1 <<= 32                     r1 <<= 32
        r1 s>>= 32                    r1 s>>= 32
        if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                      r1 <<= 32
                                      r1 >>= 32
        call 62                       call 62
    <LBB0_2>:                     <LBB0_2>:
        exit                          exit

Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.

The difference comes from `DAGCombiner` operation on the initial DAG:

Initial selection DAG before this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = any_extend t5         <--------------------- (1)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = zero_extend t5        <--------------------- (2)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:

    Replacing.1 t9: i64 = sign_extend t5
    With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
     and 0 other values
    Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
    With: t31: i32 = truncate t30
     and 1 other values

This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
  converted to a pair of shifts. This pair of shifts survives till the
  end of translation.

Differential Revision: https://reviews.llvm.org/D157870
2023-08-22 00:04:51 +03:00
Eduard Zingerman
8f28e8069c [BPF] support for BPF_ST instruction in codegen
Generate store immediate instruction when CPUv4 is enabled.
For example:

    $ cat test.c
    struct foo {
      unsigned char  b;
      unsigned short h;
      unsigned int   w;
      unsigned long  d;
    };
    void bar(volatile struct foo *p) {
      p->b = 1;
      p->h = 2;
      p->w = 3;
      p->d = 4;
    }

    $ clang -O2 --target=bpf -mcpu=v4 test.c -c -o - | llvm-objdump -d -
    ...
    0000000000000000 <bar>:
           0:	72 01 00 00 01 00 00 00	*(u8 *)(r1 + 0x0) = 0x1
           1:	6a 01 02 00 02 00 00 00	*(u16 *)(r1 + 0x2) = 0x2
           2:	62 01 04 00 03 00 00 00	*(u32 *)(r1 + 0x4) = 0x3
           3:	7a 01 08 00 04 00 00 00	*(u64 *)(r1 + 0x8) = 0x4
           4:	95 00 00 00 00 00 00 00	exit

Take special care to:
- apply `BPFMISimplifyPatchable::checkADDrr` rewrite for BPF_ST
- validate immediate value when BPF_ST write is 64-bit:
  BPF interprets `(BPF_ST | BPF_MEM | BPF_DW)` writes as writes with
  sign extension. Thus it is fine to generate such write when
  immediate is -1, but it is incorrect to generate such write when
  immediate is +0xffff_ffff.

This commit was previously reverted in e66affa17e32.
The reason for revert was an unrelated bug in BPF backend,
triggered by test case added in this commit if LLVM is built
with LLVM_ENABLE_EXPENSIVE_CHECKS.
The bug was fixed in D157806.

Differential Revision: https://reviews.llvm.org/D140804
2023-08-16 17:51:28 +03:00
Eduard Zingerman
08d92dedd2 [BPF] Fix in/out argument constraints for CORE_MEM instructions
When LLVM is build with `LLVM_ENABLE_EXPENSIVE_CHECKS=ON` option the
following C code snippet:

    struct t {
      int a;
    } __attribute__((preserve_access_index));

    void test(struct t *t) {
      t->a = 42;
    }

Causes an assertion:

$ clang -g -O2 -c --target=bpf -mcpu=v2 t.c -o /dev/null

Function Live Ins: $r1 in %0

bb.0.entry:
  liveins: $r1
  DBG_VALUE $r1, $noreg, !"t", ...
  %0:gpr = COPY $r1
  DBG_VALUE %0:gpr, $noreg, !"t", ...
  %1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
  %3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
  %4:gpr = MOV_ri 42
  CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
  RET debug-location !25; t.c:7:1

*** Bad machine code: Explicit definition marked as use ***
- function:    test
- basic block: %bb.0 entry (0x6210000d8a90)
- instruction: CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
- operand 0:   killed %4:gpr

This happens because `CORE_MEM` instruction is defined to have output
operands:

  def CORE_MEM : TYPE_LD_ST<BPF_MEM.Value, BPF_W.Value,
                            (outs GPR:$dst),
                            (ins u64imm:$opcode, GPR:$src, u64imm:$offset),
                            "$dst = core_mem($opcode, $src, $offset)",
                            []>;

As documented in [1]:

> By convention, the LLVM code generator orders instruction operands
> so that all register definitions come before the register uses, even
> on architectures that are normally printed in other orders.

In other words, the first argument for `CORE_MEM` is considered to be
a "def", while in reality it is "use":

  %1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
  %3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
  %4:gpr = MOV_ri 42
   '---------------.
                   v
  CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...

Here is how `CORE_MEM` is constructed in
`BPFMISimplifyPatchable::checkADDrr()`:

    BuildMI(*DefInst->getParent(), *DefInst, DefInst->getDebugLoc(), TII->get(COREOp))
        .add(DefInst->getOperand(0)).addImm(Opcode).add(*BaseOp)
        .addGlobalAddress(GVal);

Note that first operand is constructed as `.add(DefInst->getOperand(0))`.

For `LD{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
destination register of a load, so instruction is constructed in
accordance with `outs` declaration.

For `ST{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
source register of a store (value to be stored), so instruction
violates the `outs` declaration.

This commit fixes the issue by splitting `CORE_MEM` in three
instructions: `CORE_ST`, `CORE_LD64`, `CORE_LD32` with correct `outs`
specifications.

[1] https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class

Differential Revision: https://reviews.llvm.org/D157806
2023-08-15 02:34:21 +03:00
Eduard Zingerman
27026fe563 [BPF] Reset machine register kill mark in BPFMISimplifyPatchable
When LLVM is build with `LLVM_ENABLE_EXPENSIVE_CHECKS=ON` option
the following C code snippet:

    struct t {
      unsigned long a;
    } __attribute__((preserve_access_index));

    void foo(volatile struct t *t, volatile unsigned long *p) {
      *p = t->a;
      *p = t->a;
    }

Causes an assertion:

    $ clang -g -O2 -c --target=bpf -mcpu=v2 t2.c -o /dev/null

    # After BPF PreEmit SimplifyPatchable
    # Machine code for function foo: IsSSA, TracksLiveness
    Function Live Ins: $r1 in %0, $r2 in %1

    bb.0.entry:
      liveins: $r1, $r2
      DBG_VALUE $r1, $noreg, !"t", !DIExpression()
      DBG_VALUE $r2, $noreg, !"p", !DIExpression()
      %1:gpr = COPY $r2
      DBG_VALUE %1:gpr, $noreg, !"p", !DIExpression()
      %0:gpr = COPY $r1
      DBG_VALUE %0:gpr, $noreg, !"t", !DIExpression()
      %2:gpr = LD_imm64 @"llvm.t:0:0$0:0"
      %4:gpr = ADD_rr %0:gpr(tied-def 0), killed %2:gpr
      %5:gpr = CORE_LD 344, %0:gpr, @"llvm.t:0:0$0:0"
      STD killed %5:gpr, %1:gpr, 0
      %7:gpr = ADD_rr %0:gpr(tied-def 0), killed %2:gpr
      %8:gpr = CORE_LD 344, %0:gpr, @"llvm.t:0:0$0:0"
      STD killed %8:gpr, %1:gpr, 0
      RET

    # End machine code for function foo.

    *** Bad machine code: Using a killed virtual register ***
    - function:    foo
    - basic block: %bb.0 entry (0x6210000e6690)
    - instruction: %7:gpr = ADD_rr %0:gpr(tied-def 0), killed %2:gpr
    - operand 2:   killed %2:gpr

This happens because of the way
BPFMISimplifyPatchable::processDstReg() updates second operand of the
`ADD_rr` instruction. Code before `BPFMISimplifyPatchable`:

    .-> %2:gpr = LD_imm64 @"llvm.t:0:0$0:0"
    |
    |`----------------.
    |   %3:gpr = LDD %2:gpr, 0
    |   %4:gpr = ADD_rr %0:gpr(tied-def 0), killed %3:gpr <--- (1)
    |   %5:gpr = LDD killed %4:gpr, 0       ^^^^^^^^^^^^^
    |   STD killed %5:gpr, %1:gpr, 0        this is updated
     `----------------.
        %6:gpr = LDD %2:gpr, 0
        %7:gpr = ADD_rr %0:gpr(tied-def 0), killed %6:gpr <--- (2)
        %8:gpr = LDD killed %7:gpr, 0       ^^^^^^^^^^^^^
        STD killed %8:gpr, %1:gpr, 0        this is updated

Instructions (1) and (2) would be updated to:

    ADD_rr %0:gpr(tied-def 0), killed %2:gpr

The `killed` mark is inherited from machine operands `killed %3:gpr`
and `killed %6:gpr` which are updated inplace by `processDstReg()`.

This commit updates `processDstReg()` reset kill marks for updated
machine operands to keep liveness information conservatively correct.

Differential Revision: https://reviews.llvm.org/D157805
2023-08-15 02:23:38 +03:00
Eduard Zingerman
e66affa17e Revert "[BPF] support for BPF_ST instruction in codegen"
This reverts commit 92e28e397d4ccf1bff075f48e22cf1e23a7d02bf.

Reverting to investigate buildbot failure reported in [1].

    field-reloc-st-imm.ll:
    *** Bad machine code: Explicit definition must be a register ***
    - function:    bar
    - basic block: %bb.0 entry (0x742f318)
    - instruction: CORE_MEM 3, 416, %0:gpr, @"llvm.foo:0:4$0:2", ...
    - operand 0:   3
    *** Bad machine code: Explicit definition must be a register ***
    - function:    bar
    - basic block: %bb.0 entry (0x742f318)
    - instruction: CORE_MEM 4, 410, %0:gpr, @"llvm.foo:0:8$0:3", ...
    - operand 0:   4
    LLVM ERROR: Found 4 machine code errors.

[1] https://lab.llvm.org/buildbot/#/builders/16/builds/52877
2023-08-11 02:23:40 +03:00
Eduard Zingerman
92e28e397d [BPF] support for BPF_ST instruction in codegen
Generate store immediate instruction when CPUv4 is enabled.
For example:

    $ cat test.c
    struct foo {
      unsigned char  b;
      unsigned short h;
      unsigned int   w;
      unsigned long  d;
    };
    void bar(volatile struct foo *p) {
      p->b = 1;
      p->h = 2;
      p->w = 3;
      p->d = 4;
    }

    $ clang -O2 --target=bpf -mcpu=v4 test.c -c -o - | llvm-objdump -d -
    ...
    0000000000000000 <bar>:
           0:	72 01 00 00 01 00 00 00	*(u8 *)(r1 + 0x0) = 0x1
           1:	6a 01 02 00 02 00 00 00	*(u16 *)(r1 + 0x2) = 0x2
           2:	62 01 04 00 03 00 00 00	*(u32 *)(r1 + 0x4) = 0x3
           3:	7a 01 08 00 04 00 00 00	*(u64 *)(r1 + 0x8) = 0x4
           4:	95 00 00 00 00 00 00 00	exit

Take special care to:
- apply `BPFMISimplifyPatchable::checkADDrr` rewrite for BPF_ST
- validate immediate value when BPF_ST write is 64-bit:
  BPF interprets `(BPF_ST | BPF_MEM | BPF_DW)` writes as writes with
  sign extension. Thus it is fine to generate such write when
  immediate is -1, but it is incorrect to generate such write when
  immediate is +0xffff_ffff.

Differential Revision: https://reviews.llvm.org/D140804
2023-08-11 02:07:29 +03:00