Before llvm20, (void)__sync_fetch_and_add(...) always generates locked
xadd insns. In linux kernel upstream discussion [1], it is found that
for arm64 architecture, the original semantics of
(void)__sync_fetch_and_add(...), i.e., __atomic_fetch_add(...), is
preferred in order for jit to emit proper native barrier insns.
In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will
generate the following insns:
- for cpu v1/v2: locked xadd insns to keep backward compatibility
- for cpu v3/v4: __atomic_fetch_add() insns
To ensure proper barrier semantics for (void)__sync_fetch_and_add(...),
cpu v3/v4 is recommended.
This patch enables cpu=v3 as the default cpu version. For users wanting
to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc
command line.
[1]
https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@google.com/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
[2] https://github.com/llvm/llvm-project/pull/101428
[3] https://github.com/llvm/llvm-project/pull/106494
This patch adds a common lower action for `G_FABS`, which generates `and
x8, x8, #0x7fffffffffffffff` to reset the sign bit. The action does not
support vectors since `G_AND` does not support fp128.
This approach is different than what SDAG is doing. SDAG stores the
value onto stack, clears the sign bit in the most significant byte, and
loads the value back into register. This involves multiple memory ops
and sounds slower.
The LegalizeDAG expansion will go through memory since i16 isn't a legal
type. Avoid this by using FMV nodes.
Similar to what we did for #106886 for FNEG and FABS. Special care is
needed to handle the Sign operand being a different type.
This patch introduces lowering of the partial add reduction intrinsic to
a udot or svdot for AArch64. This also involves adding a
`shouldExpandPartialReductionIntrinsic` target hook, which AArch64 will
return false from in the cases that it can be lowered.
This test case was failing to compile with a "ran out of registers
during register allocation" error at -O0. This was because CMP_SWAP_64
has 3 operands which must be an even-odd register pair, and two other
GPR operands. All of the def operands are also early-clobber, so
registers can't be shared between uses and defs. Because the function
has an over-aligned alloca it needs frame and base pointers, so r6 and
r11 are both reserved. That leaves r0/r1, r2/r3, r4/r5 and r8/r9 as the
only valid register pairs, and if the two individual GPR operands happen
to get allocated to registers in different pairs then only 2 pairs will
be available for the three GPRPair operands.
To fix this, I've merged the two GPR operands into a single GPRPair
operand. This means that the instruction now has 4 GPRPair operands,
which can always be allocated without relying on luck. This does
constrain register allocation a bit more, but this pseudo instruction is
only used at -O0, so I don't think that's a problem.
Most of it is redundant with bfloat-convert.ll. One testcase is found in
bfloat-imm.ll. The load and stores are more thoroughly tested in
bfloat-mem.ll.
This is for an upcoming change to the threshold on Apple targets for
using a constant pool for FP literals versus building them with integer
moves.
This file is based on literal_pools_float.ll. I tried to bolt on to the
existing test, but it got messy as that file is already testing a matrix
of combinations, so creating this new file instead.
Basic infrastructure to collect Function properties in Metadata Analysis
- Add a `SmallVector` of entry properties to the metadata information.
- Add a structure to represent function properties. Currently
`numthreads` and shader kind properties of shader entry functions are
represented.
Stop adding liveins for virtual registers. In the livein interface, the
register goes through a MCPhysReg which is uint16_t. This causes the
virtual register bit to be dropped making it alias to some nonsense
physical register.
Recompute the liveins for the continue block to handle any live
registers that are needed by instructions that were spliced from the
original block. This fixing the machine verifier error so we can remove
that fixme now.
This patch handles target lowering and calling convention.
For target lowering, the vector tuple type represented as multiple
scalable vectors is now changed to a single `MVT`, each `MVT` has a
corresponding register class.
The load/store of vector tuples are handled as the same way but need
another vector insert/extract instructions to get sub-register group.
Inline assembly constraint for vector tuple type can directly be modeled
as "vr" which is identical to normal vector registers.
For calling convention, it no longer needs an alternative algorithm to
handle register allocation, this makes the code easier to maintain and
read.
Stacked on https://github.com/llvm/llvm-project/pull/97994
vmerge.vvm needs to have an all ones mask, so nothing is taken from the
false operand. So instead of checking that the passthru is the same as
false, just use the passthru directly for the tail elements.
This supersedes the convertVMergeToVMv part of #105788, as noted in
https://github.com/llvm/llvm-project/pull/105788/files#r1731683971
This patch contains two pars:
- first to revert the patch https://github.com/llvm/llvm-project/pull/101428.
- second to remove `atomic_fetch_and_*()` to `atomic_<op>()`
conversion (when return value is not used), but preserve
`__sync_fetch_and_add()` to locked insn with cpu v1/v2.
Reverts llvm/llvm-project#103371
There is `heap-use-after-free`, commented on
206b5aff44a95754f6dd7a5696efa024e983ac59
Maybe `if (Next == E || BB != Next->getParent()) {` is enough,
but not sure, what was the intent there,
Similar to what we do in foldVMV_V_V with the passthru, if we end up
changing the Src's VL in tryReduceVL we need to make sure it dominates.
Fixes#106735
Getting this to work required a few additional changes:
- Add builtins for any instructions that can't be done with plain C
currently.
- Add support for the saturating version of fp_to_<s,i>_I16x8. Other
vector sizes supported this already.
- Support bitcast of f16x8 to v128. Needed to return a __f16x8 as
v128_t.
If a lowering changed control flow, resume the legalization
loop at the first newly inserted block.
This will allow incrementally legalizing atomicrmw and cmpxchg.
The AArch64 test might be a bugfix. Previously it would lower
the vector FP case as a cmpxchg loop, but cmpxchgs get lowered
but previously weren't. Maybe it shouldn't be reporting cmpxchg
for the expand type in the first place though.
When getting a neutral value, we can prefer using a positive zero over a
negative zero if nsz is set on the FADD (or reduction). A positive zero
should be cheaper to materialize on basically all targets.
Arguably, we should be doing this kind of canonicalization in
DAGCombine, but we don't do that for any of the other reduction
variants, so this seems like path of least resistance. This does mean
that we can only do this for "fast" reductions. Just nsz isn't enough,
as that goes through the SEQ_FADD path where the IR level start value
isn't folded away.
If folks think this is to RISCV specific, let me know. There's a trivial
RISCV specific implementation. I went with the generic one as I through
this might benefit other targets.
Ever since 6859685a87ad093d60c8bed60b116143c0a684c7 (or, precisely,
84428dafc0941e3a31303fa1b286835ab2b8e234) relative jumps emitted by the
AVR codegen are off by two bytes - this pull request fixes it.
## Abstract
As compared to absolute jumps, relative jumps - such as rjmp, rcall or
brsh - have an implied `pc+2` behavior; that is, `jmp 100` is `pc =
100`, but `rjmp 100` gets understood as `pc = pc + 100 + 2`.
This is not reflected in the AVR codegen:
f95026dbf6/llvm/lib/Target/AVR/MCTargetDesc/AVRAsmBackend.cpp (L89)
... which always emits relative jumps that are two bytes too far - or
rather it _would_ emit such jumps if not for this check:
f95026dbf6/llvm/lib/Target/AVR/MCTargetDesc/AVRAsmBackend.cpp (L517)
... which causes most of the relative jumps to be actually resolved
late, by the linker, which applies the offsetting logic on its own,
hiding the issue within LLVM.
[Some time
ago](697a162fa6)
we've had a similar "jumps are off" problem that got solved by touching
`shouldForceRelocation()`, but I think that has worked only by accident.
It's exploited the fact that absolute vs relative jumps in the parsed
assembly can be distinguished through a "side channel" check relying on
the existence of labels (i.e. absolute jumps happen to named labels, but
relative jumps are anonymous, so to say). This was an alright idea back
then, but it got broken by 6859685a87ad093d60c8bed60b116143c0a684c7.
I propose a different approach:
- when emitting relative jumps, offset them by `-2` (well, `-1`,
strictly speaking, because those instructions rely on right-shifted
offset),
- when parsing relative jumps, treat `.` as `+2` and read `rjmp .+1234`
as `rjmp (1234 + 2)`.
This approach seems to be sound and now we generate the same assembly as
avr-gcc, which can be confirmed with:
```cpp
// avr-gcc test.c -O3 && avr-objdump -d a.out
int main() {
asm(
" foo:\n\t"
" rjmp .+2\n\t"
" rjmp .-2\n\t"
" rjmp foo\n\t"
" rjmp .+8\n\t"
" rjmp end\n\t"
" rjmp .+0\n\t"
" end:\n\t"
" rjmp .-4\n\t"
" rjmp .-6\n\t"
" x:\n\t"
" rjmp x\n\t"
" .short 0xc00f\n\t"
);
}
```
avr-gcc is also how I got the opcodes for all new tests like `inst-brbc.s`, so we should be good.
Adds support for wider-than-legal vector types for the histogram
intrinsic (llvm.experimental.vector.histogram.add) by splitting the
vector. Also adds integer promotion for the Inc operand.
Address space information may be encoded anywhere along the use-def
chain. Take advantage of this by traversing the chain until we find a
non-generic addrspace.
When [lowering return
values](99a10f1fe8/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (L3422))
from LLVM IR to SelectionDAG, we check that [the number of values
`SelectionDAG` tells us to return is equal to the number of values that
`ComputePTXValueVTs()` tells us to
return](99a10f1fe8/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (L3441)).
However, this check can fail on valid IR. For example:
```
define <6 x half> @foo() {
ret <6 x half> zeroinitializer
}
```
`ComputePTXValueVTs()` tells us to return ***3*** `v2f16` values, while
`SelectionDAG` tells us to return ***6*** `f16` values. Thus, the
compiler will crash.
`ComputePTXValueVTs()` [supports all `half` element vectors with an even
number of
elements](99a10f1fe8/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (L213)).
Whereas `SelectionDAG` [only supports power-of-2 sized
vectors](4e078e3797/llvm/lib/CodeGen/TargetLoweringBase.cpp (L1580)).
This is the root of the discrepancy.
Assuming that the developers who added the code to
`ComputePTXValueVTs()` overlooked this, I've restricted
`ComputePTXValueVTs()` to compute the same number of return values as
`SelectionDAG`, instead of extending `SelectionDAG` to support
non-power-of-2 sized vectors.
In #106110 we had to mark v[f]slide1down.vx as
ActiveElementsAffectResult since the elements in the body depend on VL.
However it doesn't depend on the mask, so this was overly conservative
and broke the vmerge peephole.
We can recover this by splitting up ActiveElementsAffectResult into VL
and Mask bits, so we can more accurately model v[f]slide1down.vx and
re-enable the peephole.
This preserves the semantis of fneg and matches what we do in
LegalizeDAG.
I kept the legal FSUB check to force unrolling for some targets that
don't have FSUB but have XOR. On Aarch64, using xor broke some tests that
expected to see a (v1f64 (fma (insertvector_elt (f64 (fneg
(extractvectorelt X)))))) pattern.
One of the tests for the new fake use intrinsic are failing on darwin
buildbots due to relying on behaviour for their expected triple; this
commit adds explicit triples to the few remaining fake-use tests that
didn't have them.
Fixes commit 3d08ade (#86149).
Buildbot failures:
https://lab.llvm.org/buildbot/#/builders/23/builds/2505