In a `__asm` block, a symbol reference is usually a memory constraint
(indirect TargetLowering::C_Memory) [LOOP]. CALL and JUMP instructions are special
that `__asm call k` can be an address constraint, if `k` is a function.
Clang always gives us indirect TargetLowering::C_Memory and need to convert it
to direct TargetLowering::C_Address. D133914 implements this conversion, but
does not consider JMP or case-insensitive CALL. This patch implements the missing
cases, so that `__asm jmp k` (`jmp ${0:P}`) will correctly lower to `jmp _k`
instead of `jmp dword ptr [_k]`.
(`__asm call k` lowered to `call dword ptr ${0:P}` and is fixed by D149695 to
lower to `call ${0:P}` instead.)
[LOOP]: Some instructions like LOOP{,E,NE} and Jcc always use an address
constraint (`loop _k` instead of `loop dword ptr [_k]`).
After this patch and D149579, all the following cases will be correct.
```
int k(int);
int (*kptr)(int);
...
__asm call k; // correct without this patch
__asm CALL k; // correct, but needs this patch to be compatible with D149579
__asm jmp k; // correct, but needs this patch to be compatible with D149579
__asm call kptr; // will be fixed by D149579. "Broken case" in clang/test/CodeGen/ms-inline-asm-functions.c
__asm jmp kptr; // will be fixed by this patch and D149579
```
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D149920
This intrinsic is meant to lower directly to the i8x16.shuffle instruction,
which takes its lane index arguments as immmediates. The ISel for the intrinsic
assumed that the lane index arguments were constants, so bitcode that
"incorrectly" used this intrinsic with non-immediate arguments caused an
assertion failure in the backend.
Avoid the crash by defining the lane index arguments to be immediates, matching
the underlying instruction. Update ISel accordingly. This change means that the
bitcode that previously caused a crash will now fail to validate.
Fixes#55559.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D149898
If vector has two different values and it can be splitted into two sub
vectors with same length, generate two DUP and CONCAT_VECTORS/VECTOR_SHUFFLE.
For example,
t22: v16i8 = BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t23,
t24, t24, t24, t24, t24, t24, t24, t24
==>
t26: v8i8 = AArch64ISD::DUP t23
t28: v8i8 = AArch64ISD::DUP t24
t29: v16i8 = concat_vectors t26, t28
Differential Revision: https://reviews.llvm.org/D148347
Emit FNMADD instead of FNEG(FMADD) for optimization levels
above Oz when fast-math flags (nsz+contract) permit it.
Differential Revision: https://reviews.llvm.org/D149260
Instead of equality comparison of value to preferred zero we can check just
the sign of value and if sign is set we should put this value as second operand for minimum
and first operand for maximum.
In this case FMIN/FMAX will choose the right result for 0.f and -0.f comparison.
This allows us:
1. avoid loading of big 64-bit constant for fminimum.
2. for double on non-64-nib platform we need to check only high part of value.
3. test against zero to check sign takes less size of instruction
Additionally, if we know that any of value is guaranteed to be non-zero
we should not care about 0.f and -0.f comparison.
Reviewed By: e-kud
Differential Revision: https://reviews.llvm.org/D149812
After applying FMIN/FMAX, if any of operands is NaN, the second operand will be the result.
So all we need is to check whether first operand is NaN and return it or result of FMIN/FMAX.
So we avoid usage of constant NaN in the lowering.
Additionally we can avoid handling NaN after FMIN/FMAX if we are sure that first operand is not NaN.
Reviewed By: e-kud
Differential Revision: https://reviews.llvm.org/D149729
Emit FNMADD instead of FNEG(FMADD) for optimization levels
above Oz when fast-math flags (nsz+contract) permit it.
Differential Revision: https://reviews.llvm.org/D149260
Otherwise I think we extract and use a build_vector. There may be
some more improvements that can be made and there might be some
cases that we should do something different for, but this seemed like a
decent starting point.
Reviewed By: luke
Differential Revision: https://reviews.llvm.org/D149724
The patch makes visitFSUBForFMACombine serve vp.fsub too. It helps DAGCombiner
to fuse vp.fsub and vp.fmul patterns to vp.fma.
Reviewed By: luke
Differential Revision: https://reviews.llvm.org/D149821
This patch mostly adapts the existing AMDGPUCtorDtorLoweringPass for use
by the Nvidia backend. This pass transforms the ctor / dtor list into a
kernel call that can be used to invoke those functinos. Furthermore, we
emit globals such that the names and addresses of these constructor
functions can be found by the driver. Unfortunately, since NVPTX has no
way to emit variables at a named section, nor a functioning linker to
provide the begin / end symbols, we need to mangle these names and have
an external application find them.
This work is related to the work in D149398 and D149340.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D149451
It looks like the intention here is to truncate a XLenVT -> i1, in
which case we should be emitting snez instead of sneq if I'm understanding
correctly.
Reviewed By: jacquesguan, frasercrmck
Differential Revision: https://reviews.llvm.org/D149732
Some debuggers like DBX on AIX assume the address in debug line
entries is always incremental. But clang generates two entries (entry
for file scope line and entry for prologue end) with same address if
prologue is empty
And if the prologue is empty, seems the first debug line entry for the
function is unnecessary(i.e. removing the first entry won't impact the
behavior in GDB on Linux), so I implement this for all debuggers.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D147506
The wasm64 versions of the v128.storeX_lane instructions was incorrectly defined
as returning a v128 value, which resulted in spurious drop instructions being
emitted and causing validation to fail. This was not caught earlier because
wasm64 has been experimental and not well tested. Update the relevant test file
to test both wasm32 and wasm64.
Fixes#62443.
Differential Revision: https://reviews.llvm.org/D149780
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.
This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.
Differential Revision: https://reviews.llvm.org/D149776
Instead of checking if the given bitwidth is less or equal to a bitwidth of an existing RegClass,
check if it has the exact same value.
For LLVM vector types that don't have a corresponding Register Class, widen them during legalization.
That goes for G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_BUILD_VECTOR.
Differential revision: https://reviews.llvm.org/D148096
Reviewers: foad, arsenm
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.
The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.
The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.
Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.
This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.
Depends on D143437
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D145441
Improves the codegen for VECREDUCE_{AND,OR,XOR} operations on AArch64.
Currently, these are fully scalarized, except if the vector is a <N x i1>. This
patch improves the codegen down to O(log(N)) where N is the length of the
vector for vectors whose elements are not i1, by repeatedly applying the
bitwise operations to the two halves of the vector. <N x i1> bitwise reductions
are handled using VECREDUCE_{UMAX,UMIN,ADD} instead.
I had to update quite a few codegen tests with these changes, with a general
downward trend in instruction count. Since the vector reductions already have
tests, I haven't added any new tests myself.
Differential Revision: https://reviews.llvm.org/D148185
This is a follow up to D149263 which extends the generic vslide1down handling to use vslidedown (without the one) for undef elements, and in particular for undef sub-sequences. This both removes the domain crossing, and for undef subsequences results in fewer instructions over all.
Differential Revision: https://reviews.llvm.org/D149658#inline-1446673
When the values are in GPRs, the vslide1down lowering is always better. We need to greatly improve the splat-and-mask cost model to handle constants in a meaningful way, so for now, limit this to non-constant vectors.
This does send the "partially constant" case down the vslide1down path. This could cause some regressions, though I don't see any in practice.
The cost modeling for the general case is annoyingly tricky. We have a great amount of inconsistency around immediate operands, and as a result, the exact constant and exact lowering choice matters a lot. I'm hoping that we get a "good enough" result without modeling this exactly, but we may need to do something analogous to getIntMatCost (i.e. a search w/costing).
Differential Revision: https://reviews.llvm.org/D149667
This tries to push the concat in trunc(concat(rshr, rshr)) into the leaves, so
that we can generate rshrn(concat). This helps improve the codegen for small
types, using the existing rshrn patterns.
Differential Revision: https://reviews.llvm.org/D149636
Allow shrink-wrapping past memory accesses that only access globals or
function arguments. This patch uses getUnderlyingObject to try to
identify the accessed object by a given memory operand. If it is a
global or an argument, it does not access the stack of the current
function and should not block shrink wrapping.
Note that the caller's stack may get accessed when passing an argument
via the stack, but not the stack of the current function.
This addresses part of the TODO from D63152.
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D149668
Small split change from D146023.
Migrate elf-notes to v4 and fix llvm-readobj to work with PAL metadata.
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D146119
Adds pre-SSE4 v8i16 abdu handling - we already have something similar for umax(x,y) -> add(x,usubsat(y,x)) / umin(x,y) -> sub(x,usubsat(x,y))
(I'm starting to look at adding generic TargetLowering expandABD() handling and came across this missed opportunity).
Inspiration: http://0x80.pl/notesen/2018-03-11-sse-abs-unsigned.html
Alive2: https://alive2.llvm.org/ce/z/gMhaTa