Rather than having a separate pass to add the hint instructions,
emit them directly into the streamer during asm printing.
Reviewed By: BeMg, kito-cheng
Differential Revision: https://reviews.llvm.org/D149511
This AND immediately gets legalized to RISCVISD::VMAND_VL and we don't
yet have DAG combine to optimize that away. So this is a quick fix to
improve generated code.
Instruction selection was failing when trying to zero extend a value
loaded from a PC-relative address. This adds support for zero extension
using the "program counter indirect with displacement" addressing mode.
It also adds a test with code that was previously failing to compile.
This fixes a compile error in Rust's libcore.
Differential Revision: https://reviews.llvm.org/D149034
This uses the patterns defined in MVE_TwoOpPattern to add predicated patterns
for vshls/u instructions.
Differnetial Revision: https://reviews.llvm.org/D149366
This function gets called for vectors and ISD::SELECT_CC was never
intended to support vectors. Some updates were made to support
it when this function started getting used for vectors.
Overall, using separate ISD::SETCC and ISD::SELECT looks like an
improvement even for scalar.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D149481
For the GPU, we emit external kernels that call the initializers and
constructors, however if we had a persistent kernel like in the `_start`
kernel for the `libc` project, we could initialize the standard way of
calling constructors. This patch adds new global variables containing
pointers to the constructors to be called. If these are placed in the
`.init_array` and `.fini_array` sections, then the backend will handle
them specially. The linker will then provide the `__init_array_` and
`__fini_array_` sections to traverse them. An implementation would look
like this.
```
extern uintptr_t __init_array_start[];
extern uintptr_t __init_array_end[];
extern uintptr_t __fini_array_start[];
extern uintptr_t __fini_array_end[];
using InitCallback = void(int, char **, char **);
using FiniCallback = void(void);
extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
_start(int argc, char **argv, char **envp) {
uint64_t init_array_size = __init_array_end - __init_array_start;
for (uint64_t i = 0; i < init_array_size; ++i)
reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
uint64_t fini_array_size = __fini_array_end - __fini_array_start;
for (uint64_t i = 0; i < fini_array_size; ++i)
reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
}
```
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D149340
This is stricter than the default "ieee", and should probably be the
default. This patch leaves the default alone. I can change this in a
future patch.
There are non-reversible transforms I would like to perform which are
legal under IEEE denormal handling, but illegal with flushing zero
behavior. Namely, conversions between llvm.is.fpclass and fcmp with
zeroes.
Under "ieee" handling, it is legal to translate between
llvm.is.fpclass(x, fcZero) and fcmp x, 0.
Under "preserve-sign" handling, it is legal to translate between
llvm.is.fpclass(x, fcSubnormal|fcZero) and fcmp x, 0.
I would like to compile and distribute some math library functions in
a mode where it's callable from code with and without denormals
enabled, which requires not changing the compares with denormals or
zeroes.
If an IEEE function transforms an llvm.is.fpclass call into an fcmp 0,
it is no longer possible to call the function from code with denormals
enabled, or write an optimization to move the function into a denormal
flushing mode. For the original function, if x was a denormal, the
class would evaluate to false. If the function compiled with denormal
handling was converted to or called from a preserve-sign function, the
fcmp now evaluates to true.
This could also be of use for strictfp handling, where code may be
changing the denormal mode.
Alternative name could be "unknown".
Replaces the old AMDGPU custom inlining logic with more conservative
logic which tries to permit inlining for callees with dynamic handling
and avoids inlining other mismatched modes.
The previous patch (D148980) didn't set the InstrIdxForVirtReg correctly
in genAlternativeDpCodeSequence(). It causes vnni lit test failure when
LLVM_ENABLE_EXPENSIVE_CHECKS is on.
This allows us to model and thus test transforms which are legal only when a vector load with less than element alignment are supported. This was originally part of D126085, but was split out as we didn't have a good example of such a transform. As can be seen in the test diffs, we have the recently added concat_vector(loads) -> strided_load transform (from D147713) which now benefits from the unaligned support.
While making this change, I realized that we actually *do* support unaligned vector loads and stores of all types via conversion to i8 element type. For contiguous loads and stores without masking, we actually already implement this in the backend - though we don't tell the optimizer that. For indexed, lowering to i8 requires complicated addressing. For indexed and segmented, we'd have to use indexed. All around, doesn't seem worthwhile pursuing, but makes for an interesting observation.
Differential Revision: https://reviews.llvm.org/D149375
We already have tablegen patterns for a lot of these, but performing the
combine earlier in DAG can help in a few extra cases.
Differential Revision: https://reviews.llvm.org/D149269
"convergent" is documented as meaning that the call cannot be made
control-dependent on more values, but in practice we also require that
it cannot be made control-dependent on fewer values, e.g. it cannot be
hoisted out of the body of an "if" statement.
In code like this, if we allow CSE to combine the two calls:
x = convergent_call();
if (cond) {
y = convergent_call();
use y;
}
then we get this:
x = convergent_call();
if (cond) {
use x;
}
This is conceptually equivalent to moving the second call out of the
body of the "if", up to the location of the first call, so it should be
disallowed.
Differential Revision: https://reviews.llvm.org/D149348
Clang accepts preserve_all for AArch64 while it is missing form the backed.
Fixes#58145
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D135652
D37076 makes LICM duplicate instructions into exit blocks if the
instruction is free. For GEPs, the motivation appears to be that
this allows the GEP to be folded into addressing modes, while
non-foldable users outside the loop might prevent this. TBH I don't
think LICM is the place to do this (why doesn't CGP apply this
heuristic itself?) but at least I understand the motivation.
However, the transform is also applied to all other "free"
instructions, which are just that (removed during lowering and not
"folded" in some way). For such instructions, this transform seems
somewhere between useless, counter-productive (undoing CSE/GVN) and
actively incorrect. For example, this transform can duplicate freeze
instructions, which is illegal.
This patch limits the transform to just foldable GEPs, though we
might want to drop it from LICM entirely as a followup.
This is a small compile-time improvement, because querying TTI cost
model for every single instruction is expensive.
Differential Revision: https://reviews.llvm.org/D149136
Following the changes in D145301, we now also support the efficient bitcast
when storing the bool vector. Previously, this was expanded.
Differential Revision: https://reviews.llvm.org/D148316
With a similar reason as D148023; some applications make heavy use of
the CRC32 intrinsic (e.g., as part of a hash function) and therefore
benefit from avoiding frequent SelectionDAG fallbacks. In our
application, we get a 2% compile-time improvement.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D148917
When hwasan-match-all-tag flag is enabled and short granules are used, at the point checking if this is a short tag case, the tag from pointer is stored in X16 register,
which breaks the assumption that tag from shadow memory is stored in X16 register, this will cause a false positive.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D149252
Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins.
Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate).
Differential Revision: https://reviews.llvm.org/D149358
If the removable definition resides in an INLINEASM_BR target, the
reuseable candidate might not dominate the INLINEASM_BR.
bb0:
INLINEASM_BR &"" %bb.1
renamable $x8 = MOVi64imm 29273397577910035
B %bb.2
...
bb1:
renamable $x8 = MOVi64imm 29273397577910035
renamable $x8 = ADDXri killed renamable $x8, 2048, 0
bb2:
Removing the second mov is a hazard when the inline asm branches to bb1.
Skip such replacements when the to be removed instruction is in the
target of such an INLINEASM_BR instruction.
We could get more aggressive about this in the future, but for now
simply abort.
This is causing a boot failure on linux-4.19.y branches of the LTS Linux
kernel for ARCH=arm64 with CONFIG_RANDOMIZE_BASE=y (KASLR) and
CONFIG_UNMAP_KERNEL_AT_EL0=y (KPTI).
Link: https://reviews.llvm.org/D123394
Link: https://github.com/ClangBuiltLinux/linux/issues/1837
Thanks to @nathanchance for the report, and @ardb for debugging.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D149191
Summary:
Registers for tail call return should not be clobbered by callee.
So we need a sub-class of SGPR_64 (excluding callee saved registers (CSR)) to hold
the tail call return address.
Because GFX and C calling conventions have different CSR, we need to define
the sub-class separately. This work is an extension of D147096 with the
consideration of GFX calling convention.
Based on the calling conventions, different instructions will be selected with
different sub-class of SGPR_64 as the input.
Reviewers: arsenm, cdevadas and sebastian-ne
Differential Revision: https://reviews.llvm.org/D148824
The -mattr=+global-isel is not valid syntax, so those lines have been removed.
With Global-ISel there is currently missing vector legalization for wide G_EXT,
and it does not support BE.
With this patch an undefined mask in a shufflevector will be printed as poison.
This change is done to support the new shufflevector semantics
for undefined mask elements.
Differential Revision: https://reviews.llvm.org/D149210
The code closely follows the X86 back-end. Applications that make heavy
use of {i64, i64} returns to use two registers strongly benefit from the
reduced number of SelectionDAG fallbacks.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D148346
"vpmaddwd + vpaddd" can be combined to vpdpwssd and the latency is
reduced after combination. However when vpdpwssd is in a critical path
the combination get less ILP. It happens when vpdpwssd is in a loop, the
vpmaddwd can be executed in parallel in multi-iterations while vpdpwssd
has data dependency for each iterations. If vpaddd is in a critical path
while vpmaddwd is not, it is profitable to split vpdpwssd into "vpmaddwd
+ vpaddd".
This patch is based on the machine combiner framework to acheive decision
on "vpmaddwd + vpaddd" combination. The typical example code is as
below.
```
__m256i foo(int cnt, __m256i c, __m256i b, __m256i *p) {
for (int i = 0; i < cnt; ++i) {
__m256i a = p[i];
__m256i m = _mm256_madd_epi16 (b, a);
c = _mm256_add_epi32(m, c);
}
return c;
}
```
Differential Revision: https://reviews.llvm.org/D148980
As the comment notes, the shader results in an INSERT_SUBREG with
"undef" (dead) operand in the Endif block. The same can happen with
REG_SEQUENCE. The register is considered dead from a liveness
analysis perspective. The correct thing to do seems to be nothing:
we keep the undef use of the register, the register allocator should
still be able to take the liveness into account correctly.
Differential Revision: https://reviews.llvm.org/D149161
These functions where missing support but are used enough that it
makes sense to track them.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D148963
The patch basically models custom lowering of base rounding operations to expand
rounding by coverting to ingter and coverting back to FP. The other one thing
the patch does is to covert sNan of the source to qNan.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D148519
This was part of the N extension which did not make it into
version 1.12 of the privilege specification.
Reviewed By: jrtc27
Differential Revision: https://reviews.llvm.org/D149308
On 64-bit target, when doing i64 BR_CC where one of the comparison operands is a
constant zero, try to fold the compare and BPcc into a BPr instruction.
For all integers, EQ and NE comparison are available, additionally for signed
integers, GT, GE, LT, and LE is also available.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D142461
A neon smull/umull should be preferred over a sve v2i64 mul with two extends.
It will be both less instructions and a lower cost multiply instruction.
Differential Revision: https://reviews.llvm.org/D148248
Zicntr and Zihpm are names for groups of CSRs so they should imply
that CSRs exist.
Reviewed By: asb, kito-cheng
Differential Revision: https://reviews.llvm.org/D148962
Prior to this patch the SCS prologue used the following instruction
sequence.
```
s[w|d] ra, 0(gp)
addi gp, gp, [4|8]
```
The problem with this sequence is that an interrupt occurring between the
store and the increment could clobber the value just written to the SCS.
https://reviews.llvm.org/D84414#inline-813203 pointed out a similar
issues that could have affected the epilogue.
This patch changes the instruction sequence in the prologue to:
```
addi gp, gp, [4|8]
s[w|d] ra, -[4|8](gp)
```
The downside to this is that there is now a data dependency between the
add and the store.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D149099
There are no VOP2 or VOP2 with dpp forms of v_cndmask_b16. Delete the
test. NFC.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D149184
The offset values may result in an erroneous scheduling of a load before write for a memory location if the offset values are represented as negative values in MIR, despite actually being unsigned values. This representation in MIR happens as SelectionDAG::getConstant could go through APInt to represent the encoding which assumes the MSB of the encoding as a sign-bit, regardless of whether it is supposed to be a signed value. The 8-bit negative (interpreted) value gets cast to an unsigned 32 bit value in getMemOperandsWithOffset used for comparisons in areMemAccessesTriviallyDisjoint eventually leading to an erroneous schedule in the machine scheduler.
Reviewed By: arsenm, foad
Differential Revision: https://reviews.llvm.org/D149080