If fmul's constant operand is the reciprocal of a power of 2 (i.e 1/2^n) or
fdiv's constant operand is power of 2, we can try to match patterns with
[su]int_to_fp for [su]cvtf.
Differential Revision: https://reviews.llvm.org/D156538
D144048 has added preferred function and loop alignment to
RISCVSubtarget, but now we need to set them manually for
different processors.
Tune features that set preferred function/loop align to
[2, 64] bytes (align 1 is not here since the min align is 2)
are added. These features can be used in processor
definitions.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D157832
When LLVM is build with `LLVM_ENABLE_EXPENSIVE_CHECKS=ON` option the
following C code snippet:
struct t {
int a;
} __attribute__((preserve_access_index));
void test(struct t *t) {
t->a = 42;
}
Causes an assertion:
$ clang -g -O2 -c --target=bpf -mcpu=v2 t.c -o /dev/null
Function Live Ins: $r1 in %0
bb.0.entry:
liveins: $r1
DBG_VALUE $r1, $noreg, !"t", ...
%0:gpr = COPY $r1
DBG_VALUE %0:gpr, $noreg, !"t", ...
%1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
%3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
%4:gpr = MOV_ri 42
CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
RET debug-location !25; t.c:7:1
*** Bad machine code: Explicit definition marked as use ***
- function: test
- basic block: %bb.0 entry (0x6210000d8a90)
- instruction: CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
- operand 0: killed %4:gpr
This happens because `CORE_MEM` instruction is defined to have output
operands:
def CORE_MEM : TYPE_LD_ST<BPF_MEM.Value, BPF_W.Value,
(outs GPR:$dst),
(ins u64imm:$opcode, GPR:$src, u64imm:$offset),
"$dst = core_mem($opcode, $src, $offset)",
[]>;
As documented in [1]:
> By convention, the LLVM code generator orders instruction operands
> so that all register definitions come before the register uses, even
> on architectures that are normally printed in other orders.
In other words, the first argument for `CORE_MEM` is considered to be
a "def", while in reality it is "use":
%1:gpr = LD_imm64 @"llvm.t:0:0$0:0"
%3:gpr = ADD_rr %0:gpr(tied-def 0), killed %1:gpr
%4:gpr = MOV_ri 42
'---------------.
v
CORE_MEM killed %4:gpr, 411, %0:gpr, @"llvm.t:0:0$0:0", ...
Here is how `CORE_MEM` is constructed in
`BPFMISimplifyPatchable::checkADDrr()`:
BuildMI(*DefInst->getParent(), *DefInst, DefInst->getDebugLoc(), TII->get(COREOp))
.add(DefInst->getOperand(0)).addImm(Opcode).add(*BaseOp)
.addGlobalAddress(GVal);
Note that first operand is constructed as `.add(DefInst->getOperand(0))`.
For `LD{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
destination register of a load, so instruction is constructed in
accordance with `outs` declaration.
For `ST{D,W,H,B}` instructions the `DefInst->getOperand(0)` is a
source register of a store (value to be stored), so instruction
violates the `outs` declaration.
This commit fixes the issue by splitting `CORE_MEM` in three
instructions: `CORE_ST`, `CORE_LD64`, `CORE_LD32` with correct `outs`
specifications.
[1] https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class
Differential Revision: https://reviews.llvm.org/D157806
Check iterator validity before use; fixes a crash seen in the RISC-V
Zcmp Push/Pop optimization pass when compiling an internal benchmark.
Reviewed By: asb, wangpc
Differential Revision: https://reviews.llvm.org/D157674
OpenCL loses fast math information by going through libcall wrappers
around intrinsics.
Do this to preserve call site flags which are lost when inlining. It's
not safe in general to propagate flags during inline, so avoid dealing
with this by just special casing some of the useful calls.
We already do this in getNode, but the undef might appear during
another DAGCombine.
While here remove code for handling noop truncates. getNode checks
the types and won't a noop truncate.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157910
In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in https://github.com/llvm/llvm-project/issues/64282.
This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers.
We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility.
Differential Revision: https://reviews.llvm.org/D156909
The comment was out of date, the device libs build does provide all
the pointer overloads. An extremely pedantic interpretation of the
spec would suggest only the flat version exists, but the overloads do
exist in the implementation.
https://reviews.llvm.org/D156720
If we can fit an entire vector of i1 into a single element, e.g. v32i1 ->
v1i32, then we can reverse it via vbrev.v.
We need to handle the case where the vector doesn't exactly fit into the larger
element type, e.g. v4i1 -> v1i8. In this case we shift up the reversed bits
afterwards.
Reviewed By: fakepaper56, 4vtomat
Differential Revision: https://reviews.llvm.org/D157614
Match how the generic implementation handles this. We now will leave
behind the dead other user for later passes to deal with.
https://reviews.llvm.org/D156707
For scalar integer to float converts for Streaming Compatible SVE use
non-NEON version of convert instrction.
Differential Revision: https://reviews.llvm.org/D157698
This is a lot of copy-pasting for the existing handling of
G_VECREDUCE_FMAX/G_VECREDUCE_FMIN to add handling for
G_VECREDUCE_FMAXIMUM/G_VECREDUCE_FMINIMUM in the same way.
Differential Revision: https://reviews.llvm.org/D156615
Currently when widening operands for insert_subvector nodes, we check
first that the indices are valid by seeing if the subvector is
statically known to be smaller than or equal to the in-place vector.
However if we're inserting a fixed subvector into a scalable vector we rely on
the minimum vector length of the latter. This patch extends the widening logic
to also take into account the minimum vscale from the vscale_range attribute,
so we can handle more scenarios where we know the scalable vector is large
enough to contain the subvector.
Fixes https://github.com/llvm/llvm-project/issues/63437
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D153519
This adds legalization for G_VECREDUCE_FMIN and G_VECREDUCE_FMAX, where the
selection can go via tablegen patterns. I haven't tried to get non-power2 types
working yet, just the more legal types.
Differential Revision: https://reviews.llvm.org/D156614
D141386 changed the semantics of !range metadata to return poison
on violation. If !range is combined with !noundef, violation is
immediate UB instead, matching the old semantics.
In theory, these IR semantics should also carry over into SDAG.
In practice, DAGCombine has at least one key transform that is
invalid in the presence of poison, namely the conversion of logical
and/or to bitwise and/or (c7b537bf09/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (L11252)).
Ideally, we would fix this transform, but this will require
substantial work to avoid codegen regressions.
In the meantime, avoid transferring !range metadata without
!noundef, effectively restoring the old !range metadata semantics
on the SDAG layer.
Fixes https://github.com/llvm/llvm-project/issues/64589.
Differential Revision: https://reviews.llvm.org/D157685
This patch match the SDNode pattern:" trunc (srem(sext, ext))" to vrem.vv. This could remove the extra "vsext" ,"vnsrl" and the "vsetvli" instructions in the case like "c[i] = a[i] % b[i]", where the element types in the array are all int8_t or int16_t at the same time.
For element types like uint8_t or uint16_t, the "zext + zext + urem + trunc" based redundant IR have been removed during the instCombine pass, this is because the urem operation won't lead to the overflowed in the LLVM. However, for signed types, the instCombine pass can not remove such patterns due to the potential for Undefined Behavior in LLVM IR. Taking an example, -128 % -1 will lead to the Undefined Behaviour(overflowed) under the i8 type in LLVM IR, but this situation doesn't occur for i32. To address this, LLVM first signed extends the operands for srem to i32 to prevent the UB.
For RVV, such overflow operations are already defined by the specification and yield deterministic output for extreme inputs. For example, based on the spec, for the i8 type, -128 % -1 actually have 0 as the output result under the overflowed situation. Therefore, it would be able to match such pattern in the instruction selection phase for the rvv backend rather than removing them in the target-independent optimization passes like instCombine pass.
This patch only handle the sign_ext circumstances for srem. For more information about the C test cases compared with GCC, please see : https://gcc.godbolt.org/z/MWzE7WaT4
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D156685
Add baseline test for [[ https://reviews.llvm.org/D156685 | D156685 ]].
In LLVM, such signed 8 bits reaminder operation will first signed extened the operands to 32 bits, and then narrow the operands to the smaller bits data type such as 16 bits during the CorrelatedValuePropagation Pass to optimize the final data storage size.
Such a signed extension operation for srem in LLVM system is to prevent the Undefined Behavior. Taking an example, -128 % -1 will lead to the Undefined Behaviour under the i8 type in LLVM IR, but this won't happen for i32, so such pattern cannot be eliminated in the platform-independent InstCombine Pass. The LLVM IR of these sext/trunc operations will be translated one by one during the RVV backend code generation process, and redundant vsetvli instructions will be inserted.
In fact, according to the RVV instruction manual, the vrem.vv instruction has already specified the final output value of this type of overflow operation. For example, the overflow operation of -128 % -1 will get 0 according to the RISC-V spec, so through this patch , I think we can optimize these redundant rvv code through the SDNode pattern match at the instruction selection phase.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157592
The constants can be with larger bit width, so we need to truncate
them to EltSize or we will exceed the width of fixed-length vector.
Fixes#64588
Reviewed By: luke, craig.topper, bjope, michaelmaitland
Differential Revision: https://reviews.llvm.org/D157603
There is no pattern for ADCX/ADOX and they are never selected during
ISEL. So we remove the cases in some MIR optimizations in this patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157717
This fixes https://github.com/llvm/llvm-project/issues/60766
With MSVC style exception-handling (funclets), no registers are
alive when entering the funclet so they must be reloaded from the
stack. MachineLICM can sometimes hoist such reloads out of the
funclet which is not correct, the register will have been clobbered
when entering the funclet. This can happen in any loop that
contains a try-catch.
This has been tested on x86_64-pc-window-msvc. I'm not sure if
funclets work the same on the other windows archs.
Reviewed By: rnk, arsenm
Differential Revision: https://reviews.llvm.org/D153337
There is a problem with the
SILoadStoreOptimizer::dmasksCanBeCombined() function that can lead to
UB.
This boolean function decides if two masks can be combined into 1. The
idea here is that the bits which are "on" in one mask, don't overlap
with the "on" bits of the other. Consider an example (10 bits for
simplicity):
Mask 1: 0101101000
Mask 2: 0000000110
Those can be combined into a single mask: 0101101110.
To check if such an operation is possible, the code takes the mask
which is greater and counts how many 0s there are, starting from the
LSB and stopping at the first 1. Then, it shifts 1u by this number and
compares it with the smaller mask. The problem is that when both masks
are 0, the counter will find 32 zeroes in the first mask and will try
to do a shift by 32 positions which leads to UB.
The fix is a simple sanity check, if the bigger mask is 0 or not.
https://reviews.llvm.org/D155051
While they are represent 32/16 bit immediate values they are already
included in encoding of the instructions that use them and are not true
literals. FMAMK and FMAAK instructions that use them are marked with fixed
size so getInstSizeInBytes will not increase the size for these operands.
We also add tests whose logic relies on KIMM16 and KIMM32 being considered
not inlinable.
Differential Revision: https://reviews.llvm.org/D157624
We must also track the super sources of a copy, otherwise we introduce a sort of subtle bug.
Consider:
1. DEF r0:r1
2. USE r1
3. r6:r9 = COPY r10:r13
4. r14:15 = COPY r0:r1
5. USE r6
6.. r1:4 = COPY r6:9
BackwardCopyPropagateBlock processes the instructions from bottom up. After processing 6., we will have propagatable copy for r1-r4 and r6-r9. After 5., we invalidate and erase the propagatble copy for r1-r4 and r6 but not for r7-r9.
The issue is that when processing 3., data structures still say we have valid copies for dest regs r7-r9 (from 6.). The corresponding defs for these registers in 6. are r1:r4, which we mark as registers to invalidate. When invalidating, we find the copy that corresponds to r1 is 4. (this was added when processing 4.), and we say that r1 now maps to unpropagatable copies. Thus, when we process 2., we do not have a valid copy, but when we process 1. we do -- because the mapped copy for subregister r0 was never invalidated.
The net result is to propagate the copy from 4. to 1., and replace DEF r0:r1 with DEF r14:r15. Then, we have a use before def in 2.
The main issue is that we have an inconsitent state between which def regs and which src regs are valid. When processing 5., we mark all the defs in 6. as invalid, but only the subreg use as invalid. Either we must only invalidate the individual subreg for both uses and defs, or the super register for both.
Differential Revision: https://reviews.llvm.org//D157564
Change-Id: I99d5e0b1a0d735e8ea3bd7d137b6464690aa9486
The 16-bit VAddr arguments to A16 image instructions are packed into
legal VGPR_32 operands in AMDGPULegalizerInfo::legalizeImageIntrinsic on
all subtargets. With True16, we also need to pack if the number of VAddr is one
because VGPR_16 is not a legal argument to those Image instructions.
No change to emitted code intended on subtargets pre-GFX11, and none on GFX11
until True16 is active.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D157426
Some instructions such as multi-vector LD1 only accept a range
of PN8-PN15 predicate-as-counter. This new constraint allows more
refined parsing and better decision making when parsing these
instructions from ASM, instead of defaulting to Upa which incorrectly
uses the whole range of registers P0-P15 from the register class PPR.
Differential Revision: https://reviews.llvm.org/D157517
Not sure if the only valid use is to have stackrestore directly
consume stacksave outputs or not. Handled exactly like a regular stack
pointer so all the edge cases theoretically should work.
https://reviews.llvm.org/D156669