In order to ensure the correctness of ptr addrspace(7) lowering, we need
a backwards-compatible way to flag buffer intrinsics as volatile that
can't be dropped (unlike metadata).
To acheive this in a backwards-compatible way, we use bit 31 of the
auxilliary immediates of buffer intrinsics as the volatile flag. When
this bit is set, the MachineMemOperand for said intrinsic is marked
volatile. Existing code will ensure that this results in the appropriate
use of flags like glc and dlc.
This commit also harmorizes the handling of the auxilliary immediate for
atomic intrinsics, which new go through extract_cpol like loads and
stores, which masks off the volatile bit.
Always set the upper bound for VS_CNT higher than the lower bound.
Before #77439 this code was only executed on function entry where the
lower bound was 0 so it was not a problem.
Fixes#77931
The existing logic in isKnownNonZero relies on unsigned ranges, which
can be problematic when our range calculation is imprecise. Consider the
following:
%offset.nonzero = or i32 %offset, 1
--> %offset.nonzero U: [1,0) S: [1,0)
%offset.i64 = sext i32 %offset.nonzero to i64
--> (sext i32 %offset.nonzero to i64) U: [-2147483648,2147483648)
S: [-2147483648,2147483648)
Note that the unsigned range for the sext does contain zero in this case
despite the fact that it can never actually be zero.
Instead, we can push the query down one level - relying on the fact that
the sext is an invertible operation and that the result can only be zero
if the input is. We could likely generalize this reasoning for other
invertible operations, but special casing sext seems worthwhile.
Add Float16 to Vulkan's available capabilities, and guard Float16Buffer
(Kernel-only capability) against being added outside OpenCL
environments.
Add tests to verify half and half vector types, and validate with
spirv-val.
Fixes#66398
This fixes cvt multi vector builtins that erroneously had inverted
return vectors and vector parameters. This caused the incorrect
instructions to be emitted.
Force live interval recomputation for a register if its definition is
narrowed to become partial. The live interval repair process cannot
otherwise detect these changes.
Since we already know which register we want to extend, we don't have to
ask its defining MI about it
---------
Co-authored-by: Emil Tywoniak <Emil.Tywoniak@hightec-rt.com>
This reverts commit fdb87640ee2be63af9b0e0cd943cb13d79686a03, and thus
re-enables terminator folding for RISCV. The reported miscompile has
been fixed in f5dd70c58277d925710e5a7c25c86d7565cc3c6c.
Bitwise exclusive OR and rotate right by immediate
Select xar (x, y, imm) for the following pattern:
or (shl (xor x, y), nBits-imm), (shr (xor x, y), imm)
This is essentially:
rotr (xor(x, y), imm)
This already gets converted to a strided intrinsic because we currently call
haveNoCommonBitsSet when checking or instructions, but an upcoming patch will
change this logic and we want to preserve this case.
Note that this IR is in the form that comes from instcombine. The splats need
to be inline constexprs, otherwise isSplatValue() will fail. (It can't
currently handle splats where the shufflevector is an instruction, and the
insertelement is a constexpr.
When removing an empty machine basic block, all of its successors should
be inherited by its fall through MBB. This keeps CFG as only have one
entry which is required by LiveDebugValues.
Reland #77441 as LiveDebugValues test.
Modify GCNHazardRecognizer::fixLdsDirectVMEMHazard() so the waitvsrc
operand
in gfx12 DS_PARAM_LOAD or DS_DIRECT_LOAD instructions is set
appropriately
depending on whether a hazard is found or not, rather than inserting an
S_WAITCNT_DEPCTR instruction if a hazard needs to be mitigated.
Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>
-mbranch-protection=gcs (enabled by -mbranch-protection=standard) causes
generated objects to be marked with the gcs feature. This is done via
the guarded-control-stack module flag, in a similar way to
branch-target-enforcement and sign-return-address.
Enabling GCS causes the GNU_PROPERTY_AARCH64_FEATURE_1_GCS bit to be set
on generated objects. No code generation changes are required, as GCS
just requires that functions are called using BL and returned from using
RET (or other similar variant instructions), which is already the case.
We previously had a heuristic that if a value V was used multiple times
in a single PHI, then to avoid potentially rematerializing into many predecessors
we bail out. The phi uses only counted as a single use in the shouldLocalize() hook
because it counted the PHI as a single instruction use, not factoring in it may
have many incoming edges.
It turns out this heuristic is slightly too pessimistic, and allowing a small number
of these uses to be localized can improve code size due to shortening live ranges,
especially if those ranges span a call.
This change results in some improvements in size on CTMark -Os:
```
Program size.__text
before after diff
kimwitu++/kc 451676.00 451860.00 0.0%
mafft/pairlocalalign 241460.00 241540.00 0.0%
tramp3d-v4/tramp3d-v4 389216.00 389208.00 -0.0%
7zip/7zip-benchmark 587528.00 587464.00 -0.0%
Bullet/bullet 457424.00 457348.00 -0.0%
consumer-typeset/consumer-typeset 405472.00 405376.00 -0.0%
SPASS/SPASS 410288.00 410120.00 -0.0%
lencod/lencod 426396.00 426108.00 -0.1%
ClamAV/clamscan 380108.00 379756.00 -0.1%
sqlite3/sqlite3 283664.00 283372.00 -0.1%
Geomean difference -0.0%
```
I experimented with different variations and thresholds. Using 3 instead
of 2 resulted in a further 0.1% improvement on ClamAV but also regressed
sqlite3 by the same %.
Target hook `canPairLdStOpc` is missing quite a few opcodes for which
LDPs/STPs can created. I was hoping that it would not be necessary to
add these missing opcodes here and that the attached motivating test
case would be handled by the LoadStoreOptimiser (especially after
#71908), but it's not. The problem is that after register allocation
some things are a lot harder to do. Consider this for the motivating
example
```
[1] renamable $q1 = LDURQi renamable $x9, -16 :: (load (s128) from %ir.r51, align 8, !tbaa !0)
[2] renamable $q2 = LDURQi renamable $x0, -16 :: (load (s128) from %ir.r53, align 8, !tbaa !4)
[3] renamable $q1 = nnan ninf nsz arcp contract afn reassoc nofpexcept FMLSv2f64 killed renamable $q1(tied-def 0), killed renamable $q2, renamable $q0, implicit $fpcr
[4] STURQi killed renamable $q1, renamable $x9, -16 :: (store (s128) into %ir.r51, align 1, !tbaa !0)
[5] renamable $q1 = LDRQui renamable $x9, 0 :: (load (s128) from %ir.r.G0001_609.0, align 8, !tbaa !0)
```
We can't combine the the load in line [5] into the load on [1]:
regisister q1 is used in between. And we can can't combine [1] into
[5]: it is aliasing with the STR on line [4].
So, adding some missing opcodes here seems the best/easiest approach.
I will follow up to add some more missing cases here.
Instcombine canonicalizes selects to floating point and integer minmax.
This and the dag combiner canonicalize to floating point minmax. None of
them canonicalizes to integer minmax. On Neoverse V2 basic integer
arithmetic and integer minmax have the same costs.
Calls do not have to wait for VsCnt, so after they return there might
still be scratch stores in progress. It's important that we don't send
the DEALLOC_VGPR message in that case, since that might release the
VGPRs and scratch allocation before those stores are complete.
Both RISC-V and AMDGPU(GCN) deploy two VirtRegRewriter in their codegen
pipeline. This test prematurely stops at the first one, which doesn't
cleanup the virtual register map and cause an assertion failure. Ideally
we can solve this by teaching `-stop-after` how to stop at the last
instance of a Pass, but we're just marking XFAIL for these two targets
for now.
This adds new isel patterns for Zacas that take priority over the
pseudoinstructions we use for the A extension.
Support for 2x XLen types will come in a separate patch since they need
to be done differently.
This patch changes the following intrinsic
```svst1uwq[_{d}] replaced by svst1wq[_{d}]
svst1uwq_vnum[_{d}] replaced by svst1wq_vnum[_{d}]
svst1udq[_{d}] replaced by svst1dq[_{d}]
svst1udq_vnum[_{d}] replaced by svst1dq_vnum[_{d}]
```
Drops 'u' from the quadword stores because it is simply truncating the
quadwords to 32 bits
```
svextq_lane[_{d}] replaced by svextq[_{d}]
```
EXTQ follows the previous defined EXT intrinsics
```
svdot[_{d}_{2}_{3}] replaced by svdot[_{d}_{2}]
```
Introduced with the latest SME2 ACLE change
[1]https://github.com/ARM-software/acle/pull/257
This test tracks BranchFolding pass which removes fall through jump and
leaves landing-pad to be machine basic block of no predecessors. It
would raise bug as introduced in #77441.
When i128 is a legal type, SelectionDAG now attempts to use
SRL_PARTS etc. with type i128, which is not implemented. Fix
by marking those as Expand, just like we do for i64.
Fixes https://github.com/llvm/llvm-project/issues/77132
As not all fake instructions have their real counterparts implemented
yet, we specify no AssemblerPredicate for UseFakeTrue16Insts to allow
both fake and real True16 instructions in assembler and disassembler
tests in the -mattr=+real-true16 mode during the transition period.
Source DPP and desitnation VOPDstOperand_t16 operands are still not
supported and will be addressed separately.
The intrinsics get_fpenv, set_fpenv and reset_fpenv in this change are
implemented as calls to math library functions. Target specific lowering
will be implemented later on.
This patch fixes WebAssembly's FastISel pass to correctly consider
signext/zeroext parameter flags at function declaration.
Previously, the flags at call sites were only considered during code
generation, which caused an interesting bug report #63388 .
This is problematic especially because in WebAssembly's ABI, either
signext or zeroext can be tagged to a function argument, and it must be
correctly reflected in the generated code. Unit test
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/WebAssembly/signext-zeroext.ll
shows that `i8 zeroext %t` and `i8 signext %t`'s code gen are different.
* Add IRTranslate tests for ADD, SUB, AND, OR, and XOR with scalable
vector types to show that they work as expected.
* Legalize G_ADD, G_SUB, G_AND, G_OR, and G_XOR of scalable vector
type for the RISC-V vector extension.
Similar to #76550, but for `ISD::AVGCEILU`.
Specifically, this patch aims to use `vaaddu` with rounding mode rnu
(i.e `vxrm[1:0] = 0b00`) for `ISD::AVGCEILU`.
### Source code
```
define <vscale x 8 x i8> @vaaddu_vv_nxv8i8_ceil(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y) {
%xzv = zext <vscale x 8 x i8> %x to <vscale x 8 x i16>
%yzv = zext <vscale x 8 x i8> %y to <vscale x 8 x i16>
%add = add nuw nsw <vscale x 8 x i16> %xzv, %yzv
%one = insertelement <vscale x 8 x i16> poison, i16 1, i32 0
%splat = shufflevector <vscale x 8 x i16> %one, <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
%add1 = add nuw nsw <vscale x 8 x i16> %add, %splat
%div = lshr <vscale x 8 x i16> %add1, %splat
%ret = trunc <vscale x 8 x i16> %div to <vscale x 8 x i8>
ret <vscale x 8 x i8> %ret
}
```
### Before this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
vwaddu.vv v10, v8, v9
vsetvli zero, zero, e16, m2, ta, ma
vadd.vi v10, v10, 1
vsetvli zero, zero, e8, m1, ta, ma
vnsrl.wi v8, v10, 1
ret
```
### After this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
csrwi vxrm, 0
vaaddu.vv v8, v8, v9
ret
```
`llvm.trap` is lowered to `PPC::TRAP` and `PPC::TRAP` is set as
terminator. Verifier complains about terminator should not lie in the
middle of an MBB. See #77095.
Fix it by removing `isTerminator` and `isBarrier` and then set `isTrap`
which was introduced by https://reviews.llvm.org/D48836# and is being
used by X86 and AArch64.
`PPC::TRAP` is not a hardware memory barrier and `llvm.trap` doesn't
indicate a memory barrier either.