Continuing on from #65997, if the index of insert_vector_elt is a
constant then we can work out what the minimum number of registers will
be needed for the slideup and choose a smaller type to operate on.
This reduces the LMUL for not just the slideup but also for the scalar
insert.
…ectorOfConstants.
ComputeNumSignBits can return an answer for FP constants based on
bitcasting them to int.
Check for an integer type so we don't create an illegal truncate.
We could support this case with bitcasts, but I leave that to a separate
patch.
Vector pseudos with scalar operands only use the lower SEW bits (or less in the
case of shifts and clips). This patch accounts for this in hasAllNBitUsers for
both SDNodes in RISCVISelDAGToDAG. We also need to handle this in
RISCVOptWInstrs otherwise we introduce slliw instructions that are less
compressible than their original slli counterpart.
This is a reland of aff6ffc8760b99cc3d66dd6e251a4f90040c0ab9 with the
refactoring omitted.
This reverts commit aff6ffc8760b99cc3d66dd6e251a4f90040c0ab9. Version landed differs from version reviewed in (stylistic) manner worthy of separate review.
Vector pseudos with scalar operands only use the lower SEW bits (or less
in the
case of shifts and clips). This patch accounts for this in
hasAllNBitUsers for
both SDNodes in RISCVISelDAGToDAG. We also need to handle this in
RISCVOptWInstrs otherwise we introduce slliw instructions that are less
compressible than their original slli counterpart.
Because `x0` is not listed in the clobber list, regalloc could (one day
when #20571 is fixed) allocate `$0` to `x0`:
ldr x0, x0
This will produce an error when validating the instruction. The intent
of this test FWICT is to check that the parameter in w0 is stored to a
stack slot using w0, since this target triple is the exotic arm64_32
(ILP32). Update the test to simply use "m" constraint. The clobber list
is underconstrained otherwise.
llvm.ptrmask is currently limited to pointers only, and does not accept
vectors of pointers. This is an unnecessary limitation, especially as
the underlying instructions (getelementptr etc) do support vectors of
pointers.
We should relax this sooner rather than later, to avoid introducing code
that assumes non-vectors (#67166).
To invert the result, we can profitably commute a PCMPGT node if the LHS was a constant (C > min_signed_value): https://alive2.llvm.org/ce/z/LxcPqm
Allows the constant to fold, and helps reduce register pressure
Fixes#67347
This patch supports VP_MERGE, VP_SELECT, SELECT, SELECT_CC for fp16 vectors when only have Zvfhmin.
Reviewed By: michaelmaitland
Differential Revision: https://reviews.llvm.org/D159053
I accidentally introduced this in
commit 330fa7d2a4e0 ("[TargetLowering] Deduplicate choosing InlineAsm
constraint between ISels (#67057)")
Fix forward.
…tasync argument
swiftasync introduces a number of frame adjustments which is
incompatible with current implementation of HomogeneousPrologEpilog
pass.
Split a register generated from another split usually doesn't bring us
too much benefit. It may also cause dead loop as pr67188 shows if the
heuristic cost always satisfy the split condition. So prevent such
splitting.
It fixed pr67188.
If we have a build_vector of identical binops, we'd prefer to have a
single vector binop in most cases. We do need to make sure that the two
build_vectors aren't more difficult to materialize than the original
build_vector. To start with, let's restrict ourselves to the case where
one build_vector is a fully constant vector.
Note that we don't need to worry about speculation safety here. We are
not speculating any of the lanes, and thus none of the typical - e.g.
div-by-zero - concerns apply.
I'll highlight that the constant build_vector heuristic is just one we
could chose here. We just need some way to be reasonable sure the cost
of the two build_vectors isn't going to completely outweigh the savings
from the binop formation. I'm open to alternate heuristics here - both
more restrictive and more permissive.
As noted in comments, we can extend this in a number of ways. I decided
to start small as a) that helps keep things understandable in review and
b) it covers my actual motivating case.
This commit adds 2 new instructions in the selector:
- OpAccessChain
- OpInBoundsAccessChain.
The choice between the two relies on the `inbounds` marker.
Those instruction are not used for OpenCL, to maintain the same
behavior as previously. They are only added when building for logical
SPIR-V, as it doesn't support the pointer equivalent.
Because logical SPIR-V doesn't support pointer cast either, the
assign_ptr_type intrinsic need to be generated so OpAccessChain gets
lowered with the correct pointer type, instead of i8*.
Fixes#66107
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.
Differential Revision: https://reviews.llvm.org/D154205
This reverts commit fc86d031fec5e47c6811efd3a871742ad244afdd.
This change breaks LLVM buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/5474
I am going to revert this patch as the bot has been failing for more than a day without a fix.
The patch failed in test-suite due to a liveness error after rebasing on https://reviews.llvm.org/D133103, and now it's fixed.
```
[PowerPC][Peephole] Combine rldicl/rldicr and andi/andis after isel.
Summary: rldicl/rldicr can be eliminated if it's used to clear thehigh-order or low-order n bits and all bits cleared will be ANDed with 0 byandi/andis. Or they can be folded to `andi 0` if all bits to AND are alreadyzero in the input.
Reviewed By: qiucf, shchenz
Differential Revision: https://reviews.llvm.org/D159073
```
Implement a new pass to combine multiple image_load_2dmsaa and
2darraymsaa intrinsic calls into a single image_msaa_load if:
- they refer to the same vaddr except for sample_id,
- they use a constant sample_id and they fall into the same group,
- they have the same dmask and the number of instructions and the
number of vaddr/vdata dword transfers is reduced by the combine
This should be valid on all GFX11 but a hardware bug renders it
unworkable on GFX11.0.* so it is only enabled for GFX11.5.
Based on a patch by Rodrigo Dominguez!
There are many information that can be used for tuning, like
alignments, cache line size, etc. But we can't make all of them
`SubtargetFeature` because some of them are not with enumerable
value, for example, `PrefetchDistance` used by `LoopDataPrefetch`.
In this patch, a searchable table `RISCVTuneInfoTable` is added,
in which each entry contains the CPU name and all tune information
defined in `RISCVTuneInfo`. Each field of `RISCVTuneInfo` should
have a default value and processor definitions can override the
default value via `let` statements.
We don't need to define a `RISCVTuneInfo` for each processor and
it will use the default value (which is for `generic`) if no
`RISCVTuneInfo` defined.
For processors in the same series, a subclass can inherit from
`RISCVTuneInfo` and override the fields. And we can also override
the fields in processor definitions if there are some differences
in the same processor series.
When initilizing `RISCVSubtarget`, we will use `TuneCPU` as the
key to serach the tune info table. So, the behavior here is if
we don't specify the tune CPU, we will use specified `CPU`, which
is expected I think.
This patch almost undoes 61ab106, in which I added tune features
of preferred function/loop alignments. More tune information can
be added in the future.
Summary: rldicl/rldicr can be eliminated if it's used to clear the high-order or low-order n bits and all bits cleared will be ANDed with 0 by andi/andis. Or they can be folded to `andi 0` if all bits to AND are already zero in the input.
Reviewed By: qiucf, shchenz
Differential Revision: https://reviews.llvm.org/D159073
After https://reviews.llvm.org/D142953, the float value 1.0 can be
optimized as lui+fmv.w.x. But this test aims to test the constantpool
lowering under different code model. Fix the float value to cannot be
optimized to lui+fmv.w.x .
G_SEXT and G_ZEXT are supported via patterns imported from SDISel;
G_SEXT_INREG is selected using hand-written code as there is no
(functional) rule at this moment to import G_SEXT_INREG from
ISD::SEXT_INREG.
Credit helps from @topperc on G_SEXT and G_ZEXT.
Bitwise logical ops can always be done as b32, regardless of
availability of other v2i16 ops, that would need a new GPU.
Includes the missing lowering for 2-argument register operation variants
and additional tests for `and`.