This tries to fill in the basic vector handling for sadd_sat/uadd_sat
and ssub_sat/usub_sat. It just handles the basics, marking legal types
and clamping illegally sized vectors to legal ones.
If the odd vector is undef or poison, the widening add and multiply trick
doesn't work unless we freeze the odd vector.
Unfortunately, freezing doesn't work when the operand is provably
undef/poison. MIR doesn't have a representation for freeze so it
just becomes a COPY from IMPLICIT_DEF which freely propagates undef
to each operand independently.
To work around this, check for undef explicitly and lower to a VZEXT_VL
of the even vector. This produces better code than we'd get from a
freeze anyway.
I've left a FIXME for adding a freeze. I'll do that as a separate patch
as it affects other tests and doesn't help with the new test.
The interleave lowering relies on a math trick that requires passing
the odd vector to two math instructions. In order to be correct
these instructions must see the same value.
If the odd vector is provably poison or undef, SelectionDAG will
create a vwadd and vwmaccu where the operand is a copy from IMPLICIT_DEF.
Later this will become just the undef flag on the operand. This
gives the register allocator freedom to pick a different register
for each instruction.
VPERMI (VPERMQ/PD) is nearly always lane-crossing and poorly merges with target shuffles (other than itself).
For now, I've restricted VPERMI to only merge with itself, constants, loads and splats.
We might be able to merge with a few other special cases (AND/ANDNP with constant?), which could help the shuffle-vs-trunc-256.ll AVX512VL regression, but since that now gives similar codegen to the other AVX512 variants, I'd prefer to improve the shuffle lowering for that properly.
SelectionDAG marks ISD::BITCAST as legal between scalable vector types
and ISelDAGToDAG deletes them.
We mark G_BITCAST between scalable vectors as legal in GISel. A future
patch will handle what to do with them after the legalizer (likley
either drop them in a isel-preprocess or convert them to COPYs).
BITCAST is needed for legalization of G_INSERT and G_EXTRACT. This is a
precommit for legalization of G_INSERT and G_EXTRACT.
expand-large-divrem cannot handle vector types.
If overly large vector element types survive into
isel, they will likely be scalarized there, but since
isel cannot handle scalar integer types of that size,
it will assert.
Handle vector types in expand-large-divrem by
scalarizing them and then expanding the scalar type
operation. For large vectors, this results in a
*massive* code expansion, but it's better than
asserting.
The ID argument of `gc.statepoint` gets incorrectly truncated to 32 bits
during code generation.
This is fixed by using `uint64_t` instead of `unsigned` for the `ID`
member in `SelectionDAGBuilder::StatepointLoweringInfo`, and a
`patchpoint` test case is extended to check for 64 bit ID generation in
stackmaps.
`RegisterClassInfo::getRegPressureSetLimit` has been changed to return a
smaller value than before so the limit may become negative in later
calculations. As a workaround, change to use
`TargetRegisterInfo::getRegPressureSetLimit`.
Also improve tests.
The earlier implementation on AMDGPU used explicit token operands at
SI_CALL and SI_CALL_ISEL. This is now replaced with CONVERGENCECTRL_GLUE
operands, with the following effects:
- The treatment of tokens at call-like operations is now consistent with
the treatment at intrinsics.
- Support for tail calls using implicit tokens at SI_TCRETURN "just
works".
- The extra parameter at call-like instructions is eliminated, thus
restoring those instructions and their handling to the original state.
The new glue node is placed after the existing glue node for the
outgoing call parameters, which seems to not interfere with selection of
the call-like nodes.
We currently just use mangled name. This works fine, because linker
should detect that and demangle it for the export table. However, on
MSVC, the compiler is more specific and passes demangled name as well,
with EXPORTAS. This PR aims to match that. MSVC doesn't use quotes in
this case, so I added '#' to the list of characters that don't need it.
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in
[PTX ISA 11.6.4. Linking Directives: .common]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
I've updated the logic and tests to only use `.common` for PTX 5.0 or
greater and verified that the new tests now pass with `ptxas`.
DAGCombiner (or InstCombine) will convert an add to an or if the bits
are disjoint, which can prevent what was originally an (add {s,z}ext,
{s,z}ext) from being selected as a vwadd.
This teaches combineBinOp_VLToVWBinOp_VL to recover it by treating it as
an add.
We currently check if the source and promoted types are not equal before
generating truncate instructions. This does not work for RV64 where the
promoted type is i64 and this lead to a crash due to the generation of
truncate instructions from i32 to i64.
Fixes#86400
This narrows unsigned and signed div and rem nodes via
combineBinOpOfZExt.
Unlike other binary ops, there are no widening div or rem instructions.
So we will end up with an extra vzext.vf2.
However I'm assuming that div/rem are expensive enough that by reducing
their EMUL we will gain back the cost.
Alive2 proof: https://alive2.llvm.org/ce/z/Et_L6y
Previously we wouldn't remove dead copies from basic blocks with
successors. The comment said we didn't want to trust the live-in lists.
The comment is very old so I'm not sure if that's still a concern today.
This patch checks the live-in lists and removes copies from
MaybeDeadCopies if they are referenced by any live-ins in any
successors. We only do this if the tracksLiveness property is set. If
that property is not set, we retain the old behavior.
On SystemZ, the outgoing argument area which is big enough for all calls
in the function is created once during the prolog, as opposed to
adjusting the stack around each call. The call-sequence instructions are
therefore not really useful any more than to compute the maximum call
frame size, which has so far been done by PEI, but can just as well be
done at an earlier point.
This patch removes the mapping of the CallFrameSetupOpcode and
CallFrameDestroyOpcode and instead computes the MaxCallFrameSize
directly after instruction selection and then removes the ADJCALLSTACK
pseudos. This removes the confusing pseudos and also avoids the problem
of having to keep the call frame size accurate when creating new MBBs.
This fixes#76618 which exposed the need to maintain the call frame size
when splitting blocks (which was not done).