Currently a cttz.elts of e.g. nxv32i1 will get expanded to a reduction
of nxv32i64 or equivalent, but we can split it into two legal nxv16i1
cttz.elts once we have dedicated SelectionDAG nodes.
This implements the splitting for them the same way we implement type
splitting for vp.cttz.elts, i.e. check if the low result is VF, and if
so add it to the result of the high result. It also implements operand
type promotion for NEON which needs to promote i1 vectors to something
larger first.
We also need to move expansion into LegalizeVectorOps so it doesn't get
expanded before type legalization can do splitting. This uses
LegalizeVectorOps in case the scalar reduction type, which depends on
the minimum bitwidth needed to store the result, still needs type
promotion.
The TTI costs should be updated after this to reflect the more efficient
codegen, but that is deferred to another PR.
Currently llvm.experimental.cttz.elts are directly lowered from the
intrinsic.
If the type isn't legal then the target tells SelectionDAGBuilder to
expand it into a reduction, but this means we can't split the operation.
E.g. it's possible to split a cttz.elts nxv32i1 into two nxv16i1,
instead of expanding it into a nxv32i64 reduction.
vp.cttz.elts can be split because it has a dedicated SelectionDAG node.
This adds CTTZ_ELTS and CTTZ_ELTS[_ZERO_POISON] nodes and just enough
legalization to get tests passing. A follow up patch will add splitting
and move the expansion into LegalizeDAG.
BranchInst currently represents both unconditional and conditional
branches. However, these are quite different operations that are often
handled separately. Therefore, split them into separate opcodes and
classes to allow distinguishing these operations in the type system.
Additionally, this also slightly improves compile-time performance.
The expansion converts arbitrary-precision FP represented as integer
following these algorithm:
1. Extract sign, exponent, and mantissa bit fields via masks and shifts.
2. Classify the input (zero, denormal, normal, Inf, NaN) using the
exponent and mantissa fields.
3. Normal path: adjusting the exponent bias and left-shifting the
mantissa to fit the wider destination format.
4. Denormal path: normalizing by finding the MSB position of the
mantissa (via count-leading-zeros), computing the correct exponent from
that position, stripping the implicit leading 1, and shifting the
fraction into the destination mantissa field.
5. Assemble the destination IEEE bit pattern (sign | exponent |
mantissa) and select among the normal, denormal, and special-value
results.
Currently only conversions from OCP floats are covered, in LLVM terms
these are: Float8E5M2, Float8E4M3FN, Float6E3M2FN, Float6E2M3FN,
Float4E2M1FN.
OCP spec:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
AI has assisted in X86 E2E testing.
This patch teaches the backend how to lower the FCBRT DAG node to the
vector math library function when using ArmPL. This is similar to what
we already do for llvm.pow/FPOW, however the only way to expose this is
via a DAG combine that converts
FPOW(<2 x double> %x, <2 x double> <double 1.0/3.0, double 1.0/3.0>)
into
FCBRT(<2 x double> %x)
when the appropriate fast math flags are present on the node. I've
updated the DAG combine to handle vector types and only perform the
transformation if there exists a vector library variant of cbrt.
This patch is split off from PR #183319 and teaches the backend how to
lower the FPOW DAG node to the vector math library function when using
ArmPL. This is similar to what we already do for llvm.sincos/FSINCOS
today.
Instead of directly checking if the target is android, check if
__safestack_pointer_address is available and configure android
to have the call. Maintain the -safestack-use-pointer-address cl::opt
in an unclean way by ignoring libcall availability.
Also add a RuntimeLibcallsInfo entry for __safestack_unsafe_stack_ptr,
similar to other special globals. Also add this unconditionally to most targets,
even though this seems contrary to reality. A few tests rely on unsupported OSes, so
leave that alone for now.
Previously, we can't compile the program which convert 256 bits to
floating points and vice versa(we'll crash). After this, we're able to
compile them.
Add support for `__builtin_stack_address` builtin. The semantics match
those of GCC's builtin with the same name.
`__builtin_stack_address` returns the starting address of the stack
region that may be used by called functions. It may or may not include
the space used for on-stack arguments passed to a callee (See [GCC
Bug/121013](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121013)).
Fixes#82632.
Generalise the Hexagon cmdline options to control if memset, memcpy or memmove intrinsics should be inlined versus calling library functions, so they can be used by all backends:
• -max-store-memset
• -max-store-memcpy
• -max-store-memmove
These flags override the target-specific defaults set in TargetLowering (e.g., MaxStoresPerMemcpy) and allow fine-tuning of the inlining threshold for performance analysis and optimization.
The optsize variants (-max-store-memset-Os, -max-store-memcpy-Os, max-store-memmove-Os) from the Hexagon backend were removed, and now the above options control both.
The threshold is specified as a number of store operations, which is backend-specific. Operations requiring more stores than the threshold will call the corresponding library function instead of being inlined.
This PR implements the first change outlined in
https://discourse.llvm.org/t/rfc-allow-non-constant-offsets-in-llvm-vector-splice/88974?u=lukel
In order to allow non-immediate offsets in the llvm.vector.splice
intrinsic, we need to separate out the "shift left" and "shift right"
modes into two separate intrinsics, which were previously determined by
whether or not the offset is positive or negative.
The description in the LangRef has also been reworded in terms of
sliding elements left or right and extracting either the upper or lower
half as opposed to extracting from a certain index, which brings it
inline with the definition of `llvm.fshr.*`/`llvm.fshl.*`.
This patch teaches AutoUpgrade.cpp to upgrade the old intrinsics into
their new equivalent one based on their offset, so existing uses of
vector.splice should still work.
Uses of llvm.vector.splice in `llvm/test/CodeGen` haven't been replaced
in this PR to keep the diff small and kick the tyres on the AutoUpgrader
a bit. I planned to do this in a follow up NFC but can include it in
this PR if reviewers prefer.
Similarly the shuffle costing kind `SK_Splice` has just been kept the
same for now, to be split into `SK_SpliceLeft` and `SK_SpliceRight`
later.
In line with a std proposal to introduce the llvm.clmul family of
intrinsics corresponding to carry-less multiply operations. This work
builds upon 727ee7e ([APInt] Introduce carry-less multiply primitives),
and follow-up patches will introduce custom-lowering on supported
targets, replacing target-specific clmul intrinsics.
Testing is done on the RISC-V target, which should be sufficient to
prove that the intrinsics work, since no RISC-V specific lowering has
been added.
Ref: https://isocpp.org/files/papers/P3642R3.html
Co-authored-by: Craig Topper <craig.topper@sifive.com>
The RISC-V P extension adds an instruction equivalent to
__builtin_clrsb. AArch64 has a similar instruction that we currently fail to
select when using the builtin.
This patch adds a combine based on the canonical version of the pattern
emitted by clang for the builtin, (add (ctlz (xor x, (sra x, bw-1)))),
-1). I'm starting the combine at the ctlz because the outer add can
easily be combined into other nodes obscuring the full pattern. So we
generate (add (ctls x), 1) and hope the add will be combined away.
I've also added a combine for the pattern AArch64 recognizes
(ctlz_zero_undef (or (shl (xor x, (sra x, bw-1)), 1), 1)).
I've only enabled the combines when the target has a Legal or Custom
action for the operation, taking into account type promotion. We
can relax this in the future by adding a default expansion to
LegalizeDAG and adding more type legalization rules.
fixes https://github.com/llvm/llvm-project/issues/98389
As the issue describes, promoting `llvm.fma.f16` to `llvm.fma.f32` does
not work, because there is not enough precision to handle the repeated
rounding. `f64` does have sufficient space. So this PR explicitly
promotes the 16-bit fma to a 64-bit fma.
I could not find examples of a libcall being used for fma, but that's
something that could be looked in separately to work around code size
issues.
This continues the replacement of TargetLibraryInfo uses in codegen
with RuntimeLibcallsInfo started in
821d2825a4f782da3da3c03b8a002802bff4b95c.
The series there handled all of the multiple result calls. This
extends for the other handled case, which happened to be frem.
For some reason the Libcall for these are prefixed with "REM_", for
the instruction "frem", which maps to the libcall "fmod".
We had a set of utilities which was only used for some set of
floating point libcalls. Add more, most of which are for integer
operations.
Ideally we would generate these functions from tablegen.
Previously libcall lowering decisions were made directly
in the TargetLowering constructor. Pull these into the subtarget
to facilitate turning LibcallLoweringInfo into a separate analysis
in the future.
Query RuntimeLibcalls for the support and the name. The check
that the implementation is exactly __guard_local instead of
unsupported feels a bit strange.
Currently LibcallLoweringInfo is defined inside of TargetLowering,
which is owned by the subtarget. Pass in the subtarget so we can
construct LibcallLoweringInfo with the subtarget. This is a temporary
step that should be revertable in the future, after LibcallLoweringInfo
is moved out of TargetLowering.
After the base branch was moved to main, this somehow ended up
adding a second definition of RTLCI, instead of modifying the
existing one.
Also fix other build error with gcc bots.
This fixes the -fveclib flag getting lost on its way to the backend.
Previously this was its own cl::opt with a random boolean. Move the
flag handling into CommandFlags with other backend ABI-ish options,
and have clang directly set it, rather than forcing it to go through
command line parsing.
Prior to de68181d7f, codegen used TargetLibraryInfo to find the vector
function. Clang has special handling for TargetLibraryInfo, where it
would
directly construct one with the vector library in the pass pipeline.
RuntimeLibcallsInfo currently is not used as an analysis in codegen, and
needs to know the vector library when constructed.
RuntimeLibraryAnalysis could follow the same trick that
TargetLibraryInfo is using in the future, but a lot more boilerplate changes
are needed to thread that analysis through codegen. Ideally this would come
from an IR module flag, and nothing would be in TargetOptions. For now, it's
better for all of these sorts of controls to be consistent.
Add libcall entries for sleef and armpl sincospi implementations.
This is the start of adding the vector library functions; eventually
they should all be tracked here.
I'm starting with this case because this is a prerequisite to fix
reporting sincospi calls which do not exist on any common targets
without
regressing vector codegen when these libraries are available.
Introduce a new class for the TargetLowering usage. This tracks the
subtarget specific lowering decisions for which libcall to use.
RuntimeLibcallsInfo is a module level property, which may have multiple
implementations of a particular libcall available. This attempts to be
a minimum boilerplate patch to introduce the new concept.
In the future we should have a tablegen way of selecting which
implementations should be used for a subtarget. Currently we
do have some conflicting implementations added, it just happens
to work out that the default cases to prefer is alphabetically
first (plus some of these still are using manual overrides
in TargetLowering constructors).
Currently it is considered suitable to lower to a bit test for a set of
switch case clusters when the the number of unique destinations
(`NumDests`) and the number of total comparisons (`NumCmps`) satisfy:
`(NumDests == 1 && NumCmps >= 3) || (NumDests == 2 && NumCmps >= 5) ||
(NumDests == 3 && NumCmps >= 6)`
However it is found for some cases on powerpc, for example, when
NumDests is 3, and the number of comparisons for each destination is all
2, it's not profitable to lower the switch to bit test. This is to add
an option to set the minimum of largest number of comparisons to use bit
test for switch lowering.
---------
Co-authored-by: Shimin Cui <scui@xlperflep9.rtp.raleigh.ibm.com>
Reland #161355, after fixing up the cross-projects-tests for the wasm
simd intrinsics.
Original commit message:
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
As noted in #153256, TableGen is generating reserved names for
RuntimeLibcalls, which resulted in a build failure for Arm64EC since
`vcruntime.h` defines `__security_check_cookie` as a macro.
To avoid using reserved names, all impl names will now be prefixed with
`Impl_`.
`NumLibcallImpls` was lifted out as a `constexpr size_t` instead of
being an enum field.
While I was churning the dependent code, I also removed the TODO to move
the impl enum into its own namespace and use an `enum class`: I
experimented with using an `enum class` and adding a namespace, but we
decided it was too verbose so it was dropped.
It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.
This PR adds intrinsics designed to create a mask with lanes disabled if
they overlap between the two pointer arguments, so that only safe lanes
are loaded, operated on and stored. The `loop.dependence.war.mask`
intrinsic represents cases where the store occurs after the load, and
the opposite for `loop.dependence.raw.mask`. The distinction between
write-after-read and read-after-write is important, since the ordering
of the read and write operations affects if the chain of those
instructions can be done safely.
Along with the two pointer parameters, the intrinsics also take an
immediate that represents the size in bytes of the vector element types.
This will be used by #100579.
Add entries for_stack_chk_guard, __ssp_canary_word, __security_cookie,
and __guard_local. As far as I can tell these are all just different
names for the same shaped functionality on different systems.
These aren't really functions, but special global variable names. They
should probably be treated the same way; all the same contexts that
need to know about emittable function names also need to know about
this. This avoids a special case check in IRSymtab.
This isn't a complete change, there's a lot more cleanup which
should be done. The stack protector configuration system is a
complete mess. There are multiple overlapping controls, used in
3 different places. Some of the target control implementations overlap
with conditions used in the emission points, and some use correlated
but not identical conditions in different contexts.
i.e. useLoadStackGuardNode, getIRStackGuard, getSSPStackGuardCheck and
insertSSPDeclarations are all used in inconsistent ways so I don't know
if I've tracked the intention of the system correctly.
The PowerPC test change is a bug fix on linux. Previously the manual
conditions were based around !isOSOpenBSD, which is not the condition
where __stack_chk_guard are used. Now getSDagStackGuard returns the
proper global reference, resulting in LOAD_STACK_GUARD getting a
MachineMemOperand which allows scheduling.
https://github.com/llvm/llvm-project/pull/152709 exposed the original IR
argument type to the CC lowering logic. However, in SDAG, this used the
raw type, prior to aggregate splitting. This PR changes it to use the
non-aggregate type instead. (This matches what happened in the
GlobalISel case already.)
I've also added some more detailed documentation on the
InputArg/OutputArg fields, to explain how they differ. In most cases
ArgVT is going to be the EVT of OrigTy, so they encode very similar
information (OrigTy just preserves some additional information lost in
EVTs, like pointer types). One case where they do differ is in
post-legalization lowering of libcalls, where ArgVT is going to be a
legalized type, while OrigTy is going to be the original non-legalized
type.
It is common to have ABI requirements for illegal types: For example,
two i64 argument parts that originally came from an fp128 argument may
have a different call ABI than ones that came from a i128 argument.
The current calling convention lowering does not provide access to this
information, so backends come up with various hacks to support it (like
additional pre-analysis cached in CCState, or bypassing the default
logic entirely).
This PR adds the original IR type to InputArg/OutputArg and passes it
down to CCAssignFn. It is not actually used anywhere yet, this just does
the mechanical changes to thread through the new argument.
This introduces a new `ptrtoaddr` instruction which is similar to
`ptrtoint` but has two differences:
1) Unlike `ptrtoint`, `ptrtoaddr` does not capture provenance
2) `ptrtoaddr` only extracts (and then extends/truncates) the low
index-width bits of the pointer
For most architectures, difference 2) does not matter since index (address)
width and pointer representation width are the same, but this does make a
difference for architectures that have pointers that aren't just plain
integer addresses such as AMDGPU fat pointers or CHERI capabilities.
This commit introduces textual and bitcode IR support as well as basic code
generation, but optimization passes do not handle the new instruction yet
so it may result in worse code than using ptrtoint. Follow-up changes will
update capture tracking, etc. for the new instruction.
RFC: https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/139357
The information whether a specific argument is vararg or fixed is
currently stored separately from all the other argument information in
ArgFlags. This means that it is not accessible from CCAssign, and
backends have developed all kinds of workarounds for how they can access
it after all.
Move this information to ArgFlags to make it directly available in all
relevant places.
I've opted to invert this and store it as IsVarArg, as I think that both
makes the meaning more obvious and provides for a better default (which
is IsVarArg=false).
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
Add a new combine to replace
```
(store ch (vselect cond truevec (load ch ptr offset)) ptr offset)
```
to
```
(mstore ch truevec ptr offset cond)
```
This saves a blend operation on targets that support conditional stores.
Cygwin and MinGW share the auto import behavior that could result in
__stack_check_guard being non-dso-local. Allow windres to assume a
Cygwin target as well as a MinGW one, so defines like _WIN32 would not
be present on Cygwin.