We can support these via libcalls in libgcc/compiler-rt or integer
operations for fneg/fabs/fcopysign. fp128 values will be passed in two
64-bit GPRs according to the psABI.
Supporting RV32 requires sret which is not supported by libcall handling
in LegalizerHelper.cpp yet. It doesn't call canLowerReturn.
The AMDGPUAnnotateKernelFeatures pass infers the "amdgpu-calls" and
"amdgpu-stack-objects" attributes, which are used to infer whether we
need to initialize flat scratch. This is, however, not precise. Instead,
we should use AMDGPUAttributor and infer amdgpu-no-flat-scratch-init on
kernels. Refer to https://github.com/llvm/llvm-project/issues/63586 .
If parts of a physical register for a given liverange, as assigned by
the register allocator, can be used to store other values not
represented by this liverange, then `LiveIntervals::addKillFlags`
normally avoids adding a kill flag on the use of this register
when the value's liverange ends.
However, if all the other regunits are artificial, then we can
still safely add the kill flag, since those parts of the register
can never be accessed independently.
A spread(2) shuffle is just a interleave with an undef lane. The
existing lowering was reusing the even lane for the undef value. This
was entirely legal, but non-optimal.
When we have legal instructions we want to promote to sXLen and let isel
pattern matching removing the and/sext_inreg.
When using a libcall we want to use a 'si' libcall for small types
instead of 'di'. To match the RV64 ABI, we need to sign extend `unsigned
int` arguments. We reuse the shouldSignExtendTypeInLibCall hook from
SelectionDAG.
The spec is available here:
https://github.com/intel/llvm/pull/12497
The PR doesn't add OpCooperativeMatrixApplyFunctionINTEL instruction as
it's still experimental and not properly tested E2E.
The PR also fixes few bugs in the related code:
1. CooperativeMatrixMulAddKHR optional operand must be literal, not a
constant;
2. Fixed available capabilities table creation for a case, when a single
extension adds few capabilities, that occupy not contiguous op codes.
---------
Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com>
This commit ensures than noundef (which is frequently a prerequisite for
other annotations) and range() annotations on kernel arguments are
copied onto their corresponding load from the kernel argument structure.
This DAG combine was incorrect for big-endian targets, because it
assumes that when a bitcast changes the lane width, the
least-significant bits of the wider lanes are in the lower-numbered
lanes of the smaller type, which is only true for little-endian.
The DAG has the same instructions: the signed and unsigned absolute
difference of it's input. For AArch64, they map to uabd and sabd for
Neon and SVE. The Neon and SVE instructions will require custom
patterns.
They are pseudo opcodes and are not imported by the IRTranslator. We
need combines to create them.
PowerPC, ARM, and AArch64 have native instructions.
/// i.e trunc(abs(sext(Op0) - sext(Op1))) becomes abds(Op0, Op1)
/// or trunc(abs(zext(Op0) - zext(Op1))) becomes abdu(Op0, Op1)
For GlobalISel, we are going to write the combines in MIR patterns.
see:
llvm/test/CodeGen/AArch64/abd-combine.ll
- [ ] combine into abd
- [ ] legalize and add td patterns
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
It doesn't make sense to add a new generic ISD to handle riscv tuple
type. Instead we use `SPLAT_VECTOR` for ISD and further lower to
`VMV_V_X`.
Note: If there's `visitSPLAT_VECTOR` in generic DAG combiner, it needs
to skip riscv vector tuple type.
Stack on https://github.com/llvm/llvm-project/pull/114329
We can extend the existing SHL+TRUNC lowering used for deinterleave2 for
deinterleave4, and deinterleave8 when the result types are small enough
to allow the shift to be legal. On RV64, this means i8 and i16 results
for deinterleave4 and i8 results for deinterleave8.
This adds WebAssembly support for the new [Lime1 CPU].
First, this defines some new target features. These are subsets of
existing
features that reflect implementation concerns:
- "call-indirect-overlong" - implied by "reference-types"; just the
overlong
encoding for the `call_indirect` immediate, and not the actual reference
types.
- "bulk-memory-opt" - implied by "bulk-memory": just `memory.copy` and
`memory.fill`, and not the other instructions in the bulk-memory
proposal.
Next, this defines a new target CPU, "lime1", which enables
mutable-globals,
bulk-memory-opt, multivalue, sign-ext, nontrapping-fptoint,
extended-const,
and call-indirect-overlong. Unlike the default "generic" CPU, "lime1" is
meant
to be frozen, and followed up by "lime2" and so on when new features are
desired.
[Lime1 CPU]:
https://github.com/WebAssembly/tool-conventions/blob/main/Lime.md#lime1
---------
Co-authored-by: Heejin Ahn <aheejin@gmail.com>
This update enhances the implementation of structural hashing for global
variables, using their initial contents. Private global variables or
constants are often used for metadata, where their names are not unique.
This can lead to the creation of different hash results although they
could be merged by the linker as they are effectively identical.
- Refine the hashing of GlobalVariables for strings or certain
Objective-C metadata cases that have section names. This can be further
extended to other scenarios.
- Expose StructuralHash for GlobalVariable so that this API can be
utilized by MachineStableHashing, which is also employed in the global
function outliner.
This change significantly improves size reduction by an additional 1% on
the LLD binary when the global function outliner and merger are enabled
together. As discussed in the RFC
https://discourse.llvm.org/t/loh-conflicting-with-machineoutliner/83279/8?u=kyulee-com,
if we disable or relocate the LOH pass, the size impact could increase
to 4%.
A `concat(extract-high(x), extract-high(y))` is the top half of x
inserted into the bottom half of y. This patch adds a tablegen pattern
to make sure that we generate a single i64 lane insert.
New register bank select for AMDGPU will be split in two passes:
- AMDGPURegBankSelect: select banks based on machine uniformity analysis
- AMDGPURegBankLegalize: lower instructions that can't be inst-selected
with register banks assigned by AMDGPURegBankSelect.
AMDGPURegBankLegalize is similar to legalizer but with context of
uniformity analysis. Does not change already assigned banks.
Main goal of AMDGPURegBankLegalize is to provide high level table-like
overview of how to lower generic instructions based on available target
features and uniformity info (uniform vs divergent).
See RegBankLegalizeRules.
Summary of new features:
At the moment register bank select assigns register bank to output
register using simple algorithm:
- one of the inputs is vgpr output is vgpr
- all inputs are sgpr output is sgpr.
When function does not contain divergent control flow propagating
register banks like this works. In general, first point is still correct
but second is not when function contains divergent control flow.
Examples:
- Phi with uniform inputs that go through divergent branch
- Instruction with temporal divergent use.
To fix this AMDGPURegBankSelect will use machine uniformity analysis
to assign vgpr to each divergent and sgpr to each uniform instruction.
But some instructions are only available on VALU (for example floating
point instructions before gfx1150) and we need to assign vgpr to them.
Since we are no longer propagating register banks we need to ensure that
uniform instructions get their inputs in sgpr in some way.
In AMDGPURegBankLegalize uniform instructions that are only available on
VALU will be reassigned to vgpr on all operands and read-any-lane vgpr
output to original sgpr output.
I'd already included a few cases for spread(N) in the decompress(N) variants,
but rename for clarity and add a couple more edge cases.
i.e. spread(N, 0) produces a, undef, b, undef, ...
This PR fixes:
* emission of OpNames (added newly inserted internal intrinsics and
basic blocks)
* emission of function attributes (SRet is added)
* implementation of SPV_INTEL_optnone so that it emits OptNoneINTEL
Function Control flag, and add implementation of the SPV_EXT_optnone
SPIR-V extension.
This PR resolved the following issues:
(1) There are rare but possible cases when there are bitcasts between
pointers intertwined in a sophisticated way with loads, stores, function
calls and other instructions that are part of type deduction. In this
case we must account for inserted bitcasts between pointers rather than
just ignore them.
(2) Null pointers have the same constant representation but different
types. Type info from Intrinsic::spv_track_constant() refers to the
opaque (untyped) pointer, so that each MF/v-reg pair would fall into the
same Const record in Duplicate Tracker and would be represented by a
single OpConstantNull instruction, unless we use precise pointee type
info. We must be able to distinguish one constant (null) pointer from
another to avoid generating invalid code with inconsistent types of
operands.
Re-land of #116636
Adds a new address spaces: hlsl_private. Variables with such address
space will be emitted with a Private storage class.
This is useful for variables global to a SPIR-V module, since up to now,
they were still emitted with a Function storage class, which is wrong.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
Currently FastISel triggers a fallback if there is an unreachable
terminator and the TrapUnreachable option is enabled (the ISD::TRAP
selection does not actually work).
Add handling for NoTrapAfterNoReturn, in which case we don't actually
need to emit a trap. The test is just there to make sure there is no
FastISel fallback (which is why I'm not testing the case without
noreturn). We have other tests that check the actual unreachable codegen
variations.
When merging STG instructions used for AArch64 stack tagging, we were
stopping on reaching a load or store instruction, but not calls, so it
was possible for an STG to be moved past a call to memcpy.
This test case (reduced from fuzzer-generated C code) was the result of
StackColoring merging allocas A and B into one stack slot, and
StackSafetyAnalysis proving that B does not need tagging, so we end up
with tagged and untagged objects in the same stack slot. The tagged
object (A) is live first, so it is important that it's memory is
restored to the background tag before it gets reused to hold B.
Alternative for https://github.com/llvm/llvm-project/pull/113764
It builds on a minimalistic approach with the legality check in match
and a blind apply. The precise patterns are used for better compile-time
and modularity. It also moves the pattern check into combiner. While
unary_undef_to_zero and propagate_undef_any_op rely on custom C++ code
for pattern matching.
Is there a limit on the number of patterns?
G_ANYEXT of undef -> undef
G_SEXT of undef -> 0
G_ZEXT of undef -> 0
The combine is not a member of the post legalizer combiner for AArch64.
Test:
llvm/test/CodeGen/AArch64/GlobalISel/combine-cast.mir
This is analogous to febbf91 which added shuffle lowering using
vcompress; we can do the same thing in the deinterleave2 lowering path
which is used for scalable vectors.
Note that we can further improve this for high lmul usage by adjusting
how we materialize the mask (whose result is at most m1 with a known bit
pattern). I am deliberately staging the work so that the changes to
reduce register pressure are more easily evaluated on their own merit.
This defines some new target features. These are subsets of existing
features that reflect implementation concerns:
- "call-indirect-overlong" - implied by "reference-types"; just the
overlong encoding for the `call_indirect` immediate, and not the actual
reference types.
- "bulk-memory-opt" - implied by "bulk-memory": just `memory.copy` and
`memory.fill`, and not the other instructions in the bulk-memory
proposal.
This is split out from https://github.com/llvm/llvm-project/pull/112035.
---------
Co-authored-by: Heejin Ahn <aheejin@gmail.com>