- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
consistently name them
- move `isTriviallyScalarizable` to VectorUtils
- update all uses of the api and provide the TTI parameter
Resolves#117030
Prior to this patch, we required that all users had the same VL in order
to optimize. But as the FIXME said, we can use the largest VL to
optimize, as long as we can determine what the largest is. This patch
implements the FIXME.
This reverts commit e0526b0780f56eede09b05a859a93626ecdc6e4d.
The `v_minmax/maxmin_f16`(GFX11) needs to be updated to t16 with
`v_minmax/maxmin_num_f16`(GFX12) together since they share the same
codegen pattern. Revert the old patch and resubmit
SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
Avoid introducing `ProxyReg` and `MOV` nodes during ISel when lowering
`bitconvert` or similar operations. These nodes are all erased by a
later pass but not introducing them in the first place is simpler and
likely saves compile time.
Also remove redundant `MOV` instruction definitions.
This patch introduces a scheduling model for the MIPS p8700, an
out-of-order
RISC-V processor. The model includes pipelines for the following units:
- 2 Integer Arithmetic/Logical Units (ALU and AL2)
- Multiply/Divide Unit (MDU)
- Branch Unit (CTI)
- Load/Store Unit (LSU)
- Short Floating-Point Pipe (FPUS)
- Long Floating-Point Pipe (FPUL)
For additional details, refer to the official product page:
https://mips.com/products/hardware/p8700/.
Also adds `UnsupportedSchedZfhmin` to handle cases like
`WriteFCvtF16ToF32` that
previously caused build failures.
In streaming[-compatible] functions, use SVE for scalar FP conversions
to/from integer types. This can help avoid moves between FPRs and GRPs,
which could be costly.
This patch also updates definitions of SCVTF_ZPmZ_StoD and
UCVTF_ZPmZ_StoD to disallow lowering to them from ISD nodes, as doing so
requires creating a [U|S]INT_TO_FP_MERGE_PASSTHRU node with inconsistent
types.
Follow up to #112213.
Note: This PR does not include support for f64 <-> i32 conversions (like
#112564), which needs a bit more work to support.
If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free.
We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.
Because these registers require an extra byte to encode in certain
memory form. Putting them later in the list will reduce code size when
EGPR is enabled. And align the same order in GR8, GR16 and GR32 lists.
Example:
movq (%r20), %r11 # encoding: [0xd5,0x1c,0x8b,0x1c,0x24]
movq (%r22), %r11 # encoding: [0xd5,0x1c,0x8b,0x1e]
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
== We were previously returning an invalid cost when truncating
anything to <vscale x 2 x i1>, which is incorrect since we can
generate perfectly good code for this.
== The costs for truncating legal or unpacked types to predicates
seemed overly optimistic. For example, when truncating
<vscale x 8 x i16> to <vscale x 8 x i1> we typically do
something like
and z0.h, z0.h, #0x1
cmpne p0.h, p0/z, z0.h, #0
I guess it might depend upon whether the input value is
generated in the same block or not and if we can avoid the
inreg zero-extend. However, it feels safe to take the more
conservative cost here.
== The costs for some truncates such as
trunc <vscale x 2 x i32> %a to <vscale x 2 x i16>
were 1, whereas in actual fact they are free and no instructions
are required.
== Also, for this
trunc <vscale x 8 x i32> %a to <vscale x 8 x i16>
it's just a single uzp1 instruction so I reduced the cost to 1.
In general, I've added costs for all cases where the destination
type is legal or unpacked. One unfortunate side effect of this
is the costs for some fixed-width truncates when using SVE now
look too optimistic.
Support the following relocations and assembly operators:
- `R_AARCH64_AUTH_TLSDESC_ADR_PAGE21` (`:tlsdesc_auth:` for `adrp`)
- `R_AARCH64_AUTH_TLSDESC_LD64_LO12` (`:tlsdesc_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_TLSDESC_ADD_LO12` (`:tlsdesc_auth_lo12:` for `add`)
ZPR2StridedOrContiguous loads used by a FORM_TRANSPOSED_REG_TUPLE
pseudo should attempt to assign a strided register to avoid unnecessary
copies, even though this may overlap with the list of SVE callee-saved registers.
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This patch adds basic support of `MachinePipeliner` and disable
it by default.
The functionality should be OK and all llvm-test-suite tests have
passed.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
The default legalization uses vmslt with a vector of XLen to compute a
mask. This doesn't work if the type isn't legal. For fixed vectors it
will scalarize. For scalable vectors it crashes the compiler.
This patch uses an alternate strategy that promotes the i1 vector to an
i8 vector and does the merge. I don't claim this to be the best
lowering. I wrote it quickly almost 3 years ago when a crash was
reported in our downstream.
Fixes#120405.
Support true16 format for v_minmax/maxmin_f16 in MC.
Since we are replacing `v_minmax/maxmin_f16` to `v_minmax/maxmin_f16_t16
/ v_minmax/maxmin_f16_fake16` in Post-GFX11, have to update the CodeGen
pattern for `v_minmax/maxmin_f16` to get CodeGen test passing.
Two bugs here. First calling `Inst->getFunction()` has undefined
behavior if the instruction is not tracked to a function. I suspect the
`replaceAllUsesWith` was leaving the GEPs in a weird ghost parent
situation. I switched up the visitor to be able to `eraseFromParent` as
part of visiting and then everything started working.
The second bug was in `DXILFlattenArrays.cpp`. I was unaware that you
can have multidimensional arrays of `zeroinitializer`, and `undef` so
fixed up the initializer to handle these two cases.
fixes#117273
Move the DXILOpLoweringPass after DXILTranslateMetadata, and add asserts
in DXILShaderFlags to ensure it isn't scheduled after op lowering. This
will allow us to rely on DirectX intrinsics in the shader flags analysis
rather than having to recover information from lowered operations.
Fixes#120119.
Support true16 format for v_pack_b32_f16 in MC.
Since we are replacing v_alignbit_b32 to
`v_pack_b32_f16_t16/v_pack_b32_f16_fake16` in Post-GFX11, have to update
the CodeGen pattern for `v_pack_b32_f16_fake16 `to get CodeGen test
passing. There is no pattern modified/created, but just replacing the
`v_pack_b32_f16` with fake16 format.
Some of the true16 CodeGen test are impacted since `v_pack_b32_f16`
selection are removed in Post-GFX11 while `v_pack_b32_f16_t16` are not
yet supported. The CodeGen patch for `v_pack_b32_f16_t16` will be done
is the following patch.
This is a NFC change. Update mc test for v_subrev_f16 in true16 format.
MC source change was done by previous patch and automatically enabled by
t16 pesudo
This is a NFC change. Update mc test for v_ldexp_f16 in true16 format.
MC source change was done by previous patch and automatically enabled by
t16 pesudo
We need to create symbols with "the original shape of resource and
element type" to put in the resource metadata in order to generate valid
DXIL.
Note that DXC generally doesn't emit an actual symbol outside of library
shaders (it emits an undef of a pointer to the type), but since we have
to deal with opaque pointers we would need a way to smuggle the type
through to match that. Instead, we simply emit symbols for now.
Fixed#116849
When splitting 2 unique amount shifts to shuffle(shift(x,c1),shift(x,c2)), don't use getTargetVShiftByConstNode directly to lower, use generic shifts to ensure we make use of any further canonicalization: shl(X,1) to add(X,X) etc. - this can have notably better throughput on some x86 targets.
Noticed on #120270
This splits the DXILResourceAnalysis pass into TypeAnalysis and
BindingAnalysis passes. The type analysis pass is made immutable and
populated lazily so that it can be used earlier in the pipeline without
needing to carefully maintain the invariants of the binding analysis.
Fixes#118400
We have several vector shift lowering strategies that have to analyse
the distribution of non-uniform constant vector shift amounts, at the
moment there is very little sharing of data between these analysis.
This patch creates a SmallDenseMap of the different LEGAL constant shift
amounts used, with a mask of which elements they are used in. So far
I've only updated the shuffle(immshift(x,c1),immshift(x,c2)) lowering
pattern to use it for clarity, there's several more that can be done in
followups. Its hoped that the proposed patch #117980 can be simplified
after this patch as well.
vec_shift6.ll - the existing shuffle(immshift(x,c1),immshift(x,c2))
lowering bails on out of range shift amounts, while this patch now skips
them and treats them as UNDEF - this means we manage to fold more cases
that before would have to lower to a SHL->MUL pattern, including some
legalized cases.