This patch ports the AMDGPURewriteUndefForPHI pass to the new pass
manager. With this, the pass is supported under both the legacy and the
new pass managers.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
Make codegen emit correctly rounded sqrt by default.
Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.
https://reviews.llvm.org/D158129
Fix some problems in hand written MIR tests that only showed up when I
tried to run LiveIntervals on them, after which they failed machine
verification with "Use not jointly dominated by defs" errors.
By default, `DAGCombiner` folds `sext x` to `zext x` when `x` is
non-negative. It will generate redundant `zext` inst seq on riscv64
(typically `slli (srli x, 32), 32`).
godbolt: https://godbolt.org/z/osf6adP1o
This patch applies the transform iff `zext` is **cheaper** than `sext`.
The non-constant bit/bif/bsp already work through tablegen patterns, this
patch handles the constant case, mirroring the basic support for
`or(and(X, C), and(Y, ~C))` from ISel tryCombineToBSL. BSP gets expanded
to either BIT, BIF or BSL depending on the best register allocation.
G_BIT can be replaced with G_BSP as a more general alternative.
This adds some more extensive test coverage for fpow intrinsics through global
isel, and removes the unused vector libcall types. All types get scalarized,
fp16 will be expanded to fp32 and then we lower to a libcall from there.
Live out implicit_defs need to be kept, but the check for this only
checked if the block parent was the same. This doesn't work if the
parent blocks are the same but the value is live. Fixes verifier error
"Instruction ending live segment doesn't read the register", which
would appear at the coalesced non-implicit_def def.
Fixes#38788https://reviews.llvm.org/D158882
This fixes some verifier errors when live out implicit defs are
coalesced with identity copies. Fixes some reduced testcases from
issue #38788 but doesn't solve the original failure.
I was surprised this seems to obviate the special casing in
analyzeValue that's been there since the subregister liveness support
went in.
https://reviews.llvm.org/D158850
The issue is uncovered by #47698: for IR files without a target triple,
-mtriple= specifies the full target triple while -march= merely sets the
architecture part of the default target triple, leaving a target triple which
may not make sense, e.g. riscv64-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without a target
triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead
of rejecting it outrightly.
Add a DAG combine to form a masked.load from a masked_strided_load
intrinsic with stride equal to element size. This covers a couple of
extra test cases, and allows us to simplify and common some existing
code on the concat_vector(load, ...) to strided load transform.
This is the first in a mini-patch series to try and generalize our
strided load and gather matching to handle more cases, and common up
different approaches to the same problems in different places.
When IRBuilder is given an insertion position and there is debug-info, it
sets the DebugLoc of newly inserted instructions to the DebugLoc of the
insertion position. Unfortunately, that means if you insert in front of a
debug intrinsics, your "real" instructions get potentially-misleading
source locations from the debug intrinsics. Worse, if you compile -gmlt to
get source locations but no variable locations, you'll get different source
locations to a normal -g build, which is silly.
Rectify this with the getStableDebugLoc method, which skips over any debug
intrinsics to find the next "real" instruction. This is the source location
that you would get if you compile with -gmlt, and it remains stable in the
presence of debug intrinsics. The changed tests show a few locations where
this has been happening, for example selecting line-zero locations for
instrumentation on a perfectly valid call site.
Differential Revision: https://reviews.llvm.org/D159485
This is the extract side of D159332. The goal is to avoid non-linear costing on patterns where an entire vector is split back into scalars. This is an idiomatic pattern for SLP.
Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique extracts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if we're beyond LMUL2.
There's a subtly here. For this to work, we're *relying* on an optimization in LegalizeDAG which tries to reuse the stack slot from a previous extract. In practice, this appear to trigger for patterns within a block, but if we ended up with an explode idiom split across multiple blocks, we'd still be in quadratic territory. I don't think that variant is fixable within SDAG.
It's tempting to think we can do better than going through the stack, but well, I haven't found it yet if it exists. Here's the results for sifive-s280 on all the variants I wrote (all 16 x i64 with V):
output/sifive-x280/linear_decomp_with_slidedown.mca:Total Cycles: 20703
output/sifive-x280/linear_decomp_with_vrgather.mca:Total Cycles: 23903
output/sifive-x280/naive_linear_with_slidedown.mca:Total Cycles: 21604
output/sifive-x280/naive_linear_with_vrgather.mca:Total Cycles: 22804
output/sifive-x280/recursive_decomp_with_slidedown.mca:Total Cycles: 15204
output/sifive-x280/recursive_decomp_with_vrgather.mca:Total Cycles: 18404
output/sifive-x280/stack_by_vreg.mca:Total Cycles: 12104
output/sifive-x280/stack_element_by_element.mca:Total Cycles: 4304
I am deliberately excluding scalable vectors. It functionally works, but frankly, the code quality for an idiomatic explode loop is so terrible either way that it felt better to leave that for future work.
Differential Revision: https://reviews.llvm.org/D159375
As noted in
https://github.com/llvm/llvm-project/pull/65392#discussion_r1316259471,
when lowering an extract of a fixed length vector from another vector,
we don't need to perform the vslidedown on the full vector type. Instead
we can extract the smallest subregister that contains the subvector to
be extracted and perform the vslidedown with a smaller LMUL. E.g, with
+Zvl128b:
v2i64 = extract_subvector nxv4i64, 2
is currently lowered as
vsetivli zero, 2, e64, m4, ta, ma
vslidedown.vi v8, v8, 2
This patch shrinks the vslidedown to LMUL=2:
vsetivli zero, 2, e64, m2, ta, ma
vslidedown.vi v8, v8, 2
Because we know that there's at least 128*2=256 bits in v8 at LMUL=2,
and we only need the first 256 bits to extract a v2i64 at index 2.
lowerEXTRACT_VECTOR_ELT already has this logic, so this extracts it out
and reuses it.
I've split this out into a separate PR rather than include it in #65392,
with the hope that we'll be able to generalize it later.
This patch refactors extract_subvector lowering to lower to
extract_subreg directly, and to shortcut whenever the index is 0 when
extracting a scalable vector. This doesn't change any of the existing
behaviour, but makes an upcoming patch that extends the scalable path
slightly easier to read.
This is partly a precommit for an upcoming patch, and partly to remove
the fixed length LMUL restriction similarly to what was done in
https://reviews.llvm.org/D158270, since it's no longer that relevant.
This commits adds the minimal required bits to build a logical SPIR-V
compute shader using LLC.
- Skip OpenCL-only capabilities & extensions for Logical SPIR-V.
- Generate required metadata for entrypoints from HLSL frontend.
- Fix execution mode to GLCompute in logical.
The main issue is the lack of "vulkan" bit in the triple.
This might need to be added as a vendor?
Because as-is, SPIRV32/64 assumes OpenCL, and then, SPIRV assumes
Vulkan. This is ok-ish today, but not correct.
Differential Revision: https://reviews.llvm.org/D156424
Add handling for subrange updates in LiveInterval preservation.
This requires extending MachineBasicBlock::SplitCriticalEdge
to also update subrange intervals.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158144
If there are copy instructions between uaddlv and urshr for transfer from gpr
to fpr, and vice versa, try to remove them.
Differential Revision: https://reviews.llvm.org/D159265
In emitElse live interval for SI_ELSE source must be recalculated
as SI_ELSE is removed, and new user is placed at block start.
In emitIfBreak live interval for new created AndReg must be
computed.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158141
This matches how a SelectionDAG::getExternalSymbol node is lowered. On x86-32, a
function call in -fno-pic code should emit R_386_PC32 (since ebx is not set up).
When linked as -shared (problematic!), the generated text relocation will work.
Ideally, we should mark IR intrinsics created in
CodeGenFunction::EmitBuiltinExpr as dso_local, but the code structure makes it
not very feasible.
Fix#51078
Currently s_getreg_b32 is missing the possible mode use. Really we
need separate pseudos for mode-only accesses, but leave this as a
pre-existing issue.
https://reviews.llvm.org/D152710