Inspired by some of the cases from D145468
Let SimplifyDemandedBits handle the narrowing of lshr to half-width if we don't require the upper bits, the narrowed shift is profitable and the zext/trunc are free.
A future patch will propose the equivalent shl narrowing combine.
Differential Revision: https://reviews.llvm.org/D146121
- Switch to generated checks
- Use a different run line per denormal mode to reduce test duplication
- Add test coverage for rsqrt cases
- Add test coverage for repeated arcp denominator
- Fix the optnone test
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.
This worked well in most cases, but there's one case in rocRAND where
it doesn't work. In that case, a PHI node has 2 PHI users where one is
breakable but not the other. When that PHI node isn't broken performance falls by 35%.
Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.
Solves SWDEV-409648, SWDEV-398393
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D155184
Do the LDS frame calculation once, in the IR pass, instead of repeating the work in the backend.
Prior to this patch:
The IR lowering pass sets up a per-kernel LDS frame and annotates the variables with absolute_symbol
metadata so that the assembler can build lookup tables out of it. There is a fragile association between
kernel functions and named structs which is used to recompute the frame layout in the backend, with
fatal_errors catching inconsistencies in the second calculation.
After this patch:
The IR lowering pass additionally sets a frame size attribute on kernels. The backend uses the same
absolute_symbol metadata that the assembler uses to place objects within that frame size.
Deleted the now dead allocation code from the backend. Left for a later cleanup:
- enabling lowering for anonymous functions
- removing the elide-module-lds attribute (test churn, it's not used by llc any more)
- adjusting the dynamic alignment check to not use symbol names
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D155190
This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering.
The kernels of interest have these specific phases:
NT: 1, 2a, 2c
NN: 1, 2a, 2b
TT: 1, 2b, 2c
TN: 1, 2b
The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver.
Differential Revision: https://reviews.llvm.org/D149773
Change-Id: I1a35962a95b4bdf740602b8f110d3297c6fb9d96
This time without the extra `->dump()`
A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.
The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.
Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D155050
A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.
The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.
Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D155050
These aren't implemented. They could be at moderate implementation
complexity. Raising an error is better than silently miscompiling.
Patching now because the patch at D155125 is a step towards using this metadata
more extensively as part of the lowering path and that will interact badly with
input variables with this annotation.
Lowering user defined variables at specific addresses would drop this error,
put them at the requested position in the frame during this pass, and then
use the same codegen that will be used for the kernel specific struct shortly.
Reviewed By: jmmartinez
Differential Revision: https://reviews.llvm.org/D155132
MachineLICM pass handles inner loops only when outmost loop does not have unique
predecessor. If the loop has preheader and there is loop invariant code, the
invariant code can be hoisted to the preheader in general. This patch makes the
pass handle inner loops in general.
Differential Revision: https://reviews.llvm.org/D154205
This is a preparatory patch for extending DWARFDebugLine to properly
parse line number programs with maximum_operations_per_instruction > 1
for VLIW targets.
Add some scaffolding for handling op-index in line number programs, and
add printouts for that in the table. As this affects a lot of tests,
this is done in a separate commit to get a cleaner review for the actual
op-index implementation.
Verbose printouts are not present in many tests, and adding op-index to
those will require a bit more code changes, so that is done in the
actual implementation patch.
Reviewed By: StephenTozer
Differential Revision: https://reviews.llvm.org/D152535
Documentation for TargetLowering::getShiftAmountTy says that LegalTypes
should generally be true during type legalization, so this patch does
that.
On AMDGPU the effect is that we use i32 (a sane type) instead of i64
(pointer sized type) for more shift amounts, which in turn allows more
formation of rotates and funnel shifts pre-legalization.
Differential Revision: https://reviews.llvm.org/D154960
More robust association between the kernels and lds struct.
Use poison instead of value() for lookup table elements introduced by dynamic lds lowering.
Extracted from D154946, new test from there verbatim. Segv fixed.
Fixes issues/63338
Fixes SWDEV-404491
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D154972
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
This check was unnecessary/incorrect, it was already being done by the target
hook default implementation, and the one in the matcher was checking for a
completely different thing. This change:
1) Removes the check and updates affected tests which now do some more reassociations.
2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse
check. Not sure why I didn't do this the first time.
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.
Differential Revision: https://reviews.llvm.org/D124196
So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.
Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.
This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D143762
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.
This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D143759
This avoids computing the dominator tree by removing the
simplifyInstruction use.
This was applying simplification with some kind of questionable
load-store forwarding and looking for the global. This had to have
been an ancient hack copied from previous backends. In the OpenCL
case, this is always emitted as required the direct global reference
anyway.
Combine the two checks into a check if the exponent bits are 0. The
inverted case isn't reachable until a future change, and GlobalISel
currently doesn't attempt the inversion optimization.
https://reviews.llvm.org/D143182
Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got
renamed at some point in the process and the target implementation was
left behind.
Fixes: SWDEV-407329
The sign bit has no impact on the exponent, so strip these away. Saves
on the source modifier encoding cost. I left the GlobalISel handling
until there's a resolution to issue #62628.
We should do this in instcombine too, but legalization should be
introducing more frexps than it currently is where this would occur.
1. Improved code that deduces register class from instruction definitions. Previously if some instruction didn't contain a reg class for an operand it was considered as no information on register class even if other instructions specified the class.
2. Added check on required size of resulting register because in some cases classes with smaller registers had been selected (for example VReg_1).
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D152832
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.
Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.
Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.
Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
GCNHazardRecognizer::fixVcmpxExecWARHazard() mitigates a specific hazard
by inserting a wait on sa_sdst==0 if such a wait isn't already present.
Unfortunately, the check for an existing wait incorrectly checks for one
that doesn't actually care about sa_sdst itself, but requires that no
other counters are waited for.
Once the check is performed correctly, a lit test needs to be updated,
since it is currently testing for the incorrect behaviour.
Differential Revision: https://reviews.llvm.org/D154438
SIInsertWaitcnts inserts waitcnt instructions to resolve data
dependencies. The GFX10+ vscnt (VMEM store count) counter is never used
in this way. It is only used to resolve memory dependencies, and that is
handled by SIMemoryLegalizer. Hence there is no need to conservatively
wait for vscnt to be 0 on function entry and before returns.
Differential Revision: https://reviews.llvm.org/D153537