Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
There are some intrinsics are using i16 vectors in place of bfloat
vectors.
Move towards making bf16 vectors legal so these can migrate. Leave the
larger vectors for a later change.
Depends #76213#76214
This helper function shortens examples like
`cast<ConstantSDNode>(Node->getOperand(1))->getZExtValue();` to
`Node->getConstantOperandVal(1);`.
Implemented with:
`git grep -l
"cast<ConstantSDNode>\(.*->getOperand\(.*\)\)->getZExtValue\(\)" | xargs
sed -E -i
's/cast<ConstantSDNode>\((.*)->getOperand\((.*)\)\)->getZExtValue\(\)/\1->getConstantOperandVal(\2)/`
and `git grep -l
"cast<ConstantSDNode>\(.*\.getOperand\(.*\)\)->getZExtValue\(\)" | xargs
sed -E -i
's/cast<ConstantSDNode>\((.*)\.getOperand\((.*)\)\)->getZExtValue\(\)/\1.getConstantOperandVal(\2)/'`.
With a couple of simple manual fixes needed. Result then processed by
`git clang-format`.
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass, which matches what Clang
already does in the frontend.
While AMDGPU currently disables the use of all libcalls, I've changed it
to instead disable all of them _except_ the atomic ones. Those are
already be emitted by the Clang frontend, and enabling them in the
backend allows the same behavior there.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call
parameters are bundled up into 2 intrinsic arguments, one for those that
should go in the SGPRs (the 3rd intrinsic argument), and one for those
that should go in the VGPRs (the 4th intrinsic argument). Both will
often be some kind of aggregate.
Both instruction selection frameworks have some internal representation
for intrinsics (G_INTRINSIC[_WITH_SIDE_EFFECTS] for GlobalISel,
ISD::INTRINSIC_[VOID|WITH_CHAIN] for DAGISel), but we can't use those
because aggregates are dissolved very early on during ISel and we'd lose
the inreg information. Therefore, this patch shortcircuits both the
IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call
from the very start. It tries to use the existing infrastructure as much
as possible, by calling into the code for lowering tail calls.
This has already gone through a few rounds of review in Phab:
Differential Revision: https://reviews.llvm.org/D153761
Make codegen emit correctly rounded sqrt by default.
Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.
https://reviews.llvm.org/D158129
Avoiding warnings like this when building with GCC:
warning: enumeral mismatch in conditional expression:
'llvm::AMDGPUISD::NodeType' vs 'llvm::ISD::NodeType'
[-Wenum-compare]
Lower formal arguments and returns for functions with the
`amdgpu_cs_chain` and `amdgpu_cs_chain_preserve` calling conventions:
* Put `inreg` arguments into SGPRs, starting at s0, and other arguments
into VGPRs, starting at v8. No arguments should end up on the stack, if
we don't have enough registers we should error out.
* Lower the return (which is always void) as an S_ENDPGM.
* Set the ScratchRSrc register to s48:51, as described in the docs.
* Set the SP to s32, matching amdgpu_gfx. This might be revisited in a
future patch.
Differential Revision: https://reviews.llvm.org/D153517
Mirror of the previous log changes, OpenCL conformance doesn't like
interpreting afn as ignore denormal handling but was previously hidden
by flag dropping.
Apparently afn doesn't allow you to drop the denormal handling
according to OpenCL conformance. This was hidden by losing the flags
during the library linking process. Fast log is still broken and needs
more work.
https://reviews.llvm.org/D157936
Not sure if the only valid use is to have stackrestore directly
consume stacksave outputs or not. Handled exactly like a regular stack
pointer so all the edge cases theoretically should work.
https://reviews.llvm.org/D156669
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
This check was unnecessary/incorrect, it was already being done by the target
hook default implementation, and the one in the matcher was checking for a
completely different thing. This change:
1) Removes the check and updates affected tests which now do some more reassociations.
2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse
check. Not sure why I didn't do this the first time.
The sign bit has no impact on the exponent, so strip these away. Saves
on the source modifier encoding cost. I left the GlobalISel handling
until there's a resolution to issue #62628.
We should do this in instcombine too, but legalization should be
introducing more frexps than it currently is where this would occur.
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.
Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.
Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.
Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
We previously directly codegened to v_log_f32, which is broken for
denormals. The lowering isn't complicated, you simply need to scale
denormal inputs and adjust the result. Note log and log10 are still
not accurate enough, and will be fixed separately.
Provide direct access to v_exp_f32 and v_exp_f16, so we can start
correctly lowering the generic exp intrinsics.
Unfortunately have to break from the usual naming convention of
matching the instruction name and stripping the v_ prefix. exp is
already taken by the export intrinsic. On the clang builtin side, we
have a choice of maintaining the convention to the instruction name,
or following the intrinsic name.
The division between float constants was done with less
precision. Performing the divide in double and truncating to float
provides the same value as used in the library fast math expansion.
This will map directly to the hardware instruction which does not
handle denormals for f32. This will allow moving the generic intrinsic
to be lowered correctly. Also handles selecting the f16 version, but
there's no reason to use it over the generic intrinsic.
This was inserting an s_endpgm in the middle of the block when it has
to be a terminator. Split the block and insert a branch to a new block
with the trap if it's not in a terminator position.
Fixes verifier error on LDS in function with no trap support (and
other trap sources).
- (op (op X, C1), C2) -> (op X, (op C1, C2))
- (op (op X, C1), Y) -> (op (op X, Y), C1)
Some code duplication with the G_PTR_ADD reassociations unfortunately but no
easy way to avoid it that I can see.
Differential Revision: https://reviews.llvm.org/D150230
Undo the canonicalize done in
0cfc6510323fbb5a56a5de23cbc65f7cc30fd34c. Restores some regressed
matching of integer mad. The selection patterns fo the actual mads
don't seem to be properly commuting, so some of the commuted cases are
still missed.
Fixes: SWDEV-363009