The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.
This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.
Fixes: SWDEV-487672, SWDEV-487669
This avoids regressions in a future AMDGPU commit. Previously we
would have a build_vector (extract_vector_elt x), undef with free
access to the elements bloated into a shuffle of one element + undef,
which has much worse combine support than the extract.
Alternatively could check aggressivelyPreferBuildVectorSources, but
I'm not sure it's really different than isExtractVecEltCheap.
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
This avoids some of the pending regressions after AMDGPU implements
isExtractVecEltCheap.
In a case like shl <value, undef>, splat k, because the second operand
was fully defined, we would fall through and use the splat value for the
first operand, losing the undef high bits. This would result in an additional
instruction to handle the high bits. Add some reduced testcases for different
opcodes for one of the regressions.
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.
Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
APInt will fail when given a negative offset. SelectScratchSVAddr
utilizes this function and can be given a negative offset as well, so
this change modifies it to use APSInt instead.
This change adds IntrConvergent property to image.getlod intrinsic and
to several image.sample intrinsics. All image.sample intrinsics apart
from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.
Preparation for #121124
This PR provides tests added into
[PR](https://github.com/llvm/llvm-project/pull/121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.
Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill
lowering. This has the effect of AND'ing demote/kill condition with exec
which is needed for proper live mask update.
The S_XOR is inadequate because it may return true for lane with exec=0.
This patch fixes an image corruption in game.
I think the issue went unnoticed because demote/kill condition is often
naturally dependent on exec, so AND'ing with exec is usually not
required.
scalar_to_vector is difficult to make appear and test,
but I found one case where this makes an observable difference.
It fires more often than this in the test suite, but most of them
have no net result in the final code. This helps reduce regressions
in a future commit.
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.
A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of
`TargetRegisterInfo::getRegPressureSetLimit` with some logics to
adjust the limit by removing reserved registers.
It seems that we shouldn't use
`TargetRegisterInfo::getRegPressureSetLimit`
directly, just like the comment "This limit must be adjusted
dynamically for reserved registers" said.
Separate from https://github.com/llvm/llvm-project/pull/118787
If one of the inputs has all 0 bits, the low part cannot
carry and we can just pass through the original value.
Add case: https://alive2.llvm.org/ce/z/TNc7hf
Sub case: https://alive2.llvm.org/ce/z/AjH2-J
We could do this in the general case with computeKnownBits,
but add is so common this could be potentially expensive for
something which will fire infrequently.
One potential concern is this could break the 64-bit add
we expect to see for addressing mode matching, but these
constants shouldn't appear often in addressing expressions.
One test for large offset expressions changes but isn't worse.
Fixes https://github.com/ROCm/llvm-project/issues/237
The test added in f6365a47a1ad9ab6d432f6e40d14a11419e21282 fails the verifier
for the reason noted in the comment, but we need to skip the verifier
error in EXPENSIVE_CHECKS builds
True16 support for v_minmax/maxmin_f16(GFX11) and
v_minmax/maxmin_num_f16(GFX12).
These insts are updated at the same time since we are replacing the
`v_minmax/maxmin_f16` to `v_minmax/maxmin_fake16_f16` while
`v_minmax/maxmin_num_f16` are alias insts and share the same CodeGen
pattern.
Added a GFX12 runline in minmax.ll in fake16 flow
In SIFoldOperands, leave copies for moving between agpr and vgpr
registers. The register coalescer is able to handle the copies
more efficiently than v_accvgpr_mov, v_accvgpr_write, and
v_accvgpr_read. Otherwise, the compiler generates unneccesary
instructions such as v_accvgpr_mov a0, a0.
Previously in shrinkDivRem64, it used fixed value 32 for AtLeast which
meant that <64bit divisions would be rejected from shrinking since logic
depended only on number of sign bits. I.e. 'idiv i48 %0, %1' would
return 24 for number of sign bits if %0,%1 both had 24 division bits,
and was rejected.
Do not try to rematerialize a super-register def used by a subregister
extract copy into a copy to a physical register if the other pieces of
the
full physreg are live at the rematerialization point. It would insert
the
super-register def at the rematerialization point, and assert since the
other half of the register was already live.
This is analagous to the undef subregister def handling above,
which handled the virtual register case.
Fixes#120970
This reverts a8f3ebaf11c3745e5123054776eb71755d16f2f9. You need to
use -verify-regalloc to get a MachineVerifier run with LiveIntervals,
otherwise cases not covered by the basic liveness implementation
in the verifier are passed through (which covers most use of undefined
subrange errors).
The stack case uses a physical register and should not ordinarily
reach here, but strange things happen at -O0. The testcase still
errors because we do not yet attempt to handle arbitrary dynamic
sized allocas yet.
Fixes: SWDEV-503538
Support true16 format for v_fma_f16 in MC.
Since we are replacing v_fma_f16 to v_fma_f16_t16/v_fma_f16_fake16 in
Post-GFX11, have to update the CodeGen pattern for v_fma_f16_fake16 to
get CodeGen test passing. There is no pattern modified/created, but just
replacing the v_fma_f16 with fake16 format.
Previously, registers and subregisters mapped to the same Dwarf
encoding. We don't really have any way to refer to subregisters directly
from Dwarf, the expression emitter should instead use DW_OPs to stencil
out the subregister from the whole register. This was also confusing
tools that need to map back to the llvm reg (e.g. dwarfdump), since
getLLVMRegNum() would arbitrarily return the _LO16 register.
Last chance recoloring can delete the current fixed interval
during recursive assignment of interfering live intervals. Check
if the virtual register value was assigned before attempting the
unassignment, as is done in other scenarios. This relies on the fact
that we do not recycle virtual register numbers.
I have only seen this occur in error situations where the allocation
will fail, but I think this can theoretically happen in working
allocations.
This feels very brute force, but I've spent over a week debugging
this and this is what works without any lit regressions. The surprising
piece to me was that unspillable live ranges may be spilled, and
a number of tests rely on optimizations occurring on them. My other
attempts to fixed this mostly revolved around not identifying unspillable
live ranges as snippet copies. I've also discovered we're making some
unproductive live range splits with subranges. If we avoid such splits,
some of the unspillable copies disappear but mandating that be precise
to fix a use after free doesn't sound right.
This combine pattern perform the below transformation.
fmul x, select(y, A, B) -> fldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B) -> fldexp ((fneg x), select i32 (y, a, b))
where, A=2^a & B=2^b ; a and b are integers.
It is a follow-up PR to implement the above combine for globalIsel, as
the corresponding DAG combine has been done for SelectionDAG Isel
(#111109)
Currently, AMDGPU backend can handle uniform-sized dynamic allocas.
This patch extends support for divergent-sized dynamic allocas.
When the size argument of a dynamic alloca is divergent,
a wave-wide reduction is performed to get the required stack space.
`@llvm.amdgcn.wave.reduce.umax` is used to perform the
wave reduction.
Dynamic allocas are not completely supported yet,
as the stack is not properly restored on function exit.
This patch doesn't attempt to address the aforementioned issue.
Note: Compiler already Zero-Extends or Truncates all other
types(of alloca size arg) to i32.
This would see if there are mixed integer and FP types and pick an
equivalently sized FP type to use as the vector element type, and only
cast if there were mixed integers. We need to insert a cast if the types
are mixed, which may include different FP types.
Fixes#121601