Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
This patch is part of a set of patches that add an `-fextend-lifetimes`
flag to clang, which extends the lifetimes of local variables and
parameters for improved debuggability. In addition to that flag, the
patch series adds a pragma to selectively disable `-fextend-lifetimes`,
and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes`
for this pointers only. All changes and tests in these patches were
written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer)
has handled review and merging. The extend lifetimes flag is intended to
eventually be set on by `-Og`, as discussed in the RFC
here:
https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850
This patch implements a new intrinsic instruction in LLVM,
`llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand
and has no effect other than "using" its operand, to ensure that its
operand remains live until after the fake use. This patch does not emit
fake uses anywhere; the next patch in this sequence causes them to be
emitted from the clang frontend, such that for each variable (or this) a
fake.use operand is inserted at the end of that variable's scope, using
that variable's value. This patch covers everything post-frontend, which
is largely just the basic plumbing for a new intrinsic/instruction,
along with a few steps to preserve the fake uses through optimizations
(such as moving them ahead of a tail call or translating them through
SROA).
Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).
Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
The implementation of `AAPointerInfo::RangeList::set_difference` doesn't
consider the case where two ranges have the same offset but different
sizes.
This could cause `AccessList` and `OffsetBins` out of sync because a
range has
been already updated in `AccessList` but missing in `ToRemove`.
I do have a reproducer but the reproducer itself is 248kb. `llvm-reduce`
can't
further reduce it. Not sure how I can make a smaller reproducer.
Fixes: SWDEV-479757.
Fixes 106412. The logic that skips the pass on already-lowered variables
doesn't cover the path that increases alignment of variables. If a
variable is allocated at 24 and then given 16 byte alignment, the
backend notices and fatal-errors on the inconsistency.
Make some minor tweaks to AMDGPU tests to ensure they still work as
intended after https://github.com/llvm/llvm-project/pull/97762. These
tests can be radically simplified after bitcast aware fpclass deduction.
This patch removes the conservative uniformity check in the indirect
call
specialization callback, as whether the function pointer is uniform
doesn't
matter too much. Instead, we add an argument to control specialization.
SMUL_LOHI and UMUL_LOHI are different operations because the high part
of the result is different, so it is not OK to optimize the signed
version to MUL_U24/MULHI_U24 or the unsigned version to
MUL_I24/MULHI_I24.
Run latest update_mir_test_checks.py and update one codeGen test file
This is to clean up the mir test files diff generated by python script
version update
[FixIrreducible] Use CycleInfo instead of a custom SCC traversal
1. CycleInfo efficiently locates all cycles in a single pass, while the
SCC is
repeated inside every natural loop.
2. CycleInfo provides a hierarchy of irreducible cycles, and the new
implementation transforms each cycle in this hierarchy separately
instead of
reducing an entire irreducible SCC in a single step. This reduces the
number
of control-flow paths that pass through the header of each newly created
loop. This is evidenced by the reduced number of predecessors on the
"guard"
blocks in the lit tests, and fewer operands on the corresponding PHI
nodes.
3. When an entry of an irreducible cycle is the header of a child
natural loop,
the original implementation destroyed that loop. This is now preserved,
since the incoming edges on non-header entries are not touched.
4. In the new implementation, if an irreducible cycle is a superset of a
natural
loop with the same header, then that natural loop is destroyed and
replaced
by the newly created loop.
This PR introduces new pass "amdgpu-sw-lower-lds".
This pass lowers the local data store, LDS, uses in kernel and
non-kernel functions in module to use dynamically allocated global
memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then
instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has
instrumented the IR. It identifies that IR has been instrumented using
"nosanitize_address" module flag.
For a kernel, LDS access can be static or dynamic which are direct
(accessed within kernel) and indirect (accessed through non-kernels).
**Replacement of Kernel LDS accesses:**
- All the LDS accesses corresponding to kernel will be packed together,
where all static LDS accesses will be allocated first and then dynamic
LDS follows. The total size with alignment is calculated. A new LDS
global will be created for the kernel called "SW LDS" and it will have
the attribute "amdgpu-lds-size" attached with value of the size
calculated. All the LDS accesses in the module will be replaced by GEP
with offset into the "Sw LDS".
- A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
the dynamic LDS. This will be marked used by kernel and will have
MD_absolue_symbol metadata set to total static LDS size, Since dynamic
LDS allocation starts after all static LDS allocation.
- A device global memory equal to the total LDS size will be allocated.
At the prologue of the kernel, a single work-item from the work-group,
does a "malloc" and stores the pointer of the allocation in "SW LDS". To
store the offsets corresponding to all LDS accesses, another global
variable is created which will be called "SW LDS metadata" in this pass.
- **SW LDS:**
It is LDS global of ptr type with name
"llvm.amdgcn.sw.lds.<kernel-name>".
- **SW LDS Metadata:**
It is of struct type, with n members. n equals the number of LDS globals
accessed by the kernel(direct and indirect). Each member of struct is
another struct of type {i32, i32, i32}. First member corresponds to
offset, second member corresponds to size of LDS global being replaced
and third represents the total aligned size. It will have name
"llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an
intializer with static LDS related offsets and sizes initialized. But
for dynamic LDS related entries, offsets will be intialized to previous
static LDS allocation end offset. Sizes for them will be zero initially.
These dynamic LDS offset and size values will be updated with in the
kernel, since kernel can read the dynamic LDS size allocation done at
runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.
- At the epilogue of kernel, allocated memory would be made free by the
same single work-item.
**Replacement of non-kernel LDS accesses:**
- Multiple kernels can access the same non-kernel function. All the
kernels accessing LDS through non-kernels are sorted and assigned a
kernel-id. All the LDS globals accessed by non-kernels are sorted.
- This information is used to build two tables:
- **Base table:**
Base table will have single row, with elements of the row placed as per
kernel ID. Each element in the row corresponds to ptr of "SW LDS"
variable created for that kernel.
- **Offset table:**
Offset table will have multiple rows and columns. Rows are assumed to be
from 0 to (n-1). n is total number of kernels accessing the LDS through
non-kernels. Each row will have m elements. m is the total number of
unique LDS globals accessed by all non-kernels. Each element in the row
correspond to the ptr of the replacement of LDS global done by that
particular kernel.
- A LDS variable in non-kernel will be replaced based on the information
from base and offset tables. Based on kernel-id query, ptr of "SW LDS"
for that corresponding kernel is obtained from base table. The Offset
into the base "SW LDS" is obtained from corresponding element in offset
table. With this information, replacement value is obtained.
When a loop contains a VMEM load whose result is only used outside the
loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for
vmcnt will be required inside the loop anyway, because VMEM instructions
can write their VGPR results out of order.
The generic subtarget has neither of these features. Rather than forcing
HasMovrel on, it is simpler to expand dynamic vector indexing to a
sequence of compare/select instructions.
NFC for real subtargets.
MI300 ISA section 4.5 states there is a hazard between "VALU op which
uses OPSEL or SDWA with changes the result’s bit position" and "VALU op
consumes result of that op"
This includes the case where the second op is SDWA with same dest and
dst_sel != DWORD && dst_unused == UNUSED_PRESERVE. In this case, there
is an implicit read of the first op dst and the compiler needs to
resolve this hazard. Confirmed with HW team.
We model dst_unused == UNUSED_PRESERVE as tied-def of implicit operand,
so this PR checks for that.
MI300_SP_MAS section 1.3.9.2 specifies that CVT_SR_FP8_F32 and
CVT_SR_BF8_F32 with opsel[3:2] !=0 have dest forwarding issue.
Currently, we only add check for CVT_SR_FP8_F32 with opsel[3] != 0 --
this PR adds support opsel[2] != 0 as well
Disable fixed-point iteration in all AMDGPU Combiners after #102163.
This saves around 2% compile time in ad hoc testing on some large
graphics shaders. I did not notice any regressions in the generated
code, just a bunch of harmless differences in instruction selection and
register allocation.
CreateControlFlowHub is a method that redirects control flow edges from a set of
incoming blocks to a set of outgoing blocks through a new set of "guard" blocks.
This is now refactored into a separate file with one enhancement: The input to
the method is now a set of branches rather than two sets of blocks.
The original implementation reroutes every edge from incoming blocks to outgoing
blocks. But it is possible that for some incoming block InBB, some successor S
might be in the set of outgoing blocks, but that particular edge should not be
rerouted. The new implementation makes this possible by allowing the user to
specify the targets of each branch that need to be rerouted.
This is needed when improving the implementation of FixIrreducible #101386.
Current use in FixIrreducible does not demonstrate this finer control over the
edges being rerouted. But in UnifyLoopExits, when only one successor of an
exiting block is an exit block, this refinement now reroutes only the relevant
control-flow through the edge; the non-exit successor is not rerouted. This
results in fewer branches and PHI nodes in the hub.
VOPC instructions were defined with HasTrue16BitInst flag while these
true16 instructions are actually implemented with fake16 profile.
Seperate them to true16 version and fake16 version by adding
UseRealTrue16 and UseFakeTrue16 flag and fake16 instructions.
The code default to use fake16. This is preparing for the upcoming
changes in MC to support realtrue 16bit operands and vdst. The true16
and fake16 profile will be modified in the later patches.
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.
However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.
The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.
For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
Simplify by using KnownBits::add.
Fix GlobalISel path which was ignoring the known bits of src1.
Improve analysis of mbcnt.hi which adds at most 31 even in wave64.
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.
"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
We cannot promote this case unless we know the value is only
observed through flat operations. We cannot analyze this through
a call. PointerMayBeCaptured was an imprecise check for this.
A callee with a nocapture attribute may still cast to private and
observe the address space, so really we need a different notion
of nocapture.
I doubt this was of any use anyway. The promotable cases should
have optimized out addrspacecast to begin earlier.
Fixes#66669Fixes#104035
This specific callback should now be at parity with the old
pass manager version. There are still some missing IR passes
before this point.
Also I don't understand the need for the RequiresAnalysisPass at the
end. SelectionDAG should just be using the uncached getResult?
If the upper bits of the shr aren't demanded.
This helps with cases where the outer srl was originally an sra and was
converted to a srl by SimplifyDemandedBits before it had a chance to
combine with the inner sra. This can occur when the inner sra was part
of a sign_extend_inreg expansion.
There are some regressions in ARM and Thumb2.