The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.
(#159884)
This eliminates the pseudo registerclasses used to hack the
wave register class, which are now replaced with RegClassByHwMode,
so most of the diff is from register class ID renumbering.
This change enables update_llc_test_checks.py to automatically generate
MIR checks for RUN lines that use `-stop-before` or `-stop-after` flags
allowing tests to verify intermediate compilation stages (e.g., after
instruction selection but before peephole optimizations) alongside the
final assembly output. If `-debug-only` flag is present in the run line it's
considered as the main point of interest for testing and stop flags above
are ignored (that is no MIR checks are generated).
This resulted from the scenario, when I needed to test two instruction
matching patterns where the later pattern in the peepholer reverts the
earlier pattern in the instruction selector and distinguish it from the
case when the earlier pattern didn't worked at all.
Initially created by Claude Sonnet 4.5 it was improved later to handle
conflicts in MIR <-> ASM prefixes and formatting.
This change enables update_llc_test_checks.py to automatically generate
MIR checks for RUN lines that use `-stop-before` or `-stop-after` flags
allowing tests to verify intermediate compilation stages (e.g., after
instruction selection but before peephole optimizations) alongside the
final assembly output. If `-debug-only` flag is present in the run line it's
considered as the main point of interest for testing and stop flags above
are ignored (that is no MIR checks are generated).
This resulted from the scenario, when I needed to test two instruction
matching patterns where the later pattern in the peepholer reverts the
earlier pattern in the instruction selector and distinguish it from the
case when the earlier pattern didn't worked at all.
Initially created by Claude Sonnet 4.5 it was improved later to handle
conflicts in MIR <-> ASM prefixes and formatting.
Previously, any blank lines in IR were ignored by UTC, leading to more
fragile `CHECK`s being generated.
This change lets UTC, 1) emit `CHECK-EMPTY` to check blank lines, and 2)
generate more `CHECK-NEXT`s, landing the discussion
https://github.com/llvm/llvm-project/pull/165419#issuecomment-3457572422.
Moreover, this change also aligns the behavior of IR check-gen to ASM
check-gen, which has been emitting `CHECK-EMPTY` since
a8a89c77ea.
LLVM prints switch cases indented by 2 additional spaces, as follows:
```LLVM
switch i32 %x, label %default [
i32 0, label %phi
i32 1, label %phi
]
```
Since this only changes the output IR of update_test_checks.py and does
not change the logic of the File Check Pattern, there seems to be no
need to update the existing test cases.
Adds `arm64-apple-darwin` support to `asm.py` matching and removes now
invalidated `target-triple-mismatch` test (I dont have another triple
supported by llc but not the autogenerator that make this test useful).
There is a warning that triggers if you (for instance) run
`update_llc_test_checks.py` on an input where _all_ functions have
conflicting check lines and so no checks are generated. However, there
are no warnings emitted at all for the case where some functions have
non-conflicting check lines but others don't. This is a source of
frustration because running update_llc_test_checks can result in all
check lines being removed for certain functions when such a conflict
exists with no warning, meaning we have to be extra vigilant inspecting
the diff. I've also personally wasted time tracking down the source of
the dropped lines assuming that update_test_checks would emit a warning
in such cases.
This change adds logic to emit warnings on a function-by-function basis
for any RUN that has conflicting prefixes meaning no output is
generated. This subsumes the previous warning for when _all_ functions
conflict.
UpdateTestChecks have been updated to take into account TBAA
semantics as well, when emitting checks. This is achieved by
parsing TBAA metadata for each tool invocation – whose tool
is identified by their prefixes –, and maintaining a global
dict of prefixes, TBAA nodes.
…d loops
Add a count of the number of partitions LoopDist made when distributing
a loop in meta data, then check for loops which are already distributed
to prevent reprocessing.
We see this happen on some spec apps, LD is on by default at SiFive.
When assigning numbers to registers, skip any with neither uses nor
defs. This is will not have any impact at all on the final SASS but it
makes for slightly more readable PTX. This change should also ensure
that future minor changes are less likely to cause noisy diffs in
register numbering.
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
Aims to improve error reporting by printing a warning if the target
function regex that has been selected finds no matches. For example, a
`-mtriple=arm64-apple-darwin` runline, would map to the `arm64` prefix
by `update_llc_test_checks.py` and wouldn't match Apple's function
layout, generating some not understandable garbage checks.
The implementation changes `common.process_run_line` to return an
abstract indicator of number of functions processed, without breaking
the drivers. Then `update_llc_test_checks.py` prints a driver specific
error message.
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
This patch removes all uses of %T from within LLVM tests. %T has been
deprecated for about seven years and use is not advised given it is not
unique per test and can thus lead to races. The goal of this is to
eventually remove support for %T from lit.
This changes how LLVM constructs certain data structures that relate to
exception handling (EH) on Windows. Specifically this changes how
IP2State tables for functions are constructed. The purpose of this
change is to align LLVM to the requires of the Windows AMD64 ABI, which
requires that the IP2State table entries point to the boundaries between
instructions.
On most Windows platforms (AMD64, ARM64, ARM32, IA64, but *not* x86-32),
exception handling works by looking up instruction pointers in lookup
tables. These lookup tables are stored in `.xdata` sections in
executables. One element of the lookup tables are the `IP2State` tables
(Instruction Pointer to State).
If a function has any instructions that require cleanup during exception
unwinding, then it will have an IP2State table. Each entry in the
IP2State table describes a range of bytes in the function's instruction
stream, and associates an "EH state number" with that range of
instructions. A value of -1 means "the null state", which does not
require any code to execute. A value other than -1 is an index into the
State table.
The entries in the IP2State table contain byte offsets within the
instruction stream of the function. The Windows ABI requires that these
offsets are aligned to instruction boundaries; they are not permitted to
point to a byte that is not the first byte of an instruction.
Unfortunately, CALL instructions present a problem during unwinding.
CALL instructions push the address of the instruction after the CALL
instruction, so that execution can resume after the CALL. If the CALL is
the last instruction within an IP2State region, then the return address
(on the stack) points to the *next* IP2State region. This means that the
unwinder will use the wrong cleanup funclet during unwinding.
To fix this problem, compilers should insert a NOP after a CALL
instruction, if the CALL instruction is the last instruction within an
IP2State region. The NOP is placed within the same IP2State region as
the CALL, so that the return address points to the NOP and the unwinder
will locate the correct region.
This PR modifies LLVM so that it inserts NOP instructions after CALL
instructions, when needed. In performance tests, the NOP has no
detectable significance. The NOP is rarely inserted, since it is only
inserted when the CALL is the last instruction before an IP2State
transition or the CALL is the last instruction before the function
epilogue.
NOP padding is only necessary on Windows AMD64 targets. On ARM64 and
ARM32, instructions have a fixed size so the unwinder knows how to "back
up" by one instruction.
Interaction with Import Call Optimization (ICO):
Import Call Optimization (ICO) is a compiler + OS feature on Windows
which improves the performance and security of DLL imports. ICO relies
on using a specific CALL idiom that can be replaced by the OS DLL
loader. This removes a load and indirect CALL and replaces it with a
single direct CALL.
To achieve this, ICO also inserts NOPs after the CALL instruction. If
the end of the CALL is aligned with an EH state transition, we *also*
insert a single-byte NOP. **Both forms of NOPs must be preserved.** They
cannot be combined into a single larger NOP; nor can the second NOP be
removed.
This is necessary because, if ICO is active and the call site is
modified by the loader, the loader will end up overwriting the NOPs that
were inserted for ICO. That means that those NOPs cannot be used for the
correct termination of the exception handling region (the IP2State
transition), so we still need an additional NOP instruction. The NOPs
cannot be combined into a longer NOP (which is ordinarily desirable)
because then ICO would split one instruction, producing a malformed
instruction after the ICO call.
This change fixes v2i8 lowering for parameters and returned values. As
part of this work, I move the lowering for return values to use generic
ISD::STORE nodes as these are more flexible and have existing
legalization handling.
Note that calling a function with v2i8 arguments or returns is still not
working but this is left for a subsequent change as this MR is already
fairly large.
Partially addresses #128853
This change consolidates and cleans up various NVPTXISD target-specific
nodes in order to simplify SDAG ISel. While there are some whitespace
changes in the emitted PTX it is otherwise a non-functional change.
NVPTXISD::Wrapper - This node was used to wrap external-symbol and
global-address nodes. It is redundant and has been removed. Instead we
use the non-target versions of these nodes and convert them
appropriately during ISel.
NVPTXISD::CALL - Much of the family of nodes used to represent a PTX
call instruction have been replaced by this new single node. It
corresponds to a single instruction and is therefore much simpler to
create and lower.
As with the other update scripts this takes the output of
-passes=print<gisel-value-tracking> and inserts the results into an
existing mir file. This means that the input is a lot like
update_analysis_test_checks.py, and the output needs to insert into a
mir file similarly to update_mir_test_checks.py. The code used to do the
inserting has been moved to common, to allow it to be reused. Otherwise
it tries to reuse the existing infrastructure, and
update_givaluetracking_test_checks is kept relatively short.
UpdateTestChecks has a make_analyzer_generalizer to replace pointer
addressess from the debug output of LAA with a pattern, which is an
acceptable solution when there is one RUN line. However, when there are
multiple RUN lines with a common pattern, UTC fails to recognize common
output due to mismatched pointer addresses. Instead of hacking UTC scrub
the output before comparing the outputs from the different RUN lines,
fix the issue once and for all by making LAA not output unstable pointer
addresses in the first place.
The removal of the now-dead make_analyzer_generalizer is left as a
non-trivial exercise for a follow-up.
According to the offical LoongArch reference manual, the 32-bit
LoongArch is divied into two variants: the Reduced version (LA32R) and
Standard version (LA32S). LA32S extends LA32R by adding additional
instructions, and the 64-bit version (LA64) fully includes the LA32S
instruction set.
This patch introduces a new target feature `32s` for the LoongArch
backend, enabling support for instructions specific to the LA32S
variant.
The LA32S exntension includes the following additional instructions:
- ALSL.W
- {AND,OR}N
- B{EQ,NE}Z
- BITREV.{4B,W}
- BSTR{INS,PICK}.W
- BYTEPICK.W
- CL{O,Z}.W
- CPUCFG
- CT{O,Z}.W
- EXT.W,{B,H}
- F{LD,ST}X.{D,S}
- MASK{EQ,NE}Z
- PC{ADDI,ALAU12I}
- REVB.2H
- ROTR{I},W
Additionally, LA32R defines three new instruction aliases:
- RDCNTID.W RJ => RDTIMEL.W ZERO, RJ
- RDCNTVH.W RD => RDTIMEH.W RD, ZERO
- RDCNTVL.W RD => RDTIMEL.W RD, ZERO
In most cases, the type information attached to load and store
instructions is meaningless and inconsistently applied. We can usually
use ".b" loads and avoid the complexity of trying to assign the correct
type. The one expectation is sign-extending load, which will continue to
use ".s" to ensure the sign extension into a larger register is done
correctly.
Avoid baking in absolute paths in check lines generated for DIFile
metadata. Generated test checks cannot be sensitive to absolute paths
anyway, as those vary with the environment, but there could be
situations where some sensitivity to partial paths is required for
certain tests. This implementation just assumes such tests aren't worth
the effort to support, but it could be supported in the future.
This is most useful for update_cc_test_checks with debug info enabled,
where the test writer cannot manipulate the paths within the generated
IR directly.
This change introduces a new pattern for lowering kernel byval
parameters in `NVPTXLowerArgs`. Each byval argument is wrapped in a call
to a new intrinsic, `@llvm.nvvm.internal.addrspace.wrap`. This intrinsic
explicitly equates to no instructions and is removed during operation
legalization in SDAG. However, it allows us to change the addrspace of
the arguments to 101 to reflect the fact that they will occupy this
space when lowered by `LowerFormalArgs` in `NVPTXISelLowering`.
Optionally, if a generic pointer to a param is needed, a standard
`addrspacecast` is used. This approach offers several advantages:
- Exposes addrspace optimizations: By using a standard `addrspacecast`
back to generic space we allow InferAS to optimize this instruction,
potentially sinking it through control flow or in other ways unsupported
by `NVPTXLowerArgs`. This is demonstrated in several existing tests.
- Clearer, more consistent semantics: Previously an `addrspacecast` from
generic to param space was implicitly a no-op. This is problematic
because it's not reciprocal with the inverse cast, violating LLVM
semantics. Further it is very confusing given the existence of
`cvta.to.param`. After this change the cast equates to this instruction.
- Allow for the removal of all nvvm.ptr.* intrinsics: In a follow-up
change the nvvm.ptr.gen.to.param and nvvm.ptr.param.to.gen intrinsics
may be removed.
During the transition from debug intrinsics to debug records, we used
several different command line options to customise handling: the
printing of debug records to bitcode and textual could be independent of
how the debug-info was represented inside a module, whether the
autoupgrader ran could be customised. This was all valuable during
development, but now that totally removing debug intrinsics is coming
up, this patch removes those options in favour of a single flag
(experimental-debuginfo-iterators), which enables autoupgrade, in-memory
debug records, and debug record printing to bitcode and textual IR.
We need to do this ahead of removing the
experimental-debuginfo-iterators flag, to reduce the amount of
test-juggling that happens at that time.
There are quite a number of weird test behaviours related to this --
some of which I simply delete in this commit. Things like
print-non-instruction-debug-info.ll , the test suite now checks for
debug records in all tests, and we don't want to check we can print as
intrinsics. Or the update_test_checks tests -- these are duplicated with
write-experimental-debuginfo=false to ensure file writing for intrinsics
is correct, but that's something we're imminently going to delete.
A short survey of curious test changes:
* free-intrinsics.ll: we don't need to test that debug-info is a zero
cost intrinsic, because we won't be using intrinsics in the future.
* undef-dbg-val.ll: apparently we pinned this to non-RemoveDIs in-memory
mode while we sorted something out; it works now either way.
* salvage-cast-debug-info.ll: was testing intrinsics-in-memory get
salvaged, isn't necessary now
* localize-constexpr-debuginfo.ll: was producing "dead metadata"
intrinsics for optimised-out variable values, dbg-records takes the
(correct) representation of poison/undef as an operand. Looks like we
didn't update this in the past to avoid spurious test differences.
* Transforms/Scalarizer/dbginfo.ll: this test was explicitly testing
that debug-info affected codegen, and we deferred updating the tests
until now. This is just one of those silent gnochange issues that get
fixed by RemoveDIs.
Finally: I've added a bitcode test, dbg-intrinsics-autoupgrade.ll.bc,
that checks we can autoupgrade debug intrinsics that are in bitcode into
the new debug records.
Set the default processor version to v68 when the user does not specify
one in the command line. This includes changes in the LLVM backed and
linker (lld). Since lld normally sets the version based on inputs, this
change will only affect cases when there are no inputs.
Fixes#127558
Whilst trying to clean up some loop vectoriser IR tests (see
test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
for example) a reviewer on PR #129047 suggested it would be
nice to have an option to stop generating CHECK lines after a
certain point. Typically when performing a transformation with
the loop vectoriser we don't usually care about any CHECK lines
generated for the scalar tail of the loop, since the scalar
loop is kept intact. Previously if you wanted to eliminate such
unwanted CHECK lines you had to run the update script, then
manually delete all the lines corresponding to the scalar loop.
This can be very time consuming if the tests ever need changing.
What I've tried to do here is add a new --filter-out-after
option alongside the existing --filter* options that provides
support for stopping the generation of any CHECK lines beyond
the line that matches the filter. With the existing filter
options we never generate CHECK-NEXT lines, but we still care
about ordering with --filter-out-after so I've amended the
code to ensure we treat this filter differently.
PTX supports 2 methods of accessing device function parameters:
- "simple" case: If a parameters is only loaded, and all loads can
address the parameter via a constant offset, then the parameter may be
loaded via the ".param" address space. This case is not possible if the
parameters is stored to or has it's address taken. This method is
preferable when possible.
- "move param" case: For more complex cases the address of the param may
be placed in a register via a "mov" instruction. This mov also
implicitly moves the param to the ".local" address space and allows for
it to be written to. This essentially defers the responsibilty of the
byval copy to the PTX calling convention.
The handling of these cases in the NVPTX backend for byval pointers has
some major issues. We currently attempt to determine if a copy is
necessary in NVPTXLowerArgs and either explicitly make an additional
copy in the IR, or insert "addrspacecast" to move the param to the param
address space. Unfortunately the criteria for determining which case is
possible are not correct, leading to miscompilations
(https://godbolt.org/z/Gq1fP7a3G). Further, the criteria for the
"simple" case aren't enforceable in LLVM IR across other transformations
and instruction selection, making deciding between the 2 cases in
NVPTXLowerArgs brittle and buggy.
This patch aims to fix these issues and improve address space related
optimization. In NVPTXLowerArgs, we conservatively assume that all
parameters will use the "move param" case and the local address space.
Responsibility for switching to the "simple" case is given to a new
MachineIR pass, NVPTXForwardParams, which runs once it has become clear
whether or not this is possible. This ensures that the correct address
space is known for the "move param" case allowing for optimization,
while still using the "simple" case where ever possible.
When --check-globals none, we skipped all the globals in check lines.
However, we are still checking the meta info in function defintion.
The generated checks are still sensitive to metadata changes.
This is to scrub the meta info and match them with {{.*}} instead.
Windows x64 Unwind V2 adds epilog information to unwind data:
specifically, the length of the epilog and the offset of each epilog.
The first step to do this is to add markers to the beginning and end of
each epilog when generating Windows x64 code. I've modelled this after
how LLVM was marking ARM and AArch64 epilogs in Windows (and unified the
code between the three).
getPtrStride returns 0 when the PtrScev is loop-invariant, and this is
not an erroneous value: it returns std::nullopt to communicate that it
was not able to find a valid pointer stride. In analyzeLoop, we call
getPtrStride with a value_or(0) conflating the zero return value with
std::nullopt. Fix this, handling loop-invariant loads correctly.
Currently, the AMDGPU backend bumps the Stack Pointer
by fixed size offsets in the prolog of device functions, and
restores it by the same amount in the epilog.
Prolog:
sp += frameSize
Epilog:
sp -= frameSize
If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment
Epilog:
sp -= frameSize + max_alignment
These calculations are not optimal in case of dynamic
stack realignment, and completely fail in case of
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP.
Prolog:
fp = sp
sp += frameSize
Epilog:
sp = fp
In case of dynamic stack realignment, SP is restored from
the saved Base Pointer.
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment
Epilog:
sp = bp
(Note: The presence of BP has been enforced in case of any
dynamic stack realignment.)
---------
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>