Metadata is still 1.2, not 1.3 after V6.
I thought that amdhsa.version mapped to the COV version but it's
separate, and there are no MD changes in V6, hence it doesn't need to be
updated.
Reland the original patch with additional commit containing fix for two
issues:
1. Attempting to bitcast using MVTs with no corresponding LLVM type.
getDWordFromOffset now works directly with the original vector to get
the corresponding elements given the DWordOffset.
2. Improper bit tracking in CalculateByteProvider for vector types using
certain ops. Previously, bit tracking for certain ops (e.g.
ISD::TRUNCATE) assumed operands were scalar types, which is not correct
since these ops have different semantics depending on vector / scalar.
CalculateByteProvider / CalculateSrcByte now exit on vector types,
handling which is a TODO.
If part of a register (lowered from REG_SEQUENCE) is undefined then we
should propagate undef flags to uses of those lanes. This is only
performed when live intervals are present as it requires live intervals
to correctly match uses to defs, and the primary goal is to allow
precise computation of subrange intervals.
There was an error where dividend of type i64 and actual used number of
bits of 32 fell into path that assumes only 24 bits being used. Check
that AtLeast field is used correctly when using computeNumSignBits and
add necessary extend/trunc for 32 bits path.
Regolden and update testcases.
@jrbyrnes @bcahoon @arsenm @rampitec
PAL Metadata 3.0 introduces an explicit structure in metadata for the
programmable registers written out by the compiler backend.
The previous approach used opaque registers which can change between different
architectures and required encoding the bitfield information in the backend,
which may change between versions.
This change is an extension the previously added support - which only handled
entry functions. This adds support for all functions.
The change also includes some re-factoring to separate common code.
Correct AMDGPU strictfp tests to follow the rules documented in the
LangRef:
https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics
These tests needed the strictfp attribute added to function calls and
some declarations.
Some of the tests now pass with D146845, others get farther along and
fail with D146845. The tests revealed that further work is required
in mostly AMDGPU atomics to get the tests passing.
Since I was here anyway I removed the strictfp attribute from some
constrained intrinsic declarations. They have this attribute by default.
Test changes verified with D146845.
This enables IR expansion for i128 divisions. The vector case is still
broken because ExpandLargeDivRem doesn't try to handle them.
Fixes: SWDEV-426193
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
The SGPR registers used for preserving EXEC mask while lowering the
whole-wave register spills and copies should be preserved at the prolog
and epilog if they are in the CSR range. It isn't happening when there
is only wwm-copy lowered and there are no wwm-spills. This patch
addresses that problem.
Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same
as V5 except a new "generic version" flag can be present in EFLAGS. This
is related to new generic targets that'll be added in a follow-up patch.
It's also likely V6 will have new changes (possibly new metadata
entries) added later.
Docs change are part of the follow-up patch #76955
Currently machine LICM hoist PC_ADD_REL_OFFSET out of loops, causes
register pressure when function calls are deep in loops. This is a main
cause of sgpr spill for programs containing large number of function
calls in loops.
This patch marks PC_ADD_REL_OFFSET as rematerializable, which eliminates
sgpr spills due to function calls in loops.
Summary:
The old `s_memtime` instruction was supported until the GFX10
architecture. Although this instruction has a higher latency than the
new shader counter, it's much more usable as a processor clock as it is
a full 64-bit counter. The new shader counter is only a 20-bit counter,
which makes it difficult to use as a standard cycle counter as it will
overflow in a few milliseconds. This patch suggests preferring
`s_memtime` for this instrinsic if it is still available.
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
PAL uses ELF REL (not RELA) relocations which can only store a 32-bit
addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI
which require the upper 32 bits of a 64-bit address calculation to be
correct. This means that it is not safe to fold an arbitrary offset into
a GlobalAddressSDNode, so stop doing that.
In practice this is mostly a problem for small negative offsets which do
not work as expected because PAL treats the 32-bit addend as unsigned.
Make the wmma intrinsic type signatures to be canonical. We need
a type signature as long as the type is not fixed. However, when an
argument's type matches a previous argument's type, we do not need the
signature for this argument.
This patch fixes three general cases:
1. add missing signatures
2. remove signatures for matching arguments
3. reorer the signatures -- return type signature should always appear
first
MUL24 can now return a i64 for i32 operands, but the combine was never
updated to handle this case. Extend the operand when rewriting the ADD
to handle it.
Fixes SWDEV-436654
This commit extends separate-const-offset-from-gep to look at the
newly-added `disjoint` flag on `or` instructions so as to preserve
additional opportunities for optimization.
The tests were pre-committed in #76972.
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
One V_NOP or unrelated VALU instruction in between is required for
correctness when matrix A or B of current WMMA instruction overlaps with
matrix D of previous WMMA instruction.
Remaining cases of WMMA operand overlaps are handled by the hardware and
do not require handling in hazard recognizer.
Hardware may stall in cases where:
- matrix C of current WMMA instruction overlaps with matrix D of
previous WMMA instruction
- VALU instruction reads matrix D of previous WMMA instruction
- matrix A,B or C of WMMA instruction reads result of previous VALU
instruction
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.
This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
int_amdgcn_global_load_tr did not specify non-temporal load transpose,
thus we should
not genetrate the non-temporal hint for the load. We need to implement
getTgtMemIntrinsic
to create the corresponding MemSDNode. And we don't set the non-temporal
flag because
the intrinsic did not specify it.
NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.
VGPRs used for spilling do not require explicit reservation with MRI.
freezeReservedRegs() executed before register allocation ensures these
are placed in the reserve set.
The only pass after SILowerSGPRSpills is SIPreAllocateWWMRegs which
explicitly tests for interference before register allocation so should
not reuse a WWM VGPR holding spill data. reserveReg prevents calculation
of correct liveness for physical registers which could be used to extend
SIPreAllocateWWMRegs.
This patch tweaks two AMDGPU passes to use iterators rather than
instruction pointers for expressing an insertion point. This is needed
to accurately support DPValues, the non-instruction storage object for
debug-info.
Two tests were sensitive to this change (variable assignments were being
put in the wrong place), and I've added extra run-lines with the "try
new debug-info..." flag. These get tested on our public buildbot to
ensure they continue to work accurately.
SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s
to complete and that's on a fast machine. I also recall seeing them in
the slowest tests list on build bots.
This removes the verify-machineinstrs option from these tests to speed
them up, bringing the slowest test down to +-2s.
Verifier still runs in EXPENSIVE_CHECKS builds.
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.