There should not be both `n8` and `n16:8`. This is a single list of
legal integers. Additionally, this should use the standard order in
increasing size `n8:16`.
The resource binding analysis was incorrectly reducing the size of the
`Bindings` vector by one element after sorting and de-duplication. This
led to an inaccurate setting of the `HasOverlappingBinding` flag in the
`DXILResourceBindingInfo` analysis, as the truncated vector no longer
reflected the true binding state.
This update corrects the shrink logic and introduces an `assert` in the
`DXILPostOptimizationValidation` pass. The assertion will trigger if
`HasOverlappingBinding` is set but no corresponding error is detected,
helping catch future inconsistencies.
The bug surfaced when the `srv_metadata.hlsl` and `uav_metadata.hlsl`
tests were updated to include unbounded resource arrays as part of
https://github.com/llvm/llvm-project/issues/145422. These updated test
files are included in this PR, as they would cause the new assertion to
fire if the original issue remained unresolved.
Depends on #152250
If we have a floating point vector and no zve32f/zve64f/zve64d, we can
end up with an invalid type-legalization cost from
getTypeLegalizationCost.
Previously this triggered an assertion that the type must have been
legalized if the "legal" type is a vector, but in this case when it's
not possible to legalize the original type is spat back out.
This fixes it by just checking that the legalization cost is valid.
We don't have much testing for zve64x, so we may have other places in
the cost model with this issue.
Fixes#153008
The first commit is identical to
69bec0afbb8f2aa0021d18ea38768360b16583a9.
The second commit fixes the instruction verification failures by
replacing the erroneous instruction with a trap after the error is
reported and adds `-verify-machineinstrs` to the tests added in the
original PR to catch the issue sooner.
After that change, all tests pass with both
`LLVM_ENABLE_EXPENSIVE_CHECKS={On,Off}`.
cc @RKSimon @e-kud @phoebewang @arsenm as reviewers on the original PR
Similar to the PACKH+PACK pattern for RV32. We can end up with the
shift left by 32 neeed by our PACK pattern hidden behind an OR that
packs 2 half words.
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.
The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.
This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.
No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
It is common to have ABI requirements for illegal types: For example,
two i64 argument parts that originally came from an fp128 argument may
have a different call ABI than ones that came from a i128 argument.
The current calling convention lowering does not provide access to this
information, so backends come up with various hacks to support it (like
additional pre-analysis cached in CCState, or bypassing the default
logic entirely).
This PR adds the original IR type to InputArg/OutputArg and passes it
down to CCAssignFn. It is not actually used anywhere yet, this just does
the mechanical changes to thread through the new argument.
Now, why would we want to do this?
There are a small number of places where this works:
1. It helps peepholeopt when less flag checking.
2. It allows the folding of things such as x - 0x80000000 < 0 to be
folded to cmp x, register holding this value
3. We can refine the other passes over time for this.
The default `half` legalization has some issues with quieting NaNs and
carrying excess precision. As has been done for various other targets,
update AVR to use `softPromoteHalfType` which avoids these issues.
The most obvious corrected test below is `test_load_store`, which no
longer contains calls to extend and trunc (this passing through libcalls
means that `f16` does not round trip).
Fixes the AVR part of https://github.com/llvm/llvm-project/issues/97975
Fixes the AVR part of https://github.com/llvm/llvm-project/issues/97981
If the upper 32 bits are demanded, we might have a sext_inreg in
the pattern on the byte shifted by 24. We can also match this case
since packw sign extends from bit 31.
For example:
/usr/bin/ld: lib/Target/AVR/CMakeFiles/LLVMAVRCodeGen.dir/AVRTargetMachi
ne.cpp.o: in function `llvm::TargetTransformInfoImplCRTPBase<llvm::AVRTT
IImpl>::~TargetTransformInfoImplCRTPBase()':
AVRTargetMachine.cpp:(.text._ZN4llvm31TargetTransformInfoImplCRTPBaseINS
_10AVRTTIImplEED2Ev[_ZN4llvm31TargetTransformInfoImplCRTPBaseINS_10AVRTT
IImplEED5Ev]+0x13): undefined reference to `llvm::TargetTransformInfoImp
lBase::~TargetTransformInfoImplBase()'
Add missing dependencies to CMakeLists.txt.
The end register of the tuple shall be below the last existing
register. The check does not work on something like {v[255:256]}.
Overall it works correctly because if fails later at the
getMatchingSuperReg() call.
Fixes#152754
- Fixes the ArgOperand index in `DXILOpLowering.cpp` used to obtain the
pointer operand of a lifetime intrinsic.
- Updates the tests
`llvm/test/CodeGen/DirectX/legalize-lifetimes-valver-1.5.ll`,
`llvm/test/CodeGen/DirectX/legalize-lifetimes-valver-1.6.ll`,
`llvm/test/CodeGen/DirectX/ShaderFlags/lifetimes-noint64op.ll`, and
`llvm/test/tools/dxil-dis/lifetimes.ll` to use the new size-less
lifetime intrinsic
- Removes lifetime intrinsics from the test
`llvm/test/CodeGen/DirectX/legalize-memset.ll` to be consistent with the
corresponding memcpy test which does not have lifetime intrinsics.
(Removal of lifetime intrinsics from tests like this was suggested here
in the past:
https://github.com/llvm/llvm-project/pull/139173#discussion_r2091778868)
- Rewrites the lifetime legalization functions in the EmbedDXILPass to
re-add the explicit size argument for DXIL
This PR adds hardware-measured latencies for all instructions defined in
Section 12 of the RVV specification: "Vector Fixed-Point Arithmetic
Instructions" to the SpacemiT-X60 scheduling model.
This extracts the code for modelling a fp16 operation as
`fptrunc(fpop(fpext,fpext))` into a new function named
getFP16BF16PromoteCost so that it can be reused by the
arithmetic instructions. The function takes a lambda to
calculate the cost of the operation with the promoted type.
The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.
This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much
repeats some of the logic from instruction selection.
This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.
This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
This reverts https://github.com/llvm/llvm-project/pull/132976.
The PR incorrectly claimed that this makes inlining more liberal,
referencing the string comparison in TargetTransformInfoImpl.h.
However, the implementation that actually applies is the one in
BasicTTIImpl.h, which performs a feature subset comparison. As such,
this regressed inlining, most concerningly of functions without +vector
into functions with +vector.
Revert the change to restore the previous behavior.
If a copy exists between creation of a crbit and a spill, machine-cp
may delete the copy since it seems unaware of the relation between a cr
and crbit. A fix was previously made for the generic ppc64 lowering. It
should be applied to the pwr9 and pwr10 variants too.
Likewise, relax and extend the pwr8 test to verify pwr9 and pwr10
codegen too.
This fixes#143989.
This prevents cases where some of the operands match from hitting
verifier errors with kill flags. These nodes should have been removed
earlier in most cases.
Fixes the direct issue from #149380. #151855 cleans up the codegen.
We have been tracking the performance of EVL tail folding in the loop
vectorizer on RISC-V for a while now, and after much hard work from
various contributors we think it should be generally profitable to
enable by default now.
With tail folding there is a 21% improvement on 525.x264_r on SPEC CPU
2017 on the BPI-F3 (-march=rva22u64_v -O3 -flto), as well as a 30%
geomean codesize reduction on SPEC and TSVC, with no significant
regressions detected.
Now that we are early into the LLVM 22.x development cycle it seems like
a good time to enable it to catch any issues. There are still more EVL
related items of work being tracked in #123069, which should continue to
improve performance.
Each section now tracks the index of the first linker-relaxable
fragment, enabling two changes:
* Delete redundant ALIGN relocations before the first linker-relaxable
instruction in a section. The primary example is the offset 0
R_RISCV_ALIGN relocation for a text section aligned by 4.
* For alignments larger than the NOP size after the first
linker-relaxable instruction, ALIGN relocations are now generated, even in
norelax regions. This fixes the issue #150159.
The new test llvm/test/MC/RISCV/Relocations/align-after-relax.s
verifies the required ALIGN in a norelax region following
linker-relaxable instructions.
By using a fragment index within the subsection (which is less than or
equal to the section's index), the implementation may generate redundant
ALIGN relocations in lower-numbered subsections before the first
linker-relaxable instruction.
align-option-relax.s demonstrates the ALIGN optimization.
Add an initial `call` to a few tests to prevent the ALIGN optimization.
---
When the alignment exceeds 2, we insert $alignment-2 bytes of NOPs, even
in non-RVC code. This enables non-RVC code following RVC code to handle
a 2-byte adjustment without requiring an additional state in MCSection
or AsmParser.
```
.globl _start
_start:
// GNU ld can relax this to 6505 lui a0, 0x1
// LLD hasn't implemented this transformation.
lui a0, %hi(foo)
.option push
.option norelax
.option norvc
// Now we generate R_RISCV_ALIGN with addend 2, even if this is a norvc region.
.balign 4
b0:
.word 0x3a393837
.option pop
foo:
```
Pull Request: https://github.com/llvm/llvm-project/pull/150816
Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used
as an operand, only one SGPR will be read for both the low and high
operations. As a result, the corresponding bits in `op_sel` and
`op_sel_hi` must be the same when the operand is an SGPR.
Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
Change from GFX12: Relax S_CLAUSE rules to all all non-flat memory types
in
the same clause, and all Flat types in the same.
For VMEM/FLAT clause types now look like:
- Non-Flat (load, store, atomic): buffer, global, scratch, TDM, Async
- Flat: load, store, atomic
I fixed support for varargs functions
(previously it didn't crash but the codegen was incorrect).
I added tests for structs and unions which already work. With the
multivalue abi they crash in the backend, so I added a sema check that
rejects structs and unions for that abi.
It will also crash in the backend if passed an int128 or float128 type.