- Moved existing llvm/test/CodeGen/X86/powi.ll file to
llvm/test/CodeGen/X86/powi-const.ll.
- Added new testcases for powi into llvm/test/CodeGen/X86/powi.ll.
The original code is essentially performing isel during legalisation
with the AArch64 specific nodes offering no additional value compared to
ISD::SETCC.
Resolves#99221
Key points: For SPIRV backend, it decompose into a `dot` followed a
`add`.
- [x] Implement dot2add clang builtin,
- [x] Link dot2add clang builtin with hlsl_intrinsics.h
- [x] Add sema checks for dot2add to CheckHLSLBuiltinFunctionCall in
SemaHLSL.cpp
- [x] Add codegen for dot2add to EmitHLSLBuiltinExpr in CGBuiltin.cpp
- [x] Add codegen tests to clang/test/CodeGenHLSL/builtins/dot2add.hlsl
- [x] Add sema tests to clang/test/SemaHLSL/BuiltIns/dot2add-errors.hlsl
- [x] Create the int_dx_dot2add intrinsic in IntrinsicsDirectX.td
- [x] Create the DXILOpMapping of int_dx_dot2add to 162 in DXIL.td
- [x] Create the dot2add.ll and dot2add_errors.ll tests in
llvm/test/CodeGen/DirectX/
ISD::ADDC, ISD::ADDE, ISD::SUBC and ISD::SUBE are being deprecated,
using ISD::UADDO_CARRY,ISD::USUBO_CARRY instead. Lowering the UADDO,
UADDO_CARRY, USUBO, USUBO_CARRY in the patch.
With AVX512VL targets, use 128/256-bit VPERMV/VPERMV3 nodes when we only need the lower elements.
Reapplied version of #133923 with fix for typo in the VPERMV3 mask adjustment
This is a follow up PR from
https://github.com/llvm/llvm-project/pull/132089.
When a V2S copy and its useMI are lowered to VALU, this patch check:
If the generated new VALU is a true16 inst. Add subreg access on all
operands if necessary.
an example MIR looks like:
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:sreg_32 = COPY %1:vgpr_32
%3:sreg_32 = S_FLOOR_F16 %2:sreg_32, ...
```
currently lowered to
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1:vgpr_32, 0, 0, 0 ...
```
after this patch
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1.lo16:vgpr_32, 0, 0, 0 ...
```
This PR updates AMDGPULowerBufferFatPointers to use the
InstSimplifyFolder
when creating IR during buffer fat pointer lowering.
This shouldn't cause any large functional changes and might improve the
quality of the generated code.
The pseudo-instruction LCMPXCHG16B_SAVE_RBX is used when RBX serves as
frame base pointer. At a very late stage it is then translated into a
regular LCMPXCHG16B, preceded by copying the actual argument into RBX,
and followed by restoring the register to the base pointer.
However, in case the `cmpxchg` operates on a local variable, RBX might
also be used as a base for the memory operand in frame finalization, and
we've overwritten RBX with the input operand for `cmpxchg16b`. So we
have to rewrite the memory operand base to use the saved value of RBX.
Fixes#119959.
Adds code to expand the `llvm.spv.resource.handlefrombinding` and
`llvm.spv.resource.getpointer` when the resource type is
`spirv.VulkanBuffer`.
It gets expanded as a storage buffer or uniform buffer denpending on the
storage class used.
This is implementing part of
https://github.com/llvm/wg-hlsl/blob/main/proposals/0018-spirv-resource-representation.md.
Several implementations have zero-latency instructions to zero
registers. To-date no implementation has a dedicated SVE instruction but
we can use the NEON equivalent because it is defined to zero bits
128..VL regardless of the immediate used.
NOTE: The relevant instruction is not available in streaming mode, where
the original SVE DUP instruction remains in use.
This adds a call to SimplifyDemandedBits from bitcasts with scalar input
types in SimplifyDemandedVectorElts, which can help simplify the input
scalar.
Some new registers are reused when replacing some old ones in
certain use case of ModuloScheduleExpander. It is necessary to
avoid repeated interval calculations for these registers.
llc attempts to create an empty file in the current directory, but it
can't do that on a read-only file system. Send that empty-output to
stdout, which prevents this failure.
With -fpatchable-function-entry (or the patchable_function_entry
function attribute), we emit records of patchable entry locations to the
__patchable_function_entries section. Add an additional parameter to the
command line option that allows one to specify a different default
section name for the records, and an identical parameter to the function
attribute that allows one to override the section used.
The main use case for this change is the Linux kernel using prefix NOPs
for ftrace, and thus depending on__patchable_function_entries to locate
traceable functions. Functions that are not traceable currently disable
entry NOPs using the function attribute, but this creates a
compatibility issue with -fsanitize=kcfi, which expects all indirectly
callable functions to have a type hash prefix at the same offset from
the function entry.
Adding a section parameter would allow the kernel to distinguish between
traceable and non-traceable functions by adding entry records to
separate sections while maintaining a stable function prefix layout for
all functions. LKML discussion:
https://lore.kernel.org/lkml/Y1QEzk%2FA41PKLEPe@hirez.programming.kicks-ass.net/
We haven't implemented 16 bit SGPRs. Currently allow 32-bit SGPRs to be
folded into True16 bit instructions taking 16 bit values. Also use
sgpr_32 when Imm is copied to spgr_lo16 so it could be further folded.
This improves generated code quality.
Currently iterators over EquivalenceClasses will iterate over std::set,
which guarantees the order specified by the comperator. Unfortunately in
many cases, EquivalenceClasses are used with pointers, so iterating over
std::set of pointers will not be deterministic across runs.
There are multiple places that explicitly try to sort the equivalence
classes before using them to try to get a deterministic order
(LowerTypeTests, SplitModule), but there are others that do not at the
moment and this can result at least in non-determinstic value naming in
Float2Int.
This patch updates EquivalenceClasses to keep track of all members via a
extra SmallVector and removes code from LowerTypeTests and SplitModule
to sort the classes before processing.
Overall it looks like compile-time slightly decreases in most cases, but
close to noise:
https://llvm-compile-time-tracker.com/compare.php?from=7d441d9892295a6eb8aaf481e1715f039f6f224f&to=b0c2ac67a88d3ef86987e2f82115ea0170675a17&stat=instructions
PR: https://github.com/llvm/llvm-project/pull/134075
This patch introduces the `vmem-to-lds-load-insts` target feature, which
can be used to enable builtins `__builtin_amdgcn_global_load_lds` and
`__builtin_amdgcn_raw_ptr_buffer_load_lds` on platforms which have this
feature.
This feature is only available on gfx9/10.
A limitation of using a common target feature for both builtins is that
we could have made `__builtin_amdgcn_raw_ptr_buffer_load_lds` available
on gfx6,7,8.
Previously we only marked fixed length vector extracts as cheap, so this
extends it to any extract at index 0 which should just be a subreg
extract.
This allows extracts of i1 vectors to be considered for DAG combines,
but also scalable vectors too.
This causes some slight improvements with large legalized fixed-length
vectors, but the underlying motiviation for this is to actually prevent
an unprofitable DAG combine on a scalable vector in an upcoming patch.
Fixes#130510.
In RISCV, modify the folding of (X ^ Y == 0) -> (X == Y) to account for
cases where the (X ^ Y) will be re-used.
If a constant is being used for the XOR before a branch, ensure that it
is small enough to fit within a 12-bit immediate field. Otherwise, the
equality check is more efficient than the check against 0, see the
following:
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
xor a0, a0, a1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
beq a0, a1, .LBB0_2
# %bb.1:
xor a0, a0, a1
ret
.LBB0_2:
```
Similarly, if the XOR is between 1 and a size one integer, we should
still fold away the XOR since that comparison can be optimized as a
comparison against 0.
```
# %bb.0:
slt a0, a0, a1
xor a0, a0, 1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
slt a0, a0, a1
bnez a0, .LBB0_2
# %bb.1:
xor a0, a0, 1
ret
.LBB0_2:
```
One question about my code is that I used a hard-coded value for the
width of a RISCV ALU immediate. Do you know of a way that I can gather
this from the `context`, I was unable to devise one.
The llvm.metadata section is not emitted and has special semantics. We
should not merge globals in it, similarly to how we already skip merging
of `llvm.xyz` globals.
Fixes https://github.com/llvm/llvm-project/issues/131394.
Cleanup lowering of atomic instructions and intrninsics. The TableGen
changes are primarily a refactor, though sub variants are now lowered
via operation legalization, potentially allowing for more DAG
optimization.
There are V2S copies between vpgr16 and spgr32 in true16 mode. This is
caused by vgpr16 and sgpr32 both selectable by 16bit src in ISel.
When a V2S copy and its useMI are lowered to VALU, this patch check
1. If the generated new VALU is used by a true16 inst. Add subreg access
if necessary.
2. Legalize the V2S copy by replacing it to subreg_to_reg
an example MIR looks like:
```
%2:sgpr_32 = COPY %1:vgpr_16
%3:sgpr_32 = S_OR_B32 %2:sgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:sgpr_32, ...
```
currently lowered to
```
%2:vgpr_32 = COPY %1:vgpr_16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:vgpr_32, ...
```
after this patch
```
%2:vgpr_32 = SUBREG_TO_REG 0, %1:vgpr_16, lo16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3.lo16:vgpr_32, ...
```
Implement all single-multi {BF/F/S/U/SU/US}MOP4{A/S} instructions in
clang and llvm following the acle in
https://github.com/ARM-software/acle/pull/381/files.
This PR depends on https://github.com/llvm/llvm-project/pull/127797
This patch updates the semantics of template arguments in intrinsic
names for clarity and ease of use. Previously, template argument numbers
indicated which character in the prototype string determined the final
type suffix, which was confusing—especially for intrinsics using
multiple prototype modifiers per operand (e.g., intrinsics operating on
arrays of vectors). The number had to reference the correct character in
the prototype (e.g., the ‘u’ in “2.u”), making the system cumbersome and
error-prone.
With this patch, template argument numbers now refer to the operand
number that determines the final type suffix, providing a more intuitive
and consistent approach.
During the transition from debug intrinsics to debug records, we used
several different command line options to customise handling: the
printing of debug records to bitcode and textual could be independent of
how the debug-info was represented inside a module, whether the
autoupgrader ran could be customised. This was all valuable during
development, but now that totally removing debug intrinsics is coming
up, this patch removes those options in favour of a single flag
(experimental-debuginfo-iterators), which enables autoupgrade, in-memory
debug records, and debug record printing to bitcode and textual IR.
We need to do this ahead of removing the
experimental-debuginfo-iterators flag, to reduce the amount of
test-juggling that happens at that time.
There are quite a number of weird test behaviours related to this --
some of which I simply delete in this commit. Things like
print-non-instruction-debug-info.ll , the test suite now checks for
debug records in all tests, and we don't want to check we can print as
intrinsics. Or the update_test_checks tests -- these are duplicated with
write-experimental-debuginfo=false to ensure file writing for intrinsics
is correct, but that's something we're imminently going to delete.
A short survey of curious test changes:
* free-intrinsics.ll: we don't need to test that debug-info is a zero
cost intrinsic, because we won't be using intrinsics in the future.
* undef-dbg-val.ll: apparently we pinned this to non-RemoveDIs in-memory
mode while we sorted something out; it works now either way.
* salvage-cast-debug-info.ll: was testing intrinsics-in-memory get
salvaged, isn't necessary now
* localize-constexpr-debuginfo.ll: was producing "dead metadata"
intrinsics for optimised-out variable values, dbg-records takes the
(correct) representation of poison/undef as an operand. Looks like we
didn't update this in the past to avoid spurious test differences.
* Transforms/Scalarizer/dbginfo.ll: this test was explicitly testing
that debug-info affected codegen, and we deferred updating the tests
until now. This is just one of those silent gnochange issues that get
fixed by RemoveDIs.
Finally: I've added a bitcode test, dbg-intrinsics-autoupgrade.ll.bc,
that checks we can autoupgrade debug intrinsics that are in bitcode into
the new debug records.
If the 512-bit unary shuffle is a concatenation of 128/256-bit subvectors then we're better off using a X86ISD::SHUF128 node so we can fold the concatenation into the shuffle as well.
Bug relates to `early-if-predicator` and `early-ifcvt` passes. If
virtual register has "killed" flag in both basic blocks to be merged
into head, both instructions in head basic block will have "killed" flag
for this register. It makes MIR incorrect.
Example:
```
bb.0: ; if
...
%0:intregs = COPY $r0
J2_jumpf %2, %bb.2, implicit-def dead $pc
J2_jump %bb.1, implicit-def dead $pc
bb.1: ; if.then
...
S4_storeiri_io killed %0, 0, 1
J2_jump %bb.3, implicit-def dead $pc
bb.2: ; if.else
...
S4_storeiri_io killed %0, 0, 1
J2_jump %bb.3, implicit-def dead $pc
```
After early-if-predicator will become:
```
bb.0:
%0:intregs = COPY $r0
S4_storeirif_io %1, killed %0, 0, 1
S4_storeirit_io %1, killed %0, 0, 1
```
Having `killed` flag set twice in bb.0 for `%0` is an incorrect MIR.