This introduces the DXILMemIntrinsics pass and moves memset and memcpy
handling from DXILLegalize to here. We need to do this so that we can
handle memory intrinsics before the DXILResourceAccess pass so that we
can properly deal with arrays and large structures in resources.
BUILD_VECTOR is combined to SPLAT_VECTOR if operation action of
SPLAT_VECTOR is not Expand. However we already have custom handle of
BUILD_VECTOR for fixed-length vector which has explicit constant VL
instead of making it VLMAX if lowered through SPLAT_VECTOR.
This change adds full support for the ptx `barrier.cta.red` instruction,
following the same conventions as are already used for
`barrier.cta.sync` and `barrier.cta.arrive`.
In addition this MR removes the following intrinsics which are no longer
needed:
* llvm.nvvm.barrier0.popc -->
llvm.nvvm.barrier.cta.red.popc.aligned.all(0, c)
* llvm.nvvm.barrier0.and -->
llvm.nvvm.barrier.cta.red.and.aligned.all(0, z)
* llvm.nvvm.barrier0.or -->
llvm.nvvm.barrier.cta.red.or.aligned.all(0, z)
Currently, the register coalescer may try to commute an instruction
like:
```
%0.sub_lo32:gpr64 = AND %0.sub_lo32:gpr64(tied-def 0), %1.sub_lo32:gpr64
USE %0:gpr64
```
resulting in:
```
%1.sub_lo32:gpr64 = AND %1.sub_lo32:gpr64(tied-def 0), %0.sub_lo32:gpr64
USE %1:gpr64
```
However, this is not correct if the instruction doesn't define the
entire register, as the value of the upper 32-bits
of the register used in `USE` will not be the same.
IFUNCs require loader support, so for arbitrary environments, the safe
assumption is to assume that they are not supported. In particular,
aarch64-linux-pauthtest may be used with musl, and was wrongly detected
as supporting IFUNCs.
With IFUNC support now being detected more reliably, this also removes
the check for PAuth support. If both are supported, either would work.
Instead of trying to precalculate GEP offsets ahead of time and then
process resource accesses based off of these offsets, traverse the GEP
chain inline for each access. This makes it easier to get the types
correct when translating GEPs for cbuffer and structured buffer
accesses, which in turn lets us access individual elements of those
structures directly.
Fixes#160208, #164517, and #169430
Refines OpName emission to only target Global Variables, Functions,
Function Parameters, Local Variables (allocas/phis), and Basic Blocks.
This reduces binary size and clutter by avoiding OpName for every
intermediate instruction (arithmetic, casts, etc.), while preserving
readability for interfaces and program structure.
Also updates the test suite to align with this change:
- Removes OpName checks for intermediate instructions.
- Adds side-effects (e.g., volatile stores) to tests where instructions
were previously kept alive solely by their OpName usage.
- Updates checks to use generic ID matching where specific names are no
longer available.
- Adds debug-info/opname-filtering.ll to verify the new policy.
Otherwise we were seeing "unsupported relocation" errors when
referencing a small symbol under the large code model.
This regresses some cases where a large function references a small
global (e.g. relocimm-code-model.ll), but that's probably not super
important.
SelectionDAG offered no way to widen TRUNCATE for pathological types
like <vscale x 1 x ...> as they do not allow scalarisation.
One way to go further to is widen to an intermediate type which will
allow to promote the element type in a later run of legalisation.
…ions
When a merge-like instruction has all readanylane sources and the result
is copied to VGPRs, eliminate the readanylanes by either using the
original unmerge source directly or building a new merge with the VGPR
sources.
Currently, we assign the same scheduling info to COPY regardless of
whether it's a scalar or vector one. But this might cause vector COPY
from physical registers to schedule too closed to its consumer,
prolonging the physical register live range and running out of registers
during RA as seen in #167008 .
This patch addresses this issue by creating schedule variants for COPY
instructions of vector register classes so that they can have the same
latency as simple vector arithmetics (WriteVIALUV). It is worth noting
that we _only_ need latency in this case -- keeping processor resources
in (vector) COPYs still causes the aforementioned register shortage
issue, because these COPY might then be blocked by structural hazards
and again, got sunk further down than we want.
With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.
Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.
This change moves processing of DeferredInstr to after all worklists
have be processed.
I recently observed that LLVM generates the following code:
```
addi a1, a0, -1
sltu a0, a0, a1
addi a0, a0, -1
and a0, a0, a1
ret
```
This could be optimized using the snez instruction instead.
Tests have to perform an additional FADD to prevent
combineConcatVectorOfCasts from performing the fold - we're trying to
show when this fails to occur during a combineConcatVectorOps recursion
Interestingly, due to uitofp expansion AVX1/2 is often managing to
concat where AVX512 can't
This occurs after type legalization, so the index type can be i32 or
i64. This patch simplifies the matching and checks for the optional zero
extend.
Also, a few tests from when this fold was added had broken due to
incorrectly adding `nuw` to the `add <eltCount>, #-1`, which this patch
corrects.
The pass now contains a non-fp expansion and should
be used for any similar expansions regardless of the
types involved. Hence a generic name seems apt.
Rename the source files, pass, and adjust the pass
description. Move all tests for the expansions
that have previously been merged into the pass
to a single directory.
Implements initial support for vk::push_constant.
As is, this allows handling simple push constants, but has one
main issue: layout can be incorrect (See #168401). The layout
issue being not only push-constant related, it's ignored for this PR.
The frontend part of the implementation is straightforward:
- adding a new attribute
- when targeting vulkan/spirv, we process it
- global variables with this attribute gets a new AS:
hlsl_push_constant
The IR has nothing specific, only some RO globals in this new AS.
On the SPIR-V side, we not convert this AS into a PushConstant storage
class. But this creates some issues: the variables in this storage class
must have a specific set of decoration to define their layout.
Current infra to create the SPIR-V types lacks the context required to
make this decision: no indication on the AS or context around the type
being created. Refactoring this would be a heavy task as it would
require getting this information in every place using the GR for type
creation.
Instead, we do something similar to CBuffers:
- find all globals with this address space, and change their type to
a target-specific type.
- insert a new intrinsic in place of every reference to this global
variable.
This allow the backend to handle both layout variables loads and type
lowering independently.
Type lowering has nothing specific: when we encounter a target extension
type with spirv.PushConstant, we lower this to the correct SPIR-V type
with the proper offset & block decorations.
As for the intrinsic, it's mostly a no-op, but required since we have
this target-specific type.
Note: this implementation prevents the static declaration of multiple
push constants in a single shader module. The actual specification is
more relaxed: there can be only one **used** push constant block per
entrypoint. To correctly implement this, we'd require to keep some
additional state to determine the list of statically used resources per
entrypoint. This shall be addressed as a follow-up (see #170310)
This removes the use of `LowerVECTOR_COMPRESS` in `ReplaceNodeResults`
(which was used to promote illegal integer VTs), and instead only marks
the legal VTs as "Custom" (allowing for standard type legalization).
This patch also simplifies the lowering by using the existing
fixed-length <-> SVE conversion helpers.
This was intended to be an NFC, but it appears to have caused some minor
code-gen changes/improvements.
The crash happens because the cast for `Mask =
cast<ShuffleVectorSDNode>(Res)->getMask();` fails for node `t197: v16i8
= vector_shuffle<16,17,18,19,4,5,6,7,8,9,10,11,u,u,u,u> t196, t196`.
However, both `LHS` and `RHS` are the same node, so
`DAG.getCommutedVectorShuffle` doesn't return a `ShuffleVectorSDNode`
and crashes. The fix is to add a check before the cast is performed.
Closes https://github.com/llvm/llvm-project/issues/172265
Both passes expand instructions at the IR level.
They use the same kind of instruction visitation
logic and contain significant code duplication e.g.
for scalarization.
The passthru operand is a tuple. We need to extract the correct field
vector from it.
Existing tests only handled the undef passthru case which accidentally
worked. Possibly due to IMPLICIT_DEF being converted to noreg.
Fixes#172628.
The enablePExtCodeGen was only intended to block vector code while
it is still in development. This code uses scalar types so we only
need to check for the extension.
This patch supports for both scalable vector and fixed-length vector.
It also enables fsetcc pattern match for zvfbfa to make fminimum and
fmaximum work correctly.
This PR implements the emitting of the post-link CFG information in PGO
analysis map, as explained in the
[RFC](https://discourse.llvm.org/t/rfc-extending-the-pgo-analysis-map-with-propeller-cfg-frequencies/88617).
This is enabled by a flag `pgo-analysis-map-emit-bb-sections-cfg`.
This PR bumps the SHT_LLVM_BB_ADDR_MAP version to 5.
Also includes some refactoring changes related to storing the CFG in the
Basic block sections profile reader.
fixes https://github.com/llvm/llvm-project/issues/98389
As the issue describes, promoting `llvm.fma.f16` to `llvm.fma.f32` does
not work, because there is not enough precision to handle the repeated
rounding. `f64` does have sufficient space. So this PR explicitly
promotes the 16-bit fma to a 64-bit fma.
I could not find examples of a libcall being used for fma, but that's
something that could be looked in separately to work around code size
issues.
This patch improves the legalization of vector operations, particularly
focusing on vectors that exceed the maximum supported size (e.g., 4
elements
for shaders). This includes better handling for insert and extract
element
operations, which facilitates the legalization of loads and stores for
long vectors—a common pattern when compiling HLSL matrices with Clang.
Key changes include:
- Adding legalization rules for G_FMA, G_INSERT_VECTOR_ELT, and various
arithmetic operations to handle splitting of large vectors.
- Updating G_CONCAT_VECTORS and G_SPLAT_VECTOR to be legal for allowed
types.
- Implementing custom legalization for G_INSERT_VECTOR_ELT using the
spv_insertelt intrinsic.
- Enhancing SPIRVPostLegalizer to deduce types for arithmetic
instructions
and vector element intrinsics (spv_insertelt, spv_extractelt).
- Refactoring legalizeIntrinsic to uniformly handle vector legalization
requirements.
The strategy for insert and extract operations mirrors that of bitcasts:
incoming intrinsics are converted to generic MIR instructions
(G_INSERT_VECTOR_ELT
and G_EXTRACT_VECTOR_ELT) to leverage standard legalization rules (like
splitting).
After legalization, they are converted back to their respective SPIR-V
intrinsics
(spv_insertelt, spv_extractelt) because later passes in the backend
expect these
intrinsics rather than the generic instructions.
This ensures that operations on large vectors (e.g., <16 x float>) are
correctly broken down into legal sub-vectors.