Certain non-standard float types were directly passed through in the
LLVM type converter, resulting in invalid IR or failed assertions:
```
mlir-opt: mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp:638: FailureOr<Type> mlir::LLVMTypeConverter::convertVectorType(VectorType) const: Assertion `LLVM::isCompatibleVectorType(vectorType) && "expected vector type compatible with the LLVM dialect"' failed.
```
The LLVM type converter should not define invalid type conversion rules
for such types. If there is no type conversion rule, conversion patterns
will not apply to ops with such operand types.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Add the `convergent` attribute to builtin functions and builtin function
calls when lowering SPIR-V non-uniform group functions to LLVM dialect.
---------
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
This commit is a further incremental step toward moving the whole
mlir-vulkan-runner MLIR pass pipeline into mlir-opt (see #73457). The
previous step was b225b3adf7b78387c9fcb97a3ff0e0a1e26eafe2, which moved
all device passes prior to SPIR-V serialization into a new mlir-opt test
pass, `-test-vulkan-runner-pipeline`.
This commit changes how SPIR-V serialization is accomplished for Vulkan
runner tests. Until now, this was done by the Vulkan-specific
ConvertGpuLaunchFuncToVulkanLaunchFunc pass. With this commit, this
responsibility is removed from that pass, and is instead done with the
existing generic GpuModuleToBinaryPass. In addition, the SPIR-V
serialization step is no longer done inside mlir-vulkan-runner, but
rather inside mlir-opt (in the `-test-vulkan-runner-pipeline` pass).
Both of these changes represent a greater alignment between
mlir-vulkan-runner and the other GPU integration tests. Notably, the IR
shapes produced by the mlir-opt pipelines for the Vulkan and SYCL
runners are now much more similar, with both using a gpu.binary op for
the serialized SPIR-V kernel.
In order to enable this, this commit includes these supporting changes:
- ConvertToSPIRVPass is enhanced to support producing the IR shape where
a spirv.module is nested inside a gpu.module, since this is what
GpuModuleToBinaryPass expects.
- ConvertGPULaunchFuncToVulkanLaunchFunc is changed to remove its SPIR-V
serialization functionality, and instead now extracts the SPIR-V from a
gpu.binary operation (as produced by ConvertToSPIRVPass).
- `-test-vulkan-runner-pipeline` now attaches SPIR-V target information
required by GpuModuleToBinaryPass.
- The WebGPU pass option, which had been removed from mlir-vulkan-runner
in the previous commit in this series, is restored as an option to
`-test-vulkan-runner-pipeline` instead, so that the WebGPU pass
continues being inserted into the pipeline just before SPIR-V
serialization.
This PR adds support to the `bf16` and `i1` data types when converting
`gpu::shuffle` to the `LLVMSPV` dialect, by inserting `bitcast` to/from
`i16` (for `bf16`) and extending/truncating to `i8` (for `i1`).
This PR adds a missing verifier for `tosa.pad`, ensuring that the
padding shape matches [2*rank(shape1)] according to V1.0.0
Specification. Fixes#119840.
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.
Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
This commit updates the internal `ConversionValueMapping` data structure
in the dialect conversion driver to support 1:N replacements. This is
the last major commit for adding 1:N support to the dialect conversion
driver.
Since #116470, the infrastructure already supports 1:N replacements. But
the `ConversionValueMapping` still stored 1:1 value mappings. To that
end, the driver inserted temporary argument materializations (converting
N SSA values into 1 value). This is no longer the case. Argument
materializations are now entirely gone. (They will be deleted from the
type converter after some time, when we delete the old 1:N dialect
conversion driver.)
Note for LLVM integration: Replace all occurrences of
`addArgumentMaterialization` (except for 1:N dialect conversion passes)
with `addSourceMaterialization`.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
Switch from rewrite patterns to conversion patterns. This allows to
perform type conversions together with other parts of the IR. For
example, this allows to convert from index to emit.size_t types.
/llvm-project/mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp:51:2:
error: extra ';' outside of a function is incompatible with C++98 [-Werror,-Wc++98-compat-extra-semi]
};
^
/llvm-project/mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp:97:2:
error: extra ';' outside of a function is incompatible with C++98 [-Werror,-Wc++98-compat-extra-semi]
};
^
2 errors generated.
Fix a use-after-return in #117513. Free-standing lambdas should not be
defined inside of the `LLVMTypeConverter` constructor because they go
out of scope.
During a 1:N replacement (`applySignatureConversion` or
`replaceOpWithMultiple`), the dialect conversion driver used to insert
two materializations:
* Argument materialization: convert N replacement values to 1 SSA value
of the original type `S`.
* Target materialization: convert original type to legalized type `T`.
The target materialization is unnecessary. Subsequent patterns receive
the replacement values via their adaptors. These patterns have their own
type converter. When they see a replacement value of type `S`, they will
automatically insert a target materialization to type `T`. There is no
reason to do this already during the 1:N replacement. (The functionality
used to be duplicated in `remapValues` and `insertNTo1Materialization`.)
Special case: If a subsequent pattern does not have a type converter, it
does *not* insert any target materializations. That's because the
absence of a type converter indicates that the pattern does not care
about type legality. Therefore, it is correct to pass an SSA value of
type `S` (or any other type) to the pattern.
Note: Most patterns in `TestPatterns.cpp` run without a type converter.
To make sure that the tests still behave the same, some of these
patterns now have a type converter.
This commit is in preparation of adding 1:N support to the conversion
value mapping. Before making any further changes to the mapping
infrastructure, I'd like to make sure that the code base around it (that
uses the mapping) is robust.
This patch fixes:
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp:535:13: error:
'applyPatternsAndFoldGreedily' is deprecated: Use
applyPatternsGreedily() instead [-Werror,-Wdeprecated-declarations]
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
1. We can use `getNumElements()` only for memrefs with trivial layout.
2. Buffer ops expecting sizes in i32 but descriptor values can be either
i32 or i64, add appropriate casts. This implementation is not ideal as
it can overflow, but it's still better than generating broken IR.
Do not run `cf-to-llvm` as part of `func-to-llvm`. This commit fixes
https://github.com/llvm/llvm-project/issues/70982.
This commit changes the way how `func.func` ops are lowered to LLVM.
Previously, the signature of the entire region (i.e., entry block and
all other blocks in the `func.func` op) was converted as part of the
`func.func` lowering pattern.
Now, only the entry block is converted. The remaining block signatures
are converted together with `cf.br` and `cf.cond_br` as part of
`cf-to-llvm`. All unstructured control flow is not converted as part of
a single pass (`cf-to-llvm`). `func-to-llvm` no longer deals with
unstructured control flow.
Also add more test cases for control flow dialect ops.
Note: This PR is in preparation of #120431, which adds an additional
GPU-specific lowering for `cf.assert`. This was a problem because
`cf.assert` used to be converted as part of `func-to-llvm`.
Note for LLVM integration: If you see failures, add
`-convert-cf-to-llvm` to your pass pipeline.
Do not run `arith-to-llvm` as part of `func-to-llvm`. This commit partly
fixes#70982.
Also simplify the pass pipeline for two math dialect integration tests.
Note for LLVM integration: If you see failures, add `arith-to-llvm` to your pass pipeline.
Clean up `populateVectorToLLVMConversionPatterns` so that it populates
only conversion patterns. All rewrite patterns that do not lower to LLVM
should be populated into a separate greedy pattern rewrite.
The current combination of rewrite patterns and conversion patterns
triggered an edge case when merging the 1:1 and 1:N dialect conversions.
Depends on #119973.
The mask materialization patterns during `VectorToLLVM` are rewrite
patterns. They should run as part of the greedy pattern rewrite and not
the dialect conversion. (Rewrite patterns and conversion patterns are
not generally compatible.)
The current combination of rewrite patterns and conversion patterns
triggered an edge case when merging the 1:1 and 1:N dialect conversions.
When running `convert-to-llvm`, `ceildiv` and `floordiv` ops, which do not
have direct llvm conversion pattern, would not get lowered to llvm
dialect. This patch adds CeilFloorDivExpandOpsPatterns to both
`convert-to-llvm` and `arith-to-llvm` (deprecated) lowering those ops to
lower level arith ops which can be lowered to llvm using LLVM
conversion.
Reland of https://github.com/llvm/llvm-project/pull/117305 after
buildbot failures.
See:
https://lab.llvm.org/buildbot/#/builders/80/builds/7168https://lab.llvm.org/buildbot/#/builders/130/builds/7036https://lab.llvm.org/buildbot/#/builders/138/builds/7290
Added dependence to ArithTransforms in ArithToLLVM. In previous
discussion, it has been suggested to move the
CeilFloorDivExpandOpsPatterns to ArithUtils but I think linking
ArithTransforms makes more sense as otherwise :
* ArithToLLVM needs a new dependency to ArithUtils
* ArithUtils needs new dependency to ArithTransforms or move the
patterns as well which will create more dependencies
* It creates lots of code motion which makes it hard to review.
Constrains Vector lowering to apply boundary checks only to data
transfers operating on block shapes.
This further aligns lowering with the current Xe instructions'
restrictions.
A complex multiplication should lower simply to the familiar 4 real
multiplications, 1 real addition, 1 real subtraction. No special-casing
of infinite or NaN values should be made, instead the complex numbers
should be thought as just vectors of two reals, naturally bottoming out
on the reals' semantics, IEEE754 or otherwise. That is what nearly
everybody else is doing ("nearly" because at the end of this PR
description we pinpoint the actual source of this in C99 `_Complex`),
and this pattern, by trying to do something different, was generating
much larger code, which was much slower and a departure from the
naturally expected floating-point behavior.
This code had originally been introduced in
https://reviews.llvm.org/D105270, which stated this rationale:
> The lowering handles special cases with NaN or infinity like C++.
I don't think that the C++ standard is a particularly important thing to
follow in this instance. What matters more is what people actually do in
practice with complex numbers, which rarely involves the C++
`std::complex` library type.
But out of curiosity, I checked, and the above statement seems
incorrect. The [current C++
standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/n4928.pdf)
library specification for `std::complex` does not say anything about the
implementation of complex multiplication: paragraph `[complex.ops]`
falls back on `[complex.member.ops]` which says:
> Effects: Multiplies the complex value rhs by the complex value *this
and stores the product in *this.
I also checked cppreference which often has useful information in case
something changed in a c++ language revision, but likewise, nothing at
all there:
https://en.cppreference.com/w/cpp/numeric/complex/operator_arith3
Finally, I checked in Compiler Explorer what Clang 19 currently
generates:
https://godbolt.org/z/oY7Ks4j95
That is just the familiar 4 multiplications.... and then there is some
weird check (`fcmp`) and conditionally a call to an external `__mulsc3`.
Googled that, found this StackOverflow answer:
https://stackoverflow.com/a/49438578
Summary: this is not about C++ (this post confirms my reading of the C++
standard not mandating anything about this). This is about C, and it
just happens that this C++ standard library implementation bottoms out
on code shared with the C `_Complex` implementation.
Another nuance missing in that SO answer: this is actually
[implementation-defined
behavior](https://en.cppreference.com/w/c/preprocessor/impl). There are
two modes, controlled by
```c
#pragma STDC CX_LIMITED_RANGE {ON,OFF,DEFAULT}
```
It is implementation-defined which is the default. Clang defaults to
OFF, but that's just Clang. In that mode, the check is required:
https://en.cppreference.com/w/c/language/arithmetic_types#Complex_floating_types
And the specific point in the [C99
standard](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)
is: `G.5.1 Multiplicative operators`.
But set it to ON and the check is gone:
https://godbolt.org/z/aG8fnbYoP
Summary: the argument has moved from C++ to C --- and even there, to
implementation-defined behavior with a standard opt-out mechanism.
Like with C++, I maintain that the C standard is not a particularly
meaningful thing for MLIR to follow here, because people doing business
with complex numbers tend to lower them to real numbers themselves, or
have their own specialized complex types, either way not relying on
C99's `_Complex` type --- and the very poor performance of the
`CX_LIMITED_RANGE OFF` behavior (default in Clang) is certainly a key
reason why people who care prefer to stay away from `_Complex` and
`std::complex`.
A good example that's relevant to MLIR's space is CUDA's `cuComplex`
type (used in the cuBLAS CGEMM interface). Here is its multiplication
function. The comment about competitiveness is interesting: it's not a
quirk of this particular function, it's the spirit underpinning
numerical code that matters.
1bf5cd15c5/v8.0/include/cuComplex.h (L106-L120)
```c
/* This implementation could suffer from intermediate overflow even though
* the final result would be in range. However, various implementations do
* not guard against this (presumably to avoid losing performance), so we
* don't do it either to stay competitive.
*/
__host__ __device__ static __inline__ cuFloatComplex cuCmulf (cuFloatComplex x,
cuFloatComplex y)
{
cuFloatComplex prod;
prod = make_cuFloatComplex ((cuCrealf(x) * cuCrealf(y)) -
(cuCimagf(x) * cuCimagf(y)),
(cuCrealf(x) * cuCimagf(y)) +
(cuCimagf(x) * cuCrealf(y)));
return prod;
}
```
Another instance in CUTLASS:
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/complex.h#L231-L236
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
## Description
This PR updates the `ConvertGpuOpsToLLVMSPVOps`'s option by replacing
the `index-bitwidth` with a boolean option `use-64bit-index` (similar to
the `ConvertGPUToSPIRV` option).
The reason for this modification is because the
`ConvertGpuOpsToLLVMSPVOps`:
> Generate LLVM operations to be ingested by a SPIR-V backend for gpu
operations
In the context of SPIR-V specifications only two physical addressing
models are allowed: `Physical32` and `Physical64`.
This change guarantees output sanity by preventing invalid or
unsupported index bitwidths from being specified.
`arith.fptoui %arg0 : f32 to i16` was lowered to
```
%0 = emitc.cast %arg0 : f32 to ui32
emitc.cast %0 : ui32 to i16
```
and is now lowered to
```
%0 = emitc.cast %arg0 : f32 to ui16
emitc.cast %0 : ui16 to i16
```
The current implementation of lowering to llvm for vector.extract
incorrectly assumes that if the number of indices is zero, the operation
can be folded away. This PR removes this condition and relies on the
folder to do it instead.
This PR also unifies the logic for scalar extracts and slice extracts,
which as a side effect also enables vector.extract lowering for n-d
vector.extract with dynamic inner most dimension. (This was only
prevented by a conservative check in the old implementation)
Unsigned integer types are uncommon enough in MLIR that there is no
operation to cast a scalar from signless to unsigned and vice versa.
Currently tosa.rescale uses builtin.unrealized_conversion_cast which
does not lower. Instead, this commit introduces optional attributes to
indicate unsigned input or output, named similarly to those in the TOSA
specification. This is more in line with the rest of MLIR where specific
operations rather than values are signed/unsigned.
Use `llvm.func`'s `intel_reqd_sub_group_size` attribute instead of
SPIR-V environment attributes in the `gpu.shuffle` conversion pattern.
This metadata is needed to check the semantics of the operation are
supported, i.e., it has a constant width and its value is equal to the
sub-group size.
As the pass also converts `gpu.func` to `llvm.func`, adding a
discardable attribute of name `intel_reqd_sub_group_size` attribute to
the latter is enough for this pattern to work.
We no longer have a notion of "default" sub-group size, so this
attribute needs to be set in the parent function for `gpu.shuffle`
operations to be converted.
Drop dependency on the SPIR-V dialect as we no longer require creating
attributes from this dialect to lower `gpu.shuffle` instances.
---------
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
Now that LLVM allows a operand bundle on assume calls to directly
specify alignment assumptions, change the lowering of
memref.assume_alignment to use that feature instead of the ptrtoint
method.
This makes LLVM's job easier and prevents issues when dealing with
cases where ptrtoint isn't a desired operation (like those with poiner
provenance)
Add support for denormal in the Arith dialect (binary and unary
operations).
Denormal are attached to every operation, and they can be of three
different
kinds:
1) ieee, denormal are preserved and processed as defined by IEEE 754
rules.
2) preserve sign, a mode where denormal numbers are flushed to zero, but
the
sign of the zero (+0 or -0) is preserved.
3) positive zero, a mode where all denormal numbers are flushed to
positive zero
(+0), ignoring the sign of the original number.
Denormal refers to both the operands and the result. Currently only
lowering for
ieee is supported.
Currently, `GeneralizePadOpPattern` is grouped under
`populatePadOpVectorizationPatterns`. However, as noted in #111349, this
transformation "decomposes" rather than "vectorizes" `tensor.pad`. As
such, it functions as:
* a vectorization _pre-processing_ transformation, not
* a vectorization transformation itself.
To clarify its purpose, this PR turns `GeneralizePadOpPattern` into a
standalone transformation by:
* introducing a dedicated `populateDecomposePadPatterns` method,
* adding a `apply_patterns.linalg.decompose_pad` Transform Dialect Op,
* removing it from `populatePadOpVectorizationPatterns`.
In addition, to better reflect its role, it is renamed as "decomposition"
rather then "generalization". This is in line with the recent renaming
of similar ops, i.e. tensor.pack/tensor.unpack Ops in #116439.
This patch adds the `ConvertToLLVMAttrInterface` and
`ConvertToLLVMOpInterface` interfaces. It also modifies the
`convert-to-llvm` pass to use these interfaces when available.
The `ConvertToLLVMAttrInterface` interfaces allows attributes to
configure conversion to LLVM, including the conversion target, LLVM type
converter, and populating conversion patterns. See the `NVVMTargetAttr`
implementation of this interface for an example of how this interface
can be used to configure conversion to LLVM.
The `ConvertToLLVMOpInterface` interface collects all convert to LLVM
attributes stored in an operation.
Finally, the `convert-to-llvm` pass was modified to use these interfaces
when available. This allows applying `convert-to-llvm` to GPU modules
and letting the `NVVMTargetAttr` decide which patterns to populate.
This commit adds extra checks to the MemRef argument materializations in
the LLVM type converter. These materializations construct a
`MemRefType`/`UnrankedMemRefType` from the unpacked elements of a MemRef
descriptor or from a bare pointer.
The extra checks ensure that the inputs to the materialization function
are correct. It is possible that a user added extra type conversion
rules that convert MemRef types in a different way and the extra checks
ensure that we construct a MemRef descriptor only if the inputs are what
we expect.
This commit also drops a check around bare pointer materializations:
```
// This is a bare pointer. We allow bare pointers only for function entry
// blocks.
```
This check should not be part of the materialization function. Whether a
MemRef block argument is converted into a MemRef descriptor or a bare
pointer is decided in the lowering pattern. At the point of time when
materialization functions are executed, we already made that decision
and we should just materialize regardless of the input format.
Add conversion to LLVM for `SPV_INTEL_split_barrier` operations via
conversion to SPIR-V built-ins.
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
Adds initial support for lowering the `loop` directive to MLIR.
The PR includes basic suport and testing for the following clauses:
* `collapse`
* `order`
* `private`
* `reduction`
Parent PR: #113911, only the latest commit is relevant to this PR.
/llvm-project/mlir/lib/Conversion/ComplexToStandard/ComplexToStandard.cpp:529:21:
error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
529 | for (int i = 1; i < coefficients.size(); ++i) {
| ~ ^ ~~~~~~~~~~~~~~~~~~~
1 error generated.