### Problem
PR #142944 introduced a new canonicalization pattern which caused
failures in the following GPU-related integration tests:
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f16-f16-accum.mlir
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f32.mlir
The issue occurs because the new canonicalization pattern can generate
multi-dimensional `vector.from_elements` operations (rank > 1), but the
GPU lowering pipelines were not equipped to handle these during the
conversion to LLVM.
### Fix
This PR adds `vector::populateVectorFromElementsLoweringPatterns` to the
GPU lowering passes that are integrated in `gpu-lower-to-nvvm-pipeline`:
- `GpuToLLVMConversionPass`: the general GPU-to-LLVM conversion pass.
- `LowerGpuOpsToNVVMOpsPass`: the NVVM-specific lowering pass.
Co-authored-by: Yang Bai <yangb@nvidia.com>
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
This PR adds support for complex power operations (`cpow`) in the
`ComplexToROCDLLibraryCalls` conversion pass, specifically targeting
AMDGPU architectures. The implementation optimises complex
exponentiation by using mathematical identities and special-case
handling for small integer powers.
- Force lowering to `complex.pow` operations for the `amdgcn-amd-amdhsa`
target instead of using library calls
- Convert `complex.pow(z, w)` to `complex.exp(w * complex.log(z))` using
mathematical identity
Add support for 1:N type conversions to the `ControlFlowToLLVM` lowering
patterns. Not applicable to `cf.switch` and `cf.assert`.
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Add support for 1:N type conversions to the `FuncToLLVM` lowering
patterns. This commit does not change the lowering of any types (such as
`MemRefType`). It just sets up the infrastructure, such that 1:N type
conversions can be used during `FuncToLLVM`.
Note: When the converted result types of a `func.func` have more than 1
type, then the results are wrapped in an `llvm.struct`. That's because
`llvm.func` does not support multiple result values. This "wrapping" was
already implemented for cases where the original `func.func` has
multiple results. With 1:N conversions, even a single result can now
expand to multiple converted results, triggering the same wrapping
mechanism.
The test cases are exercised with both the old and the new no-rollback
conversion driver.
Currently, modifier is printed as address, so it is not readable and not
useful. This PR adds readable printing for it.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Lowering transfer_read/transfer_write to load_gather/store_scatter in
case the target uArch doesn't support load_nd/store_nd. The high level
steps:
1. compute Strides;
2. compute Offsets;
3. collapseMemrefTo1D;
4. create Load gather or store_scatter op
Split the function into two: one that copies a single unranked
descriptor and one that copies multiple unranked descriptors. This is in
preparation of adding 1:N support to the Func->LLVM lowering patterns.
Rename `computeSizes` to `computeSize` and make it compute just a single
size. This is in preparation of adding 1:N support to the Func->LLVM
lowering patterns.
Now that we have general support for setting argument and result
attributes on LLVM intrinsics, extend the definitions of mbcnt.lo and
mbcnt.hi to carry such attributes. With that, update the construction of
the mbcnt.lo/mbcnt.hi calls used to get the lane ID to be `noundef`
(since the lane ID is always defined) and to be annotated with the
correct ranges (so that generic LLVM passes can correctly optimized
based on the fact that there are never more than 32/64 lanes).
(Also, handle a pattern that wasn't using getLaneId() and get rid of a
dead argument)
This commit improves the `allowPatternRollback` flag handling in the
dialect conversion driver. Previously, this flag was used to merely
detect cases that are incompatible with the new One-Shot Dialect
Conversion driver. This commit implements the driver itself: when the
flag is set to "false", all IR changes are materialized immediately,
bypassing the `IRRewrite` and `ConversionValueMapping` infrastructure.
A few selected test cases now run with both the old and the new driver.
RFC:
https://discourse.llvm.org/t/rfc-a-new-one-shot-dialect-conversion-driver/79083
Adds a pass for Vector to AMX operation conversion.
Initially, a direct rewrite for vector contraction in packed VNNI layout
is supported. Operations are expected to already be in shapes which are
AMX-compatible for the rewriting to occur.
Previously, the NVVM dialect's ldmatrix operation could only generate a
limited subset of the available NVVM ldmatrix intrinsics. The intrinsics
generating new ops introduced in BlackWell are not accessible through
the NVVM ops. This commit extends the ldmatrix operation to support all
available ldmatrix intrinsics.
Add missing xor AtomicRMWKind enum in arith. Also add support for xor to
memref.atomic_rmw so the change can be tested.
This does NOT add it for all users of the enum (e.g. Affine, Vector)
These commits continue the work done in
https://github.com/llvm/llvm-project/pull/144344, of adding alignment
attributes to operations in the vector and memref. These commits focus
on adding the alignment attribute to the `maskedload` and `maskedstore`
operations. The `VectorLoadConversion` pattern in VectorToLLVM is a
template for `load`, `store`, `maskedload` and `maskedstore` operations.
Having the alignment attribute in all these operations would allow for
an easy way to propagate the alignment attribute from the vector dialect
to the LLVM dialect.
This patchset also includes changes to the conversion from VectorToLLVM
to propagate the alignment attribute for the
vector.{,masked}{load,store} operations.
This patchset:
* propagates alignment attributes from memref operations into the SPIR-V
dialect,
* fixes an error in the logic which previously propagated alignment
attributes but did not add other MemoryAccess attributes.
* adds a failure condition in the case where the alignment attribute
from the memref dialect (64-bit wide) does not fit in SPIR-V's alignment
attribute (specified to be 32-bit wide).
Propagate alignment attribute from operations in the memref dialect to
the LLVM dialect.
Possible improvements: maybe the alignment attribute in LLVM's store and
load operations should be confined/constrained to i64? I believe that
way one can avoid typing the value in the attribute dictionary. I.e.,
from `{ alignment = 32 : i64 }` to `{ alignment = 32}`
Since `resultTy` might be nullptr, we should use `dyn_cast` instead of
`cast`. Additionally, `typeConverter->convertType<T>` is more
appropriate in this context.
Adds an initial conversion in the Memref -> SPIR-V lowering for images.
Any memref in the "Image" storage-class/address-space will be considered
for lowering to the `!spirv.image` type during Memref to SPIR-V
conversion. Initially only the reading of sampled images are support and
images are read via the `OpImageFetch` instruction. Future work should
expand the conversion patterns to target non-sampled images and add
support for image write operations.
Images are supported for fp32, fp16, int32, uint32, int16 and uint16
types and lit tests have been added to verify this is the case along
with negative testing to check the cases where images aren't supported.
---------
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
The `gpu.module` operation can contain `spirv.target_env` attributes
within an array attribute named `"targets"`. So it accounts for that
case by iterating over the `"targets"` attribute, if present, and
looking up `spirv.target_env`.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
This patch adds support for lowering several float classification ops
from the Math dialect to the SPIR-V dialect.
### Highlights:
- Introduced a new `spirv.IsFinite` operation corresponding to the
SPIR-V `OpIsFinite` instruction.
- Lowered `math.isfinite`, `math.isinf`, and `math.isnan` to SPIR-V
using `CheckedElementwiseOpPattern`.
- Added corresponding tests for op definition and conversion lowering.
This addresses the discussion in:
https://github.com/llvm/llvm-project/issues/150778
---
Let me know if any additional adjustments are needed!
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
`offset` and `width` must be constants and there are constraints on
their values. Update the operation definition to use attributes instead
of operands.
Do not access the erased `scf.while` operation in the lowering pattern.
That won't work anymore in a One-Shot Dialect Conversion and triggers a
use-after-free sanitizer error.
After the One-Shot Dialect Conversion refactoring, a
`ConversionPatternRewriter` will behave more like a normal
`PatternRewriter`.
This patch removes spurious includes of `llvm/IR` files, and unnecessary
link components in the LLVM dialect.
The only major dependencies still coming from LLVM are
`llvm::DataLayout`, which is used by `verifyDataLayoutString` and some
`dwarf` symbols in some attributes. Both of them should likely be
removed in the future.
Finally, I also removed one constructor from `LLVM::AssumeOp` that used
[OperandBundleDefT](https://llvm.org/doxygen/classllvm_1_1OperandBundleDefT.html)
without good reason and introduced a header unnecessarily.
fixes
https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615
Note this happens because ADL can't disambiguate between
`mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` **for the value
0** which I guess is equal to nullptr on some (most?) systems.
Note, this only occurs with the value 0.
This PR uses `val.getDefiningOp<OpTy>()` to replace `dyn_cast<OpTy>(val.getDefiningOp())` , `dyn_cast_or_null<OpTy>(val.getDefiningOp())` and `dyn_cast_if_present<OpTy>(val.getDefiningOp())`.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
The current arith-to-amdgpu patterns for scaling_extf and scaling_truncf
don't take full advantage of the native packing ability of the
intrinsics being targetted. Scaling extension takes the location of the
two elements to be extended as a constant argument (byte for fp4, half
for fp8), and scaling truncation takes a 32-bit input register and a
byte or half to write the truncated values to.
Not using these features would cause excess unneeded register pressure.
This PR resolves the inefficiency.
It also adds a test for the expected usecase of extending or
truncateting a block of 32 values to/from fp4 with a uniform scale to
ensure that this usage has a minimal amount of vector shuffling.