This patch updates the TMA Tensor prefetch Op
to add support for im2col_w/w128 and tile_gather4 modes.
This completes support for all modes available in Blackwell.
* lit tests are added for all possible combinations.
* The invalid tests are moved to a separate file with more coverage.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
### Problem
PR #142944 introduced a new canonicalization pattern which caused
failures in the following GPU-related integration tests:
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f16-f16-accum.mlir
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f32.mlir
The issue occurs because the new canonicalization pattern can generate
multi-dimensional `vector.from_elements` operations (rank > 1), but the
GPU lowering pipelines were not equipped to handle these during the
conversion to LLVM.
### Fix
This PR adds `vector::populateVectorFromElementsLoweringPatterns` to the
GPU lowering passes that are integrated in `gpu-lower-to-nvvm-pipeline`:
- `GpuToLLVMConversionPass`: the general GPU-to-LLVM conversion pass.
- `LowerGpuOpsToNVVMOpsPass`: the NVVM-specific lowering pass.
Co-authored-by: Yang Bai <yangb@nvidia.com>
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Rename the pass `LinalgNamedOpConversionPass` to
`SimplifyDepthwiseConvPass` to avoid conflating it with the new
morphisms we are creating between the norms.
The viewLikeOpInterface abstracts the behavior of an operation view one
buffer as another. However, the current interface only includes a
"getViewSource" method and lacks a "getViewDest" method.
Previously, it was generally assumed that viewLikeOpInterface operations
would have only one return value, which was the view dest. This
assumption was broken by memref.extract_strided_metadata, and more
operations may break these silent conventions in the future. Calling
"viewLikeInterface->getResult(0)" may lead to a core dump at runtime.
Therefore, we need 'getViewDest' method to standardize our behavior.
This patch adds the getViewDest function to viewLikeOpInterface and
modifies the usage points of viewLikeOpInterface to standardize its use.
Currently, Flang can generate no-loop kernels for all OpenMP target
kernels in the program if the flags
-fopenmp-assume-teams-oversubscription or
-fopenmp-assume-threads-oversubscription are set.
If we add an additional parameter, we can choose
in the future which OpenMP kernels should be generated as no-loop
kernels.
This PR doesn't modify current behavior of oversubscription flags.
RFC for no-loop kernels:
https://discourse.llvm.org/t/rfc-no-loop-mode-for-openmp-gpu-kernels/87517
Adds the `#llvm.target<triple = $TRIPLE, chip = $CHIP, features =
$FEATURES>` attribute and along with a `-llvm-target-to-data-layout`
pass to derive a MLIR data layout from the LLVM data layout string
(using the existing `DataLayoutImporter`). The attribute implements the
relevant DLTI-interfaces, to expose the `triple`, `chip` (AKA `cpu`) and
`features` on `#llvm.target` and the full `DataLayoutSpecInterface`. The
pass combines the generated `#dlti.dl_spec` with an existing `dl_spec`
in case one is already present, e.g. a `dl_spec` which is there to
specify size of the `index` type.
Adds a `TargetAttrInterface` which can be implemented by all attributes
representing LLVM targets.
Similar to the Draft PR https://github.com/llvm/llvm-project/pull/78073.
RFC on which this PR is based:
https://discourse.llvm.org/t/mandatory-data-layout-in-the-llvm-dialect/85875
Add `mlir-capi-global-constructors-test` to `MLIR_TEST_DEPENDS` when
`MLIR_ENABLE_EXECUTION_ENGINE` is enabled, to ensure that it is also
built during standalone builds, and therefore fix test failure due to
the executable being missing.
I don't understand the purpose of `LLVM_ENABLE_PIC AND TARGET
${LLVM_NATIVE_ARCH}` block, but the condition is not true in standalone
builds.
Fixes 7610b1372955da55e3dc4e2eb1440f0304a56ac8.
This PR adds support for complex power operations (`cpow`) in the
`ComplexToROCDLLibraryCalls` conversion pass, specifically targeting
AMDGPU architectures. The implementation optimises complex
exponentiation by using mathematical identities and special-case
handling for small integer powers.
- Force lowering to `complex.pow` operations for the `amdgcn-amd-amdhsa`
target instead of using library calls
- Convert `complex.pow(z, w)` to `complex.exp(w * complex.log(z))` using
mathematical identity
A CMake change included in CMake 4.0 makes `AIX` into a variable
(similar to `APPLE`, etc.)
ff03db6657
However, `${CMAKE_SYSTEM_NAME}` unfortunately also expands exactly to
`AIX` and `if` auto-expands variable names in CMake. That means you get
a double expansion if you write:
`if (${CMAKE_SYSTEM_NAME} MATCHES "AIX")`
which becomes:
`if (AIX MATCHES "AIX")`
which is as if you wrote:
`if (ON MATCHES "AIX")`
You can prevent this by quoting the expansion of "${CMAKE_SYSTEM_NAME}",
due to policy
[CMP0054](https://cmake.org/cmake/help/latest/policy/CMP0054.html#policy:CMP0054)
which is on by default in 4.0+. Most of the LLVM CMake already does
this, but this PR fixes the remaining cases where we do not.
Blocks without a terminator are not handled correctly by
`Block::without_terminator`: the last operation is excluded, even when
it is not a terminator. With this commit, only terminators are excluded.
If the last operation is unregistered, it is included for safety.
I have seen misuse of the `hasEffect` API in downstream projects: users
sometimes think that `hasEffect == false` indicates that the operation
does not have a certain memory effect. That's not necessarily the case.
When the op does not implement the `MemoryEffectsOpInterface`, it is
unknown whether it has the specified effect. "false" can also mean
"maybe".
This commit clarifies the semantics in the documentation. Also adds
`hasUnknownEffects` and `mightHaveEffect` convenience functions. Also
simplifies a few call sites.
This is helping some windows users, here is the doc:
**LLVM_LIT_TOOLS_DIR**:PATH
The path to GnuWin32 tools for tests. Valid on Windows host. Defaults to
the empty string, in which case lit will look for tools needed for tests
(e.g. ``grep``, ``sort``, etc.) in your ``%PATH%``. If GnuWin32 is not
in your
``%PATH%``, then you can set this variable to the GnuWin32 directory so
that
lit can find tools needed for tests in that directory.
Fixes#150163
MLIR bytecode does not preserve alias definitions, so each attribute
encountered during deserialization is treated as a new one. This can
generate duplicate `DISubprogram` nodes during deserialization.
The patch adds a `StringMap` cache that records attributes and fetches
them when encountered again.
This is a cherry pick of #154053 with a fix for bad handling of
endianess when loading float and double litteral from the binary.
---------
Co-authored-by: Ferdinand Lemaire <ferdinand.lemaire@woven-planet.global>
Co-authored-by: Jessica Paquette <jessica.paquette@woven-planet.global>
Co-authored-by: Luc Forget <luc.forget@woven.toyota>
Exposes the `tensor.extract_slice` reshaping logic in
`BubbleUpExpandShapeThroughExtractSlice` and
`BubbleUpCollapseShapeThroughExtractSlice` through two corresponding
utility functions. These compute the offsets/sizes/strides of an extract
slice after either collapsing or expanding.
This should also make it easier to implement the two other bubbling
cases: (1) the `collapse_shape` is a consumer or (2) the `expand_shape`
is a consumer.
---------
Signed-off-by: Ian Wood <ianwood@u.northwestern.edu>
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 30 instances that rely on this "redirection". Since the
redirection doesn't improve readability, this patch replaces SmallSet
with SmallPtrSet for pointer element types.
I'm planning to remove the redirection eventually.
## Description
This change introduces a new canonicalization pattern for the MLIR
Vector dialect that optimizes chains of insertions. The optimization
identifies when a vector is **completely** initialized through a series
of vector.insert operations and replaces the entire chain with a
single `vector.from_elements` operation.
Please be aware that the new pattern **doesn't** work for poison vectors
where only **some** elements are set, as MLIR doesn't support partial
poison vectors for now.
**New Pattern: InsertChainFullyInitialized**
* Detects chains of vector.insert operations.
* Validates that all insertions are at static positions, and all
intermediate insertions have only one use.
* Ensures the entire vector is **completely** initialized.
* Replaces the entire chain with a
single vector.from_elementts operation.
**Refactored Helper Function**
* Extracted `calculateInsertPosition` from
`foldDenseElementsAttrDestInsertOp` to avoid code duplication.
## Example
```
// Before:
%v1 = vector.insert %c10, %v0[0] : i64 into vector<2xi64>
%v2 = vector.insert %c20, %v1[1] : i64 into vector<2xi64>
// After:
%v2 = vector.from_elements %c10, %c20 : vector<2xi64>
```
It also works for multidimensional vectors.
```
// Before:
%v1 = vector.insert %cv0, %v0[0] : vector<3xi64> into vector<2x3xi64>
%v2 = vector.insert %cv1, %v1[1] : vector<3xi64> into vector<2x3xi64>
// After:
%0:3 = vector.to_elements %arg1 : vector<3xi64>
%1:3 = vector.to_elements %arg2 : vector<3xi64>
%v2 = vector.from_elements %0#0, %0#1, %0#2, %1#0, %1#1, %1#2 : vector<2x3xi64>
```
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
This is the continuation of #152131
This PR adds support for parsing the global initializer and function
body, and support for decoding scalar numerical instructions and
variable related instructions.
---------
Co-authored-by: Ferdinand Lemaire <ferdinand.lemaire@woven-planet.global>
Co-authored-by: Jessica Paquette <jessica.paquette@woven-planet.global>
Co-authored-by: Luc Forget <luc.forget@woven.toyota>
In FoldArithToVectorOuterProduct pattern, static cast to vector type
causes assertion when a scalar type was encountered. It seems the author
meant to have a dyn_cast instead.
This NFC patch handles it by using dyn_cast.
This patch is forcing all values to be initialized by the
LivenessAnalysis, even in dead blocks. The dataflow framework will skip
visiting values when its already knows that a block is dynamically
unreachable, so this requires specific handling.
Downstream code could consider that the absence of liveness is the same
a "dead".
However as the code is mutated, new value can be introduced, and a
transformation like "RemoveDeadValue" must conservatively consider that
the absence of liveness information meant that we weren't sure if a
value was dead (it could be a newly introduced value.
Fixes#153906
This is similar to the fix to the greedy driver in #153957 ; except that
instead of removing unreachable code, we just ignore it.
Operations like:
```
%add = arith.addi %add, %add : i64
```
are legal in unreachable code.
Unfortunately many patterns would be unsafe to apply on such IR and can
lead to crashes or infinite loops.
This PR adds pattern to distribute the load/store/prefetch nd ops with
offsets from workgroup to subgroup IR. This PR is part of the transition
to move offsets from create_nd to load/store/prefetch nd ops.
Create_nd PR : #152351