#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.
This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.
This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.
## Reproduction
Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.
This script reproduces the error message from `mgpuModuleUnload` on my
system:
```
#!/bin/bash
set -e
LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}
cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
gpu.terminator
}
return
}
MLIR
$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
-gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
| $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll
$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o
cc /tmp/repro.o \
-L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
-lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro
echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context
This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload
Fixes#170833
This PR enhance the multi-reduction op pattern of wg-to-sg distribution
pass:
1. allows each sg have multiple distribution of sg_data tiles.
2. expand the slm buffer size.
3. construct the layout based on the partial reduced vector and use
layout.computeDistributedCoords() to compute coordinates. the layout is
constructed so that the store is cooperative, and load overlapps with
neighbour threads.
4. perform save and load.
The MLIR LLVM dialect is missing support for several parameter
attributes that
exist in LLVM IR: `writable`, `dead_on_unwind`, `dead_on_return`, and
`nofpclass`. This adds them to the kind-to-name mapping in
`AttrKindDetail.h`
and the corresponding name accessors in `LLVMDialect.td`.
The existing generic conversion infrastructure in `ModuleTranslation`
and
`ModuleImport` picks them up automatically — `writable` and
`dead_on_unwind`
round-trip as `UnitAttr`, while `dead_on_return` and `nofpclass`
round-trip as
`IntegerAttr`.
CIR needs these to match classic codegen's ABI output (sret gets
`writable
dead_on_unwind`, indirect args get `dead_on_return`, fast-math FP args
get
`nofpclass`).
This change adds the following NVVM Ops for new narrow FP conversions
introduced in PTX 9.1:
- `convert.{f32x2/bf16x2}.to.s2f6x2`
- `convert.s2f6x2.to.bf16x2`
- `convert.bf16x2.to.f8x2` (extended for `f8E4M3FN` and `f8E5M2` types)
- `convert.{f16x2/bf16x2}.to.f6x2`
- `convert.{f16x2/bf16x2}.to.f4x2`
PTX ISA Reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
Currently, the opt-reduction-pass only supports inputting the
optimization pipeline via the command line, which becomes cumbersome
when the pipeline is long. To address this, this PR introduces the
opt-pass-file option. This allows users to save the pipeline in a file
and provide the filename to parse the pipeline.
To avoid potential memory leaks, this PR defers the ModuleOp cloning
until after the verification check. If the check fails, the
moduleVariant might not be properly deallocated(original
implementation), leading to a memory leak. Therefore, this PR ensures
that the clone operation is only performed after a successful check. It
is part of https://github.com/llvm/llvm-project/pull/189353.
The pivot used to fix divisibility in Smith normal form is stale. This
will not affect correctness, but can lower efficiency since the outer
loop will be executed more times.
Thanks for @benquike of discovering this.
The verifyLayouts function walked the IR before distribution and failed
the pass if any XeGPU anchor op or vector-typed result was missing a
layout attribute. This was added as a temporary guard while the pass was
being developed.
Now we add target check for each op.
This adds a new method Verifier::visitDIType, and then changes method
for subclasses of DIType to call it. The new method just dispatches to
DIScope and adds a file/line check inspired by
Verifier::visitDISubprogram.
When JITing SPIR-V using LevelZero API, it expects the length of the
string since passed input data is a `void *`. Problem is, getting the
length of the string is not possible using something like
`strlen(reinterpret_cast<char *>(data))` in `mgpuModuleLoadJIT`
implementation. Becasuse the SPIR-V binary contains null bytes (i.e.,
the data is binary SPIR-V, not null-terminated text).
As a result we need to pass the `assmeblySize` via the
`mgpuModuleLoadJIT(void* data, int optLevel, size_t assmeblySize)`.
The patterns `IndexCastOfIndexCast` and `IndexCastUIOfIndexCastUI` in
ArithCanonicalization.td incorrectly eliminated a pair of index casts
whenever the outer result type equalled the original source type,
without verifying that the intermediate cast was lossless.
For example, the following was wrong folded to `%arg0`:
%0 = index_castui %arg0 : i64 to index
%1 = index_castui %0 : index to i8 ← truncates to 8 bits
%2 = index_castui %1 : i8 to index ← incorrectly removed
The pattern matched `%1`/`%2` because `i8.to(index)` has the same result
type as `i64.to(index)`, even though the i8 intermediate silently drops
56 bits. The same bug existed for the signed `index_cast` variant.
Fix: move the optimization into the `fold` methods of `IndexCastOp` and
`IndexCastUIOp` with an explicit check that the intermediate type is at
least as wide as the source type (using
`IndexType::kInternalStorageBitWidth` as the representative width for
`index`). Only then is the round-trip guaranteed lossless and the chain
can be collapsed.
Fixes#90238Fixes#90296
Assisted-by: Claude Code
`generateInitialTensorForPartialReduction` and the `getInitSliceInfo*`
helpers unconditionally cast every result expression of the partial
result AffineMap to `AffineDimExpr`. When the original output indexing
map contains a constant (e.g. `affine_map<(d0,d1,d2)->(d0,0,d2)>`), the
constant expression propagates into the partial map and the cast
triggers an assertion.
Fixes#173025
Assisted-by: Claude Code
The affine-loop-invariant-code-motion pass was hoisting side-effectful
operations (e.g. affine.store) out of loops whose trip count is
statically known to be zero. This caused stores to execute
unconditionally even though the loop body should never run, producing
incorrect results.
The fix skips hoisting of non-memory-effect-free ops when
getConstantTripCount returns 0. Pure/side-effect-free ops are still
eligible for hoisting because they cannot change observable program
state.
Fixes#128273
Assisted-by: Claude Code
The generic MemRefRewritePattern handles AllocOp/AllocaOp by calling
getFlattenMemrefAndOffset with the op's own result as the source memref.
This inserts ExtractStridedMetadataOp and ReinterpretCastOp that consume
op.result before the alloc op itself in the block. After
replaceOpWithNewOp, op.result is RAUW'd to the new ReinterpretCastOp
result, leaving those earlier ops with forward references — a domination
violation caught by MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS.
Replace the AllocOp/AllocaOp cases in MemRefRewritePattern with a
dedicated AllocLikeFlattenPattern that never touches op.result until the
final replaceOpWithNewOp:
- sizes come from op.getMixedSizes() (operands, not the result)
- strides come from getStridesAndOffset on the MemRefType
- the flat allocation size is computed via
getLinearizedMemRefOffsetAndSize plus the static base offset so the
buffer covers [0, offset+extent)
- castAllocResult is simplified to take the pre-computed sizes and
strides rather than inserting an ExtractStridedMetadataOp on the
original op
- non-zero static base offsets are now correctly preserved in the
reinterpret_cast (the old code hardcoded offset=0, which was a verifier
error for layouts with offset \!= 0)
- dynamic offsets or strides bail out via notifyMatchFailure
Also remove the now-dead AllocOp/AllocaOp branches from replaceOp() and
the constexpr specialisation in getIndices().
Assisted-by: Claude Code
`mlir::affine::simplifyConstrainedMinMaxOp` called
`canonicalizeMapAndOperands` with `newOperands` that could contain null
`Value()`s. These nulls came from
`unpackOptionalValues(constraints.getMaybeValues(), newOperands)` where
internal constraint variables added by `appendDimVar` (for `dimOp`,
`dimOpBound`, and `resultDimStart*`) have no associated SSA values.
Passing null Values to `canonicalizeMapAndOperands` risks undefined
behavior:
- `seenDims.find(null_value)` in the DenseMap causes all null operands
to collide at the same key, producing incorrect dim remapping.
- Any null operand that remains referenced in the result map would
propagate as a null Value into `AffineValueMap`, crashing callers that
try to use those operands to create ops.
Fix: Before calling `canonicalizeMapAndOperands`, filter null operands
from `newOperands` by replacing their dim/symbol positions in `newMap`
with constant 0 (safe because internal constraint dims should not appear
in the bound map expression) and compacting `newOperands` to contain
only non-null Values.
Fixes#127436
Assisted-by: Claude Code
The "small range with constant divisor" optimization in
`inferAffineExpr` for `AffineExprKind::Mod` assumed that if the dividend
range span (`lhsMax - lhsMin`) is less than the divisor, then the mod
results form a contiguous range. This is not always true, as the range
can straddle a modulus boundary.
For example, `[14, 17] mod 8`:
- Span is 3 < 8, so the old condition passed
- But `14%8=6` and `17%8=1` (wraps at 16)
- `umin=6, umax=1` → assertion `umin.ule(umax)` fails
The fix adds a same-quotient check (`lhsMin/rhs == lhsMax/rhs`) to
ensure both endpoints fall within the same modular period. When they
don't, we fall back to the conservative `[0, divisor-1]` range.
Assisted-by: Cursor (Claude)
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
This PR aims to make the pass more generic by removing the ModuleOp
restriction. This PR reimplements the logic using a standalone
PassManager. Additionally, the isInteresting method has been updated to
accept Operation* for better flexibility. Finally, a dedicated test
directory has been added to improve the organization of OptReductionPass
tests.
Change `NVVM_SyncWarpOp` base class from `NVVM_Op` to
`NVVM_IntrOp<"bar.warp.sync">`, which auto-generates `llvmEnumName =
nvvm_bar_warp_sync` and registers it with
`-gen-intr-from-llvmir-conversions` and
`-gen-convertible-llvmir-intrinsics`. This enables LLVM IR to MLIR
import. The hand-written `llvmBuilder` is removed as the default
`LLVM_IntrOpBase` builder is equivalent.
Add a FileCheck test covering the 'expected at least one input' error in
ReduceOp::verify(). The companion 'expected at least one output' check
was dead code: SameVariadicOperandSize fires first whenever
inputs.size() \!= inits.size(), and when both are empty the input check
fires first; remove the unreachable branch.
Assisted-by: Claude Code
This reverts commit 1418f80.
The change can cause an infinite rewrite loop when
ForwardConcatInsertSliceDest interacts with
FoldEmptyTensorWithExtractSliceOp.
Handle specialization of `linalg.generic` ops representing a unary
elementwise computation to the `linalg.elementwise` category op. This
implements a previously absent path in the linalg morphism.
The walk in SincosFusion may detect a cos within a nested region of the
sin block. This triggers an assertion in `isBeforeInBlock` later on.
Added a check within the walk so it filters operations in nested
regions, which are not in the same block and should not be fused anyway.
---------
Co-authored-by: Yebin Chon <ychon@nvidia.com>
Previously, UnrollMultiReductionPattern bailed out when all the
dimensions were reduced to a scalar. This PR adds support for this case
by tiling the source vector and chaining partial reductions through the
accumulator operand.
This PR add Layout Propagation support for multi-reduction/reduction op
with scalar result:
1) Enhance setupMultiReductionResultLayout() and
LayoutInfoPropagation::visitVectorMultiReductionOp() to support scalar
result
2) Add propagation support for vector.reduction op at the lane level,
since the op is only introduced at the lane level.
The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.
For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.
Fix both computation sites:
1. In MLIR's `getReductionDataSize()`, use
`DeclareReductionOp::getByrefElementType()` instead of `getType()` when
the reduction is by-ref, so the reduction buffer struct layout (and more
importantly its size) matches that emitted by the `OMPIRBuilder`.
2. In `OMPIRBuilder::createReductionsGPU()`, use
`ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref
reductions when computing `MaxDataSize`. It seems that `MaxDataSize`
isn't actually used in the deviceRTL, but it's better to fix it to avoid
future propagation of this bug.
Finally, add CHECK lines to the existing array-descriptor reduction test
to verify both the kernel environment `ReductionDataSize` and the
`reduce_data_size` call argument reflect the actual element type size.
Assisted-by: Claude Opus 4.6
---------
Co-authored-by: Jeffrey Sandoval <jeffrey.sandoval@hpe.com>
Add support for the `nvvm.managed` attribute on `llvm.mlir.global` ops.
When present, the LLVM IR translation emits `!nvvm.annotations` metadata
with `!"managed"` for the global variable, which the NVPTX backend uses
to generate `.attribute(.managed)` in PTX output.
This enables CUDA managed memory support for frontends that lower
through MLIR.
This change adds a `gpu-num-threads` option to the sparsifier. This
allows users to specify the number of threads used for GPU codegen,
similar to the `num-threads` option in the `-sparse-gpu-codegen` pass.
This patch enables the MarkDeclareTarget for CIR by adding the pass to
the lowerings and attaching the declare target interface to the
cir::FuncOp. The MarkDeclareTarget is also generalized to work on the
FunctionOpInterface instead of func::Op since it needs to be able to
handle cir::FuncOp as well.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This PR extends flang's alias analysis so it can reason about values
that originate from OpenACC data and privatization operations, including
values passed through block arguments.
The `emitSummaryAndDescComments` is used to generate summary and
description for tablegen generated classes and structs such as Dialects
and Interfaces. The generated summary and description is indented
incorrectly in the output generated file. For example
`NVGPUDialect.h.inc ` looks like the following:
```cpp
namespace mlir::nvgpu {
/// The `NVGPU` dialect provides a bridge between higher-level target-agnostic
/// dialects (GPU and Vector) and the lower-level target-specific dialect
/// (LLVM IR based NVVM dialect) for NVIDIA GPUs. This allow representing PTX
/// specific operations while using MLIR high level dialects such as Memref
/// and Vector for memory and target-specific register operands, respectively.
class NVGPUDialect : public ::mlir::Dialect {
...
};
} // namespace mlir::nvgpu
```
This is because the `emitSummaryAndDescComments` trims the summary and
description from both sides, rendering the re-indentation useless. This
PR resolves this bug.
Generalize TransferReadDropUnitDimsPattern to also drop unit dimensions
when `vector::ConstantMaskOp` is used.
Previously TransferReadDropUnitDimsPattern would only drop unit
dimensions when `vector::CreateMaskOp` with a statically known operand
was used.
Assisted-by: Cursor
affine-scalrep's findUnusedStore incorrectly classified an
affine.vector_store as dead when a subsequent store wrote to the same
base index but with a smaller vector type. A vector<1xi64> store at
[0,0] does not fully overwrite a vector<5xi64> store at [0,0], so the
first store must be preserved.
The loadCSE function in the same file already had the correct
type-equality check for loads; this patch adds the analogous check for
stores in findUnusedStore.
Fixes#113687
Assisted-by: Claude Code
This PR fixes a regression where the numCSE statistic was being
incremented twice for a single operation elimination. The numCSE counter
is already internally incremented within the replaceUsesAndDelete
function. Manually incrementing it again after the function call leads
to an inaccurate total count. This is part of the
https://github.com/llvm/llvm-project/pull/180556.
`arith.index_cast` and `arith.index_castui` accept memref operands (via
`IndexCastTypeConstraint`), but `IndexCastOpLowering::matchAndRewrite`
did not handle this case. When the operand was a memref, the conversion
framework substituted the converted LLVM struct type, and the lowering
incorrectly attempted to emit `llvm.sext`/`llvm.zext`/`llvm.trunc` on a
struct value, producing invalid LLVM IR.
Since LLVM uses opaque pointers, all memrefs with integer or index
element types lower to the same `\!llvm.struct<(ptr, ptr, i64, ...)>`
type, making `arith.index_cast` on memrefs a no-op at the LLVM level.
Add a check that treats the memref case as an identity conversion (same
as the same-bit-width path).
Fixes#92377
Assisted-by: Claude Code
IntegerAttr::get(Type, APInt) did not validate that the APInt's bit
width matched the expected bit width for the given type. For integer
types, the APInt width must equal the integer type's width. For index
types, the APInt width must equal IndexType::kInternalStorageBitWidth
(64 bits).
Passing an APInt with the wrong bit width could cause a
non-deterministic crash in StorageUniquer when comparing two IntegerAttr
instances for the same type but with different APInt widths.
This commit adds assertions in the get(Type, APInt) builder to catch
such misuse early in debug builds, providing a clear error message at
the call site rather than a cryptic crash in the storage uniquer.
Fixes#56401
Assisted-by: Claude Code
The `RawType` type alias is unused (`-Wunused-local-typedef`) in build
with asserts deactivated. In combination with `-Werror`, this causes
builds to fail.
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>