26579 Commits

Author SHA1 Message Date
Jared Hoberock
90ec5f2f62
[MLIR][test] Re-disable FileCheck on async.mlir integration test (#190702)
#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.

This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.

This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
2026-04-07 01:14:56 +02:00
Jared Hoberock
7087ece044
[MLIR][ExecutionEngine] Tolerate CUDA_ERROR_DEINITIALIZED in mgpuModuleUnload (#190563)
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.

## Reproduction

Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.

This script reproduces the error message from `mgpuModuleUnload` on my
system:

```
#!/bin/bash
set -e

LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}

cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
  %c1 = arith.constant 1 : index
  gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
             threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
    gpu.terminator
  }
  return
}
MLIR

$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
  -gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
  | $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll

$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o

cc /tmp/repro.o \
  -L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
  -lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro

echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context

This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload

Fixes #170833
2026-04-06 21:11:58 +00:00
Jianhui Li
9bddf47198
[MLIR][XeGPU] Extend Wg-to-Sg Distribution of Multi-Reduction Op for round-robin layout (#189988)
This PR enhance the multi-reduction op pattern of wg-to-sg distribution
pass:
1. allows each sg have multiple distribution of sg_data tiles.
2. expand the slm buffer size.
3. construct the layout based on the partial reduced vector and use
layout.computeDistributedCoords() to compute coordinates. the layout is
constructed so that the store is cooperative, and load overlapps with
neighbour threads.
4. perform save and load.
2026-04-06 14:07:50 -07:00
adams381
9265f9284c
[mlir][ABI] Add writable, dead_on_unwind, dead_on_return, nofpclass param attrs to LLVM dialect (#188374)
The MLIR LLVM dialect is missing support for several parameter
attributes that
exist in LLVM IR: `writable`, `dead_on_unwind`, `dead_on_return`, and
`nofpclass`. This adds them to the kind-to-name mapping in
`AttrKindDetail.h`
and the corresponding name accessors in `LLVMDialect.td`.

The existing generic conversion infrastructure in `ModuleTranslation`
and
`ModuleImport` picks them up automatically — `writable` and
`dead_on_unwind`
round-trip as `UnitAttr`, while `dead_on_return` and `nofpclass`
round-trip as
`IntegerAttr`.

CIR needs these to match classic codegen's ABI output (sret gets
`writable
dead_on_unwind`, indirect args get `dead_on_return`, fast-math FP args
get
`nofpclass`).
2026-04-06 11:26:11 -05:00
Eric Feng
930ef7736e
[mlir][amdgpu] Add optional write mask to amdgpu.global_load_async_to_lds (#190498) 2026-04-06 09:21:32 -07:00
Srinivasa Ravi
63231ebfe7
[MLIR][NVVM] Add new narrow FP convert Ops (#184291)
This change adds the following NVVM Ops for new narrow FP conversions
introduced in PTX 9.1:
- `convert.{f32x2/bf16x2}.to.s2f6x2`
- `convert.s2f6x2.to.bf16x2`
- `convert.bf16x2.to.f8x2` (extended for `f8E4M3FN` and `f8E5M2` types)
- `convert.{f16x2/bf16x2}.to.f6x2`
- `convert.{f16x2/bf16x2}.to.f4x2`

PTX ISA Reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
2026-04-06 12:06:25 +05:30
lonely eagle
ce1a9fd766
Reland "[mlir][reducer] Add eraseRedundantBlocksInRegion and getSuccessorForwardOperands API to BranchOpInterface" (#189253)
After fixing undefined symbol and memory leak issues(You can see
previous issue https://github.com/llvm/llvm-project/pull/189150), the PR
would like to reland
it(https://github.com/llvm/llvm-project/pull/187864).
2026-04-06 12:54:15 +08:00
lonely eagle
d08bb68080
[mlir][reducer] Add opt-pass-file option to opt-reduction pass (#189353)
Currently, the opt-reduction-pass only supports inputting the
optimization pipeline via the command line, which becomes cumbersome
when the pipeline is long. To address this, this PR introduces the
opt-pass-file option. This allows users to save the pipeline in a file
and provide the filename to parse the pipeline.
2026-04-05 12:55:21 +08:00
Ivan R. Ivanov
420111e9e4
[mlir][LLVM] Fix incorrect verification of atomicrmw f{min,max}imumnum (#190474)
Fix llvm.atomicrmw fminimumnum and fmaximumnum to correctly take the
float operation verification path.
2026-04-04 17:17:37 +00:00
lonely eagle
230d757a13
[mlir][reducer] make opt-reduction pass clone topOp after check (NFC) (#189356)
To avoid potential memory leaks, this PR defers the ModuleOp cloning
until after the verification check. If the check fails, the
moduleVariant might not be properly deallocated(original
implementation), leading to a memory leak. Therefore, this PR ensures
that the clone operation is only performed after a successful check. It
is part of https://github.com/llvm/llvm-project/pull/189353.
2026-04-04 22:08:34 +08:00
Yue Huang
38f8945362
[MLIR][Presburger] Fix stale pivot in Smith normal form (#189789)
The pivot used to fix divisibility in Smith normal form is stale. This
will not affect correctness, but can lower efficiency since the outer
loop will be executed more times.

Thanks for @benquike of discovering this.
2026-04-03 23:17:57 -07:00
Nishant Patel
6f68502a98
[MLIR][XeGPU] Port tests from the XeGPUSubgroupDistribute to XeGPUSgToWiDistributeExperimental (#189747)
This PR ports tests from subgroup-distribute.mlir (old pass) to
sg-to-wi-experimental.mlir (new pass)
2026-04-03 17:14:16 -07:00
Nishant Patel
b5936d4989
[MLIR][XeGPU] Remove verifyLayouts from sg to wi pass (#190360)
The verifyLayouts function walked the IR before distribution and failed
the pass if any XeGPU anchor op or vector-typed result was missing a
layout attribute. This was added as a temporary guard while the pass was
being developed.
Now we add target check for each op.
2026-04-03 17:00:05 -07:00
Nishant Patel
9c0a9bb3c6
[MLIR][XeGPU] Add support for reducing to scalar in sg to wi pass (#190193) 2026-04-03 16:56:28 -07:00
Tom Tromey
8d34545792
Introduce and use Verifier::visitDIType (#189067)
This adds a new method Verifier::visitDIType, and then changes method
for subclasses of DIType to call it. The new method just dispatches to
DIScope and adds a file/line check inspired by
Verifier::visitDISubprogram.
2026-04-03 12:40:37 -06:00
Md Abdullah Shahneous Bari
ffd29734cc
[mlir][gpu] Extend mgpumoduleLoadJIT API to add assemblySize parameter (#189429)
When JITing SPIR-V using LevelZero API, it expects the length of the
string since passed input data is a `void *`. Problem is, getting the
length of the string is not possible using something like
`strlen(reinterpret_cast<char *>(data))` in `mgpuModuleLoadJIT`
implementation. Becasuse the SPIR-V binary contains null bytes (i.e.,
the data is binary SPIR-V, not null-terminated text).

As a result we need to pass the `assmeblySize` via the
`mgpuModuleLoadJIT(void* data, int optLevel, size_t assmeblySize)`.
2026-04-03 12:45:40 -05:00
Mehdi Amini
8c81064169
[MLIR][Arith] Fix index_cast/index_castui chain folding to check intermediate width (#189042)
The patterns `IndexCastOfIndexCast` and `IndexCastUIOfIndexCastUI` in
ArithCanonicalization.td incorrectly eliminated a pair of index casts
whenever the outer result type equalled the original source type,
without verifying that the intermediate cast was lossless.

For example, the following was wrong folded to `%arg0`:
  %0 = index_castui %arg0 : i64 to index
  %1 = index_castui %0    : index to i8    ← truncates to 8 bits
  %2 = index_castui %1    : i8 to index    ← incorrectly removed

The pattern matched `%1`/`%2` because `i8.to(index)` has the same result
type as `i64.to(index)`, even though the i8 intermediate silently drops
56 bits. The same bug existed for the signed `index_cast` variant.

Fix: move the optimization into the `fold` methods of `IndexCastOp` and
`IndexCastUIOp` with an explicit check that the intermediate type is at
least as wide as the source type (using
`IndexType::kInternalStorageBitWidth` as the representative width for
`index`). Only then is the round-trip guaranteed lossless and the chain
can be collapsed.

Fixes #90238
Fixes #90296


Assisted-by: Claude Code
2026-04-03 16:05:08 +02:00
Mehdi Amini
c2ec012098
[mlir][linalg] Fix crash in tile_reduction when output map has constant exprs (#189166)
`generateInitialTensorForPartialReduction` and the `getInitSliceInfo*`
helpers unconditionally cast every result expression of the partial
result AffineMap to `AffineDimExpr`. When the original output indexing
map contains a constant (e.g. `affine_map<(d0,d1,d2)->(d0,0,d2)>`), the
constant expression propagates into the partial map and the cast
triggers an assertion.


Fixes #173025

Assisted-by: Claude Code
2026-04-03 11:09:26 +00:00
Mehdi Amini
73bcfb6824
[mlir][Affine] Fix LICM incorrectly hoisting stores from zero-trip-count loops (#189165)
The affine-loop-invariant-code-motion pass was hoisting side-effectful
operations (e.g. affine.store) out of loops whose trip count is
statically known to be zero. This caused stores to execute
unconditionally even though the loop body should never run, producing
incorrect results.

The fix skips hoisting of non-memory-effect-free ops when
getConstantTripCount returns 0. Pure/side-effect-free ops are still
eligible for hoisting because they cannot change observable program
state.

Fixes #128273

Assisted-by: Claude Code
2026-04-03 13:07:26 +02:00
Mehdi Amini
ff86be21de
[MLIR][MemRef] Fix AllocOp/AllocaOp flattening domination violation (#188980)
The generic MemRefRewritePattern handles AllocOp/AllocaOp by calling
getFlattenMemrefAndOffset with the op's own result as the source memref.
This inserts ExtractStridedMetadataOp and ReinterpretCastOp that consume
op.result before the alloc op itself in the block. After
replaceOpWithNewOp, op.result is RAUW'd to the new ReinterpretCastOp
result, leaving those earlier ops with forward references — a domination
violation caught by MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS.

Replace the AllocOp/AllocaOp cases in MemRefRewritePattern with a
dedicated AllocLikeFlattenPattern that never touches op.result until the
final replaceOpWithNewOp:
- sizes come from op.getMixedSizes() (operands, not the result)
- strides come from getStridesAndOffset on the MemRefType
- the flat allocation size is computed via
getLinearizedMemRefOffsetAndSize plus the static base offset so the
buffer covers [0, offset+extent)
- castAllocResult is simplified to take the pre-computed sizes and
strides rather than inserting an ExtractStridedMetadataOp on the
original op
- non-zero static base offsets are now correctly preserved in the
reinterpret_cast (the old code hardcoded offset=0, which was a verifier
error for layouts with offset \!= 0)
- dynamic offsets or strides bail out via notifyMatchFailure

Also remove the now-dead AllocOp/AllocaOp branches from replaceOp() and
the constexpr specialisation in getIndices().

Assisted-by: Claude Code
2026-04-03 11:21:00 +02:00
Mehdi Amini
d725513e7d
[MLIR][Affine] Fix null operands in simplifyConstrainedMinMaxOp (#189246)
`mlir::affine::simplifyConstrainedMinMaxOp` called
`canonicalizeMapAndOperands` with `newOperands` that could contain null
`Value()`s. These nulls came from
`unpackOptionalValues(constraints.getMaybeValues(), newOperands)` where
internal constraint variables added by `appendDimVar` (for `dimOp`,
`dimOpBound`, and `resultDimStart*`) have no associated SSA values.

Passing null Values to `canonicalizeMapAndOperands` risks undefined
behavior:
- `seenDims.find(null_value)` in the DenseMap causes all null operands
to collide at the same key, producing incorrect dim remapping.
- Any null operand that remains referenced in the result map would
propagate as a null Value into `AffineValueMap`, crashing callers that
try to use those operands to create ops.

Fix: Before calling `canonicalizeMapAndOperands`, filter null operands
from `newOperands` by replacing their dim/symbol positions in `newMap`
with constant 0 (safe because internal constraint dims should not appear
in the bound map expression) and compacting `newOperands` to contain
only non-null Values.

Fixes #127436

Assisted-by: Claude Code
2026-04-03 10:17:50 +02:00
Zhewen Yu
a7bf24919f
[mlir][IntRangeAnalysis] Fix assertion in inferAffineExpr for mod with range crossing modulus boundary (#188842)
The "small range with constant divisor" optimization in
`inferAffineExpr` for `AffineExprKind::Mod` assumed that if the dividend
range span (`lhsMax - lhsMin`) is less than the divisor, then the mod
results form a contiguous range. This is not always true, as the range
can straddle a modulus boundary.

For example, `[14, 17] mod 8`:
- Span is 3 < 8, so the old condition passed
- But `14%8=6` and `17%8=1` (wraps at 16)
- `umin=6, umax=1` → assertion `umin.ule(umax)` fails

The fix adds a same-quotient check (`lhsMin/rhs == lhsMax/rhs`) to
ensure both endpoints fall within the same modular period. When they
don't, we fall back to the conservative `[0, divisor-1]` range.

Assisted-by: Cursor (Claude)

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
2026-04-03 10:15:52 +02:00
lonely eagle
8db1f6492a
[mlir][reducer] Remove the restriction that OptReductionPass must be a ModuleOp (#189038)
This PR aims to make the pass more generic by removing the ModuleOp
restriction. This PR reimplements the logic using a standalone
PassManager. Additionally, the isInteresting method has been updated to
accept Operation* for better flexibility. Finally, a dedicated test
directory has been added to improve the organization of OptReductionPass
tests.
2026-04-03 14:49:01 +08:00
xys-syx
1474e3e4f4
[MLIR][NVVM] Derive NVVM_SyncWarpOp from NVVM_IntrOp for import support (#188415)
Change `NVVM_SyncWarpOp` base class from `NVVM_Op` to
`NVVM_IntrOp<"bar.warp.sync">`, which auto-generates `llvmEnumName =
nvvm_bar_warp_sync` and registers it with
`-gen-intr-from-llvmir-conversions` and
`-gen-convertible-llvmir-intrinsics`. This enables LLVM IR to MLIR
import. The hand-written `llvmBuilder` is removed as the default
`LLVM_IntrOpBase` builder is equivalent.
2026-04-02 15:41:50 -04:00
Bangtian Liu
86b5f11ecc
[mlir][GPU] Add constant address space to GPU dialect (#190211)
This PR adds a `constant` address space to the` GPU dialect and
lowerings to all GPU backends.

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>
2026-04-02 15:02:12 -04:00
Leandro Lupori
d59356aac5
Revert "Reland "[flang][OpenMP] Fix lowering of LINEAR iteration variables (#183794)"" (#190180)
Reverts llvm/llvm-project#188851
2026-04-02 11:09:30 -03:00
Mehdi Amini
a36f821e77
[mlir][linalg] Add test for ReduceOp empty-input verifier; remove dead empty-output check (#189614)
Add a FileCheck test covering the 'expected at least one input' error in
ReduceOp::verify(). The companion 'expected at least one output' check
was dead code: SameVariadicOperandSize fires first whenever
inputs.size() \!= inits.size(), and when both are empty the input check
fires first; remove the unreachable branch.

Assisted-by: Claude Code
2026-04-02 15:57:53 +02:00
Dhruv Chauhan
b87be02cc7
Revert "[mlir][tensor] Forward concat insert_slice destination into DPS provider" (#190143)
This reverts commit 1418f80.

The change can cause an infinite rewrite loop when
ForwardConcatInsertSliceDest interacts with
FoldEmptyTensorWithExtractSliceOp.
2026-04-02 14:48:44 +01:00
Julian Oppermann
018e048daf
[MLIR][Linalg] Generic to category specialization for unary elementwise ops (#187217)
Handle specialization of `linalg.generic` ops representing a unary
elementwise computation to the `linalg.elementwise` category op. This
implements a previously absent path in the linalg morphism.
2026-04-02 10:50:21 +02:00
yebinchon
495e1a4257
[mlir] added a check in the walk to prevent catching a cos in a nested region (#190064)
The walk in SincosFusion may detect a cos within a nested region of the
sin block. This triggers an assertion in `isBeforeInBlock` later on.
Added a check within the walk so it filters operations in nested
regions, which are not in the same block and should not be fused anyway.

---------

Co-authored-by: Yebin Chon <ychon@nvidia.com>
2026-04-01 20:10:56 -07:00
Nishant Patel
b3ca423a78
[MLIR][Vector] Enhance vector.multi_reduction unrolling to handle scalar result (#188633)
Previously, UnrollMultiReductionPattern bailed out when all the
dimensions were reduced to a scalar. This PR adds support for this case
by tiling the source vector and chaining partial reductions through the
accumulator operand.
2026-04-01 14:59:08 -07:00
Jianhui Li
1a1fbf967a
[MLIR][XeGPU] Support round-robin layout for constant and broadcast in wg-to-sg distribution (#189798)
As title.
2026-04-01 14:58:01 -07:00
Nishant Patel
9f50004651
[MLIR][XeGPU] Enhance the peephole optimization to remove the convert_layout after multi-reduction rewrite (#188849) 2026-04-01 13:55:11 -07:00
Jianhui Li
401ba6df84
[MLIR][XeGPU] Add Layout Propagation support for multi-reduction/reduction op with scalar result (#189133)
This PR add Layout Propagation support for multi-reduction/reduction op
with scalar result:
1) Enhance setupMultiReductionResultLayout() and
LayoutInfoPropagation::visitVectorMultiReductionOp() to support scalar
result
2) Add propagation support for vector.reduction op at the lane level,
since the op is only introduced at the lane level.
2026-04-01 13:01:34 -07:00
Jeff Sandoval
95a76886c1
[OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref reductions (#185460)
The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.

For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.

Fix both computation sites:

1. In MLIR's `getReductionDataSize()`, use
`DeclareReductionOp::getByrefElementType()` instead of `getType()` when
the reduction is by-ref, so the reduction buffer struct layout (and more
importantly its size) matches that emitted by the `OMPIRBuilder`.

2. In `OMPIRBuilder::createReductionsGPU()`, use
`ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref
reductions when computing `MaxDataSize`. It seems that `MaxDataSize`
isn't actually used in the deviceRTL, but it's better to fix it to avoid
future propagation of this bug.

Finally, add CHECK lines to the existing array-descriptor reduction test
to verify both the kernel environment `ReductionDataSize` and the
`reduce_data_size` call argument reflect the actual element type size.

Assisted-by: Claude Opus 4.6

---------

Co-authored-by: Jeffrey Sandoval <jeffrey.sandoval@hpe.com>
2026-04-01 14:59:16 -05:00
Nishant Patel
150aa6f2d3
[MLIR][XeGPU] Add support for convert layout with scalar in Sg to WI distribution (#189721) 2026-04-01 12:05:32 -07:00
Zhen Wang
3b3b556a12
[mlir][NVVM] Add managed attribute for global variables (#189751)
Add support for the `nvvm.managed` attribute on `llvm.mlir.global` ops.
When present, the LLVM IR translation emits `!nvvm.annotations` metadata
with `!"managed"` for the global variable, which the NVPTX backend uses
to generate `.attribute(.managed)` in PTX output.

This enables CUDA managed memory support for frontends that lower
through MLIR.
2026-04-01 18:18:20 +00:00
Vito Secona
fbf484009c
[mlir][sparse] add GPU num threads to sparsifier options (#189078)
This change adds a `gpu-num-threads` option to the sparsifier. This
allows users to specify the number of threads used for GPU codegen,
similar to the `num-threads` option in the `-sparse-gpu-codegen` pass.
2026-04-01 10:42:26 -07:00
Jan Leyonberg
91adaeceb1
[CIR][MLIR][OpenMP] Enable the MarkDeclareTarget pass for ClangIR (#189420)
This patch enables the MarkDeclareTarget for CIR by adding the pass to
the lowerings and attaching the declare target interface to the
cir::FuncOp. The MarkDeclareTarget is also generalized to work on the
FunctionOpInterface instead of func::Op since it needs to be able to
handle cir::FuncOp as well.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-01 12:50:09 -04:00
Razvan Lupusoru
9506f20b4d
[flang][acc] Add AA implementation for acc operations (#189772)
This PR extends flang's alias analysis so it can reason about values
that originate from OpenACC data and privatization operations, including
values passed through block arguments.
2026-04-01 09:03:47 -07:00
Ming Yan
158f10fe24
[mlir][memref] Fold memref.reinterpret_cast operations with valid offset or size constants. (#189533)
When encountering an invalid offset or size, we only skip the current
invalid value and continue attempting to fold other valid offsets or
sizes.
2026-04-01 15:30:12 +02:00
AidinT
461a1c51bf
[mlir][ods] resolve the wrong indent issue (#189277)
The `emitSummaryAndDescComments` is used to generate summary and
description for tablegen generated classes and structs such as Dialects
and Interfaces. The generated summary and description is indented
incorrectly in the output generated file. For example
`NVGPUDialect.h.inc ` looks like the following:

```cpp
namespace mlir::nvgpu {

/// The `NVGPU` dialect provides a bridge between higher-level target-agnostic
///     dialects (GPU and Vector) and the lower-level target-specific dialect
///     (LLVM IR based NVVM dialect) for NVIDIA GPUs. This allow representing PTX
///     specific operations while using MLIR high level dialects such as Memref
///     and Vector for memory and target-specific register operands, respectively.
class NVGPUDialect : public ::mlir::Dialect {
    ...
  };

} // namespace mlir::nvgpu
```

This is because the `emitSummaryAndDescComments` trims the summary and
description from both sides, rendering the re-indentation useless. This
PR resolves this bug.
2026-04-01 15:20:54 +02:00
Erick Ochoa Lopez
5072c020aa
[mlir][vector] Drop trailing 1-dims from constant_mask (#187383)
Generalize TransferReadDropUnitDimsPattern to also drop unit dimensions
when `vector::ConstantMaskOp` is used.

Previously TransferReadDropUnitDimsPattern would only drop unit
dimensions when `vector::CreateMaskOp` with a statically known operand
was used.

Assisted-by: Cursor
2026-04-01 09:05:25 -04:00
Mehdi Amini
f6ffdbcbae
[MLIR][Affine] Fix dead store elimination for vector stores with different types (#189248)
affine-scalrep's findUnusedStore incorrectly classified an
affine.vector_store as dead when a subsequent store wrote to the same
base index but with a smaller vector type. A vector<1xi64> store at
[0,0] does not fully overwrite a vector<5xi64> store at [0,0], so the
first store must be preserved.

The loadCSE function in the same file already had the correct
type-equality check for loads; this patch adds the analogous check for
stores in findUnusedStore.

Fixes #113687

Assisted-by: Claude Code
2026-04-01 10:40:53 +00:00
lonely eagle
6b2b0da40d
[mlir][CSE] Fix double-counting of numCSE statistic (#189802)
This PR fixes a regression where the numCSE statistic was being
incremented twice for a single operation elimination. The numCSE counter
is already internally incremented within the replaceUsesAndDelete
function. Manually incrementing it again after the function call leads
to an inaccurate total count. This is part of the
https://github.com/llvm/llvm-project/pull/180556.
2026-04-01 17:10:20 +08:00
Mehdi Amini
249e871fa4
[MLIR][ArithToLLVM] Fix index_cast on memref types generating invalid LLVM IR (#189227)
`arith.index_cast` and `arith.index_castui` accept memref operands (via
`IndexCastTypeConstraint`), but `IndexCastOpLowering::matchAndRewrite`
did not handle this case. When the operand was a memref, the conversion
framework substituted the converted LLVM struct type, and the lowering
incorrectly attempted to emit `llvm.sext`/`llvm.zext`/`llvm.trunc` on a
struct value, producing invalid LLVM IR.

Since LLVM uses opaque pointers, all memrefs with integer or index
element types lower to the same `\!llvm.struct<(ptr, ptr, i64, ...)>`
type, making `arith.index_cast` on memrefs a no-op at the LLVM level.
Add a check that treats the memref case as an identity conversion (same
as the same-bit-width path).

Fixes #92377

Assisted-by: Claude Code
2026-04-01 11:03:14 +02:00
Mehdi Amini
b1f8c28559
[MLIR] Validate APInt bitwidth in IntegerAttr::get(Type, APInt) (#188725)
IntegerAttr::get(Type, APInt) did not validate that the APInt's bit
width matched the expected bit width for the given type. For integer
types, the APInt width must equal the integer type's width. For index
types, the APInt width must equal IndexType::kInternalStorageBitWidth
(64 bits).

Passing an APInt with the wrong bit width could cause a
non-deterministic crash in StorageUniquer when comparing two IntegerAttr
instances for the same type but with different APInt widths.

This commit adds assertions in the get(Type, APInt) builder to catch
such misuse early in debug builds, providing a clear error message at
the call site rather than a cryptic crash in the storage uniquer.

Fixes #56401

Assisted-by: Claude Code
2026-04-01 10:47:36 +02:00
Lukas Sommer
6a31be68e3
[mlir][NFC] Remove conditionally unused type alias (#189894)
The `RawType` type alias is unused (`-Wunused-local-typedef`) in build
with asserts deactivated. In combination with `-Werror`, this causes
builds to fail.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
2026-04-01 10:25:58 +02:00
AidinT
585e2a015b
[MLIR] Convert BytecodeDialectInterface to ods (#188852)
This PR converts `BytecodeDialectInterface` to ODS.
2026-04-01 05:41:07 +02:00
AidinT
d52a5e8a5a
[MLIR] convert ConvertToEmitCPatternInterface to ODS (#188621)
This PR converts `ConvertToEmitCPatternInterface` dialect interface to ODS. Also makes changes to derived classes.
2026-04-01 05:30:12 +02:00