lifetime.start and lifetime.end are primarily intended for use on
allocas, to enable stack coloring and other liveness optimizations. This
is necessary because all (static) allocas are hoisted into the entry
block, so lifetime markers are the only way to convey the actual
lifetimes.
However, lifetime.start and lifetime.end are currently *allowed* to be
used on non-alloca pointers. We don't actually do this in practice, but
just the mere fact that this is possible breaks the core purpose of the
lifetime markers, which is stack coloring of allocas. Stack coloring can
only work correctly if all lifetime markers for an alloca are
analyzable.
* If a lifetime marker may operate on multiple allocas via a select/phi,
we don't know which lifetime actually starts/ends and handle it
incorrectly (https://github.com/llvm/llvm-project/issues/104776).
* Stack coloring operates on the assumption that all lifetime markers
are visible, and not, for example, hidden behind a function call or
escaped pointer. It's not possible to change this, as part of the
purpose of lifetime markers is that they work even in the presence of
escaped pointers, where simple use analysis is insufficient.
I don't think there is any way to have coherent semantics for lifetime
markers on allocas, while also permitting them on arbitrary pointer
values.
This PR restricts lifetimes to operate on allocas only. As a followup, I
will also drop the size argument, which is superfluous if we always
operate on an alloca. (This change also renders various code handling
lifetime markers on non-alloca dead. I plan to clean up that kind of
code after dropping the size argument as well.)
In practice, I've only found a few places that currently produce
lifetimes on non-allocas:
* CoroEarly replaces the promise alloca with the result of an intrinsic,
which will later be replaced back with an alloca. I think this is the
only place where there is some legitimate loss of functionality, but I
don't think this is particularly important (I don't think we'd expect
the promise in a coroutine to admit useful lifetime optimization.)
* SafeStack moves unsafe allocas onto a separate frame. We can safely
drop lifetimes here, as SafeStack performs its own stack coloring.
* Similar for AddressSanitizer, it also moves allocas into separate
memory.
* LSR sometimes replaces the lifetime argument with a GEP chain of the
alloca (where the offsets ultimately cancel out). This is just
unnecessary. (Fixed separately in
https://github.com/llvm/llvm-project/pull/149492.)
* InferAddrSpaces sometimes makes lifetimes operate on an addrspacecast
of an alloca. I don't think this is necessary.
Before this PR, this was valid
```
%0 = llvm.mlir.constant(dense<[1, 2]> : vector<2xi32>) : vector<2xf32>
```
but this was not:
```
%0 = llvm.mlir.constant(1 : i32) : f32
```
because only scalar types were checked for compatibility, not the element types of nested types. Another additional check that this PR adds is to verify the float semantics. Before this PR,
```
%cst = llvm.mlir.constant(1.0 : bf16) : f16
```
was considered valid (because bf16 and f16 both have 16 bits), but with this PR it is not considered valid. This PR also moves all tests on the verifier of the llvm constant op into a single file. To summarize the state after this PR.
Invalid:
```mlir
%0 = llvm.mlir.constant(dense<[128, 1024]> : vector<2xi32>) :
vector<2xf32>
%0 = llvm.mlir.constant(dense<[128., 1024.]> : vector<2xbf16>) :
vector<2xf16>
```
Valid:
```mlir
%0 = llvm.mlir.constant(dense<[128., 1024.]> : vector<2xf32>) :
vector<2xi32>
%0 = llvm.mlir.constant(dense<[128, 1024]> : vector<2xi64>) :
vector<2xi8>
```
and identical valid/invalid cases for the scalar cases.
Clean up outer logic of affine loop tiling pass. A wrongly named
temporary method was exposed publicly; fix that. Remove unconditional
emission of remarks.
This commit fixes a check in the validation pass which intended to
validate whether a `tosa.cond_if` operation was conformant to the
specification. The specification requires all values used in the
then/else regions are explicitly declared within the regions. This
change checks that these regions are 'isolated from above', to ensure
this requirement is true.
This patch deletes `vector.matrix_multiply` and `vector.flat_transpose`,
which are thin wrappers around the corresponding LLVM intrinsics:
- `llvm.intr.matrix.multiply`
- `llvm.intr.matrix.transpose`
These Vector dialect ops did not provide additional semantics or
abstraction beyond the LLVM intrinsics. Their removal simplifies the
lowering pipeline without losing any functionality.
The lowering chains:
- `vector.contract` → `vector.matrix_multiply` →
`llvm.intr.matrix.multiply`
- `vector.transpose` → `vector.flat_transpose` →
`llvm.intr.matrix.transpose`
are now replaced with:
- `vector.contract` → `llvm.intr.matrix.multiply`
- `vector.transpose` → `llvm.intr.matrix.transpose`
This was accomplished by directly replacing:
- `vector::MatrixMultiplyOp` with `LLVM::MatrixMultiplyOp`
- `vector::FlatTransposeOp` with `LLVM::MatrixTransposeOp`
Note: To avoid a build-time dependency from `Vector` to `LLVM`,
relevant transformations are moved from "Vector/Transforms" to
`Conversion/VectorToLLVM`.
This PR fixes a missing validation in `isaCopyOpInterface` by checking
that the `linalg.yield` operand is identical to the first block
argument, indicating a direct copy. Fixes#130002.
In accordance with the semantics of forall, its body is executed in
parallel by multiple threads. We should not expect to branch back into
the forall body after the region's execution is complete.
getMixedOffsets() calls getMixedValues() with `static_offsets` and
`offsets`. It is assumed that the number of dynamic offsets in
`static_offsets` equals the rank of `offsets`. Otherwise, we fail on
assert when trying to access an array out of its bounds.
The same applies to getMixedStrides() and getMixedOffsets().
A verification of this assumption is added to
verifyOffsetSizeAndStrideOp() and a clear assert is added in
getMixedValues().
This PR updates create APIs for arith and affine - specifically these
are the only in-tree dialects/ops with "custom" builders:
```
AffineDmaStartOp
AffineDmaWaitOp
ConstantIntOp
ConstantFloatOp
ConstantIndexOp
```
See https://github.com/llvm/llvm-project/pull/147168 for more info.
This patch specializes the Python bindings for ForallOp and
InParallelOp, similar to the existing one for ForOp. These bindings
create the regions and blocks properly and expose some additional
helpers.
This change addresses the performance issue in the **--tosa-reduce-transposes** implementation by working directly with the
raw tensor data, eliminating the need for creating the costly intermediate attributes that leads to bottleneck.
This happens only when you use larger tile size, which is greater than
or equal to the dimension size. In this case, it is a full slice, so it
is fusible.
The IR can be generated during the TileAndFuse process. It is hard to
fix in such driver, so we enable the naive fusion for the case.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
…s to replicated form
This adds a new SPIR-V dialect-level conversion pass
`ConversionToReplicatedConstantCompositePass`. This pass looks for splat
composite `spirv.Constant` or `spirv.SpecConstantComposite` and rewrites
them into `spirv.EXT.ConstantCompositeReplicate` or
`spirv.EXT.SpecConstantCompositeReplicate`, respectively.
---------
Signed-off-by: Mohammadreza Ameri Mahabadian <mohammadreza.amerimahabadian@arm.com>
If a dimension is not tiled, it is always valid to fuse the pack op,
even if it has padding semantics. Because it always generates a full
slice along the dimension.
If a dimension is tiled and it does not need extra padding, the fusion
is valid.
The revision also formats corresponding tests for consistency.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This PR addresses the following issues.
1. Add the missing attributes when creating a new GPU funcOp in
`MoveFuncBodyToWarpExecuteOnLane0` pattern.
2. Bug fix in LoadNd distribution to make sure LoadOp is the last op in
warpOp region before it is distributed (needed for preserving the memory
op ordering during distribution).
3. Add utility for removing OpOperand or OpResult layout attributes.
This op doesn't have any rank or indices restrictions on src/dst
memrefs, but was using `SameVariadicOperandSize` which was causing
issues. Also fix some other issues while we at it.
Some platforms print `{anonymous}` instead of the other two forms
accepted by the test regex. This PR just removes the attempt to guess
how the anonymous namespace will be printed.
@Kewen12 is there a way to trigger the particular CIs that failed in
https://github.com/llvm/llvm-project/pull/146228 on this PR?
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
This patch adds support for scalable vectorization of linalg.mmt4d. The
key design change is the introduction of a new vectorizer state variable:
* `assumeDynamicDimsMatchVecSizes`
...along with the corresponding Transform dialect attribute:
* `assume_dynamic_dims_match_vec_sizes`.
This flag instructs the vectorizer to assume that dynamic memref/tensor
dimensions match the corresponding vector sizes (fixed or scalable). With this
assumption, masking becomes unnecessary, which simplifies the lowering pipeline
significantly.
While this assumption is not universally valid, it typically holds for
`linalg.mmt4d`. Inputs and outputs are explicitly packed using `linalg.pack`,
and this packing includes padding, ensuring that dimension sizes align with
vector sizes (*).
* Related discussion: https://github.com/llvm/llvm-project/issues/143920
An upcoming patch will include an end-to-end test that leverages scalable
vectorization of linalg.mmt4d to demonstrate the newly enabled functionality.
This would not be feasible without the changes introduced here, as it would
otherwise require additional logic to handle complex - but ultimately redundant
- masks.
(*) This holds provided that the tile sizes used for packing match the vector
sizes used during vectorization. It is the user’s responsibility to enforce
this.
Alignment information is important to allow LLVM backends such as AMDGPU
to select wide memory accesses (e.g., dwordx4 or b128). Since this info
is not always inferable, it's better to inform LLVM backends explicitly
about it. Furthermore, alignment is not necessarily a property of the
element type, but of each individual memory access op (we can have
overaligned and underaligned accesses compared to the natural/preferred
alignment of the element type).
This patch introduces `alignment` attribute to memref/vector.load/store
ops.
Follow-up PRs will
1. Propagate the attribute to LLVM/SPIR-V.
2. Introduce `alignment` attribute to other vector memory access ops:
vector.gather + vector.scatter
vector.transfer_read + vector.transfer_write
vector.compressstore + vector.expandload
vector.maskedload + vector.maskedstore
3. Replace `--convert-vector-to-llvm='use-vector-alignment=1` with a
simple pass to populate alignment attributes based on the vector
types.
- Introduces a `large_resource_limit` parameter across Python bindings,
enabling the eliding of resource strings exceeding a specified character
limit during IR printing.
- To maintain backward compatibilty, when using `operation.print()` API,
if `large_resource_limit` is None and the `large_elements_limit` is set,
the later will be used to elide the resource string as well. This change
was introduced by https://github.com/llvm/llvm-project/pull/125738.
- For printing using pass manager, the `large_resource_limit` and
`large_elements_limit` are completely independent of each other.
Based on the OpenACC specification — which states that if the bind name
is given as an identifier it should be resolved according to the
compiled language, and if given as a string it should be used unmodified
— we introduce two distinct `bindName` representations for `acc routine`
to handle each case appropriately: one as an array of `SymbolRefAttr`
for identifiers and another as an array of `StringAttr` for strings.
To ensure correct correspondence between bind names and devices, this
patch also introduces two separate sets of device attributes. The
routine operation is extended accordingly, along with the necessary
updates to the OpenACC dialect and its lowering.
This PR adds a feature that attaches a listener to all RewritePatterns that
logs information about the modified operations.
When the MLIR test suite is run, these debug outputs can
be filtered and combined into an index linking operations to the
patterns that insert, modify, or replace them. This index is intended to
be used to create a website that allows one to look up patterns from an
operation name.
The debug logs emitted can be viewed with --debug-only=generate-pattern-catalog,
and the lit config is modified to do this when the env var MLIR_GENERATE_PATTERN_CATALOG is set.
Example usage:
```
mkdir build && cd build
cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS="mlir" \
-DLLVM_TARGETS_TO_BUILD="host" \
-DCMAKE_BUILD_TYPE=DEBUG
ninja -j 24 check-mlir
MLIR_GENERATE_PATTERN_CATALOG=1 bin/llvm-lit -j 24 -v -a tools/mlir/test | grep 'pattern-logging-listener' | sed 's/^# | [pattern-logging-listener] //g' | sort | uniq > pattern_catalog.txt
```
Sample pattern catalog output (that fits in a gist):
https://gist.github.com/j2kun/02d1ab8d31c10d71027724984c89905a
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
This PR fixes a use-after-free error that happens when `DistinctAttr`
instances are created within a `PassManager` running with crash recovery
enabled. The root cause is that `DistinctAttr` storage is allocated in a
thread_local allocator, which is destroyed when the crash recovery
thread joins, invalidating the storage.
Moreover, even without crash reproduction disabling multithreading on
the context will destroy the context's thread pool, and in turn delete
the threadlocal storage. This means a call to
`ctx->disableMulthithreading()` breaks the IR.
This PR replaces the thread local allocator with a synchronised
allocator that's shared between threads. This persists the lifetime of
allocated DistinctAttr storage instances to the lifetime of the context.
### Problem Details:
The `DistinctAttributeAllocator` uses a
`ThreadLocalCache<BumpPtrAllocator>` for lock-free allocation of
`DistinctAttr` storage in a multithreaded context. The issue occurs when
a `PassManager` is run with crash recovery (`runWithCrashRecovery`), the
pass pipeline is executed on a temporary thread spawned by
`llvm::CrashRecoveryContext`. Any `DistinctAttr`s created during this
execution have their storage allocated in the thread_local cache of this
temporary thread. When the thread joins, the thread_local storage is
destroyed, freeing the `DistinctAttr`s' memory. If this attribute is
accessed later, e.g. when printing, it results in a use-after-free.
As mentioned previously, this is also seen after creating some
`DistinctAttr`s and then calling `ctx->disableMulthithreading()`.
### Solution
`DistinctAttrStorageAllocator` uses a synchronised, shared allocator
instead of one wrapped in a `ThreadLocalCache`. The former is what
stores the allocator in transient thread_local storage.
### Testing:
A C++ unit test has been added to validate this fix. (I was previously
reproducing this failure with `mlir-opt` but I can no longer do so and I
am unsure why.)
-----
Note: This is a 2nd attempt at my previous PR
https://github.com/llvm/llvm-project/pull/128566 that was reverted in
https://github.com/llvm/llvm-project/pull/133000. I believe I've
addressed the TSAN and race condition concerns.
This commit fixes a transpose_conv2d verifier check which compares the
output channels size to the bias size. The check didn't make sure output
channels were static before performing the comparison. This lead to
failures such as:
```
'tosa.transpose_conv2d' op bias channels expected to be equal to output channels (-9223372036854775808) or 1, got 5
```
when the output channels size was dynamic.
The requirement for a boolean condition is already checked for both
operators elsewhere. `cond_if` requires a boolean condition at
construction. `while_loop` cond_graph is checked in the verifier for a
scalar boolean output type.
Previously the cast folder would sign extend boolean values, leading
"true" to be casted to a value of -1 instead of 1. This change ensures
i1 values are zero extended, since i1 is used as a boolean value in
TOSA. According to the TOSA spec, the result of a boolean cast with
value "true" to another integer type should give a result of 1.
Fixes https://github.com/llvm/llvm-project/issues/57951
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Add default null init for `mlir::affine::MemRefAccess`. This is
consistent with various other MLIR structures and had been missing for
`mlir::affine::MemRefAccess`.