Adds `mapa` and `mapa.shared.cluster` MLIR Ops to generate mapa
instructions.
`mapa` - Map the address of the shared variable in the target CTA.
- `mapa` - source is a register containing generic address pointing to
shared memory.
- `mapa.shared.cluster` - source is a shared memory variable or a
register containing a valid shared memory address.
PTX Spec Reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-mapa
Implement the integer range inference niterface for memref.dim and
tetnor.dim using shared code. The inference will infer the `dim` of
dynamic dimensions to [0, index_max] and take the union of all the
dimensions that the `dim` argument could be validly referring to.
Update to use getConstShapeValue to collect shape information along the
graph.
Change-Id: Ic6fc2341e3bcfbec06a1d08986e26dd08573bd9c
Co-authored-by: TatWai Chong <tatwai.chong@arm.com>
A new op that allows for representing arbitrary contractions on operands
of arbitrary rank, with arbitrary transposes and arbitrary broadcasts
specified through its indexing_maps attribute.
Supports the expected lowerings to linalg.generic and to
vector.contract.
Corresponding RFC is here:
https://discourse.llvm.org/t/mlir-rfc-introduce-linalg-contract/83589
With the removal of mlir-vulkan-runner (as part of #73457) in
e7e3c45bc70904e24e2b3221ac8521e67eb84668, this pass no longer has to be
public (previously it had to be so the runner could use it). This commit
makes it instead only available for use by mlir-opt.
This is a recommit of 058d183980a2f334d085a46c32abded0557aa789 (#124301)
which had been reverted in 4573c857da88b3210d497d9a88a89351a74b5964 due
to a missing linker dependency on MLIRSPIRVTransforms in
mlir/test/lib/Pass/CMakeLists.txt (fixed in this commit).
With the removal of mlir-vulkan-runner (as part of #73457) in
e7e3c45bc70904e24e2b3221ac8521e67eb84668, this pass no longer has to be
public (previously it had to be so the runner could use it). This commit
makes it instead only available for use by mlir-opt.
The tablgen definition `TypedStrAttr` is an attribute constraints that
is meant to restrict the type of a `StringAttr` to the type given as
parameter. However, the definition did not previously restrict the type;
any `StringAttr` was accepted. This PR makes the definition actually
enforce the type.
To test the constraints, the PR also changes the test op that was
previously used to test this constraint such that the enforced type is
`AnyInteger` instead of `AnyType`. The latter allowed any type, so not
enforcing that constraint had no observable effect. The PR then adds a
test case with a wrong type and ensures that diagnostics are produced.
Signed-off-by: Ingo Müller <ingomueller@google.com>
Adds AVX512 bf16 dot-product operation and defines lowering to LLVM
intrinsics.
AVX512 intrinsic operation definition is extended with an optional
extension field that allows specifying necessary LLVM mnemonic suffix
e.g., `"bf16"` for `x86_avx512bf16_` intrinsics.
This resolves the same issue addressed in
https://github.com/llvm/llvm-project/pull/124286, but for invoke
operations. The issue arose from duplicated logic for both imports. This
PR also refactors the common import code for call and invoke
instructions to mitigate issues in the future.
Adds `griddepcontrol.wait` and `griddepcontrol.launch.dependents`
MLIR Ops to generate griddepcontrol instructions.
`griddepcontrol` - Allows dependent and prerequisite grids as defined by
the runtime to control execution in the following ways:
- `griddepcontrol.wait` - causes the executing thread to wait until all
prerequisite grids in flight have completed and all the memory
operations from the prerequisite grids are performed and made visible
to the current grid.
- `griddepcontrol.launch.dependents` - signals that specific dependents
the runtime system designated to react to this instruction can be
scheduled as soon as all other CTAs in the grid issue the same
instruction or have completed.
PTX Spec Reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol
Following up on #122188, this PR adds support for poison indices to
`ExtractOp` and `InsertOp`. It also includes canonicalization patterns
to turn extract/insert ops with poison indices into `ub.poison`.
Model the `IndexType` as `uint64_t` when converting to a python integer.
With the python bindings,
```python
DenseIntElementsAttr(op.attributes["attr"])
```
used to `assert` when `attr` had `index` type like `dense<[1, 2, 3, 4]>
: vector<4xindex>`.
---------
Co-authored-by: Christopher McGirr <christopher.mcgirr@amd.com>
Co-authored-by: Tiago Trevisan Jost <tiago.trevisanjost@amd.com>
The TOSA-v1.0 specification makes the shift attribute of the MUL
(Hammard product) operator an input. Move the `shift` parameter of the
MUL operator in the MILR TOSA dialect from an attribute to an input and
update any lit tests appropriately.
Expand the verifier of the `tosa::MulOp` operation to check the various
constraints defined in the TOSA-v1.0 specification. Specifically, ensure
that all input operands (excluding the optional shift) are of the same
rank. This means that broadcasting tests which previously checked rank-0
tensors would be broadcast are no longer valid and are removed.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
Co-authored-by: TatWai Chong <tatwai.chong@arm.com>
On lowering from `memref` to LLVM, `malloc` and other intrinsic
functions from `libc` will be declared in the current module. User's
redefinition of these reserved functions will poison the internal
analysis with wrong prototype. This patch adds assertion on the found
function's type and reports if it mismatch with the intended type.
Related to #120950
---------
Co-authored-by: Luohao Wang <Luohaothu@users.noreply.github.com>
Detailed writeup is in https://github.com/google/heir/issues/1153. See
also https://github.com/llvm/llvm-project/pull/120881. In short,
`propagateIfChanged` is used outside of the `DataFlowAnalysis` scope,
because it is public, but it does not propagate as expected as the
`DataFlowSolver` has stopped running.
To solve such misuse, `propagateIfChanged` should be made
protected/private.
For downstream users affected by this, to correctly propagate the
change, the Analysis should be re-run (check #120881) instead of just a
`propagateIfChanged`
The change to `IntegerRangeAnalysis` is just a expansion of the
`solver->propagateIfChanged`. The `Lattice` has already been updated by
the `join`. Propagation is done by `onUpdate`.
Cc @Mogball for review
See
https://discourse.llvm.org/t/rfc-introduce-opasm-type-attr-interface-for-pretty-print-in-asmprinter/83792
for detailed introduction.
This PR acts as the first part of it
* Add `OpAsmTypeInterface` and `getAsmName` API for deducing ASM name
from type
* Add default impl in `OpAsmOpInterface` to respect this API when
available.
The `OpAsmAttrInterface` / hooking into Alias system part should be
another PR, using a `getAlias` API.
### Discussion
* Instead of using `StringRef getAsmName()` as the API, I use `void
getAsmName(OpAsmSetNameFn)`, as returning StringRef might be unsafe
(std::string constructed inside then returned a _ref_; and this aligns
with the design of `getAsmResultNames`.
* On the result packing of an op, the current approach is that when not
all of the result types are `OpAsmTypeInterface`, then do nothing (old
default impl)
### Review
Cc @j2kun and @Alexanderviand-intel for downstream; Cc @River707 and
@joker-eph for relevent commit history; Cc @ftynse for discourse.
The recipe operations should have AutomaticAllocationScope so recipes can
be converted using operators that require parent ops to have
AutomaticAllocationScope
The op carries the output-shape directly. This can be used directly.
Also adds a method to get the shape as a `SmallVector<OpFoldResult>`.
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
# Overview
This PR adds 2:4 structured sparsity (sparse A, dense B) matrix multiply
instructions to ROCDL.
# Testing
I've added tests to Dialect/mlir and Target/mlir
This was discussed during the original review but I made it stricter
than discussed. Making it a pure view but adding a helper for bytecode
serialization (I could avoid the helper, but it ends up with more logic
and stronger coupling).
The convertFloorOp pattern incurs precision loss when floating-point
numbers exceed the representable range of int64. This pattern should be
removed.
Fixes https://github.com/llvm/llvm-project/issues/119836
This PR adds mfma.scale.f32.32x32x64.f8f6f4 and
mfma.scale.f32.16x16x128.f8f6f4 to the ROCDL dialect. They are converted
to the corresponding intrinsics in the mlir-to-llvmir pass.
Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.
* Update the existing tests to check the lowering to intrinsics.
* Add newer cp_async_zfill tests to verify the lowering for the 'cpSize'
variants.
* Tidy-up CHECK lines in cp_async() function in nvvmir.mlir (NFC)
PTX spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The TOSA-V1.0 specification adds "nan propagation" modes as attributes
for several operators. Adjust the ODS definitions of the relevant
operations to include this attribute.
The defined modes are "PROPAGATE" and "IGNORE" and the PROPAGATE mode is
set by default.
MAXIMUM, MINIMUM, REDUCE_MAX, REDUCE_MIN, MAX_POOL, CLAMP, and ARGMAX
support this attribute.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
Co-authored-by: TatWai Chong <tatwai.chong@arm.com>
This patch changes PadOp's padding input to type !tosa.shape<2 * rank>,
(where rank is the rank of the PadOp's input), instead of a <rank x 2>
tensor.
This patch is also a part of TOSA v1.0 effort:
https://discourse.llvm.org/t/rfc-tosa-dialect-increment-to-v1-0/83708
This patch updates the PadOp to match all against the TOSA v1.0 form.
Original Authors include:
@Tai78641
@wonjeon
Co-authored-by: Tai Ly <tai.ly@arm.com>
Scan directive allows to specify scan reductions within an worksharing
loop, worksharing loop simd or simd directive which should have an
`InScan` modifier associated with it. This change adds the mlir support
for the same.
Related PR: [Parsing and Semantic Support for
scan](https://github.com/llvm/llvm-project/pull/102792)
This is hopefully the first patch in the series of patches adding some
missing SPIR-V ops to MLIR over the next weeks/months, starting with
something simple: `OpEmitVertex` and `OpEndPrimitive`. Since the ops
have no input and outputs, and the only condition is "This instruction
must only be used when only one stream is present.", which I don't think
can be validate at the instruction level in isolation, I set
`hasVerifier` to 0. I hope I didn't miss anything, but I'm more than
happy to address any comments.
…the distributed IR case.
This patch allows `nd_load` and `nd_store` to preserve the tensor
descriptor shape during distribution to SIMT. The validation now expects
the distributed instruction to retain the `sg_map` attribute and uses it
to verify the consistency.
[note: this is blocked by:
https://github.com/tensorflow/tensorflow/pull/73891 otherwise tensorflow
may have lit test failures]
This patch adds SameOperandsAndResultRank trait to TOSA operators with
ResultsBroadcastableShape trait. SameOperandsAndResultRank trait
requiring that all operands and results have matching ranks unless the
operand/result is unranked.
This also renders the TosaMakeBroadcastable pass unnecessary - but this
pass is left in for now just in case it is still used in some flows. The
lit test, broadcast.mlir, is removed.
This also adds verify of the SameOperandsAndResultRank trait in the
TosaInferShapes pass to validate inferred shapes.
Signed-off-by: Tai Ly <tai.ly@arm.com>
This flag enables the configuration of some transformation such as the
lowering of contractions and transposes. The default configuration
preserves the existing behavior.
Add ROCDL support to the following instructions:
V_MFMA_F32_16X16X32_BF16
V_MFMA_I32_16X16X64_I8
V_MFMA_F32_16X16X32_F16
V_MFMA_F32_32X32X16_BF16
V_MFMA_I32_32X32X32_I8
V_MFMA_F32_32X32X16_F16
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Jungwook Park <jungwook.park@amd.com>
This PR adds missing ds\.read.tr4\.b64, ds\.read\.tr8\.b64,
ds\.read\.tr6\.b96,
ds\.read\.tr16\.b64 and global\.load\.lds ops to
the ROCDL dialect.
The ops are converted to the corresponding intrinsic calls during the
translation from MLIR to LLVM IRs.
---------
Co-authored-by: Ognjen Plavsic <plognjen@amd.com>
This follows up on 733be4ed7dcf976719f424c0cb81b77a14f91f5a, which made
mlir-vulkan-runner and its associated passes redundant, and completes
the main goal of #73457. The mlir-vulkan-runner tests become part of the
integration test suite, and the Vulkan runner runtime components become
part of ExecutionEngine, just as was done when removing other
target-specific runners.
This commit is a follow-up to 99a562b3cb17e89273ba0fe77129f2fb17a19381,
which migrated some of the mlir-vulkan-runner tests to mlir-cpu-runner
using a new pipeline and set of wrappers. That commit could not migrate
all the tests, because the existing calling conventions/ABIs for kernel
arguments generated by GPUToLLVMConversionPass were not a good fit for
the Vulkan runtime. This commit fixes this and migrates the remaining
tests. With this commit, mlir-vulkan-runner and many related components
are now unused, and they will be removed in a later commit (see #73457).
The old calling conventions require both the caller (host LLVM code) and
callee (device code) to have compile-time knowledge of the precise
argument types. This works for CUDA, ROCm and SYCL, where there is a
C-like calling convention agreed between the host and device code, and
the runtime passes through arguments as raw data without comprehension.
For Vulkan, however, the interface declared by the shader/kernel is in a
more abstract form, so the device code has indirect access to the
argument data, and the runtime must process the arguments to set up and
bind appropriately-sized buffer descriptors.
This commit introduces a new calling convention option to meet the
Vulkan runtime's needs. It lowers memref arguments to {void*, size_t}
pairs, which can be trivially interpreted by the runtime without it
needing to know the original argument types. Unlike the stopgap measure
in the previous commit, this system can support memrefs of various ranks
and element types, which unblocked migrating the remaining tests.
PR #122344 adds intrinsics for Bulk Async Copy
(non-tensor variants) using TMA. This patch
adds the corresponding NVVM Dialect Ops.
lit tests are added to verify the lowering to all
variants of the intrinsics.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>