23 Commits

Author SHA1 Message Date
Krzysztof Drewniak
a4dd51d72f
[mlir][ArithToAMDGPU] Use native packing support (#150342)
The current arith-to-amdgpu patterns for scaling_extf and scaling_truncf
don't take full advantage of the native packing ability of the
intrinsics being targetted. Scaling extension takes the location of the
two elements to be extended as a constant argument (byte for fp4, half
for fp8), and scaling truncation takes a 32-bit input register and a
byte or half to write the truncated values to.

Not using these features would cause excess unneeded register pressure.
This PR resolves the inefficiency.

It also adds a test for the expected usecase of extending or
truncateting a block of 32 values to/from fp4 with a uniform scale to
ensure that this usage has a minimal amount of vector shuffling.
2025-07-24 12:26:03 -05:00
Maksim Levental
b0434925c9
[mlir][NFC] update Conversion create APIs (4/n) (#149879)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-23 10:49:35 -05:00
Krzysztof Drewniak
5ebbc258d4
[mlir][ArithToAMDGPU][NFC] Add PatternBenefit (#150091)
Since there may be caseses where these patterns are run alongside the
generic patterns from ArithExpandOps, add a PatternBenefit argument to
allow these architecture-specific patterns to be prioritized.
2025-07-23 07:56:05 -07:00
James Newling
ac6e2ee39b
[mlir][vector] Support direct broadcast conversion (LLVM & SPIRV) (#148027)
Add conversion for broadcast from scalar for LLVM and SPIRV. Also some
miscellaneous replacements of vector.splat with vector.broadcast in
VectorToGPU and ArithToAMDGPU.

Part of deprecation of vector.splat RFC:
https://discourse.llvm.org/t/rfc-mlir-vector-deprecate-then-remove-vector-splat/87143/4
2025-07-21 11:23:59 -07:00
Tim Gymnich
6f291cb099
[mlir][amdgpu] Add conversion from arith.scaling_extf / arith.scaling_truncf to amdgpu (#146372)
- add conversion from arith.scaling_extf to amdgpu.scaled_ext_packed
- add conversion from arith.scaling_truncf to amdgpu.packed_scaled_trunc
2025-07-08 21:45:23 +02:00
Yi Qian
0ea4fb9264
[AMD][ROCDL] Add packed conversions fp8/bf8->bf16 and fp8/bf8->fp32 in ROCDL dialect (#131850)
- Add packed conversions fp8/bf8->bf16 for gfx950 and fp8/bf8->fp32 for
gfx942 in ROCDL dialect
- Update amdgpu.ext_packed_fp8 lowering to use ROCDL packed fp8/bf8->f32
conversions for vector target types and ROCDL scalar fp8/bf8->fp32 for
scalar target type.

---------

Co-authored-by: Jungwook Park <jungwook.park@amd.com>
2025-03-21 14:49:50 +00:00
Matthias Springer
a21cfca320
[mlir][IR] Deprecate match and rewrite functions (#130031)
Deprecate the `match` and `rewrite` functions. They mainly exist for
historic reasons. This PR also updates all remaining uses of in the MLIR
codebase.

This is addressing a
[comment](https://github.com/llvm/llvm-project/pull/129861#pullrequestreview-2662696084)
on an earlier PR.

Note for LLVM integration: `SplitMatchAndRewrite` will be deleted soon,
update your patterns to use `matchAndRewrite` instead of separate
`match` / `rewrite`.

---------

Co-authored-by: Jakub Kuderski <jakub@nod-labs.com>
2025-03-07 08:43:01 +01:00
Matthias Springer
a6151f4e23
[mlir][IR] Move match and rewrite functions into separate class (#129861)
The vast majority of rewrite / conversion patterns uses a combined
`matchAndRewrite` instead of separate `match` and `rewrite` functions.

This PR optimizes the code base for the most common case where users
implement a combined `matchAndRewrite`. There are no longer any `match`
and `rewrite` functions in `RewritePattern`, `ConversionPattern` and
their derived classes. Instead, there is a `SplitMatchAndRewriteImpl`
class that implements `matchAndRewrite` in terms of `match` and
`rewrite`.

Details:
* The `RewritePattern` and `ConversionPattern` classes are simpler
(fewer functions). Especially the `ConversionPattern` class, which now
has 5 fewer functions. (There were various `rewrite` overloads to
account for 1:1 / 1:N patterns.)
* There is a new class `SplitMatchAndRewriteImpl` that derives from
`RewritePattern` / `OpRewritePatern` / ..., along with a type alias
`RewritePattern::SplitMatchAndRewrite` for convenience.
* Fewer `llvm_unreachable` are needed throughout the code base. Instead,
we can use pure virtual functions. (In cases where users previously had
to implement `rewrite` or `matchAndRewrite`, etc.)
* This PR may also improve the number of [`-Woverload-virtual`
warnings](https://discourse.llvm.org/t/matchandrewrite-hiding-virtual-functions/84933)
that are produced by GCC. (To be confirmed...)

Note for LLVM integration: Patterns with separate `match` / `rewrite`
implementations, must derive from `X::SplitMatchAndRewrite` instead of
`X`.

---------

Co-authored-by: River Riddle <riddleriver@gmail.com>
2025-03-06 08:48:51 +01:00
Mirza Halilčević
1fc49ff593
[MLIR][AMDGPU] Add OCP FP8 support for new hardware (#127728)
(Continuing from #106160)

This PR addresses remaining review comments from the original PR.

Original PR Description
---
Upcoming hardware (gfx12 and some future gfx9) will support the OCP
8-bit float formats for their matrix multiplication intrinsics and
conversion operations, retaining existing opcodes and compiler builtins.

This commit adds support for these types to the MLIR wrappers around
such operations, ensuring that the OCP types aren't used to generate
those builtins on hardware that doesn't expect that format and,
conversely, to ensure that the pre-OCP formats aren't used on new
hardware.

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Paul Fuqua <pf@acm.org>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-03-03 14:10:31 -06:00
Fabian Ritter
8900e412ae
[AMDGPU][MLIR] Replace gfx940 and gfx941 with gfx942 in MLIR (#125836)
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.

For SWDEV-512631
2025-02-19 10:05:45 +01:00
Prashant Kumar
97b08b8ee0
[mlir][amdgpu] Support for 8bit extf for 0d vector type (#126102)
For 0d vector type the rewrite crashes.
2025-02-07 04:15:22 +05:30
Matthias Springer
7a77f14c0a
[mlir][IR] Remove isF...() type API for low-precision FP types (#123326)
Remove `type.isFloat4E2M1FN()` etc. Use `isa<Float4E2M1FNType>(type)`
instead.

For details, see:
https://discourse.llvm.org/t/rethink-on-approach-to-low-precision-fp-types/82361/28
2025-01-20 09:22:53 +01:00
Jacques Pienaar
09dfc5713d
[mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Kunwar Grover
8e66303916
[mlir][Vector] Remove trivial uses of vector.extractelement/vector.insertelement (1/N) (#116053)
This patch removes trivial usages of
vector.extractelement/vector.insertelement. These operations can be
fully represented by vector.extract/vector.insert. See
https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops/71116
for more information.

Further patches will remove more usages of these ops.
2024-11-13 15:45:59 +00:00
Jakub Kuderski
763bc9249c
[mlir][amdgpu] Align Chipset with TargetParser (#107720)
Update the Chipset struct to follow the `IsaVersion` definition from
llvm's `TargetParser`. This is a follow up to
https://github.com/llvm/llvm-project/pull/106169#discussion_r1733955012.

* Add the stepping version. Note: This may break downstream code that
compares against the minor version directly.
* Use comparisons with full Chipset version where possible.

Note that we can't use the code in `TargetParser` directly because the
chipset utility is outside of `mlir/Target` that re-exports llvm's
target library.
2024-09-09 11:12:26 -04:00
Giuseppe Rossini
1387ba48a3
[MLIR][AMDGPU] Introduce fp16 packed arithmetic (#105688)
This PR is introducing rocdl.cvt.pkrtz in the ROCDL dialect and it is
using that instruction when lowering `arith::TruncFOp`.
2024-08-26 12:48:57 -05:00
Rob Suderman
f35318e828
[mlir][amdgpu] Add support for multi-dim arith.truncf/extf fp8 lowering (#98074)
The existing `fp8` lowering from `arith` to `amdgpu` bails out on the
multidimensional case. We can handle this by `vector.shape_cast`
collapsing to the 1-D case on extraction and re-casting back to the
desired output shape.
2024-07-09 14:59:58 -07:00
Ramkumar Ramachandra
db791b278a
mlir/LogicalResult: move into llvm (#97309)
This patch is part of a project to move the Presburger library into
LLVM.
2024-07-02 10:42:33 +01:00
Christian Sigg
a5757c5b65
Switch member calls to isa/dyn_cast/cast/... to free function calls. (#89356)
This change cleans up call sites. Next step is to mark the member
functions deprecated.

See https://mlir.llvm.org/deprecation and
https://discourse.llvm.org/t/preferred-casting-style-going-forward.
2024-04-19 15:58:27 +02:00
Victor Perez
8827ff92b9
[MLIR][Arith] Add rounding mode attribute to truncf (#86152)
Add rounding mode attribute to `arith`. This attribute can be used in
different FP `arith` operations to control rounding mode. Rounding modes
correspond to IEEE 754-specified rounding modes. Use in `arith.truncf` folding.

As this is not supported in dialects other than LLVM, conversion should fail for
now in case this attribute is present.

---------

Signed-off-by: Victor Perez <victor.perez@codeplay.com>
2024-04-01 11:57:14 +02:00
Hugo Trachino
65066c0277
[mlir] Use create instead of createOrFold for ConstantOp as folding has no effect (NFC) (#80129)
This aims to clean-up confusing uses of
builder.createOrFold<ConstantOp> since folding of constants fails.
2024-01-31 23:40:37 -08:00
Krzysztof Drewniak
750e90e440
[mlir][ArithToAMDGPU] Add option for saturating truncation to fp8 (#74153)
Many machine-learning applications (and most software written at AMD)
expect the operation that truncates floats to 8-bit floats to be
saturatinng. That is, they expect `truncf 256.0 : f32 to f8E4M3FNUZ` to
yield `240.0`, not `NaN`, and similarly for negative numbers. However,
the underlying hardware instruction that can be used for this truncation
implements overflow-to-NaN semantics.

To enable handling this usecase, we add the saturate-fp8-truncf option
to ArithToAMDGPU (off by default), which causes the requisite clamping
code to be emitted. Said clamping code ensures that Inf and NaN are
passed through exactly (and thus trancate to NaN).

Per review feedback, this commit efactors
createScalarOrSplatConstant() to the Arith dialect utilities and uses
it in this code. It also fixes naming of existing patterns and
switches from vector.extractelement/insertelement to
vector.extract/insert.
2024-01-23 16:52:21 -06:00
Krzysztof Drewniak
2ebd633f14 [mlir][AMDGPU] Add packed 8-bit float conversion ops and lowering
Define operations that wrap the gfx940's new operations for converting
between f32 and registers containing packed sets of four 8-bit floats.

Define rocdl operations for the intrinsics and an AMDGPU dialect
wrapper around them (to account for the fact that MLIR distinguishes
the two float formats at the type level but that the LLVM IR does
not).

Define an ArithToAMDGPU pass, meant to run before conversion to LLVM,
that replaces relevant calls to arith.extf and arith.truncf with the
packed operations in the AMDGPU dialect. Note that the conversion
currently only handles scalars and vectors of rank <= 1, as we do not
have a usecase for multi-dimensional vector support right now.

Reviewed By: jsjodin

Differential Revision: https://reviews.llvm.org/D152457
2023-09-28 14:44:16 +00:00