llvm-project

Author	SHA1	Message	Date
Krzysztof Drewniak	149fa17adf	[mlir][AMDGPU] Update gather_to_lds with explicit-async support (#181082 ) This commit takes advantage of the new `load.async.to.lds` intrinsic in order to add an `async` mode to `gather_to_lds`. In this mode, completion of the load needs to be managed with `asyncmark` and `wait.asyncmark` intrinsics instead of being implicitly derived by alias analysis. This commit adds the flag, a lowering for it, and updates tests. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-16 20:52:35 +00:00
Krzysztof Drewniak	cba0e6ad8e	[mlir][AMDGPU] Change width of LDS barrier count (#180554 ) Whoops, turns out I was off by 1 on how many bits are in the counts and phases ind these new LDS barriers. This commit fixes this. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 13:37:14 -08:00
Krzysztof Drewniak	762c32aa08	[mlir][AMDGPU] Add wrappers for in-memory barriers on gfx1250 (#180112 ) This commit introduces the `!amdgpu.ds_barrier_state` type and operations on that type, including extracting its components and (more importantly) provides wrappers around the upcoming barrier-management instructions that will be added in gfx1250. This commit is loosely based on work done for Triton, but this commit provides slightly more lower level-primitives (namely a known-atomic load for getting the barrier state instead of providing a `wait` operation that includes an entire spin-loop, though if people want one we could consider adding it.) These operations will allow LDS barriers to be interacted with in a more type-safe manner. The types and operations use the Ds naming scheme to match the underlying instructions and to avoid confusion with the "LDS barrier" already present in the AMDGPU dialect that was a workaround for LLVM's memory fencing support. (To summarize a potential usage pattern, one can use a pair of these barriers to communicate between wave(s) in a workgroup that load data into memory and a separate wave(s) that compute with that data.) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 17:43:12 +00:00
Ravil Dorozhinskii	b1907c109c	[ROCDL] Refactored MFMA ops in ODS; added constraints (#175775 ) This PR improves the ROCDL MFMA intrinsics by making their operand and result types explicit in the IR and by modeling immediate arguments (immargs) as attributes rather than opaque operands. This brings MFMA intrinsics in line with recent changes made to ROCDL WMMA operations, where intrinsic signatures were clarified to avoid treating them as an unstructured “blob of arguments”.	2026-01-21 21:50:25 +01:00
Erick Ochoa Lopez	aba7d72c8d	[mlir][amdgpu] gfx1250+ lower fat_raw_pointer_cast (#175047 ) * numRecords are set to all 1s if out of bounds is not requested. * set flags correctly to zero.	2026-01-08 16:47:24 -05:00
Erick Ochoa Lopez	19089fa13b	[mlir][amdgpu] Fix DMA lowerings. (#174008 ) * Fixes off by one error where tensor_dim_0_stride was always set to 1. * Instead of always setting this value to 1, tensor_dim_0_stride is the stride across the last dimension.	2025-12-30 13:33:35 -05:00
Eric Feng	24c7b4ea48	[mlir][amdgpu] implement amdgpu.sparse_mfma wrapper for smfmac instructions (#171968 ) Signed-off-by: Eric Feng <Eric.Feng@amd.com>	2025-12-18 20:16:14 -06:00
Erick Ochoa Lopez	5f15fee8ac	[mlir][amdgpu] Add tensor load store operations (#172686 ) Reland https://github.com/llvm/llvm-project/pull/170918 This PR differs from the original one by making the target materialization more restrictive.	2025-12-17 12:37:27 -05:00
Erick Ochoa Lopez	b9d6ad9ce9	Revert "[mlir][amdgpu] Add tensor load store operations (#170918 )" (#172671 ) This reverts commit ecbb44464a3a5fad090be8c19632b9046f8eb109. Broke ROCM integration tests. Will reland in future commit.	2025-12-17 15:06:22 +00:00
Mehdi Amini	8cc9c690eb	[MLIR] Fix clang-tidy fixes for llvm-prefer-isa-or-dyn-cast-in-conditionals in AMDGPUToROCDL.cpp (NFC) The cast can't fail, the `if` checks are spurious.	2025-12-17 05:30:01 -08:00
Ivan Butygin	ce553ab69f	Revert "[mlir][amdgpu] Expose waitcnt bitpacking infra (#172313 )" (#172636 ) This reverts commit 93013817afabe23a07073528481856b3507b6faf. Revert https://github.com/llvm/llvm-project/pull/172313 Missing libraries, again	2025-12-17 12:13:44 +00:00
Ivan Butygin	93013817af	[mlir][amdgpu] Expose waitcnt bitpacking infra (#172313 ) So we can get rid of our copy in `AMDGPUToROCDL`.	2025-12-17 14:32:30 +03:00
Erick Ochoa Lopez	ecbb44464a	[mlir][amdgpu] Add tensor load store operations (#170918 ) * removes unused code. * lowers tensor load and store operations.	2025-12-16 09:09:28 -05:00
Justin Rosner	3a88bb90bb	[mlir][AMDGPU] Add scaled wmma ops for gfx1250 (#169854 ) This PR adds scaled WMMA ops (available on gfx1250) and the lowering to the AMDGPU dialect, wrapping the underlying intrinsics.	2025-12-15 15:44:36 -08:00
Erick Ochoa Lopez	5123d36c02	[mlir][amdgpu] Lower make_gather_dma_descriptor. (#172083 ) * Makes `MakeDescriptorOp` a template for `make_dma_descriptor` and `make_gather_dma_descriptor`. * Makes verification and folder for `make_dma_descriptor` a template. * Adds custom verification and folder for `make_dma_gather_descriptor` based on tempalte. * Adds `make_gather_dma_descriptor` op. * Lowers `make_gather_dma_descriptor` to ROCDL.	2025-12-15 13:32:57 -05:00
Ivan Butygin	f785ca0d72	[mlir][nvgpu] Move memref memspace attributes conversion to single place (#172156 ) Also, some fixes for AMDGPU part for better naming.	2025-12-14 12:44:47 +03:00
Erick Ochoa Lopez	5ebb928532	[mlir][amdgpu] Adds make_dma_gather_base (#171857 ) * Adds `tdm_gather_base` type. * Adds `make_dma_gather_base` op. * Adds `make_dma_gather_base` lowering to ROCDL.	2025-12-12 09:20:38 -05:00
Erick Ochoa Lopez	2f9b8b7428	[mlir][amdgpu] Continue lowering make_tdm_descriptor. (#171498 ) * changes workgroup mask's type from i16 to vector<16xi1> * changes pad_amount and pad_interval from Index to I32 * adds lit tests for padEnable, iteration and dynamic cases * adds TODO for a future instrumentation pass to validate inputs * adds descriptor groups 2 and 3	2025-12-11 15:49:50 -05:00
Ivan Butygin	c22d82a1d4	[mlir][amdgpu] Move GPU memory spaces conversion to single place (#171876 )	2025-12-11 21:39:57 +03:00
Erick Ochoa Lopez	87345d2ad4	[mlir][amdgpu] Add type conversion to populate method (NFC) (#171708 ) * Renames populateAMDGPUMemorySpaceAttributeConversions to populateAMDGPUTypeAndAttributeConversions. * Adds TDMBaseType conversion to populateAMDGPUTypeAndAttributeConversions.	2025-12-11 08:44:19 -05:00
Ivan Butygin	c9c4e6eb58	Reland [mlir][amdgpu] Add common gpu mem space conversions to convert-amdgpu-to-rocdl (#171599 ) Reland https://github.com/llvm/llvm-project/pull/171543 Added missing GPU lib `MLIRGPUToGPURuntimeTransforms`.	2025-12-10 17:33:51 +03:00
Ivan Butygin	467af2715a	Revert "[mlir][amdgpu] Add common gpu mem space conversions to `conve… (#171594 ) …rt-amdgpu-to-rocdl` (#171543)" This reverts commit fd0fb05ae196cb664ebdd8940aad20f9606c62f7. Forgot to link GPU lib and shared lib build failed.	2025-12-10 10:47:07 +00:00
Ivan Butygin	fd0fb05ae1	[mlir][amdgpu] Add common gpu mem space conversions to `convert-amdgpu-to-rocdl` (#171543 ) Without it `convert-amdgpu-to-rocdl` will fail to convert `amdgpu.gather_to_lds` with `#gpu.address_space<workgroup>` mem space.	2025-12-10 13:15:10 +03:00
Ivan Butygin	f88d060c41	[mlir][amdgpu] `memory_counter_wait` tensor counter support (#171153 )	2025-12-08 20:02:40 +03:00
Tim Gymnich	0487154588	[mlir][amdgpu] Add workgroup_mask to MakeDmaDescriptorOp (#171103 ) - add `workgroup_mask` and `early_timeout`	2025-12-08 16:02:18 +01:00
Kazu Hirata	29fa151a07	[mlir] Fix a warning This patch fixes: mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp:2666:10: error: unused variable 'v4i32' [-Werror,-Wunused-variable]	2025-12-05 11:28:36 -08:00
Erick Ochoa Lopez	5dfd9c4f84	[mlir][amdgpu] Add lowering for make_dma_descriptor (#169955 ) * Adds initial lowering for make_dma_descriptor supporting tensors of rank 2. * Adds folders for make_dma_descriptor allowing statically known operands to be folded into attributes. * Add AllElementTypesMatch<["lds", "global"]> to make_dma_base. * Rename pad to pad_amount * Rename pad_every to pad_interval	2025-12-05 14:24:23 -05:00
Krzysztof Drewniak	e209b8bc2f	[mlir][AMDGPU] Rename gfx1250 packed extension ops, change firstScaleLane (#170718 ) The current name of scaled_ext_packed816 was, in retrospect, bothering me, since it just has a bunch of numbers on the end and doesn't really reflect the wave-wide nature of the operation. On top of that, the fact that firstScaleLane was 0 or 1, which might be read as the first lane being 1 (and not what it actually was, 16), also seemed weird. Therefore, before this op sees any use, 1. Renaem it to scaled_ext_packed_matrix 2. Change the semantics of firstScaleLane to actually point at the lane where the scales start (valid options currently are 0 or 16, the two halves of a wave32 wave). (Disclaimer: the mechanical updates were done via AI.) --------- Co-authored-by: Erick Ochoa Lopez <eochoalo@amd.com>	2025-12-04 14:35:16 -08:00
Erick Ochoa Lopez	73979c1df9	[mlir][amdgpu] Lower amdgpu.make_dma_base (#169817 ) * Adds lowering for `amdgpu.make_dma_base`	2025-12-02 13:48:31 -05:00
Erick Ochoa Lopez	1fcfd5c67b	[mlir][amdgpu] Sink op creation in scaled conversion intrinsics (NFC) (#168542 ) Where possible: * notifyMatchFailure happen first * then op.emitOpError * finally assertions / op creation. --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>	2025-11-18 10:35:05 -05:00
Erick Ochoa Lopez	909c9aacea	[mlir][amdgpu] Add lowerings for ScaledExtPacked816 (#168123 ) * Adds lowerings for amdgpy.scaled_ext_packed816 * updates verifiers	2025-11-17 16:51:52 -05:00
Muzammiluddin Syed	b1262d13e0	[mlir][ROCDL] Refactor wmma intrinsics to use attributes not operands where possible (#167041 ) The current implementation of the WMMA intrinsic ops as they are defined in the ROCDL tablegen is incorrect. They represent as operands what should be attributes such as `clamp`, `opsel`, `signA/signB`. This change performs a refactoring to bring it in line with what we expect. --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>	2025-11-13 19:50:02 -05:00
Jakub Kuderski	ba0be89cd2	[mlir] Simplify Default cases in type switches. NFC. (#165767 ) Use default values instead of lambdas when possible. `std::nullopt` and `nullptr` can be used now because of https://github.com/llvm/llvm-project/pull/165724.	2025-10-30 15:10:59 -04:00
Jakub Kuderski	3167752f29	[mlir][amdgpu][rocdl] Allow for graceful wmma conversion failures (#165616 )	2025-10-29 16:09:59 -04:00
Jakub Kuderski	466c526714	[mlir][amdgpu][rocdl] Add gfx1250 wmma ops (#165064 ) Update `amdgpu.wmma` op definition and implement amdgpu to rocdl conversion for new variants.	2025-10-28 12:42:39 -04:00
Jakub Kuderski	dc5f274560	[mlir][amdgpu] Add explicit intrinsic shape to wmma (#164920 ) This is in preparation for adding support for gfx1250 wmma intrinsics that include much more possible shapes. Instead of guessing the wave32/wave64 mode based on element types and vector sizes, require the intrinsic shapes to be set explicitly as attributes.	2025-10-24 12:21:33 -04:00
Jakub Kuderski	ae11c5c2c4	[mlir] Switch uses of deprecated .create methods to free function. NFC. (#164635 ) See https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.	2025-10-22 14:51:03 +00:00
Shilei Tian	2195fe7e01	[AMDGPU] Add the support for 45-bit buffer resource (#159702 ) On new targets like `gfx1250`, the buffer resource (V#) now uses this format: ``` base (57-bit): resource[56:0] num_records (45-bit): resource[101:57] reserved (6-bit): resource[107:102] stride (14-bit): resource[121:108] ``` This PR changes the type of `num_records` from `i32` to `i64` in both builtin and intrinsic, and also adds the support for lowering the new format. Fixes SWDEV-554034. --------- Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>	2025-09-24 11:12:02 -04:00
Krzysztof Drewniak	5ecc6d1951	[mlir][AMDGPU] Use LDS-only MMRA fences for lds_barrier (#157919 ) The previous lowering strategy for amdgpu.lds_barrier (which is an operation whose semantics are) "s.barrier, and all LDS operations before this happen-before LDS operations after this, and there must not be an inherent fence/forcing-to-completion of global memory (for performance)" was previosuly implemented through using manual calls to waitcnt() intrinsics and the s_barrire intrinsic(s). The lack of explicit fencing enabled miscompiles (where LDS accesses were reordered with the barrier) on gfx12. Since LLVM now allows MMRA annotations to ensure that only LDS accesses are fenced by a pair of fences, we can now use these fences in order to explicitly represent the semantics we want instead of trying to prescribe the method of their implemntation. Note that the gfx908 workaround of hiding the s_barrier in inline assembly in order to prevent spurious vmem barriers remains in place, but is is removed for gfx11 because the fences have been changed to give us the effect we want recently.	2025-09-23 14:00:09 -05:00
Gaurav Verma	a2a9601ea4	[mlir][AMDGPU] Updated `PermlaneSwapOp` to select correct val (#157586 ) * as per the instruction description, updated `PermlaneSwapOp` to select correct val * updated corresponding lit tests Issue it resolves: the block reduction was failing otherwise as we were selecting the `{0}` always. --------- Signed-off-by: xintin <gaurav.verma@amd.com>	2025-09-12 13:45:56 +02:00
Tim Gymnich	003cbbd4ca	[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933 ) - promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap %src {16,32}`	2025-08-24 12:41:09 +02:00
Tim Gymnich	e20fa4f412	[mlir][AMDGPU] Add PermlaneSwapOp (#154345 ) - Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and `rocdl.permlane32.swap` --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>	2025-08-21 18:21:43 +02:00
Maksim Levental	c610b24493	[mlir][NFC] update `mlir/Dialect` create APIs (27/n) (#150638 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-25 11:48:32 -05:00
Maksim Levental	8e8f195322	[mlir][amd] fix LLVM::InsertValueOp::create failure to disambiguate (#150605 ) fixes https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615 Note this happens because ADL can't disambiguate between `mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` for the value 0 which I guess is equal to nullptr on some (most?) systems. Note, this only occurs with the value 0.	2025-07-25 07:56:27 -04:00
Maksim Levental	b0434925c9	[mlir][NFC] update `Conversion` create APIs (4/n) (#149879 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-23 10:49:35 -05:00
Ivan Butygin	4977100624	[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper (#149670 ) The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly (similar to the asm format) and delegate architecture-dependent bitpacking to the amdgpu->rocdl lowering. --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>	2025-07-22 23:37:56 +03:00
Daniel Hernandez-Juarez	668c964282	[AMDGPU] [MLIR] Add 96 and 128 bit GatherToLDS for gfx950 (#147496 ) This PR adds 96 and 128 gather_to_lds support for gfx950. Updating lowering, verifier and tests.	2025-07-09 11:53:26 -04:00
Alan Li	3f3282cee8	[AMDGPU] Adding AMDGPU dialect wrapper for ROCDL transpose loads. (#145395 ) * 1-to-1 mapping wrapper op. * Direct lowering from AMDGPU wrapper to ROCDL intrinsics.	2025-06-25 22:58:14 -04:00
Umang Yadav	836201f117	Allow bf16 operands on new MFMAs (#144925 ) New gfx950 MFMA allows bf16 operands. `c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)` When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16 always which fails to compile for newer bf16 MFMA e.g. `v_mfma_f32_16x16x32bf16`. Backend expects bf16 type for the operands for those newer MFMAs. This patch fixes it. CC: @krzysz00 @dhernandez0 @giuseros @antiagainst @kuhar	2025-06-19 12:52:31 -05:00
Daniel Hernandez-Juarez	68b6f392ed	[MLIR][AMDGPU] Fix bug in GatherToLDSOpLowering, get the correct MemRefType for destination (#142915 ) This PR fixes a bug in GatherToLDSOpLowering, we were getting the MemRefType of source for the destination. Additionally, some related typos are corrected. CC: @krzysz00 @umangyadav @lialan	2025-06-13 11:33:51 -05:00

1 2 3

122 Commits