This is a takeover of PR ##110527
This commit adds an optional list of memory fences to gpu.barrier,
allowing users to specify which memory scopes they wish to fence
explicitly, while leaving the default semantics (which are equivalent to
calling for a global and local fence by analogy to CUDA's __syncthreads)
unchanged. The new expanded semantics are implemented for SPIR-V and for
the AMDGPU backend.
See also
https://discourse.llvm.org/t/rfc-add-memory-scope-to-gpu-barrier/81021/2?u=fmarno,
where the default behavior of a gpu.barrier was hashed out (though note
that the examples based on VMCNT are outdated for AMDGPU in that memory
fences can now be annotated with the correct set of address spaces).
This commit also deprecates amdgpu.lds_barrier for usecases that don't
involve targeting a gfx908.
Assisted-by: Cursor/Claude code (tests and extending amdgpu.lds_barrier
pattern while copying it over)
---------
Co-authored-by: Finlay Marno <finlay.marno@codeplay.com>
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Alan Li <alan.li@me.com>
This lets us properly annotate ranges for gpu.cluster_block_id and
gpu.cluster_dim_blocks. It also allows us to fill in the
nvvm.cluster_dim attribute for use in the NVVM backend.
This PR deletes the `createLowerGpuOpsToROCDLOpsPass` constructor from
the .td file, making the `createConvertGpuOpsToROCDLOps` pass available
to users. This has the following effects:
1. `createLowerGpuOpsToROCDLOpsPass` is not available anymore. Instead,
`createConvertGpuOpsToROCDLOps` should be used. This makes the interface
consistent with ConvertGpuOpsToNVVMOps.
2. To call `createConvertGpuOpsToROCDLOps`, the options must be passed
via ConvertGpuOpsToROCDLOpsOptions. This has the side effect of
making the `allowed-dialects` option available, which was not accessible
via C++ before.
`subgroup_broadcast` allow to broadcast the value from one lane to all
lanes in subgroup.
Supported modes:
* `first_active_lane` - broadcast value from the first active lane in
subgroup.
* `specific_lane` - broadcast value from the specified lane, lane index
must be within subgroup.
* `any_lane` - if `src` value is uniform across all the subgroup lanes
return it unchanged, otherwise result is poison. This variant
essentially an uniformity hint for the compiler, conveying that specific
value is uniform across all subgroup lanes. Dropping `any_lane`
broadcast should not change the code semantics.
Now that we have general support for setting argument and result
attributes on LLVM intrinsics, extend the definitions of mbcnt.lo and
mbcnt.hi to carry such attributes. With that, update the construction of
the mbcnt.lo/mbcnt.hi calls used to get the lane ID to be `noundef`
(since the lane ID is always defined) and to be annotated with the
correct ranges (so that generic LLVM passes can correctly optimized
based on the fact that there are never more than 32/64 lanes).
(Also, handle a pattern that wasn't using getLaneId() and get rid of a
dead argument)
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Of the 128-bits of buffer descriptor only 48 bits are address bits, so
following the discussion on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54,
the logic conclusion is to set the index width to 48 bits instead of
the current value of 128.
Most of the test changes are mechanical datalayout updates, but there
is one actual change: the ptrmask test now uses .i48 instead of .i128
and I had to update SelectionDAGBuilder to correctly extend the mask.
Reviewed By: krzysz00
Pull Request: https://github.com/llvm/llvm-project/pull/139419
The umbrella pass fol lowering GPU ops to ROCDL (aka lowering to LLVM
+ the AMDGPU-specific setup) would call the arith patterns that manually
implemented extf and truncf on bfloat because the LLVM AMDGPU backend
used to not suppport those operaitons.
Since the backend does now support these operations and has for quite
some time, remove these patterns from the default lowering flow.
This PR addressed a bug from llvm/llvm-project#137360. which was using
GPUSubgroupSizeToROCDL to patterns function that do not have a valid
constructor for it. This is causing compilation error below:
error: constructor inherited by 'GPUSubgroupSizeOpToROCDL' from base
class 'ConvertOpToLLVMPattern<mlir::gpu::SubgroupSizeOp>' is implicitly
deleted
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
This patch fixes:
mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp:170:5:
error: default label in switch which covers all enumeration values
[-Werror,-Wcovered-switch-default]
Using `ConvertToLLVMPatternInterface` allows to unhardcode specific
dialect conversions from passes and, more importantly, allows downstream
projects to inject their ops/types translation here by registering
corresponding interface.
Add `allowed-dialects` option so user can control which dialects can be
used to populate conversions.
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.
Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.
Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.
Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
- Replace hand-written parser/printer with auto-generated assembly
format.
- Remove implicit `gpu.module_end` terminator and use the `NoTerminator`
trait instead. (Same as `builtin.module`.)
- Turn the region into a graph region. (Same as `builtin.module`.)
Add support in `-convert-gpu-to-llvm-spv` to convert `gpu.func` to
`llvm.func` operations.
- `spir_kernel`/`spir_func` calling conventions used for
kernels/functions.
- `workgroup` attributions encoded as additional `llvm.ptr<3>`
arguments.
- No attribute used to annotate kernels
- `reqd_work_group_size` attribute using to encode
`gpu.known_block_size`.
- `llvm.mlir.workgroup_attrib_size` used to encode workgroup attribution
sizes. This will be attached to the pointer argument workgroup
attributions lower to.
**Note**: A notable missing feature that will be addressed in a
follow-up PR is a `-use-bare-ptr-memref-call-conv` option to replace
MemRef arguments with bare pointers to the MemRef element types instead
of the current MemRef descriptor approach.
---------
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
This patch refactors the conversion of math operations to ROCDL library
calls. This pass will also be used in flang to lower Fortran
intrinsics/math functions for OpenMP target offloading codgen.
This change reworks how range information for GPU dispatch IDs (block
IDs, thread IDs, and so on) is handled.
1. `known_block_size` and `known_grid_size` become inherent attributes
of GPU functions. This makes them less clunky to work with. As a
consequence, the `gpu.func` lowering patterns now only look at the
inherent attributes when setting target-specific attributes on the
`llvm.func` that they lower to.
2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size`
are made official dialect-level discardable attributes which can be
placed on arbitrary functions. This allows for progressive lowerings
(without this, a lowering for `gpu.thread_id` couldn't know about the
bounds if it had already been moved from a `gpu.func` to an `llvm.func`)
and allows for range information to be provided even when
`gpu.*_{id,dim}` are being used outside of a `gpu.func` context.
3. All of these index operations have gained an optional `upper_bound`
attribute, allowing for an alternate mode of operation where the bounds
are specified locally and not inherited from the operation's context.
These also allow handling of cases where the precise launch sizes aren't
known, but can be bounded more precisely than the maximum of what any
platform's API allows. (I'd like to thank @benvanik for pointing out
that this could be useful.)
When inferring bounds (either for range inference or for setting `range`
during lowering) these sources of information are consulted in order of
specificity (`upper_bound` > inherent attribute > discardable attribute,
except that dimension sizes check for `known_*_bounds` to see if they
can be constant-folded before checking their `upper_bound`).
This patch also updates the documentation about the bounds and inference
behavior to clarify what these attributes do when set and the
consequences of setting them up incorrectly.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
This change updates the dataLayout string to ensure alignment with the
latest LLVM TargetMachine configuration. The aim is to
maintain consistency and prevent potential compilation issues related to
memory address space handling.
In order to ensure operations lower correctly (especially
memref.addrspacecast, which relies on the data layout benig set
correctly then dealing with dynamic memrefs) and to prevent compilation
issues later down the line, set the `llvm.data_layout` attribute on GPU
modules when lowering their contents to a ROCDL / AMDGPU target.
If there's a good way to test the embedded string to prevent it from
going out of sync with the LLVM TargetMachine, I'd appreciate hearing
about it. (Or, alternatively, if there's a place I could farctor the
string out to).
This is a new ODS feature that allows dialects to define a list of
key/value pair representing an attribute type and a name.
This will generate helper classes on the dialect to be able to
manage discardable attributes on operations in a type safe way.
For example the `test` dialect can define:
```
let discardableAttrs = (ins
"mlir::IntegerAttr":$discardable_attr_key,
);
```
And the following will be generated in the TestDialect class:
```
/// Helper to manage the discardable attribute `discardable_attr_key`.
class DiscardableAttrKeyAttrHelper {
::mlir::StringAttr name;
public:
static constexpr ::llvm::StringLiteral getNameStr() {
return "test.discardable_attr_key";
}
constexpr ::mlir::StringAttr getName() {
return name;
}
DiscardableAttrKeyAttrHelper(::mlir::MLIRContext *ctx)
: name(::mlir::StringAttr::get(ctx, getNameStr())) {}
mlir::IntegerAttr getAttr(::mlir::Operation *op) {
return op->getAttrOfType<mlir::IntegerAttr>(name);
}
void setAttr(::mlir::Operation *op, mlir::IntegerAttr val) {
op->setAttr(name, val);
}
bool isAttrPresent(::mlir::Operation *op) {
return op->hasAttrOfType<mlir::IntegerAttr>(name);
}
void removeAttr(::mlir::Operation *op) {
assert(op->hasAttrOfType<mlir::IntegerAttr>(name));
op->removeAttr(name);
}
};
DiscardableAttrKeyAttrHelper getDiscardableAttrKeyAttrHelper() {
return discardableAttrKeyAttrName;
}
```
User code having an instance of the TestDialect can then manipulate this
attribute on operation using:
```
auto helper = testDialect.getDiscardableAttrKeyAttrHelper();
helper.setAttr(op, value);
helper.isAttrPresent(op);
...
```
Reduction is heavily used for many DL workload especially with
softmax/Attention layers. Wave/Warp shuffle and reduction is known to be
a speedy/efficient way to do these reductions.
In this patch we introduce AMD shuffle intrinsic Ops to ROCDL, along with it's corresponding lowering from gpu.shuffle. This should speed up a lot of DL workloads on ROCM backend. Currently, we have support for xor and idx, which are the more common ones. In the future, we plan on adding support for Down and Up, as well as using the ds_swizzle to further enhance it's performance when width and offsets are constant.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D158684