Replaces https://github.com/llvm/llvm-project/pull/121886
Fixes https://github.com/llvm/llvm-project/issues/120254 (hopefully 🤞)
## Problem
Consider the following example:
```fortran
program test
real :: x(1)
integer :: i
!$omp parallel do reduction(+:x)
do i = 1,1
x = 1
end do
!$omp end parallel do
end program
```
The HLFIR+OMP IR for this example looks like this:
```mlir
func.func @_QQmain() {
...
omp.parallel {
%5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>>
%6 = fir.alloca !fir.box<!fir.array<1xf32>>
...
omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) {
omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) {
...
omp.yield
}
}
omp.terminator
}
return
}
```
The problem addressed by this PR is related to: the `alloca` in the
`omp.parallel` region + the related `reduction` clause on the
`omp.wsloop` op. When we try translate the reduction from MLIR to LLVM,
we have to choose an `alloca` insertion point. This happens in
`convertOmpWsloop` where at entry to that function, this is what the
LLVM module looks like:
```llvm
define void @_QQmain() {
%tid.addr = alloca i32, align 4
...
entry:
%omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1)
br label %omp.par.entry
omp.par.entry:
%tid.addr.local = alloca i32, align 4
...
br label %omp.par.region
omp.par.region:
br label %omp.par.region1
omp.par.region1:
...
%5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
```
Now, when we choose an `alloca` insertion point for the reduction, this
is the chosen block `omp.par.entry` (without the changes in this PR).
The problem is that the allocation needed for the reduction needs to
reference the `%5` SSA value. This results in inserting allocations in
`omp.par.entry` that reference allocations in a later block
`omp.par.region1` which causes the `Instruction does not dominate all
uses!` error.
## Possible solution - take 2:
This PR contains a more localized solution than
https://github.com/llvm/llvm-project/pull/121886. It makes sure that on
entry to `initReductionVars`, the IR builder is at a point where we can
starting inserting initialization region; to make things cleaner, we
still split the builder insertion point to a dedicated
`omp.reduction.init`. This way we avoid splitting after the latest
allocation block; which is what causing the issue.
the `ptx_kernel` calling convention is a more idiomatic and standard way
of specifying a NVPTX kernel than using the metadata which is not
supposed to change the meaning of the program. Further, checking the
calling convention is significantly faster than traversing the metadata,
improving compile time.
This change updates the clang and mlir frontends as well as the
NVPTXCtorDtorLowering pass to emit kernels using the calling convention.
In addition, this updates all NVPTX unit tests to use the calling
convention as well.
Currently the compiler will ICE in programs like the following on the
device lowering pass:
```
program main
implicit none
type i1_t
integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=>null()
integer, dimension(:), pointer :: tabval
!$omp THREADPRIVATE(tab)
allocate(newi1)
tab=>newi1
tab%val(:)=1
tabval=>tab%val
!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do
end program main
```
This is due to the fact that THREADPRIVATE returns a result operation,
and this operation can actually be used by other LLVM dialect (or other
dialect) operations. However, we currently skip the lowering of
threadprivate, so we effectively never generate and bind an LLVM-IR
result to the threadprivate operation result. So when we later go on to
lower dependent LLVM dialect operations, we are missing the required
LLVM-IR result, try to access and use it and then ICE. The fix in this
particular PR is to allow compilation of threadprivate for device as
well as host, and simply treat the device compilation as a no-op,
binding the LLVM-IR result of threadprivate with no alterations and
binding it, which will allow the rest of the compilation to proceed,
where we'll eventually discard the host segment in any case.
The other possible solution to this I can think of, is doing something
similar to Flang's passes that occur prior to CodeGen to the LLVM
dialect, where they erase/no-op certain unrequired operations or
transform them to lower level series of operations. And we would
erase/no-op threadprivate on device as we'd never have these in target
regions.
The main issues I can see with this are that we currently do not
specialise this stage based on wether we're compiling for device or
host, so it's setting a precedent and adding another point of having to
understand the separation between target and host compilation. I am also
not sure we'd necessarily want to enforce this at a dialect level incase
someone else wishes to add a different lowering flow or translation
flow. Another possible issue is that a target operation we have/utilise
would depend on the result of threadprivate, meaning we'd not be allowed
to entirely erase/no-op it, I am not sure of any situations where this
may be an issue currently though.
This patch,
- Added a translation support for aligned clause in SIMD directive by passing the alignment details to "llvm.assume" intrinsic.
- Updated the insertion point for llvm.assume intrinsic call in "OMPIRBuilder.cpp".
- Added a check in aligned clause MLIR lowering, to ensure that the alignment value must be a power of 2.
Implementatoin details:
Both Mutexinoutset and Inoutset is recognized as flag=0x4
and 0x8 respectively, the flags is set to `kmp_depend_info` and
passed as argument to `__kmpc_omp_task_with_deps` runtime call
Implementation details:
The UNTIED clause is recognized by setting the flag=0 for the default
case or performing logical OR to flag if other clauses are specified,
and this flag is passed as an argument to the `__kmpc_omp_task_alloc`
runtime call.
This removes a runtime dependency on the CUDA Toolkit path, instead of
looking up the filesystem we use a version of libdevice embedded in the
binary at build time.
This change allows to expose through an interface attributes wrapping
content as external resources, and the usage inside the ModuleToObject
show how we will be able to provide runtime libraries without relying on
the filesystem.
This is a follow-up of #117246.
I thought then it would be easy to edit a DictionaryAttr but it turns
out that these attributes are immutable and need to be passed during the
construction of the gpu.binary Op.
The first commit was using the NVVMTargetAttr to pass the information.
After feedback from @fabianmcg, this PR now passes the information
through a new option of the gpu-module-to-binary pass.
Please add reviewers, as you see fit.
### Background
In `lib/Target/LLVM/NVVM/Target.cpp`, `NVPTXSerializer` compile PTX to
binary with two different flows controlled by
`MLIR_ENABLE_NVPTXCOMPILER`.
If building mlir with `-DMLIR_ENABLE_NVPTXCOMPILER=ON`, the flow does
not check if the target is `gpu::CompilationTarget::Fatbin`, and compile
PTX to cubin directly, which is not consistent with another flow.
### Implement
Use static [nvfatbin](https://docs.nvidia.com/cuda/nvfatbin/index.html)
library.
I have tested it locally, the two flows can return the same Fatbin
result after inputing the same `GpuModule`.
This reapplies PR #118463 after introducing a fix for a bug uncovere by
the test suite. The problem is that when the alloca block is terminated
with a conditional branch, this violates a pre-condition of
`allocatePrivateVars` (which assumes the alloca block has a single
successor). This new PR includes a test that reproduces the issue.
Extend MLIR to LLVM lowering by adding support for `omp.wsloop` for
delayed privatization. This also refactors a few bit of code to isolate
the logic needed for `firstprivate` initialization in a shared util that
can be used across constructs that need it. The same is done for
`dealloc` regions.
This PR adds translation support for task detach. Essentially, if the
`detach` clause is present on a task, emit a
`__kmpc_task_allow_completion_event` on it, and store its return (of
type `kmp_event_t*`) into the `event_handle`.
Extend MLIR to LLVM lowering by adding support for `omp.wsloop` for
delayed privatization. This also refactors a few bit of code to isolate
the logic needed for `firstprivate` initialization in a shared util that
can be used across constructs that need it. The same is done for
`dealloc`
regions.
Parent PR: https://github.com/llvm/llvm-project/pull/118447. Only latest
commit is relevant for this PR.
This refactors the logic needed to emit init logic for reductions by
moving some duplicated code into a shared util. The logic for doing is
quite involved and is needed for any construct that has reductions.
Moreover, when a construct has both private and reduction clauses, both
sets of clauses need to cooperate with each other when emitting the
logic needed for allocation and initialization. Therefore, this PR
clearly sets the boundaries for the logic needed to initialize
reductions.
We've had the ability to define LLVM's `range` attribute through
#llvm.constant_range for some time, and have used this for some GPU
intrinsics. This commit allows using `llvm.range` as a parameter or
result attribute on function declarations and definitions.
Add FIR generation and LLVMIR translation support for mergeable clause
on task construct. If mergeable clause is present on a task, the
relevant flag in `ompt_task_flag_t` is set and passed to
`__kmpc_omp_task_alloc`.
Here is the [merged
MR](https://github.com/llvm/llvm-project/pull/116007) which caused a
failure and [was
reverted](https://github.com/llvm/llvm-project/pull/116811).
Thanks to @joker-eph for the help, I fix it (miss constructing
`ModuleObject` with callback functions in
`mlir/lib/Target/LLVM/NVVM/Target.cpp`) and split unit tests from origin
test which don't need `ptxas` to make the test runs more widely.
This is one of 3 PRs in a PR stack that aims to add support for explicit
mapping of allocatable members in derived types.
The primary changes in this PR are the OpenMPToLLVMIRTranslation.cpp
changes, which are small and seek to alter the current member mapping to
add an additional map insertion for pointers. Effectively, if the member
is a pointer (currently indicated by having a varPtrPtr field) we add an
additional map for the pointer and then alter the subsequent mapping of
the member (the data) to utilise the member rather than the parents base
pointer. This appears to be necessary in certain cases when mapping
pointer data within record types to avoid segfaulting on device (due to
incorrect data mapping). In general this record type mapping may be
simplifiable in the future.
There are also additions of tests which should help to showcase the
affect of the changes above.
The implementation is mostly based on the one existing for the exact
flag.
disjoint means that for each bit, that bit is zero in at least one of
the inputs. This allows the Or to be treated as an Add since no carry
can occur from any bit. If the disjoint keyword is present, the result
value of the or is a [poison
value](https://llvm.org/docs/LangRef.html#poisonvalues) if both inputs
have a one in the same bit position. For vectors, only the element
containing the bit is poison.
This patch primarily updates the MapInfoFinalization pass to utilise the
BlockArgument interface. It also shuffles newly added arguments the
MapInfoFinalization passes to the end of the BlockArg/Relevant MapInfo
lists, instead of one prior to the owning descriptor type.
During this it was noted that the use_device_ptr/addr handling of target
data was a little bit too order dependent so I've attempted to make it
less so, as we cannot depend on argument ordering to be the same as
Fortran for any future frontends.
This implementation is based on the existing one for the exact flag.
If the nneg flag is set and the argument is negative, the result is a
poison value.
The implementation is mostly based on the one existing for the nsw and
nuw flags.
If the exact flag is present, the corresponding operation returns a
poison value when the result is not exact. (For a division, if rounding
happens; for a right shift, if a non-zero bit is shifted out.)
Upstreaming some code cleanups ahead of supporting delayed task
execution.
- Make allocatePrivateVars not need to be a template (it will need to
operate separately on firstprivate and private variables for delayed
task execution so it can't index into lists of all variables in the
operation).
- Use llvm::SmallVectorImpl for function arguments
- collectPrivatizationDecls already reserves size for privateDecls so we
don't need to do that in callers
- Use llvm::zip_equal instead of C-style array indexing
This patch is intended to resolve#109481 and improve the usability of
the AMX dialect.
In LLVM IR, AMX intrinsics use `x86_amx` which is one of the primitive
types. This type is supposed to be used for AMX intrinsic calls and no
other operations. AMX dialect of MLIR uses regular 2D vector types,
which are then lowered to arrays of vectors in the LLVMIR dialect. This
creates an inconsistency in the types used in the LLVMIR dialect and
LLVMIR. Translation of AMX intrinsic calls to LLVM IR doesn't require
result types to match and that is where tile loads and mul operation
results get `x86_amx` type. This works in very simple cases when mul and
tile store operations directly consume the result of another AMX
intrinsic call, but it doesn't work when an argument is a block argument
(phi node).
In addition to translation problems, this inconsistency between types
used in MLIR and LLVM IR makes MLIR verification and transformation
quite problematic. Both `amx.tileload` and `vector::transfer_read` can
load values of the same type, but only one of them can be used in AMX
operations. In general, by looking at a type of value, we cannot
determine if it can only be used for AMX operations or contrary can be
used in other operations but AMX ones.
To remove this inconsistency and make AMX operations more explicit in
their limitations, I propose to add `LLVMX86AMXType` type to the LLVMIR
dialect to match `x86_amx` type in LLVM IR, and introduce
`amx::TileType` to be used by AMX operations in MLIR. This resolves
translation problems for AMX usage with phi nodes and provides proper
type verification in MLIR for AMX operations.
P.S. This patch also adds missing FP16 support. It's trivial but
unrelated to type system changes, so let me know if I should submit it
separately.
---------
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
This uses essentially an identical implementation to that used for
ParallelOp. The private variable allocation and deallocation use shared
functions to avoid code duplication. FIRSTPRIVATE variable copying uses
duplicated code for now because I anticipate the implementation
diverging in the near future once I store data for firstprivate
variables in the task description structure.
After enabling delayed privatisation for TASK in flang, one more test in
the fujitsu test suite passes (I haven't looked into why).
Now when building llvm-solid with `-DMLIR_ENABLE_NVPTXCOMPILER=ON`,
there will be an absolute path (`/path/to/libnvptxcompiler_static.a`) in
MLIRNVVMTarget dependencies (in
`/build/path/install/lib/cmake/mlir/MLIRTargets.cmake`). For example,
```cmake
set_target_properties(MLIRNVVMTarget PROPERTIES
INTERFACE_LINK_LIBRARIES "MLIRIR;MLIRExecutionEngineUtils;MLIRSupport;MLIRGPUDialect;MLIRTargetLLVM;MLIRNVVMToLLVMIRTranslation;LLVMSupport;/path/to/libnvptxcompiler_static.a"
)
```
If downstream project uses pre-built llvm and depends on MLIRNVVMTarget,
it may fail to build due to the absence of the
`libnvptxcompiler_static.a` absolute path.
After this commit, there will no absolute path in
`/build/path/install/lib/cmake/mlir/MLIRTargets.cmake`
```cmake
set_target_properties(MLIRNVVMTarget PROPERTIES
INTERFACE_LINK_LIBRARIES "MLIRIR;MLIRExecutionEngineUtils;MLIRSupport;MLIRGPUDialect;MLIRTargetLLVM;MLIRNVVMToLLVMIRTranslation;LLVMSupport;\$<LINK_ONLY:MLIR_NVPTXCOMPILER_LIB>"
)
```
Then downstream project can modify `libnvptxcompiler_static.a` path and
use cmake to build. For example,
```cmake
# find_library(...)
add_library(MLIR_NVPTXCOMPILER_LIB STATIC IMPORTED GLOBAL)
set_property(TARGET MLIR_NVPTXCOMPILER_LIB PROPERTY IMPORTED_LOCATION ${...})
```
This patch improves not-yet-implemented error diagnostics to more
closely follow the format used by Flang lowering for the same kind of
errors. This helps keep some level of uniformity from a user
perspective.
When composite constructs are lowered, clauses for each leaf construct
are lowered before creating the set of loop wrapper operations, using
these outside values to populate their operand lists. Then, when the
loop nest associated to that composite construct is lowered, the binding
of Fortran symbols to the entry block arguments defined by these loop
wrappers is performed, resulting in the creation of `hlfir.declare`
operations in the entry block of the `omp.loop_nest`.
This approach prevents `hlfir.declare` operations related to the binding
and other operations resulting from the evaluation of the clauses from
being inserted between loop wrapper operations, which would be an
illegal MLIR representation. However, this introduces the problem of
entry block arguments defined by a wrapper that then should be used by
one of its nested wrappers, because the corresponding Fortran symbol
would still be mapped to an outside value at the time of gathering the
list of operands for the nested wrapper.
This patch adds operand re-mapping logic to update wrappers without
changing when clauses are evaluated or where the `hlfir.declare`
creation is performed.
This patch improves error reporting in the MLIR to LLVM IR translation
pass for the 'omp' dialect by emitting descriptive errors when
encountering clauses not yet supported by that pass.
Additionally, not-yet-implemented errors previously missing for some
clauses are added, to avoid silently ignoring them.
Error messages related to inlining of `omp.private` and
`omp.declare_reduction` regions have been updated to use the same
format.
This patch unifies the handling of errors passed through the
OpenMPIRBuilder and removes some redundant error messages through the
introduction of a custom `ErrorInfo` subclass.
Additionally, the current list of operations and clauses unsupported by
the MLIR to LLVM IR translation pass is added to a new Lit test to check
they are being reported to the user.
`DIGenericSubrange` is used when the dimensions of the arrays are
unknown at build time (e.g. assumed-rank arrays in Fortran). It has same
`lowerBound`, `upperBound`, `count` and `stride` fields as in
`DISubrange` and its translation looks quite similar as a result.
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
This patch updates the translation of `omp.wsloop` with a nested
`omp.simd` to prevent uses of block arguments defined by the latter from
triggering null pointer dereferences.
This happens because the inner `omp.simd` operation representing
composite `do simd` constructs is currently skipped and not translated,
but this results in block arguments defined by it not being mapped to an
LLVM value. The proposed solution is to map these block arguments to the
LLVM value associated to the corresponding operand, which is defined
above.
This patch implements an approach to communicate errors between the
OMPIRBuilder and its users. It introduces `llvm::Error` and
`llvm::Expected` objects to replace the values returned by callbacks
passed to `OMPIRBuilder` codegen functions. These functions then check
the result for errors when callbacks are called and forward them back to
the caller, which has the flexibility to recover, exit cleanly or dump a
stack trace.
This prevents a failed callback to leave the IR in an invalid state and
still continue the codegen process, triggering unrelated assertions or
segmentation faults. In the case of MLIR to LLVM IR translation of the
'omp' dialect, this change results in the compiler emitting errors and
exiting early instead of triggering a crash for not-yet-implemented
errors. The behavior in Clang and openmp-opt stays unchanged, since
callbacks will continue always returning 'success'.
Extends `nowait` support for other device directives. This PR refactors
the task generation utils used for the `target` directive so that they
are general enough to be reused for other device directives as well.