Existing pattern for lowering gpu.printf op to LLVM call uses fixed
function name and calling convention.
Those two should be exposed as pass option to allow supporting Intel
Compute Runtime for GPU.
Also adds gpu.printf op pattern to GPU to LLVMSPV pass.
It may appear out of place, but integration test is added to XeVM
integration test as that is the current best folder for testing with
Intel Compute Runtime.
Test should be moved in the future if a better test folder is added.
In order to make the gpu.printf => [various LLVM calls] passes less
order-dependent and to allow downstreams that don't use gpu.module to
use gpu.printf, allow the flowerings for such prints to target the
nearest `SymbolTable` instead.
Bug description: Global variables and functions created during
gpu.printf conversion to NVVM may contain debug info metadata from
function containing the gpu.printf which cannot be used out of that
function.
`FunctionOpInterface` assumed the fact that the function type (attribute
of the operation) can be cloned with arbirary lists of function
arguments and results to support argument and result list mutation. This
is not always correct, in particular, LLVM dialect functions require
exactly one result making it impossible to erase the result.
Allow function type cloning to fail and propagate this failure through
various APIs that use it. The common assumption is that existing IR has
not been modified.
Fixes#131142.
Reland a8c7ecdcbc3e89b493b495c6831cc93671c3b844 / #136300.
`FunctionOpInterface` assumed the fact that the function type (attribute
of the operation) can be cloned with arbirary lists of function
arguments and results to support argument and result list mutation. This
is not always correct, in particular, LLVM dialect functions require
exactly one result making it impossible to erase the result.
Allow function type cloning to fail and propagate this failure through
various APIs that use it. The common assumption is that existing IR has
not been modified.
Fixes#131142.
This commit adds 1:N support to `SignatureConversion::remapInputs`. This
API allows users to replace a block argument with multiple replacement
values. (And the block argument is dropped.) The API already supported
"bbarg --> multiple bbargs" mappings, but "bbarg --> multiple SSA
values" was missing.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
This is a follow-up to #127844. That PR got vectors of arbitrary rank
working, but I hadn't thought about the rank-0 case.
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
There was a discrepancy between the type-converter and rewrite-pattern
parts of conversion to LLVM used in various GPU targets, at least ROCDL
and NVVM:
- The TypeConverter part was handling vectors of arbitrary rank,
converting them to nests of `!llvm.array< ... >` with a vector at the
inner-most dimension:
8337d01e30/mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp (L629-L655)
- The rewrite pattern part was not handling `llvm.array`:
8337d01e30/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp (L594-L596)
That led to conversion failures when lowering `math` dialect ops on
rank-2 vectors, as in the testcase being added in this PR.
This PR fixes this by reusing a shared utility already used in other
conversions to LLVM:
8337d01e30/mlir/lib/Conversion/LLVMCommon/VectorPattern.cpp (L80-L104)
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
LLVM itself is generally moving away from using `undef` and towards
using `poison`, to the point of having a lint that caches new uses of
`undef` in tests.
In order to not trip the lint on new patterns and to conform to the
evolution of LLVM
- Rename valious ::undef() methods on StructBuilder subclasses to
::poison()
- Audit the uses of UndefOp in the MLIR libraries and replace almost all
of them with PoisonOp
The remaining uses of `undef` are initializing `uninitialized` memrefs,
explicit conversions to undef from SPIR-V, and a few cases in
AMDGPUToROCDL where usage like
%v = insertelement <M x iN> undef, iN %v, i32 0
%arg = bitcast <M x iN> %v to i(M * N)
is used to handle "i32" arguments that are are really packed vectors of
smaller types that won't always be fully initialized.
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.
Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
Add support in `-convert-gpu-to-llvm-spv` to convert `gpu.func` to
`llvm.func` operations.
- `spir_kernel`/`spir_func` calling conventions used for
kernels/functions.
- `workgroup` attributions encoded as additional `llvm.ptr<3>`
arguments.
- No attribute used to annotate kernels
- `reqd_work_group_size` attribute using to encode
`gpu.known_block_size`.
- `llvm.mlir.workgroup_attrib_size` used to encode workgroup attribution
sizes. This will be attached to the pointer argument workgroup
attributions lower to.
**Note**: A notable missing feature that will be addressed in a
follow-up PR is a `-use-bare-ptr-memref-call-conv` option to replace
MemRef arguments with bare pointers to the MemRef element types instead
of the current MemRef descriptor approach.
---------
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
Before this commit, there used to be a workaround in the
`func.func`/`gpu.func` op lowering when the bare-pointer calling
convention is enabled. This workaround "patched up" the argument
materializations for memref arguments. This can be done directly in the
argument materialization functions (as the TODOs in the code base
indicate).
This commit effectively reverts back to the old implementation
(a664c14001fa2359604527084c91d0864aa131a4) and adds additional checks to
make sure that bare pointers are used only for function entry block
arguments.
The `gpu.func` op lowering accounts for memref arguments/results (both
"normal" and bare-pointer supported), but the `gpu.return` op lowering
did not. The lowering produced invalid IR that did not verify.
This commit uses the same lowering strategy as for `func.return` in the
`gpu.return` lowering. (The C++ implementation is copied. We may want to
share some code between `func` and `gpu` lowerings in the future.)
This change reworks how range information for GPU dispatch IDs (block
IDs, thread IDs, and so on) is handled.
1. `known_block_size` and `known_grid_size` become inherent attributes
of GPU functions. This makes them less clunky to work with. As a
consequence, the `gpu.func` lowering patterns now only look at the
inherent attributes when setting target-specific attributes on the
`llvm.func` that they lower to.
2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size`
are made official dialect-level discardable attributes which can be
placed on arbitrary functions. This allows for progressive lowerings
(without this, a lowering for `gpu.thread_id` couldn't know about the
bounds if it had already been moved from a `gpu.func` to an `llvm.func`)
and allows for range information to be provided even when
`gpu.*_{id,dim}` are being used outside of a `gpu.func` context.
3. All of these index operations have gained an optional `upper_bound`
attribute, allowing for an alternate mode of operation where the bounds
are specified locally and not inherited from the operation's context.
These also allow handling of cases where the precise launch sizes aren't
known, but can be bounded more precisely than the maximum of what any
platform's API allows. (I'd like to thank @benvanik for pointing out
that this could be useful.)
When inferring bounds (either for range inference or for setting `range`
during lowering) these sources of information are consulted in order of
specificity (`upper_bound` > inherent attribute > discardable attribute,
except that dimension sizes check for `known_*_bounds` to see if they
can be constant-folded before checking their `upper_bound`).
This patch also updates the documentation about the bounds and inference
behavior to clarify what these attributes do when set and the
consequences of setting them up incorrectly.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
GEPArg can only be constructed from int32_t and mlir::Value. Explicitly
cast other types (e.g. unsigned, size_t) to int32_t to avoid narrowing
conversion warnings on MSVC. Some recent examples of such are:
```
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'size_t' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'unsigned int' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
```
Co-authored-by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
Setting thread block size with `maxntid` on the kernel has great
performance benefits. In this way, downstream PTX compiler can do better
register allocation.
MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute
(`known_block_size`) that keeps the thread block size when it is known.
This PR simply uses this attribute to set `maxntid`.
Expand the copying of attributes on GPU kernel arguments during LLVM
lowering.
Support copying attributes from values that are already LLVM pointers.
Support copying attributes, like `noundef`, that aren't specific to (the
pointer parts of) arguments.
While the `gpu.launch` Op allows setting the size via the
`dynamic_shared_memory_size` argument, accessing the dynamic shared
memory is very convoluted. This PR implements the proposed Op,
`gpu.dynamic_shared_memory` that aims to simplify the utilization of
dynamic shared memory.
RFC:
https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/
**Proposal from RFC**
This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory
feature efficiently. It is is a powerful feature that enables the
allocation of shared memory at runtime with the kernel launch on the
host. Afterwards, the memory can be accessed directly from the device. I
believe similar story exists for AMDGPU.
**Current way Using Dynamic Shared Memory with MLIR**
Let me illustrate the challenges of using dynamic shared memory in MLIR
with an example below. The process involves several steps:
- memref.global 0-sized array LLVM's NVPTX backend expects
- dynamic_shared_memory_size Set the size of dynamic shared memory
- memref.get_global Access the global symbol
- reinterpret_cast and subview Many OPs for pointer arithmetic
```
// Step 1. Create 0-sized global symbol. Manually set the alignment
memref.global "private" @dynamicShmem : memref<0xf16, 3> { alignment = 16 }
func.func @main() {
// Step 2. Allocate shared memory
gpu.launch blocks(...) threads(...)
dynamic_shared_memory_size %c10000 {
// Step 3. Access the global object
%shmem = memref.get_global @dynamicShmem : memref<0xf16, 3>
// Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations.
%4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128], strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3>
%5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3>
%6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3>
%7 = memref.subview %6[0, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>
%8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>
// Step.5 Use
"test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index)
"test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index)
gpu.terminator
}
```
Let’s write the program above with that:
```
func.func @main() {
gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
%i = arith.constant 18 : index
// Step 1: Obtain shared memory directly
%shmem = gpu.dynamic_shared_memory : memref<?xi8, 3>
%c147456 = arith.constant 147456 : index
%c155648 = arith.constant 155648 : index
%7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3>
%8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3>
// Step 2: Utilize the shared memory
"test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
"test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
}
}
```
This PR resolves#72513
This relands fbde19a664e5fd7196080fb4ff0aeaa31dce8508, which was broken due to incorrect GEP element type creation.
This commit changes the builders of the `llvm.mlir.addressof` operations
to no longer produce typed pointers.
As a consequence, a GPU to NVVM pattern had to be updated, that still
relied on typed pointers.
This commit changes the builders of the `llvm.mlir.addressof` operations
to no longer produce typed pointers.
As a consequence, a GPU to NVVM pattern and the toy example LLVM lowerings had to be updated, as they still relied on typed pointers.
ConversionPatterns do not (and should not) modify the type converter that they are using.
* Make `ConversionPattern::typeConverter` const.
* Make member functions of the `LLVMTypeConverter` const.
* Conversion patterns take a const type converter.
* Various helper functions (that are called from patterns) now also take a const type converter.
Differential Revision: https://reviews.llvm.org/D157601
This revision untangles a few more conversion pieces and allows rewriting
the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass
in a declarative fashion that provides a much better understanding and control.
Differential Revision: https://reviews.llvm.org/D157617
The common GPU operation transformation that lowers `math` operations
to function calls in the `gpu-to-nvvm` and `gpu-to-rocdl` passes handles
`vector` types by applying the function to each scalar and returning a
new vector. However, there was a typo that results in incorrectly
accumulating the result vector, and the rewrite returns an `llvm.mlir.undef`
result instead of the correct vector. A patch is added and tests are
strengthened.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D154269
This revision adds comdat support to functions. Additionally,
it ensures only comdats that have uses are imported/exported and
only non-empty global comdat operations are created.
Reviewed By: Dinistro
Differential Revision: https://reviews.llvm.org/D153739
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Caveats include:
- This clang-tidy script probably has more problems.
- This only touches C++ code, so nothing that is being generated.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This first patch was created with the following steps. The intention is
to only do automated changes at first, so I waste less time if it's
reverted, and so the first mass change is more clear as an example to
other teams that will need to follow similar steps.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
4. Some changes have been deleted for the following reasons:
- Some files had a variable also named cast
- Some files had not included a header file that defines the cast
functions
- Some files are definitions of the classes that have the casting
methods, so the code still refers to the method instead of the
function without adding a prefix or removing the method declaration
at the same time.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\
mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\
mlir/lib/**/IR/\
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\
mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\
mlir/test/lib/Dialect/Test/TestTypes.cpp\
mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\
mlir/test/lib/Dialect/Test/TestAttributes.cpp\
mlir/unittests/TableGen/EnumsGenTest.cpp\
mlir/test/python/lib/PythonTestCAPI.cpp\
mlir/include/mlir/IR/
```
Differential Revision: https://reviews.llvm.org/D150123
The following pattern is common in the llvm codebase, as well as in downstream projects:
```
llvm::to_vector(llvm::map_range(container, lambda))
```
This patch introduces a shortcut for this called `map_to_vector`.
This template depends on both `llvm/ADT/SmallVector.h` and `llvm/ADT/STLExtras.h`, and since these are both relatively large and do not depend on each other, the `map_to_vector` helper is placed in a new header under `llvm/ADT/SmallVectorExtras.h`. Only a handful of use cases have been updated to use the new helper.
Differential Revision: https://reviews.llvm.org/D145390
Add support for argument attributes on workgroup and private
attributions for GPU functions. These arguments are outside the range
of getNumArguments() and get printed separately, so the default
mechanism for function argument attributes can't be used on them.
Having done this, check for the `llvm.align` attribute on workgroup or
private attributions in a `gpu.func` and pass it through to the
relevant allocation op (creating a global or alloca). This allows
people creating kernels that use multiple workgroup buffers to set an
alignment.
(This could, in the future, be a GPU dialect `alignment` attribute,
but I've taken the simpler route of using the LLVM version instead for
simplicity and because I don't know how this might impact backends
like Vulkan)
Reviewed By: nirvedhmeshram
Differential Revision: https://reviews.llvm.org/D148965
Currently the use of bare pointer calling convention is controlled
globally through use of an option in the `LLVMTypeConverter`. To allow
more fine-grained control use an attribute on a function to drive the
calling convention to use.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D147494
Part of https://discourse.llvm.org/t/rfc-switching-the-llvm-dialect-and-dialect-lowerings-to-opaque-pointers/68179
This patch adds the new pass option `use-opaque-pointers` to the GPU to LLVM lowerings (including ROCD and NVVM) and adapts the code to support using opaque pointers in addition to typed pointers.
The required changes mostly boil down to avoiding `getElementType` and specifying base types in GEP and Alloca.
In the future opaque pointers will be the only supported model, hence tests have been ported to using opaque pointers by default. Additional regression tests for typed-pointers have been added to avoid breaking existing clients.
Note: This does not yet port the `GpuToVulkan` passes.
Differential Revision: https://reviews.llvm.org/D144448
Remapping memory spaces is a function often needed in type
conversions, most often when going to LLVM or to/from SPIR-V (a future
commit), and it is possible that such remappings may become more
common in the future as dialects take advantage of the more generic
memory space infrastructure.
Currently, memory space remappings are handled by running a
special-purpose conversion pass before the main conversion that
changes the address space attributes. In this commit, this approach is
replaced by adding a notion of type attribute conversions
TypeConverter, which is then used to convert memory space attributes.
Then, we use this infrastructure throughout the *ToLLVM conversions.
This has the advantage of loosing the requirements on the inputs to
those passes from "all address spaces must be integers" to "all
memory spaces must be convertible to integer spaces", a looser
requirement that reduces the coupling between portions of MLIR.
ON top of that, this change leads to the removal of most of the calls
to getMemorySpaceAsInt(), bringing us closer to removing it.
(A rework of the SPIR-V conversions to use this new system will be in
a folowup commit.)
As a note, one long-term motivation for this change is that I would
eventually like to add an allocaMemorySpace key to MLIR data layouts
and then call getMemRefAddressSpace(allocaMemorySpace) in the
relevant *ToLLVM in order to ensure all alloca()s, whether incoming or
produces during the LLVM lowering, have the correct address space for
a given target.
I expect that the type attribute conversion system may be useful in
other contexts.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D142159
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.
The following steps were taken to make the transition across the codebase:
1. Introduce a pass "gpu-lower-memory-space-attributes":
The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.
2. Update the verifiers of GPU/NVGPU dialect operations.
If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.
3. Update existing gpu-to-llvm conversion infrastructure.
In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.
For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D140644
When converting to nvvm lowering gpu.printf to vprintf allows us to
support printing when running on cuda.
Differential Revision: https://reviews.llvm.org/D141049
Reland D139447, D139471 With flang actually working
- FunctionOpInterface: make get/setFunctionType interface methods
This patch removes the concept of a `function_type`-named type attribute
as a requirement for implementors of FunctionOpInterface. Instead, this
type should be provided through two interface methods, `getFunctionType`
and `setFunctionTypeAttr` (*Attr because functions may use different
concrete function types), which should be automatically implemented by
ODS for ops that define a `$function_type` attribute.
This also allows FunctionOpInterface to materialize function types if
they don't carry them in an attribute, for example.
Importantly, all the function "helper" still accept an attribute name to
use in parsing and printing functions, for example.
- FunctionOpInterface: arg and result attrs dispatch to interface
This patch removes the `arg_attrs` and `res_attrs` named attributes as a
requirement for FunctionOpInterface and replaces them with interface
methods for the getters, setters, and removers of the relevent
attributes. This allows operations to use their own storage for the
argument and result attributes.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D139736
and "[mlir] Fix examples build"
This reverts commit fbc253fe81da4e1d6bfa2519e01e03f21d8c40a8 and
96cf183bccd7d1c3083f169a89a6af1f263b3aae.
Which I missed in the first revert in f3379feabe38fd3711b13ffcf6de4aab03b7ccdc.
This patch removes the concept of a `function_type`-named type attribute
as a requirement for implementors of FunctionOpInterface. Instead, this
type should be provided through two interface methods, `getFunctionType`
and `setFunctionTypeAttr` (*Attr because functions may use different
concrete function types), which should be automatically implemented by
ODS for ops that define a `$function_type` attribute.
This also allows FunctionOpInterface to materialize function types if
they don't carry them in an attribute, for example.
Importantly, all the function "helper" still accept an attribute name to
use in parsing and printing functions, for example.
Reviewed By: rriddle, lattner
Differential Revision: https://reviews.llvm.org/D139447
Improve type conversion error propagation/failure during LLVM lowering.
BEFORE
```
llvm-mlir/mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp:304: SmallVector<mlir::Type, 5> mlir::LLVMTypeConverter::getMemRefDescriptorFields(mlir::MemRefType, bool): Assertion `isStrided(type) && "Non-strided layout maps must have been normalized away"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
...
```
AFTER
```
<unknown>:0: error: integer overflow during size computation
<unknown>:0: error: Conversion to strided form failed either due to non-strided layout maps (which should have been normalized away) or other reasons
<unknown>:0: error: failed to legalize operation 'gpu.func' that was explicitly marked illegal
<unknown>:0: note: see current operation:
"gpu.func"() ( {
...
```
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D139072
Unroll ops that map to intrinsics when lowering to LLVM, because intrinsics don't support vector operands/results.
Reviewed By: herhut
Differential Revision: https://reviews.llvm.org/D136345