This PR makes the following improvements to `vector.scatter` and its
lowering pipeline:
- In addition to `memref`, accept a ranked `tensor` as the base operand
of `vector.scatter`, similar to `vector.transfer_write`.
- Implement bufferization support for `vector.scatter`, so that
tensor-based scatter ops can be fully lowered to memref-based forms.
It's worth to complete the functionality of map_scatter decomposition.
Full discussion can be found here:
https://github.com/iree-org/iree/issues/21135
---------
Signed-off-by: Ryutaro Okada <1015ryu88@gmail.com>
Uniform values should not be distributed during vector distribution.
Example would be a reduction result where reduction happens across
lanes.
However, current `getDistributedType` does not accept a zero result
affine map (i.e. no distributed dims) when describing the distributed
dimensions. This result in null type being returned and crashing the
vector distribution in some cases. An example case would be a `scf.for`
op (about to be distributed) in which one of the for result is a uniform
value and it does not have a user outside the warp op. This necessitates
querying the `getDistributedType` to figure our the distributed type of
this value.
The PR https://github.com/llvm/llvm-project/pull/162167 removed a
pattern to linearize vector.splat, without adding the equivalent pattern
for vector.broadcast. This PR adds such a pattern, hopefully brining
vector.broadcast up to full parity with vector.splat that has now been
removed.
---------
Signed-off-by: James Newling <james.newling@gmail.com>
In some cases, loop bounds (lower, upper and step) of `scf.for` can come
locally from the parent warp op the `scf.for`. Current logic will not
yield the loop bounds in the new warp op generated during lowering
causing sinked `scf.for` to have non dominating use.
In this PR, we have added logic to yield loop bounds by default (treat
them as other operands of `scf.for`) which fixes this bug.
This PR enhances the elementwise unrolling pattern to support higher
rank to lower rank unroll. The approach is to add leading unit dims to
lower rank targetShape to match the rank of original vector (because
ExtractStridedSlice requires same rank to extractSlices), extract slice,
reshape to targetShape's rank and perform the operation.
Use wrappers around `std::accumulate` to make the code more concise and
less bug-prone: https://github.com/llvm/llvm-project/pull/162129.
With `std::accumulate`, it's the initial value that determines the
accumulator type. `llvm::sum_of` and `llvm::product_of` pick the right
accumulator type based on the range element type.
Found some funny bugs like a local accumulate helper that calculated a
sum with initial value of 1 -- we didn't hit the bug because the code
was actually dead...
Change remaining OpBuilder methods to use `llvm::MaybeAlign` instead of
`uint64_t` for alignment parameters.
---------
Co-authored-by: Erick Ochoa Lopez <erick.ochoalopez@amd.com>
This PR moves the patterns that unroll vector.to_elements and
vector.from_elements into the file with other vector unrolling
operations. This PR also adds these unrolling patterns into the
`populateVectorUnrollPatterns`. And renames
`populateVectorToElementsLoweringPatterns`
`populateVectorFromElementsLoweringPatterns` to
`populateVectorToElementsUnrollPatterns`
`populateVectorFromElementsUnrollPatterns`.
This PR adds patterns to lower `vector.shuffle` with inputs with
different vector sizes more efficiently. The current LLVM lowering for
these cases degenerates to a sequence of `vector.extract` and
`vector.insert` operations. With this PR, the smaller input is promoted
to larger vector size by introducing an extra `vector.shuffle`.
Addresses: https://github.com/llvm/llvm-project/issues/115653
We already have utilities to flatten memrefs into 1-D. This change makes
memref flattening a prerequisite for vector narrow type emulation,
ensuring that emulation patterns only need to handle 1-D scenarios.
This patch updates the following ops to use `source` (instead of
`vector`) as the name for their source argument:
* `vector.extract`
* `vector.scalable.extract`
* `vector.extract_strided_slice`
This change ensures naming consistency with the "builders" for these Ops
that already use the name `source` rather than `vector`. It also
addresses part of:
* https://github.com/llvm/llvm-project/issues/131602
Specifically, it ensures that we use `source` and `dest` for read and
write operations, respectively (as opposed to `vector` and `dest`).
This PR adds `scf.if` op distribution to the existing `VectorDistribute`
patterns. The logic mostly follows that of `scf.for`: move op outside, wrap each
branch with `gpu.warp_execute_on_lane_0`. A notable difference to `scf.for` is
that each branch has its own set of escaping values, and `scf.if` itself does not
have block arguments.
This PR adds a distribution pattern for
[`vector.step`](https://mlir.llvm.org/docs/Dialects/Vector/#vectorstep-vectorstepop)
op.
The result of the step op is a vector containing a sequence
`[0,1,...,N-1]`. For the warp distribution, we consider a vector with `N
== warp_size` (think SIMD). Distributing it to SIMT, means that each
lane is represented by a thread/lane id scalar.
More complex cases with the support for warp size multiples (e.g.,
`[0,1,...,2*N-1]`) require additional layout information to be handled
properly. Such support may be added later.
The lane id scalar is wrapped into a `vector<1xindex>` to emulate the
sequence distribution result.
Other than that, the distribution is similar to that of
`arith.constant`.
This PR is a follow-up to #151175 that supported lowering
multi-dimensional `vector.from_elements` op to LLVM by introducing a
unrolling pattern.
## Changes
### Add `vector.shape_cast` based flattening pattern for
`vector.from_elements`
This change introduces a new linearization pattern that uses
`vector.shape_cast` to flatten multi-dimensional `vector.from_elements`
operations. This provides an alternative approach to the unrolling-based
method introduced in #151175.
**Example:**
```mlir
// Before
%v = vector.from_elements %e0, %e1, %e2, %e3 : vector<2x2xf32>
// After
%flat = vector.from_elements %e0, %e1, %e2, %e3 : vector<4xf32>
%result = vector.shape_cast %flat : vector<4xf32> to vector<2x2xf32>
```
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
Co-authored-by: James Newling <james.newling@gmail.com>
Adds a utility getter to `warp_execute_on_lane_0` which simplifies
access to the op's terminator.
Uses are refactored to utilize the new terminator getter.
The viewLikeOpInterface abstracts the behavior of an operation view one
buffer as another. However, the current interface only includes a
"getViewSource" method and lacks a "getViewDest" method.
Previously, it was generally assumed that viewLikeOpInterface operations
would have only one return value, which was the view dest. This
assumption was broken by memref.extract_strided_metadata, and more
operations may break these silent conventions in the future. Calling
"viewLikeInterface->getResult(0)" may lead to a core dump at runtime.
Therefore, we need 'getViewDest' method to standardize our behavior.
This patch adds the getViewDest function to viewLikeOpInterface and
modifies the usage points of viewLikeOpInterface to standardize its use.
In FoldArithToVectorOuterProduct pattern, static cast to vector type
causes assertion when a scalar type was encountered. It seems the author
meant to have a dyn_cast instead.
This NFC patch handles it by using dyn_cast.
This patch extends the operation that rewrites elementwise operations
whose inputs are all broadcast from the same shape to handle
mixed-types, such as when the result and input types don't match, or
when the inputs have multiple types.
PR #150867 failed to check for the possibility of type mismatches when
rewriting splat constants. In order to fix that issue, we add support
for mixed-type operations more generally.
There is a pattern that rewrites
elementwise_op(broadcast(x1 : T to U), broadcast(x2 : T to U), ...) to
broadcast(elementwise_op(x1, x2, ...) : T to U).
This pattern did not, however, account for the case where a broadcast
constant is represented as a SplatElementsAttr, which can safely be
reshaped or scalarized but is not a `vector.broadcast` or `vector.splat`
operation.
This patch fixes this oversight, prenting premature broadcasting.
This did result in the need to update some linalg dialect tests, which
now feature a less-broadcast computation and/or more constant folding.
The crash is caused because, during IR transformation, the
vector-unrolling pass (using ExtractStridedSliceOp) attempts to slice an
input vector of higher rank using a target vector of lower rank, which
is not supported. Fixes#148368.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This patch deletes `vector.matrix_multiply` and `vector.flat_transpose`,
which are thin wrappers around the corresponding LLVM intrinsics:
- `llvm.intr.matrix.multiply`
- `llvm.intr.matrix.transpose`
These Vector dialect ops did not provide additional semantics or
abstraction beyond the LLVM intrinsics. Their removal simplifies the
lowering pipeline without losing any functionality.
The lowering chains:
- `vector.contract` → `vector.matrix_multiply` →
`llvm.intr.matrix.multiply`
- `vector.transpose` → `vector.flat_transpose` →
`llvm.intr.matrix.transpose`
are now replaced with:
- `vector.contract` → `llvm.intr.matrix.multiply`
- `vector.transpose` → `llvm.intr.matrix.transpose`
This was accomplished by directly replacing:
- `vector::MatrixMultiplyOp` with `LLVM::MatrixMultiplyOp`
- `vector::FlatTransposeOp` with `LLVM::MatrixTransposeOp`
Note: To avoid a build-time dependency from `Vector` to `LLVM`,
relevant transformations are moved from "Vector/Transforms" to
`Conversion/VectorToLLVM`.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Reapply attempt for : https://github.com/llvm/llvm-project/pull/148291
Fix for the build failure reported in :
https://lab.llvm.org/buildbot/#/builders/116/builds/15477
-----
This crash is caused by mismatch of distributed type returned by
`getDistributedType` and intended distributed type for forOp results.
Solution diff:
20c2cf6766
Example:
```
func.func @warp_scf_for_broadcasted_result(%arg0: index) -> vector<1xf32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<1xf32>) {
%ini = "some_def"() : () -> (vector<1xf32>)
%0 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini) -> (vector<1xf32>) {
%1 = "some_op"(%arg4) : (vector<1xf32>) -> (vector<1xf32>)
scf.yield %1 : vector<1xf32>
}
gpu.yield %0 : vector<1xf32>
}
return %2 : vector<1xf32>
}
```
In this case the distributed type for forOp result is `vector<1xf32>`
(result is not distributed and broadcasted to all lanes instead).
However, in this case `getDistributedType` will return NULL type.
Therefore, if the distributed type can be recovered from warpOp, we
should always do that first before using `getDistributedType`
This enables memref.load/store + vector.load/store support for sub-byte
float types. Since the memref types don't matter for loads/stores, we
still use the same types as integers with equivalent widths, with a few
extra bitcasts needed around certain operations.
There is no direct change needed for vector.load/store support. The
tests added for them are to verify that float types are
supported as well.