This PR continues with the introduction of poison as initialization
vector, in this particular case, in LowerVectorBitCast,
LowerVectorBroadcast and LowerVectorTranspose.
This is the first PR that introduces `ub.poison` vectors as part of a
rewrite/conversion pattern in the Vector dialect. It replaces the
`arith.constant dense<0>` vector initialization for
`vector.insert_slice` ops with a poison vector.
This PR depends on all the previous PRs that introduced support for
poison in Vector operations such as `vector.shuffle`, `vector.extract`,
`vector.insert`, including ODS, canonicalization and lowering support.
This PR may improve end-to-end compilation time through LLVM, depending
on the workloads.
This PR adds a folder for `vector.extract(ub.poison) -> ub.poison`. It
also replaces `create` with `createOrFold` insert/extract ops in vector
unroll and transpose lowering patterns to trigger the poison foldings
introduced recently.
This is PR 2 in a series of N patches aimed at improving
"VectorEmulateNarrowType.cpp". This is mainly minor refactoring, no
major functional changes are made/added.
**CHANGE 1**
Renames the variable "scale". Note, "scale" could mean either:
* "container-elements-per-emulated-type", or
* "emulated-elements-per-container-type".
While from the context it is clear that it's always the former (original
type is always a sub-byte type and the emulated type is usually `i8`),
this PR reduces the cognitive load by making this clear.
**CHANGE 2**
Replaces `isUnalignedEmulation` with `isFullyAligned`
Note, `isUnalignedEmulation` is always computed following a
"per-element-alignment" condition:
```cpp
// Check per-element alignment.
if (containerBits % emulatedBits != 0) {
return rewriter.notifyMatchFailure(
op, "impossible to pack emulated elements into container elements "
"(bit-wise misalignment)");
}
// (...)
bool isUnalignedEmulation = origElements % emulatedPerContainerElem != 0;
```
Given that `isUnalignedEmulation` captures only one of two conditions
required for "full alignment", it should be re-named as
`isPartiallyUnalignedEmulation`. Instead, I've flipped the condition and
renamed it as `isFullyAligned`:
```cpp
bool isFullyAligned = origElements % emulatedPerContainerElem == 0;
```
**CHANGE 3**
* Unifies various comments throughout the file (for consistency).
* Adds new comments throughout the file and adds TODOs where high-level
comments are missing.
**GitHub issue to track this work**:
https://github.com/llvm/llvm-project/issues/123630
Updates `emulatedVectorLoad` that was introduced in #115922.
Specifically, ATM `emulatedVectorLoad` mixes "emulated type" and
"container type". This only became clear after #123526 in which the
concepts of "emulated" and "container" types were introduced.
This is an NFC change and simply updates the variable naming.
This is PR 1 in a series of N patches aimed at improving
"VectorEmulateNarrowType.cpp". This is mainly minor refactoring, no
major functional changes are made/added.
This PR renames:
* `srcBits`/`dstBits` + `oldElementType`/`newElementType`
to improve consistency in naming within the file. This is illustrated
below:
```cpp
// Extracted from VectorEmulateNarrowType.cpp
// BEFORE (mixing old/new and src/dst):
// Type oldElementType = op.getType().getElementType();
// Type newElementType = convertedType.getElementType();
// int srcBits = oldElementType.getIntOrFloatBitWidth();
// int dstBits = newElementType.getIntOrFloatBitWidth();
// AFTER (consistently using emulated/container):
Type emulatedElemType = op.getType().getElementType();
Type containerElemType = convertedType.getElementType();
int emulatedBits = emulatedElemTy.getIntOrFloatBitWidth();
int containerBits = containerElemTy.getIntOrFloatBitWidth();
```
Also adds some comments and unifies related "rewriter notification"
messages.
**GitHub issue to track this work:**
* https://github.com/llvm/llvm-project/issues/123630
It looks like scalable `vector.insertslice/extractslice` ops made their way
through lowering patterns that generate `vector.shuffle` ops. I'm not
sure why this wasn't caught by the verifier, probably because the
shuffle op was folded into something else as part of the same rewrite
and the IR wasn't verified.
This PR fixes the issue by preventing scalable vector.insertslice/extractslice
ops to be lowered to vector shuffles. Instead, they are now lowered to a
sequence of insertslice/extractelement ops using an existing patter.
This patch enables unaligned, statically indexed storing of vectors with
sub emulation width element types.
To illustrate the mechanism, consider the example of storing
vector<7xi2> into memref<3x7xi2>[1, 0].
In this case the linearized indices of those bits being overwritten are
[14, 28), which are:
* the last 2 bits of byte no.2
* byte no.3
* first 4 bits of byte no.4
Because memory accesses are in bytes, byte no.2 and no.4 in the above
example are only being modified partially.
In the case of multi-threading scenario, in order to avoid data
contention, these two bytes must be handled atomically.
This PR implements a generalization of the existing more efficient
lowering of shape casts from 2-D to 1D and 1-D to 2-D vectors. This
significantly reduces code size and generates more performant code for
n-D shape casts that make their way to LLVM/SPIR-V.
`BreakDownVectorBitCast` leverages
* `vector.extract_strided_slices` + `vector.insert_strided_slices`
As these Ops do not support extracting scalable sub-vectors (i.e.
extracting/inserting a fraction of a scalable dim), it's best to bail
out.
0-d vectors are supported now and so these patterns are no longer
required. This covers a part of this issue
https://github.com/llvm/llvm-project/issues/112913 . Additionally this
removes %arg2 in mlir/test/Conversion/GPUCommon/transfer_write.mlir and
renames %arg3 to %arg2 as %arg2 was originally not required.
In the LLVM style guide, we prefer not using braced initializer lists to
call a constructor. Also, we prefer using an equal before the open curly
brace if we use a braced initializer list when initializing a variable.
See
https://llvm.org/docs/CodingStandards.html#do-not-use-braced-initializer-lists-to-call-a-constructor
for more details.
The style guide does not explain the reason well. There is an article
from abseil, which mentions few benefits. E.g., we can avoid the most
vexing parse, etc. See https://abseil.io/tips/88 for more details.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Adds rewrites for i2 to i8 signed and unsigned extension, similar to the
ones that already exist for i4 to i8 conversion.
I use this for i6 quantized models, and this gives me roughly a 2x
speedup for an i6 4096x4096 dequantization-matmul on an AMD 5950x.
I didn't add the rewrite for i8 to i2 truncation because I currently
don't use it, but if this is needed, I can add it as well.
---------
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
In `Gather1DToConditionalLoads`, currently we will check if the stride
of the most minor dim of the input memref is 1. And if not, the
rewriting pattern will not be applied. However, according to the
verification of `vector.load` here:
4e32271e8b/mlir/lib/Dialect/Vector/IR/VectorOps.cpp (L4971-L4975)
.. if the output vector type of `vector.load` contains only one element,
we can ignore the requirement of the stride of the input memref, i.e.
the input memref can be with any stride layout attribute in such case.
So here we can allow more cases in lowering `vector.gather` by relaxing
such check.
As shown in the test case attached in this patch
[here](1933fbad58/mlir/test/Dialect/Vector/vector-gather-lowering.mlir (L151)),
now `vector.gather` of memref with non-trivial stride can be lowered
successfully if the result vector contains only one element.
---------
Signed-off-by: PragmaTwice <twice@apache.org>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
This commit updates the internal `ConversionValueMapping` data structure
in the dialect conversion driver to support 1:N replacements. This is
the last major commit for adding 1:N support to the dialect conversion
driver.
Since #116470, the infrastructure already supports 1:N replacements. But
the `ConversionValueMapping` still stored 1:1 value mappings. To that
end, the driver inserted temporary argument materializations (converting
N SSA values into 1 value). This is no longer the case. Argument
materializations are now entirely gone. (They will be deleted from the
type converter after some time, when we delete the old 1:N dialect
conversion driver.)
Note for LLVM integration: Replace all occurrences of
`addArgumentMaterialization` (except for 1:N dialect conversion passes)
with `addSourceMaterialization`.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Continue the move of `warp_execute_on_lane_0` op to the gpu dialect
(#116994). This patch creates a utils library in GPU and moves generic
helper functions there.
This patch simplifies and extends the logic used when compressing masks
emitted by `vector.constant_mask` to support extracting 1-D vectors from
multi-dimensional vector loads. It streamlines mask computation, making
it applicable for multi-dimensional mask generation, improving the
overall handling of masked load operations.
Previously when `numFrontPadElems` is not zero, `getCompressedMaskOp`
produces wrong result if the mask generator op is a
`vector.create_mask`.
This patch resolves the issue by including `numFrontPadElems` into the
mask generation.
Signed-off-by: Alan Li <me@alanli.org>
Previously the patch was not expecting to handle non-static index, when
the index is a non constant value it will crash.
This patch is to make sure it return gracefully instead of crashing.
This is a NFC-ish change that moves
vector.extractelement/vector.insertelement vector distribution patterns
to vector.insert/vector.extract.
Before:
0-d/1-d vector.extract -> vector.extractelement -> distributed
vector.extractelement
2-d+ vector.extract -> distributed vector.extract
After:
scalar input vector.extract -> distributed vector.extract
vector.extractelement -> distributed vector.extract
2d+ vector.extract -> distributed vector.extract
The same changes are done for insertelement/insert. The change allows us
to remove reliance on vector.extractelement/vector.insertelement, which
are soon to be depreciated:
https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops/71116/8
No extra tests are included because this patch doesn't introduce /
remove any functionality. It only changes the chain of lowerings. This
change can be completly NFC if we make the distributed operation
vector.extractelement/vector.insertelement, but that is slightly weird,
because you are going from extractelement -> extract -> extractelement.
This commit adds support for handling mask constants generated by the
`arith.constant` op in the `VectorEmulateNarrowType` pattern.
Previously, this pattern would not match due to the lack of mask
constant handling in `getCompressedMaskOp`.
The changes include:
1. Updating `getCompressedMaskOp` to recognize and handle
`arith.constant` ops as mask value sources.
2. Handling cases where the mask is not aligned with the emulated load
width. The compressed mask is adjusted to account for the offset.
Limitations:
- The arith.constant op can only have 1-dimensional constant values.
Resolves: #115742
Signed-off-by: Alan Li <me@alanli.org>
Add a new helper function `isReachable` to `Block`. This function
traverses all successors of a block to determine if another block is
reachable from the current block.
This functionality has been reimplemented in multiple places in MLIR.
Possibly additional copies in downstream projects. Therefore, moving it
to a common place.
In `staticallyExtractSubvector`, When the extracting slice is the same
as source vector, do not need to emit `vector.extract_strided_slice`.
This fixes the lit test case `@vector_store_i4` in
`mlir\test\Dialect\Vector\vector-emulate-narrow-type.mlir`, where
converting from `vector<8xi4>` to `vector<4xi8>` does not need slice
extraction.
The issue was introduced in #113411 and #115070, CI failure link:
https://buildkite.com/llvm-project/github-pull-requests/builds/118845
This PR does not include a lit test case because it is a fix and the
above mentioned `@vector_store_i4` test actually tests the mechanism.
Signed-off-by: Alan Li <me@alanli.org>
All patterns in populateVectorNarrowTypeEmulationPatterns currently
assume a 1-D vector load/store rather than an n-D vector load/store.
This assumption is evident in ConvertVectorTransferRead, for example,
here (extracted from `ConvertVectorTransferRead`):
```cpp
auto newRead = rewriter.create<vector::TransferReadOp>(
loc, VectorType::get(numElements, newElementType), adaptor.getSource(),
getValueOrCreateConstantIndexOp(rewriter, loc, linearizedIndices),
newPadding);
auto bitCast = rewriter.create<vector::BitCastOp>(
loc, VectorType::get(numElements * scale, oldElementType), newRead);
```
Both invocations of `VectorType::get()` here generate a 1-D vector.
Attempts to use these patterns with more generic cases, such as 2-D
vectors, fail. For example, trying to cast the following 2-D case to
`i32`:
```mlir
func.func @vector_maskedload_2d_i8_negative(
%idx1: index,
%idx2: index,
%num_elems: index,
%passthru: vector<2x4xi8>) -> vector<2x4xi8> {
%0 = memref.alloc() : memref<3x4xi8>
%mask = vector.create_mask %num_elems, %num_elems : vector<2x4xi1>
%1 = vector.maskedload %0[%idx1, %idx2], %mask, %passthru :
memref<3x4xi8>, vector<2x4xi1>, vector<2x4xi8> into vector<2x4xi8>
return %1 : vector<2x4xi8>
}
```
For example, casting to i32 produces:
```bash
error: 'vector.bitcast' op failed to verify that all of {source, result} have same rank
%1 = vector.maskedload %0[%idx1, %idx2], %mask, %passthru :
^
```
Instead of reworking these patterns (that's going to require much more
effort), I’ve marked them as 1-D only and extended
"TestEmulateNarrowTypePass" with an option to disable the Memref type
converter - that's to be able to add negative tests (otherwise, the type
converter throws an error we can't really test for). While not ideal,
this workaround should suit a test pass.
Based on existing emulating scheme, this patch expands to support
dynamic indexing by dynamically create intermediate new mask, new pass
thru vector and dynamically insert the result into destination vector.
the dynamic parts are constructed by multiple `vector.extract` and
`vector.insert` to rearrange the original mask/passthru vector, as
`vector.insert_strided_slice` and `vector.extract_strided_slice` only
take static offsets and indices.
Note: currently only supporting `vector.maskedload` with masks created
by `vector.constant_mask`. `vector.create_mask` is currently not
working.
---------
Co-authored-by: hasekawa-takumi <167335845+hasekawa-takumi@users.noreply.github.com>
The documentation for narrow-type emulation was sparse, so I’ve expanded
it with additional clarifications (e.g., specifying that the example
discusses `i4` -> `i8` emulation).
I also noticed some inconsistencies in testing for narrow-type
emulation, with several cases covered only for "loading" and missing for
"storing." To address this, I’ve:
* Added comments in the test file for easier reference,
* Added the missing tests for `vector.maskedstore`.
Additionally, I’ve renamed tests for `vector.masked{load|store}` for
clarity:
* `@vector_cst_maskedload_i8` -> `@vector_maskedload_i8_constant_mask`.
This makes it easier to contrast with similar functions, such as
`@vector_maskedload_i8`.
Lastly, I’ve added a high-level comment in VectorEmulateNarrowType.cpp
to clarify the overall design and intent of the file.
This patch fixes:
mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp:202:2:
error: extra ';' outside of a function is incompatible with C++98
[-Werror,-Wc++98-compat-extra-semi]
* Supports `vector.load` and `vector.transfer_read` ops.
* In the case of dynamic indexing, use per-element insertion/extraction
to build desired narrow type vectors.
* Fixed wrong function comment of `getCompressedMaskOp`.
---------
Co-authored-by: Han-Chung Wang <hanhan0912@gmail.com>
Currently, the lowering for vector.step lives
under a folder. This is not ideal if we want
to do transformation on it and defer the
materizaliztion of the constants much later.
This commits adds a rewrite pattern that
could be used by using
`transform.structured.vectorize_children_and_apply_patterns`
transform dialect operation.
Moreover, the rewriter of vector.step is also
now used in -convert-vector-to-llvm pass where
it handles scalable and non-scalable types as
LLVM expects it.
As a consequence of removing the vector.step
lowering as its folder, linalg vectorization
will keep vector.step intact.
Previously, the pass only supported emulation of loading vector sizes
that are multiples of the emulated data type. This patch expands its
support for emulating sizes that are not multiples of byte sizes. In
such cases, the element values are packed back-to-back to preserve
memory space.
To give a concrete example: if an input has type `memref<3x3xi2>`, it is
actually occupying 3 bytes in memory, with the first 18 bits storing the
values and the last 6 bits as padding. The slice of `vector<3xi2>` at
index `[2, 0]` is stored in memory from bit 12 to bit 18. To properly
load the elements from bit 12 to bit 18 from memory, first load byte 2
and byte 3, and convert it to a vector of `i2` type; then extract bits 4
to 10 (element index 2-5) to form a `vector<3xi2>`.
A limitation of this patch is that the linearized index of the unaligned
vector has to be known at compile time. Extra code needs to be emitted
to handle it if the condition does not hold.
The following ops are updated:
* `vector::LoadOp`
* `vector::TransferReadOp`
* `vector::MaskedLoadOp`
Since
ddf2d62c7d
, 0-d vectors are supported in VectorType. This patch removes 0-d vector
handling with scalars for the TransferOpReduceRank pattern. This pattern
specifically introduces tensor.extract_slice during vectorization,
causing vectorization to not fold transfer_read/transfer_write slices
properly. The changes in vectorization test files reflect this.
There are other places where lowering patterns are still side-stepping
from handling 0-d vectors properly, by turning them into scalars, but
this patch only focuses on the vector.transfer_x patterns.
This is a reasonable canonicalization because `extract` is more
constrained than `extract_strided_slices`, so there is no loss of
semantics here, just lifting an op to a special-case higher/constrained
op. And the additional `shape_cast` is merely adding leading unit dims
to match the original result type.
Context: discussion on #111541. I wasn't sure how this would turn out,
but in the process of writing this PR, I discovered at least 2 bugs in
the pattern introduced in #111541, which shows the value of shared
canonicalization patterns which are exercised on a high number of
testcases.
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>