- print multiplication by -1 as unary negate; expressions like s0 * -1, d0 * -1
+ d1 will now appear as -s0, -d0 + d1 resp.
- a minor cleanup while on printAffineExprInternal
PiperOrigin-RevId: 230222151
This CL fixes a misunderstanding in how to build DimOp which triggered
execution issues in the CPU path.
The problem is that, given a `memref<?x4x?x8x?xf32>`, the expressions to
construct the dynamic dimensions should be:
`dim %arg, 0 : memref<?x4x?x8x?xf32>`
`dim %arg, 2 : memref<?x4x?x8x?xf32>`
and
`dim %arg, 4 : memref<?x4x?x8x?xf32>`
Before this CL, we wold construct:
`dim %arg, 0 : memref<?x4x?x8x?xf32>`
`dim %arg, 1 : memref<?x4x?x8x?xf32>`
`dim %arg, 2 : memref<?x4x?x8x?xf32>`
and expect the other dimensions to be constants.
This assumption seems consistent at first glance with the syntax of alloc:
```
%tensor = alloc(%M, %N, %O) : memref<?x4x?x8x?xf32>
```
But this was actuallyincorrect.
This CL also makes the relevant functions available to EDSCs and removes
duplication of the incorrect function.
PiperOrigin-RevId: 229622766
This CL adds a short term remedy to an issue that was found during execution
tests.
Lowering of vector transfer ops uses the permutation map to determine which
ForInst have been super-vectorized. During materialization to HW vector sizes
however, some of those dimensions may be fully unrolled and do not appear in
the permutation map.
Such dimensions were then not clipped and may have accessed out of bounds.
This CL conservatively clips all dimensions to ensure no out of bounds access.
The longer term solution is still up for debate but will probably require
either passing more information between Materialization and lowering, or just
merging the 2 passes.
PiperOrigin-RevId: 228980787
This CL is the 5th on the path to simplifying AffineMap composition.
This removes the distinction between normalized single-result AffineMap and
more general composed multi-result map.
One nice byproduct of making the implementation driven by single-result is
that the multi-result extension is a trivial change: the implementation is
still single-result and we just use:
```
unsigned idx = getIndexOf(...);
map.getResult(idx);
```
This CL also fixes an AffineNormalizer implementation issue related to symbols.
Namely it stops performing substitutions on symbols in AffineNormalizer and
instead concatenates them all to be consistent with the call to
`AffineMap::compose(AffineMap)`. This latter call to `compose` cannot perform
simplifications of symbols coming from different maps based on positions only:
i.e. dims are applied and renumbered but symbols must be concatenated.
The only way to determine whether symbols from different AffineApply are the
same is to look at the concrete values. The canonicalizeMapAndOperands is thus
extended with behavior to support replacing operands that appear multiple
times.
Lastly, this CL demonstrates that the implementation is correct by rewriting
ComposeAffineMaps using only `makeComposedAffineApply`. The implementation
uses a matcher because AffineApplyOp are introduced as composed operations on
the fly instead of iteratively forwardSubstituting. For this purpose, a walker
would revisit freshly introduced AffineApplyOp. Regardless, ComposeAffineMaps
is scheduled to disappear, this CL replaces the implementation based on
iterative `forwardSubstitute` by a composed-by-construction
`makeComposedAffineApply`.
Remaining calls to `forwardSubstitute` will be removed in the next CL.
PiperOrigin-RevId: 228830443
Supervectorization does not plan on handling multi-result AffineMaps and
non-canonical chains of > 1 AffineApplyOp.
This CL uses the simpler single-result unbounded AffineApplyOp in the
MaterializeVectors pass.
PiperOrigin-RevId: 228469085
This CL is the 2nd on the path to simplifying AffineMap composition.
This CL uses the now accepted `AffineExpr::compose(AffineMap)` to
implement `AffineMap::compose(AffineMap)`.
Implications of keeping the simplification function in
Analysis are documented where relevant.
PiperOrigin-RevId: 228276646
This CL is the 1st on the path to simplifying AffineMap composition.
This CL uses the now accepted AffineExpr.replaceDimsAndSymbols to
implement `AffineExpr::compose(AffineMap)`.
Arguably, `simplifyAffineExpr` should be part of IR and not Analysis but
this CL does not yet pull the trigger on that.
PiperOrigin-RevId: 228265845
Supervectorization does not plan on handling multi-result AffineMaps and
non-canonical chains of > 1 AffineApplyOp.
This CL introduces a simpler abstraction and composition of single-result
unbounded AffineApplyOp by using the existing unbound AffineMap composition.
This CL adds a simple API call and relevant tests:
```c++
OpPointer<AffineApplyOp> makeNormalizedAffineApply(
FuncBuilder *b, Location loc, AffineMap map, ArrayRef<Value*> operands);
```
which creates a single-result unbounded AffineApplyOp.
The operands of AffineApplyOp are not themselves results of AffineApplyOp by
consrtuction.
This represent the simplest possible interface to complement the composition
of (mathematical) AffineMap, for the cases when we are interested in applying
it to Value*.
In this CL the composed AffineMap is not compressed (i.e. there exist operands
that are not part of the result). A followup commit will compress to normal
form.
The single-result unbounded AffineApplyOp abstraction will be used in a
followup CL to support the MaterializeVectors pass.
PiperOrigin-RevId: 227879021
The entire compiler now looks at structural properties of the function (e.g.
does it have one block, does it contain an if/for stmt, etc) so the only thing
holding up this difference is round tripping through the parser/printer syntax.
Removing this shrinks the compile by ~140LOC.
This is step 31/n towards merging instructions and statements. The last step
is updating the docs, which I will do as a separate patch in order to split it
from this mostly mechanical patch.
PiperOrigin-RevId: 227540453
This CL introduces a simple set of Embedded Domain-Specific Components (EDSCs)
in MLIR components:
1. a `Type` system of shell classes that closely matches the MLIR type system. These
types are subdivided into `Bindable` leaf expressions and non-bindable `Expr`
expressions;
2. an `MLIREmitter` class whose purpose is to:
a. maintain a map of `Bindable` leaf expressions to concrete SSAValue*;
b. provide helper functionality to specify bindings of `Bindable` classes to
SSAValue* while verifying comformable types;
c. traverse the `Expr` and emit the MLIR.
This is used on a concrete example to implement MemRef load/store with clipping in the
LowerVectorTransfer pass. More specifically, the following pseudo-C++ code:
```c++
MLFuncBuilder *b = ...;
Location location = ...;
Bindable zero, one, expr, size;
// EDSL expression
auto access = select(expr < zero, zero, select(expr < size, expr, size - one));
auto ssaValue = MLIREmitter(b)
.bind(zero, ...)
.bind(one, ...)
.bind(expr, ...)
.bind(size, ...)
.emit(location, access);
```
is used to emit all the MLIR for a clipped MemRef access.
This simple EDSL can easily be extended to more powerful patterns and should
serve as the counterpart to pattern matchers (and could potentially be unified
once we get enough experience).
In the future, most of this code should be TableGen'd but for now it has
concrete valuable uses: make MLIR programmable in a declarative fashion.
This CL also adds Stmt, proper supporting free functions and rewrites
VectorTransferLowering fully using EDSCs.
The code for creating the EDSCs emitting a VectorTransferReadOp as loops
with clipped loads is:
```c++
Stmt block = Block({
tmpAlloc = alloc(tmpMemRefType),
vectorView = vector_type_cast(tmpAlloc, vectorMemRefType),
ForNest(ivs, lbs, ubs, steps, {
scalarValue = load(scalarMemRef, accessInfo.clippedScalarAccessExprs),
store(scalarValue, tmpAlloc, accessInfo.tmpAccessExprs),
}),
vectorValue = load(vectorView, zero),
tmpDealloc = dealloc(tmpAlloc.getLHS())});
emitter.emitStmt(block);
```
where `accessInfo.clippedScalarAccessExprs)` is created with:
```c++
select(i + ii < zero, zero, select(i + ii < N, i + ii, N - one));
```
The generated MLIR resembles:
```mlir
%1 = dim %0, 0 : memref<?x?x?x?xf32>
%2 = dim %0, 1 : memref<?x?x?x?xf32>
%3 = dim %0, 2 : memref<?x?x?x?xf32>
%4 = dim %0, 3 : memref<?x?x?x?xf32>
%5 = alloc() : memref<5x4x3xf32>
%6 = vector_type_cast %5 : memref<5x4x3xf32>, memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 3 {
for %i5 = 0 to 4 {
for %i6 = 0 to 5 {
%7 = affine_apply #map0(%i0, %i4)
%8 = cmpi "slt", %7, %c0 : index
%9 = affine_apply #map0(%i0, %i4)
%10 = cmpi "slt", %9, %1 : index
%11 = affine_apply #map0(%i0, %i4)
%12 = affine_apply #map1(%1, %c1)
%13 = select %10, %11, %12 : index
%14 = select %8, %c0, %13 : index
%15 = affine_apply #map0(%i3, %i6)
%16 = cmpi "slt", %15, %c0 : index
%17 = affine_apply #map0(%i3, %i6)
%18 = cmpi "slt", %17, %4 : index
%19 = affine_apply #map0(%i3, %i6)
%20 = affine_apply #map1(%4, %c1)
%21 = select %18, %19, %20 : index
%22 = select %16, %c0, %21 : index
%23 = load %0[%14, %i1, %i2, %22] : memref<?x?x?x?xf32>
store %23, %5[%i6, %i5, %i4] : memref<5x4x3xf32>
}
}
}
%24 = load %6[%c0] : memref<1xvector<5x4x3xf32>>
dealloc %5 : memref<5x4x3xf32>
```
In particular notice that only 3 out of the 4-d accesses are clipped: this
corresponds indeed to the number of dimensions in the super-vector.
This CL also addresses the cleanups resulting from the review of the prevous
CL and performs some refactoring to simplify the abstraction.
PiperOrigin-RevId: 227367414
printing the entry block in a CFG function's argument line. Since I'm touching
all of the testcases anyway, change the argument list from printing as
"%arg : type" to "%arg: type" which is more consistent with bb arguments.
In addition to being more consistent, this is a much nicer look for cfg functions.
PiperOrigin-RevId: 227240069
Supervectorization uses null pointers to SSA values as a means of communicating
the failure to vectorize. In operation vectorization, all operations producing
the values of operation arguments must be vectorized for the given operation to
be vectorized. The existing check verified if any of the value "def"
statements was vectorized instead, sometimes leading to assertions inside `isa`
called on a null pointer. Fix this to check that all "def" statements were
vectorized.
PiperOrigin-RevId: 226941552
This operation is produced and used by the super-vectorization passes and has
been emitted as an abstract unregistered operation until now. For end-to-end
testing purposes, it has to be eventually lowered to LLVM IR. Matching
abstract operation by name goes into the opposite direction of the generic
lowering approach that is expected to be used for LLVM IR lowering in the
future. Register vector_type_cast operation as a part of the SuperVector
dialect.
Arguably, this operation is a special case of the `view` operation from the
Standard dialect. The semantics of `view` is not fully specified at this point
so it is safer to rely on a custom operation. Additionally, using a custom
operation may help to achieve clear dialect separation.
PiperOrigin-RevId: 225887305
This CL adds a pass that lowers VectorTransferReadOp and VectorTransferWriteOp
to a simple loop nest via local buffer allocations.
This is an MLIR->MLIR lowering based on builders.
A few TODOs are left to address in particular:
1. invert the permutation map so the accesses to the remote memref are coalesced;
2. pad the alloc for bank conflicts in local memory (e.g. GPUs shared_memory);
3. support broadcast / avoid copies when permutation_map is not of full column rank
4. add a proper "element_cast" op
One notable limitation is this does not plan on supporting boundary conditions.
It should be significantly easier to use pre-baked MLIR functions to handle such paddings.
This is left for future consideration.
Therefore the current CL only works properly for full-tile cases atm.
This CL also adds 2 simple tests:
```mlir
for %i0 = 0 to %M step 3 {
for %i1 = 0 to %N step 4 {
for %i2 = 0 to %O {
for %i3 = 0 to %P step 5 {
vector_transfer_write %f1, %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d0)} : vector<5x4x3xf32>, memref<?x?x?x?xf32, 0>, index, index, index, index
```
lowers into:
```mlir
for %i0 = 0 to %arg0 step 3 {
for %i1 = 0 to %arg1 step 4 {
for %i2 = 0 to %arg2 {
for %i3 = 0 to %arg3 step 5 {
%1 = alloc() : memref<5x4x3xf32>
%2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>>
store %cst, %2[%c0] : memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 5 {
%3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4)
for %i5 = 0 to 4 {
%4 = affine_apply (d0, d1) -> (d0 + d1) (%i1, %i5)
for %i6 = 0 to 3 {
%5 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6)
%6 = load %1[%i4, %i5, %i6] : memref<5x4x3xf32>
store %6, %0[%5, %4, %i2, %3] : memref<?x?x?x?xf32>
dealloc %1 : memref<5x4x3xf32>
```
and
```mlir
for %i0 = 0 to %M step 3 {
for %i1 = 0 to %N {
for %i2 = 0 to %O {
for %i3 = 0 to %P step 5 {
%f = vector_transfer_read %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, 0, d0)} : (memref<?x?x?x?xf32, 0>, index, index, index, index) -> vector<5x4x3xf32>
```
lowers into:
```mlir
for %i0 = 0 to %arg0 step 3 {
for %i1 = 0 to %arg1 {
for %i2 = 0 to %arg2 {
for %i3 = 0 to %arg3 step 5 {
%1 = alloc() : memref<5x4x3xf32>
%2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 5 {
%3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4)
for %i5 = 0 to 4 {
for %i6 = 0 to 3 {
%4 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6)
%5 = load %0[%4, %i1, %i2, %3] : memref<?x?x?x?xf32>
store %5, %1[%i4, %i5, %i6] : memref<5x4x3xf32>
%6 = load %2[%c0] : memref<1xvector<5x4x3xf32>>
dealloc %1 : memref<5x4x3xf32>
```
PiperOrigin-RevId: 224552717
This CL adds the following free functions:
```
/// Returns the AffineExpr e o m.
AffineExpr compose(AffineExpr e, AffineMap m);
/// Returns the AffineExpr f o g.
AffineMap compose(AffineMap f, AffineMap g);
```
This addresses the issue that AffineMap composition is only available at a
distance via AffineValueMap and is thus unusable on Attributes.
This CL thus implements AffineMap composition in a more modular and composable
way.
This CL does not claim that it can be a good replacement for the
implementation in AffineValueMap, in particular it does not support bounded
maps atm.
Standalone tests are added that replicate some of the logic of the AffineMap
composition pass.
Lastly, affine map composition is used properly inside MaterializeVectors and
a standalone test is added that requires permutation_map composition with a
projection map.
PiperOrigin-RevId: 224376870
This CL hooks up and uses permutation_map in vector_transfer ops.
In particular, when going into the nuts and bolts of the implementation, it
became clear that cases arose that required supporting broadcast semantics.
Broadcast semantics are thus added to the general permutation_map.
The verify methods and tests are updated accordingly.
Examples of interest include.
Example 1:
The following MLIR snippet:
```mlir
for %i3 = 0 to %M {
for %i4 = 0 to %N {
for %i5 = 0 to %P {
%a5 = load %A[%i4, %i5, %i3] : memref<?x?x?xf32>
}}}
```
may vectorize with {permutation_map: (d0, d1, d2) -> (d2, d1)} into:
```mlir
for %i3 = 0 to %0 step 32 {
for %i4 = 0 to %1 {
for %i5 = 0 to %2 step 256 {
%4 = vector_transfer_read %arg0, %i4, %i5, %i3
{permutation_map: (d0, d1, d2) -> (d2, d1)} :
(memref<?x?x?xf32>, index, index) -> vector<32x256xf32>
}}}
````
Meaning that vector_transfer_read will be responsible for reading the 2-D slice:
`%arg0[%i4, %i5:%15+256, %i3:%i3+32]` into vector<32x256xf32>. This will
require a transposition when vector_transfer_read is further lowered.
Example 2:
The following MLIR snippet:
```mlir
%cst0 = constant 0 : index
for %i0 = 0 to %M {
%a0 = load %A[%cst0, %cst0] : memref<?x?xf32>
}
```
may vectorize with {permutation_map: (d0) -> (0)} into:
```mlir
for %i0 = 0 to %0 step 128 {
%3 = vector_transfer_read %arg0, %c0_0, %c0_0
{permutation_map: (d0, d1) -> (0)} :
(memref<?x?xf32>, index, index) -> vector<128xf32>
}
````
Meaning that vector_transfer_read will be responsible of reading the 0-D slice
`%arg0[%c0, %c0]` into vector<128xf32>. This will require a 1-D vector
broadcast when vector_transfer_read is further lowered.
Additionally, some minor cleanups and refactorings are performed.
One notable thing missing here is the composition with a projection map during
materialization. This is because I could not find an AffineMap composition
that operates on AffineMap directly: everything related to composition seems
to require going through SSAValue and only operates on AffinMap at a distance
via AffineValueMap. I have raised this concern a bunch of times already, the
followup CL will actually do something about it.
In the meantime, the projection is hacked at a minimum to pass verification
and materialiation tests are temporarily incorrect.
PiperOrigin-RevId: 224376828