In the transform dialect, a transform IR handle may be pointing to a payload IR
operation that is an ancestor of another payload IR operation pointed to by
another handle. If such a "parent" handle is consumed by a transformation, this
indicates that the associated operation is likely rewritten, which in turn
means that the "child" handle may now be associated with a dangling pointer or
a pointer to a different operation than originally. Add a handle invalidation
mechanism to guard against such situations by reporting errors at runtime.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D127480
There are various shortcuts in `BufferizationState::getBuffer` that avoid a buffer copy when we just need an allocation (and no initialization). This change adds those shortcuts to the TensorCopyInsertion pass, so that `getBuffer` can be simplified in a subsequent change.
Differential Revision: https://reviews.llvm.org/D126821
The constructor already supports passing an ostream as argument,
so let's make the create function support it too.
Differential Revision: https://reviews.llvm.org/D127449
It was a StructAttr. Also adds a FieldParser for AffineMap.
Depends on D127348
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D127350
The TensorCopyInsertion pass resolves out-of-place bufferization decisions by inserting explicit `bufferization.alloc_tensor` ops. This change moves that functionality into a new BufferizableOpInterface method, so that it can be overridden by op implementations. Some op bufferizations must insert additional `alloc_tensor` ops to make sure that certain aliasing invariants are not violated (e.g., scf::ForOp). This will be addressed in a subsequent change.
Differential Revision: https://reviews.llvm.org/D126817
Fixed issue with vector.contract default unroll permutation.
Adds support for vector unroll transformations to unroll in different
orders. For example, the vector.contract can be unrolled into a
smaller set of contractions. There is a choice of how to unroll the
decomposition based on the traversal order of (dim0, dim1, dim2).
The choice of traversal order can now be specified by a callback which
given by the caller of the transform. For now, only the
vector.contract, vector.transfer_read/transfer_write operations
support the callback.
Differential Revision: https://reviews.llvm.org/D127004
This pass runs the One-Shot Analysis to find out which tensor OpOperands must bufferize out-of-place. It then rewrites those tensor OpOperands to explicit allocations with a copy in the form of `bufferization.alloc_tensor`. The resulting IR can then be bufferized without having to care about read-after-write conflicts.
This change makes it possible to connect One-Shot Analysis to other bufferizations such as the sparse compiler.
Differential Revision: https://reviews.llvm.org/D126573
If `copy` is specified, the newly allocated buffer is initialized with the given contents. Also add an optional `escape` attribute to indicate whether the buffer of the tensor may be returned from the parent block (aka. "escape") after bufferization.
This change is in preparation of connecting One-Shot Bufferize to the sparse compiler.
Differential Revision: https://reviews.llvm.org/D126570
This simplifies the bufferization itself and is in preparation of connecting with the sparse compiler.
Differential Revision: https://reviews.llvm.org/D126814
Users should explicitly run `-buffer-results-to-out-params` instead.
The purpose of this change is to remove `finalizeBuffers`, which made it difficult to extend the bufferization to custom buffer types.
Differential Revision: https://reviews.llvm.org/D126253
The buffer deallocation pass must now be run explicitly when `allow-return-alloc` is set.
This results in a few extra buffer copies in unoptimized test cases. The proper way to avoid such copies is to relax the OpOperand/OpResult aliasing contract on ops such as scf.for. Some of these copies can also be avoided by improving the buffer deallocation pass.
Differential Revision: https://reviews.llvm.org/D126252
The operation `shape.concat` was used for type shape only.
We now enable it for extent tensors.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D127321
This commit allows for One-Shot Bufferize to be used through the transform dialect. No op handle is currently returned for the bufferized IR.
Differential Revision: https://reviews.llvm.org/D125098
This relies on the existing TileAndFuse pattern for tensor-based structured
ops. It complements pure tiling, from which some utilities are generalized.
Depends On D127300
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D127319
Introduce transform ops for "for" loops, in particular for peeling, software
pipelining and unrolling, along with a couple of "IR navigation" ops. These ops
are intended to be generalized to different kinds of loops when possible and
therefore use the "loop" prefix. They currently live in the SCF dialect as
there is no clear place to put transform ops that may span across several
dialects, this decision is postponed until the ops actually need to handle
non-SCF loops.
Additionally refactor some common utilities for transform ops into trait or
interface methods, and change the loop pipelining to be a returning pattern.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D127300
When constraints in the two operands make each other redundant, prefer constraints of the second because this affects the number of sets in the output at each level; reducing these can help prevent exponential blowup.
This is accomplished by adding extra overloads to Simplex::detectRedundant that only scan a subrange of the constraints for redundancy.
Reviewed By: Groverkss
Differential Revision: https://reviews.llvm.org/D127237
Reduces repetition in tablegen files for defining various tensor types. In particular the goal is to reduce the repetition when defining new tensor types (e.g., D126994).
Reviewed By: aartbik, rriddle
Differential Revision: https://reviews.llvm.org/D127039
Implement the C-API and Python bindings for the builtin opaque type, which was previously missing.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D127303
When `RegionBranchOpInterface::getSuccessorRegions` is called for anything other than the parent op, it expects the operands of the terminator of the source region to be passed, not the operands of the parent op. This was not always respected.
This fixes a bug in integer range inference and ForwardDataFlowSolver and changes `scf.while` to allow narrowing of successors using constant inputs.
Fixes#55873
Reviewed By: mehdi_amini, krzysz00
Differential Revision: https://reviews.llvm.org/D127261
This is the first PR to add `F16` and `BF16` support to the sparse codegen. There are still problems in supporting these two data types, such as `BF16` is not quite working yet.
Add tests cases.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D127010
Introduce RoundOp in the math dialect. The operation rounds the operand to the
nearest integer value in floating-point format. RoundOp lowers to LLVM
intrinsics 'llvm.intr.round' or as a function call to libm (round or roundf).
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D127286
Find writability conflicts (writes to buffers that are not allowed to be written to) by checking SSA use-def chains. This is better than the current writability analysis, which is too conservative and finds false positives.
Differential Revision: https://reviews.llvm.org/D127256
The `init` and `tensor` ops are renamed (and one moved to another dialect).
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D127169
Reverts commit 1469ebf8382107e0344173f362b690d19e24029d (original commit)
Reverts commit a392a39f75af586e3d3cd046a8361939277e067f (build fix for above commit)
The commit broke tests in out-of-tree projects, indicating that some logical
error was made in the previous change but not covered by current tests.
A few OpenMP tests were retaining the FIR operands even after running
the LLVM conversion pass. To fix these tests the legality checkes for
OpenMP conversion are made stricter to include operands and results.
The Flush, Single and Sections operations are added to conversions or
legality checks. The RegionLessOpConversion is appropriately renamed
to clarify that it works only for operations with Variable operands.
The operands of the flush operation are changed to match those of
Variable Operands.
Fix for an OpenMP issue mentioned in
https://github.com/llvm/llvm-project/issues/55210.
Reviewed By: shraiysh, peixin, awarzynski
Differential Revision: https://reviews.llvm.org/D127092
Four leading spaces are interpreted as a code block in markdown. Unless
used consistently in ODS op description, they cannot be stripped away by
the tablegen backend, which results in malformed markdown being
generated.
Add complex.conj op to calculate the complex conjugate which is widely used for the mathematical operation on the complex space.
Reviewed By: pifon2a
Differential Revision: https://reviews.llvm.org/D127181
Transpose operations on constant data were getting folded during the
canonicalization process. This has compile time cost proportional to
the constant size. Moving this to a separate pass to enable optionality
and flexibility of how such scenarios can be handled.
Reviewed By: rsuderman, jpienaar, stellaraccident
Differential Revision: https://reviews.llvm.org/D124685
Adds supprot for vector unroll transformations to unroll in different
orders. For example, the `vector.contract` can be unrolled into a
smaller set of contractions. There is a choice of how to unroll the
decomposition based on the traversal order of (dim0, dim1, dim2).
The choice of traversal order can now be specified by a callback which
given by the caller of the transform. For now, only the
`vector.contract`, `vector.transfer_read/transfer_write` operations
support the callback.
Differential Revision: https://reviews.llvm.org/D127004
This operation should be supported as a named op because
when the operands are viewed as having canonical layouts
with decreasing strides, then the "reduction" dimensions
of the filter (h, w, and c) are contiguous relative to each
output channel. When lowered to a matrix multiplication,
this layout is the simplest to deal with, and thus future
transforms/vectorizations of `conv2d` may find using this
named op convenient.
Differential Revision: https://reviews.llvm.org/D126995
This is correct for all values, i.e. the same as promoting the division to fp32 in the NVPTX backend. But it is faster (~10% in average, sometimes more) because:
- it performs less Newton iterations
- it avoids the slow path for e.g. denormals
- it allows reuse of the reciprocal for multiple divisions by the same divisor
Test program:
```
#include <stdio.h>
#include "cuda_fp16.h"
// This is a variant of CUDA's own __hdiv which is fast than hdiv_promote below
// and doesn't suffer from the perf cliff of div.rn.fp32 with 'special' values.
__device__ half hdiv_newton(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float rcp;
asm("{rcp.approx.ftz.f32 %0, %1;\n}" : "=f"(rcp) : "f"(fb));
float result = fa * rcp;
auto exponent = reinterpret_cast<const unsigned&>(result) & 0x7f800000;
if (exponent != 0 && exponent != 0x7f800000) {
float err = __fmaf_rn(-fb, result, fa);
result = __fmaf_rn(rcp, err, result);
}
return __float2half(result);
}
// Surprisingly, this is faster than CUDA's own __hdiv.
__device__ half hdiv_promote(half a, half b) {
return __float2half(__half2float(a) / __half2float(b));
}
// This is an approximation that is accurate up to 1 ulp.
__device__ half hdiv_approx(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float result;
asm("{div.approx.ftz.f32 %0, %1, %2;\n}" : "=f"(result) : "f"(fa), "f"(fb));
return __float2half(result);
}
__global__ void CheckCorrectness() {
int i = threadIdx.x + blockIdx.x * blockDim.x;
half x = reinterpret_cast<const half&>(i);
for (int j = 0; j < 65536; ++j) {
half y = reinterpret_cast<const half&>(j);
half d1 = hdiv_newton(x, y);
half d2 = hdiv_promote(x, y);
auto s1 = reinterpret_cast<const short&>(d1);
auto s2 = reinterpret_cast<const short&>(d2);
if (s1 != s2) {
printf("%f (%u) / %f (%u), got %f (%hu), expected: %f (%hu)\n",
__half2float(x), i, __half2float(y), j, __half2float(d1), s1,
__half2float(d2), s2);
//__trap();
}
}
}
__device__ half dst;
__global__ void ProfileBuiltin(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = x / x;
}
dst = x;
}
__global__ void ProfilePromote(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_promote(x, x);
}
dst = x;
}
__global__ void ProfileNewton(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_newton(x, x);
}
dst = x;
}
__global__ void ProfileApprox(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_approx(x, x);
}
dst = x;
}
int main() {
CheckCorrectness<<<256, 256>>>();
half one = __float2half(1.0f);
ProfileBuiltin<<<1, 1>>>(one); // 1.001s
ProfilePromote<<<1, 1>>>(one); // 0.560s
ProfileNewton<<<1, 1>>>(one); // 0.508s
ProfileApprox<<<1, 1>>>(one); // 0.304s
auto status = cudaDeviceSynchronize();
printf("%s\n", cudaGetErrorString(status));
}
```
Reviewed By: herhut
Differential Revision: https://reviews.llvm.org/D126158
This patch adds the knobs to use peeling in the codegen strategy
infrastructure.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126842
This reverts commit 4e5ce2056e3e85f109a074e80bdd23a10ca2bed9.
This relands commit 1350c9887dca5ba80af8e3c1e61b29d6696eb240.
Reinstates the range analysis with the build issue fixed.
Differential Revision: https://reviews.llvm.org/D126926
`scf.foreach_thread` results alias with the underlying `scf.foreach_thread.parallel_insert_slice` destination operands
and they bufferize to equivalent buffers in the absence of other conflicts.
`scf.foreach_thread.parallel_insert_slice` conflict detection is similar to `tensor.insert_slice` conflict detection.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126769
Add an option to predicate the epilogue within the kernel instead of
peeling the epilogue. This is a useful option to prevent generating
large amount of code for deep pipeline. This currently require a user
lamdba to implement operation predication.
Differential Revision: https://reviews.llvm.org/D126753
This commit enables providing long-form documentation more seamlessly to the LSP
by revamping decl documentation. For ODS imported constructs, we now also import
descriptions and attach them to decls when possible. For PDLL constructs, the LSP will
now try to provide documentation by parsing the comments directly above the decls
location within the source file. This commit also adds a new parser flag
`enableDocumentation` that gates the import and attachment of ODS documentation,
which is unnecessary in the normal build process (i.e. it should only be used/consumed
by tools).
Differential Revision: https://reviews.llvm.org/D124881