This reverts commit db5d845c73ee2d64f1a5bab3fc72edece9e3a7ba.
As per PR discussion "Looks like we've missed lowering of bitcasts
between v2f16 and v2i16 and it breaks XLA."
On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.
In order to support that we need to make i16x2 a native type in the
backend. This does the necessary changes to make i16x2 a native type and
adds support for the instructions natively supporting i16x2.
This caused a negative test in nvptx slp to start passing. Changed the
test to a positive one as the IR is correctly vectorized.
CUDA-12 no longer supports 32-bit compilation.
Tests agnostic to 32/64 compilation mode are switched to use nvptx64.
Tests that do care about it have 32-bit ptxas compilation disabled with cuda-12+.
Differential Revision: https://reviews.llvm.org/D152199
ptxas is a proprietary compiler from Nvidia that can compile PTX to
machine code (SASS). It has a lot of diagnostics to catch errors
in PTX, which can be used to verify PTX output from llc.
Set -DPXTAS_EXECUTABLE=/path/to/ptxas CMake option to enable it.
If this option is not set, then ptxas is substituted to true which
effectively disables all ptxas RUN lines.
LLVM_PTXAS_EXECUTABLE environment variable takes precedence over
the CMake option, and allows to override ptxas executable that is used for LIT
without complete re-configuration.
Differential Revision: https://reviews.llvm.org/D121727
Original code only used vector loads/stores for explicit vector arguments.
It could also do more loads/stores than necessary (e.g v5f32 would
touch 8 f32 values). Aggregate types were loaded one element at a time,
even the vectors contained within.
This change attempts to generalize (and simplify) parameter space
loads/stores so that vector loads/stores can be used more broadly.
Functionality of the patch has been verified by compiling thrust
test suite and manually checking the differences between PTX
generated by llvm with and without the patch.
General algorithm:
* ComputePTXValueVTs() flattens input/output argument into a flat list
of scalars to load/store and returns their types and offsets.
* VectorizePTXValueVTs() uses that data to create vectorization plan
which returns an array of flags marking boundaries of vectorized
load/stores. Scalars are represented as 1-element vectors.
* Code that generates loads/stores implements a simple state machine
that constructs a vector according to the plan.
Differential Revision: https://reviews.llvm.org/D30011
llvm-svn: 295784