This patch adds support for i64, f64 values in `gpu.shuffle`, rewriting 64bit shuffles into two 32bit shuffles.
The reason behind this change is that both CUDA & HIP support this kind of shuffling.
The implementation provided by this patch is based on the LLVM IR emitted by clang for 64bit shuffles when using `-O3`.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D148974