Generally, the cost of a memory op will scale with the number of vector registers accessed. Machines might exist which have a narrow memory access than vector register width, but machines with a wider memory access width than vector register width seem unlikely.
I noticed this because we were preferring wide loads + deinterleaves on examples where the cost of a short gather (actually a strided load) would be better. Touching 8 vector registers instead of doing a 4 element gather is not a good tradeoff.
Differential Revision: https://reviews.llvm.org/D147470
If the legalized type is a legal interleaved access type (i.e. there's a
supported vlseg/vsseg instruction for it), the interleaved access pass
will pick any interleaved memory op (wide load + shuffles) and lower it
into a vlseg/vsseg intrinsic.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D146522
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155