llvm-project

Author	SHA1	Message	Date
Nikita Popov	5ddce70ef0	[AArch64] Convert some tests to opaque pointers (NFC)	2022-12-19 12:36:19 +01:00
David Green	1206f72e31	[AArch64] Fold Mul(And(Srl(X, 15), 0x10001), 0xffff) to CMLTz This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16 CMLTz instruction. The Srl and And extract the top bit (whether the input is negative) and the Mul sets all values in the i16 half to all 1/0 depending on if that top bit was set. This is equivalent to a v8i16 CMLTz instruction. The same applies to other sizes with equivalent constants. Differential Revision: https://reviews.llvm.org/D130874	2022-08-02 13:01:59 +01:00
David Green	03c65c0d32	[AArch64] Convert vector add(ext, ext) into ext(add(ext, ext)) Given a vector add or sub from extends that needs more that one 'step' (i.e i8 to i32 or i16 to i64), we can transform the sequence to sext(add(ext, ext)), to allow the add(ext, ext) to become a single uaddl and a larger extend, producing less instructions in total. https://alive2.llvm.org/ce/z/S2T4k- Differential Revision: https://reviews.llvm.org/D128426	2022-06-24 10:04:28 +01:00
David Green	4ea1b43527	[AArch64] Generate ADDP from shuffled add This adds a fold of add(x, shuffle(x, <1,0,3,2,5,4,...>), into shuffle(addp(x), <0,0,1,1,2,2,..>. The ADDP instruction takes two vectors and returns one, adding adjacent pairs. So we match x in a custom combine as it is lowered from a v8i32. The original code would be 2 rev64 and 2 add, with the new code being a single addp with a zip1;zip2 shuffle, producing smaller code. Differential Revision: https://reviews.llvm.org/D126686	2022-06-06 11:39:51 +01:00
David Green	115c188807	[DAG][PowerPC] Combine shuffle(bitcast(X), Mask) to bitcast(shuffle(X, Mask')) If the mask is made up of elements that form a mask in the higher type we can convert shuffle(bitcast into the bitcast type, simplifying the instruction sequence. A v4i32 2,3,0,1 for example can be treated as a 1,0 v2i64 shuffle. This helps clean up some of the AArch64 concat load combines, along with helping simplify a number of other tests. The PowerPC combine for v16i8 splat vector loads needed some fixes to keep it working for v16i8 vectors. This improves the handling of v2i64 shuffles to match too, hopefully improving them in general. Differential Revision: https://reviews.llvm.org/D123801	2022-05-06 10:50:31 +01:00
Alexey Bataev	2cca53c815	[DAG]Introduce llvm::processShuffleMasks and use it for shuffles in DAG Type Legalizer. We can process the long shuffles (working across several actual vector registers) in the best way if we take the actual register represantion into account. We can build more correct representation of register shuffles, improve number of recognised buildvector sequences. Also, same function can be used to improve the cost model for the shuffles. in future patches. Part of D100486 Differential Revision: https://reviews.llvm.org/D115653	2022-04-20 09:37:16 -07:00
Alexey Bataev	5f7ac15912	Revert "[DAG]Introduce llvm::processShuffleMasks and use it for shuffles in DAG Type Legalizer." This reverts commit 2f49163b3365e5dc046b03e422a048dd45aee3f0 to fix a buildbot failure. Reported in https://lab.llvm.org/buildbot#builders/105/builds/24284	2022-04-20 06:35:55 -07:00
Alexey Bataev	2f49163b33	[DAG]Introduce llvm::processShuffleMasks and use it for shuffles in DAG Type Legalizer. We can process the long shuffles (working across several actual vector registers) in the best way if we take the actual register represantion into account. We can build more correct representation of register shuffles, improve number of recognised buildvector sequences. Also, same function can be used to improve the cost model for the shuffles. in future patches. Part of D100486 Differential Revision: https://reviews.llvm.org/D115653	2022-04-20 05:32:56 -07:00
David Green	73dc996428	[AArch64] Add lane moves to PerfectShuffle tables This teaches the perfect shuffle tables about lane inserts, that can help reduce the cost of many entries. Many of the shuffle masks are one-away from being correct, and a simple lane move can be a lot simpler than trying to use ext/zip/etc. Because they are not exactly like the other masks handled in the perfect shuffle tables, they require special casing to generate them, with a special InsOp Operator. The lane to insert into is encoded as the RHSID, and the move from is grabbed from the original mask. This helps reduce the maximum perfect shuffle entry cost to 3, with many more shuffles being generatable in a single instruction. Differential Revision: https://reviews.llvm.org/D123386	2022-04-19 14:49:50 +01:00
David Green	50af82701c	[AArch64] Cost all perfect shuffles entries as cost 1 A brief introduction to perfect shuffles - AArch64 NEON has a number of shuffle operations - dups, zips, exts, movs etc that can in some way shuffle around the lanes of a vector. Given a shuffle of size 4 with 2 inputs, some shuffle masks can be easily codegen'd to a single instruction. A <0,0,1,1> mask for example is a zip LHS, LHS. This is great, but some masks are not so simple, like a <0,0,1,2>. It turns out we can generate that from zip LHS, <0,2,0,2>, having generated <0,2,0,2> from uzp LHS, LHS, producing the result in 2 instructions. It is not obvious from a given mask how to get there though. So we have a simple program (PerfectShuffle.cpp in the util folder) that can scan through all combinations of 4-element vectors and generate the perfect combination of results needed for each shuffle mask (for some definition of perfect). This is run offline to generate a table that is queried for generating shuffle instructions. (Because the table could get quite big, it is limited to 4 element vectors). In the perfect shuffle tables zip, unz and trn shuffles were being cost as 2, which is higher than needed and skews the perfect shuffle tables to create inefficient combinations. This sets them to 1 and regenerates the tables. The codegen will usually be better and the costs should be more precise (but it can get less second-order re-use of values from multiple shuffles, these cases should be fixed up in subsequent patches. Differential Revision: https://reviews.llvm.org/D123379	2022-04-19 12:05:05 +01:00
David Green	1ba8f4f67d	[AArch64] Move v4i8 concat load lowering to a combine. The existing code was not updating the uses of loads that it recreated, leading to incorrect chains which could break the ordering between nodes. This moves the code to a combine instead, and makes sure we update the chain references. This does mean it happens earlier - potentially before the concats are simplified. This can lead to inefficiencies in the codegen, which will be fixed in followups.	2022-04-14 15:19:33 +01:00
David Green	fe6057a293	[AArch64] Custom lower concat(v4i8 load, ...) We already have custom lowering for v4i8 load, which loads as a f32, converts to a vector and bitcasts and extends the result to a v4i16. This adds some custom lowering of concat(v4i8 load, ...) to keep the result as an f32 and create a buildvector of the resulting f32 loads. This helps not create all the extends and bitcasts, which are often difficult to fully clean up. Differential Revision: https://reviews.llvm.org/D121400	2022-03-18 11:58:02 +00:00
David Green	0fa4aeb453	[AArch64] Add extra insert-subvector tests. NFC	2022-03-17 15:29:07 +00:00

13 Commits