llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	c80d8a8cea	[AMDGPU] MachineLICM cannot hoist VALU MachineLoop::isLoopInvariant() returns false for all VALU because of the exec use. Check TII::isIgnorableUse() to allow hoisting. That unfortunately results in higher register consumption since MachineLICM does not adequately estimate pressure. Therefor I think it shall only be enabled after D107677 even though it does not depend on it. Differential Revision: https://reviews.llvm.org/D107859	2021-10-20 11:47:24 -07:00
Stanislav Mekhanoshin	6185835656	[AMDGPU] Allow rematerialization of SOP with virtual registers D106408 was doing this for all targets although it was reverted due to couple performance regressions on some targets. The difference for AMDGPU is the ability to rematerialize SOP instructions with virtual register uses like we already do for VOP. Differential Revision: https://reviews.llvm.org/D110743	2021-10-20 11:46:50 -07:00
Sanjay Patel	c1ca9e3077	[AMDGPU] add test for usubsat; NFC	2021-10-19 13:05:23 -04:00
Arthur Eubanks	15fefcb9eb	[opt] Directly translate -O# to -passes='default<O#>' Right now when we see -O# we add the corresponding 'default<O#>' into the list of passes to run when translating legacy -pass-name. This has the side effect of not using the default AA pipeline. Instead, treat -O# as -passes='default<O#>', but don't allow any other -passes or -pass-name. I think we can keep `opt -O#` as shorthand for `opt -passes='default<O#>` but disallow anything more than just -O#. Tests need to be updated to not use `opt -O# -pass-name`. Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D112036	2021-10-18 16:48:10 -07:00
Anshil Gandhi	0567f03331	[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols By default clang emits complete contructors as alias of base constructors if they are the same. The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols. @yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true` and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had to be extended to support aliases to functions. inline-calls.ll was corrected appropriately. Reviewed By: yaxunl, #amdgpu Differential Revision: https://reviews.llvm.org/D109707	2021-10-18 16:53:15 -06:00
Jon Roelofs	1300677f97	[AArch64][GlobalISel] combine and + [la]sr => ubfx https://godbolt.org/z/h8ejrG4hb rdar://83597585 Differential Revision: https://reviews.llvm.org/D111839	2021-10-18 10:33:01 -07:00
Piotr Sobczak	d869921004	[AMDGPU] Add patterns for i8/i16 local atomic load/store Add patterns for i8/i16 local atomic load/store. Added tests for new patterns. Copied atomic_[store/load]_local.ll to GlobalISel directory. Differential Revision: https://reviews.llvm.org/D111869	2021-10-18 11:23:10 +02:00
Stanislav Mekhanoshin	7cdb1df8c7	[AMDGPU] Divergence driven selection for fused bitlogic The change adds divergence predicates for fused logical operations. The problem with selecting a scalar fused op such as S_NOR_B32 is that it does not have a VALU counterpart and will be split in moveToVALU. At the same time it prevents selection of a better opcode on the VALU side (such as V_OR3_B32) which does not have a counterpart on SALU side. XNOR opcodes are left as is and selected as scalar to get advantage of the SIInstrInfo::lowerScalarXnor() code which can commute operations to keep one of two opcodes on SALU if possible. See xnor.ll test for this. Differential Revision: https://reviews.llvm.org/D111907	2021-10-18 01:44:25 -07:00
Anshil Gandhi	1830ec94ac	Revert "[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols" This reverts commit 03375a3fb33b11e1249d9c88070b7f33cb97802a.	2021-10-15 16:16:18 -06:00
Stanislav Mekhanoshin	cd538a6b14	[AMDGPU] Precommit fused-bitlogic.ll test. NFC.	2021-10-15 13:56:24 -07:00
Anshil Gandhi	03375a3fb3	[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols By default clang emits complete contructors as alias of base constructors if they are the same. The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols. @yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true` and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had to be extended to support aliases to functions. inline-calls.ll was corrected appropriately. Reviewed By: yaxunl, #amdgpu Differential Revision: https://reviews.llvm.org/D109707	2021-10-15 11:39:15 -06:00
Michael Liao	bacddf47a8	[amdgpu] Fix a crash case when preserving MDT in SILowerControlFlow - When a redundant MBB is being erased from MDT, check whether its single successor is dominiated by it. If yes, update that successor's idom before erasing MBB; otherwise, it implies MBB is a leaf node and could be erased directly. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111831	2021-10-15 13:21:53 -04:00
Abinav Puthan Purayil	0379263f23	[AMDGPU] Fix width check for signed mul24 generation. This changes fixes a case in which the highest set bit of the original result is at bit 31 and sign-extending the mul24 for it would make the result negative. Differential Revision: https://reviews.llvm.org/D111823	2021-10-15 18:53:41 +05:30
Julien Pages	e4e48e2f02	[AMDGPU] Add more tests for build_vector Differential Revision: https://reviews.llvm.org/D111652	2021-10-14 11:54:17 -04:00
Abinav Puthan Purayil	b3c9d84e5a	[AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result. The 24-bit mul intrinsics yields the low-order 32 bits. We should only do the transformation if the operands are known to be not wider than 24 bits and the result is known to be not wider than 32 bits. Differential Revision: https://reviews.llvm.org/D111523	2021-10-14 09:00:19 +05:30
Jay Foad	c885857e9d	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. Differential Revision: https://reviews.llvm.org/D111646	2021-10-13 17:12:26 +01:00
Amara Emerson	5abce56edb	[GlobalISel] Add support for constant vector folding of binops in CSEMIRBuilder. Differential Revision: https://reviews.llvm.org/D111524	2021-10-12 11:31:22 -07:00
Stanislav Mekhanoshin	9cf995be6b	[AMDGPU] Promote generic pointer kernel arguments into global The new pass walks kernel's pointer arguments, then loads from them. If a loaded value is a pointer and loaded pointer is unmodified in the kernel before the load, then promote loaded pointer to global. Then recursively continue. Differential Revision: https://reviews.llvm.org/D111464	2021-10-12 10:07:33 -07:00
Jay Foad	66ce1015af	Revert "[AMDGPU] Enable load clustering in the post-RA scheduler" This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec. It was committed by accident.	2021-10-12 16:19:35 +01:00
Jay Foad	66e13c7f43	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%.	2021-10-12 16:09:04 +01:00
hsmahesha	52cb3af08c	[AMDGPU] Remove dead frame indices after sgpr spill. All those frame indices which are dead after sgpr spill should be removed from the function frame. Othewise, there is a side effect such as re-mapping of free frame index ids by the later pass(es) like "stack slot coloring" which in turn could mess-up with the book keeping of "frame index to VGPR lane". Reviewed By: cdevadas Differential Revision: https://reviews.llvm.org/D111150	2021-10-12 09:58:49 +05:30
Amara Emerson	da904719e9	[GlobalISel] Regenerate some MIR tests with CHECK-NEXT for another patch.	2021-10-11 14:40:34 -07:00
Jay Foad	2e1ad93201	[AMDGPU] Fix copying a machine operand Without this I get: * Bad machine code: Instruction has operand with wrong parent set * - function: available_externally_test - basic block: %bb.0 (0x7dad598) - instruction: %0:r600_treg32_x = MOV 1, 0, 0, 0, $alu_literal_x, 0, 0, 0, -1, 1, $pred_sel_off, @available_externally, 0 Differential Revision: https://reviews.llvm.org/D111549	2021-10-11 20:22:47 +01:00
Qiu Chaofan	573531fb1f	Fix typo of colon to semicolon in lit tests	2021-10-09 10:03:50 +08:00
David Stuttard	69f7d81d0a	[AMDGPU] Set number vgprs used in PS shaders based on input registers actually used For PS shaders we can use the input SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR registers Calculate the number of VGPR registers used as input VGPRs based on these registers rather than the arguments passed in (this conservatively always allocates the maximum). Differential Revision: https://reviews.llvm.org/D101633 Change-Id: Idf7c060cbbd5f7e3300102c55ecee3c07f209de6	2021-10-08 14:24:35 +01:00
Mirko Brkusanin	d20840c937	[GlobalISel] Combine for eliminating redundant operand negations Differential Revision: https://reviews.llvm.org/D111319	2021-10-08 14:29:22 +02:00
Amara Emerson	08b3c0d995	[GlobalISel] Combine G_UMULH x, (1 << c)) -> x >> (bitwidth - c) In order to not generate an unnecessary G_CTLZ, I extended the constant folder in the CSEMIRBuilder to handle G_CTLZ. I also added some extra handing of vector constants too. It seems we don't have any support for doing constant folding of vector constants, so the tests show some other useless G_SUB instructions too. Differential Revision: https://reviews.llvm.org/D111036	2021-10-07 23:51:37 -07:00
Jay Foad	e996cf7dce	[AMDGPU] Preserve MachineDominatorTree in SILowerControlFlow Updating the MachineDominatorTree is easy since SILowerControlFlow only splits and removes basic blocks. This should save a bit of compile time because previously we would recompute the dominator tree from scratch after this pass. Another reason for doing this is that SILowerControlFlow preserves LiveIntervals which transitively requires MachineDominatorTree. I think that means that SILowerControlFlow is obliged to preserve MachineDominatorTree too as explained here: https://lists.llvm.org/pipermail/llvm-dev/2020-November/146923.html although it does not seem to have caused any problems in practice yet. Differential Revision: https://reviews.llvm.org/D111313	2021-10-07 21:30:26 +01:00
Amara Emerson	8bfc0e06dc	[GlobalISel] Port the udiv -> mul by constant combine. This is a straight port from the equivalent DAG combine. Differential Revision: https://reviews.llvm.org/D110890	2021-10-07 11:37:17 -07:00
Carl Ritson	b5d6ad20e1	[MachineCopyPropagation] Handle propagation of undef copies When propagating undefined copies the undef flag must also be propagated. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111219	2021-10-07 20:34:27 +09:00
Philip Reames	d652724c0b	[test] refresh a couple of autogen tests	2021-10-05 18:41:24 -07:00
Carl Ritson	adf7043a9f	[AMDGPU] Only remove branches in SIInstrInfo::removeBranch Without this change _term instructions can be removed during critical edge splitting. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111126	2021-10-06 10:34:26 +09:00
kpyzhov	095c48fdf3	[AMDGPU] Use "hostcall" module flag instead of searching for ockl_hostcall_internal() declaration. The current way to detect hostcalls by looking for "ockl_hostcall_internal()" function in the module seems to be not reliable enough. The LTO may rename the "ockl_hostcall_internal()" function when an application is compiled with "-fgpu-rdc", and MetadataStreamer pass to fail to detect hostcalls, therefore it does not set the "hidden_hostcall_buffer" kernel argument. This change adds a new module flag: hostcall that can be used to detect whether GPU functions use host calls for printf. Differential revision: https://reviews.llvm.org/D110337	2021-10-05 09:56:04 -04:00
Mirko Brkusanin	40e00063bc	[GlobalISel] Combine fabs(fneg(x)) to fabs(x) Differential Revision: https://reviews.llvm.org/D110943	2021-10-05 13:43:39 +02:00
Jay Foad	9ce4f37206	[AMDGPU][GlobalISel] Fix legalization of G_UMULH Scalarize before narrowing because the narrowing implementation does not work on vectors. This matches what we do for regular G_MUL. Differential Revision: https://reviews.llvm.org/D111129	2021-10-05 10:56:02 +01:00
Carl Ritson	e86d45ec00	[AMDGPU] Pre-commit test for D111126 (NFC)	2021-10-05 18:13:54 +09:00
Amara Emerson	8bde5e58c0	Delay outgoing register assignments to last. The delayed stack protector feature which is currently used for SDAG (and thus allows for more commonly generating tail calls) depends on being able to extract the tail call into a separate return block. To do this it also has to extract the vreg->physreg copies that set up the call's arguments, since if it doesn't then the call inst ends up using undefined physregs in it's new spliced block. SelectionDAG implementations can do this because they delay emitting register copies until after the stack arguments are set up. GISel however just processes and emits the arguments in IR order, so stack arguments always end up last, and thus this breaks the code that looks for any register arg copies that precede the call instruction. This patch adds a thunk argument to the assignValueToReg() and custom assignment hooks. For outgoing arguments, register assignments use this return param to return a thunk that does the actual generating of the copies. We collect these until all the outgoing stack assignments have been done and then execute them, so that the copies (and perhaps some artifacts like G_SEXTs) are placed after any stores. Differential Revision: https://reviews.llvm.org/D110610	2021-10-04 12:33:20 -07:00
Jay Foad	24688f8fdf	Revert "[GlobalISel] Support vectors in LegalizerHelper::narrowScalarMul" This reverts commit 90da0b9a5a5322f5a48574274421357d7b22f2cb. It was causing an LLVM_ENABLE_EXPENSIVE_CHECKS buildbot failure.	2021-10-04 20:26:30 +01:00
Amara Emerson	dafcbfdaa0	[GlobalISel] Widen G_EXTRACT_VECTOR_ELT using anyext instead of sext. G_SEXT seems to be unnecessary here, anyext will do. Differential Revision: https://reviews.llvm.org/D110469	2021-10-04 12:19:19 -07:00
Jay Foad	90da0b9a5a	[GlobalISel] Support vectors in LegalizerHelper::narrowScalarMul Also remove some redundancy because the source and result types of any multiply are always the same. Differential Revision: https://reviews.llvm.org/D110926	2021-10-04 19:33:38 +01:00
Jay Foad	dff3454bda	[TwoAddressInstruction] Tweak constraining of tied operands In collectTiedOperands, when handling an undef use that is tied to a def, constrain the dst reg with the actual register class of the src reg, instead of with the register class from the instructions's MCInstrDesc. This makes a difference in some AMDGPU test cases like this, before: %16:sgpr_96 = INSERT_SUBREG undef %15:sgpr_96_with_sub0_sub1(tied-def 0), killed %11:sreg_64_xexec, %subreg.sub0_sub1 After, without this patch: undef %16.sub0_sub1:sgpr_96 = COPY killed %11:sreg_64_xexec This fails machine verification if you force it to run after TwoAddressInstruction (currently it is disabled) with: * Bad machine code: Invalid register class for subregister index * - function: s_load_constant_v3i32_align4 - basic block: %bb.0 (0xa011a88) - instruction: undef %16.sub0_sub1:sgpr_96 = COPY killed %11:sreg_64_xexec - operand 0: undef %16.sub0_sub1:sgpr_96 Register class SGPR_96 does not fully support subreg index 4 After, with this patch: undef %16.sub0_sub1:sgpr_96_with_sub0_sub1 = COPY killed %11:sreg_64_xexec See also svn r159120 which introduced the code to handle tied undef uses. Differential Revision: https://reviews.llvm.org/D110944	2021-10-01 20:57:58 +01:00
Jay Foad	61ecfc6f9d	[TwoAddressInstruction] Pre-commit a test case for D110944	2021-10-01 20:57:57 +01:00
Jay Foad	156d7d2df7	[LiveIntervals] Remove unused subreg ranges in repairIntervalsInRange If the old instructions mentioned a subreg that the new instructions do not, remove the subrange for that subreg. For example, in TwoAddressInstructionPass::eliminateRegSequence, if a use operand in the REG_SEQUENCE has the undef flag then we don't generate a copy for it so after the elimination there should be no live interval at all for the corresponding subreg of the def. This is a small step towards switching TwoAddressInstructionPass over from LiveVariables to LiveIntervals. Currently this path is only tested if you explicitly enable -early-live-intervals. Differential Revision: https://reviews.llvm.org/D110542	2021-09-30 09:15:10 +01:00
Ruiling Song	52785989e9	AMDGPU: Broadcast scalar boolean to vector boolean explicitly This is used to fix wrong code generation of s_add_co_select_user in test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll s_addc_u32 s4, s6, 0 s_cselect_b64 vcc, 1, 0 <-- vcc set as 0x1 if SCC==1 v_mov_b32_e32 v1, s4 s_cmp_gt_u32 s6, 31 v_cndmask_b32_e32 v1, 0, v1, vcc If the s_addc_u32 set SCC, then we will get value 0x1 in VCC. The v_cndmask will do per thread selection with VCC as condition register. As VCC only gets the first bit being set, only the first thread/lane in destination register can get correct result if the very first lane is active. In fact, we should broadcast the value to all active lanes of the final register. The idea here is doing this broadcast to vector boolean explicitly instead of lowering it into a COPY from SCC which would be interpreted as selecting between 0/1. This is used to replace D109754. Reviewed-by: foad, alex-t Differential Revision: https://reviews.llvm.org/D109889	2021-09-30 10:15:01 +08:00
Praveen Velliengiri	e90b512c4d	[AMDGPU] Change ASAN init/fini kernels linkage to external. HSA runtime fails to find the symbols for Init and Fini kernels as they mark with internal linkage, changing the linkage to external to fix those errors. Differential Revision: https://reviews.llvm.org/D110054	2021-09-27 11:50:37 -06:00
Sebastian Neubauer	bf980930e5	[AMDGPU] Ignore KILLs when forming clauses KILL instructions are sometimes present and prevented hard clauses from being formed. Fix this by ignoring all meta instructions in clauses. Differential Revision: https://reviews.llvm.org/D106042	2021-09-27 16:33:52 +02:00
Amara Emerson	acd13994d1	[GlobalISel] Re-generate some call lowering tests with the new CHECK-NEXT behaviour.	2021-09-26 17:25:38 -07:00
Amara Emerson	f4cfda03d6	[AArch64][AMDGPU] Re-generate some tests with CHECK-NEXT to prepare for a patch.	2021-09-24 18:26:08 -07:00
Stanislav Mekhanoshin	cf74ef134c	[AMDGPU] Limit promote alloca max size in functions Non-entry functions have 32 caller saved VGPRs available. If we promote alloca to consume more registers we will have to spill CSRs. There is no reason to eliminate scratch access to get another scratch access instead. Differential Revision: https://reviews.llvm.org/D110372	2021-09-24 13:38:39 -07:00
Stanislav Mekhanoshin	08d7eec06e	Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39.	2021-09-24 10:26:11 -07:00

1 2 3 4 5 ...

4917 Commits