llvm-project

Author	SHA1	Message	Date
Fangrui Song	9e9907f1cf	[AMDGPU,test] Change llc -march= to -mtriple= (#75982 ) Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```	2024-01-16 21:54:58 -08:00
Jay Foad	7b3bbd83c0	Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038 )" This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c. Reverted due to various buildbot failures.	2023-10-09 12:31:32 +01:00
Jay Foad	2501ae58e3	[CodeGen] Really renumber slot indexes before register allocation (#67038 ) PR #66334 tried to renumber slot indexes before register allocation, but the numbering was still affected by list entries for instructions which had been erased. Fix this to make the register allocator's live range length heuristics even less dependent on the history of how instructions have been added to and removed from SlotIndexes's maps.	2023-10-09 11:44:41 +01:00
Jay Foad	e0919b189b	[CodeGen] Renumber slot indexes before register allocation (#66334 ) RegAllocGreedy uses SlotIndexes::getApproxInstrDistance to approximate the length of a live range for its heuristics. Renumbering all slot indexes with the default instruction distance ensures that this estimate will be as accurate as possible, and will not depend on the history of how instructions have been added to and removed from SlotIndexes's maps. This also means that enabling -early-live-intervals, which runs the SlotIndexes analysis earlier, will not cause large amounts of churn due to different register allocator decisions.	2023-09-19 11:18:12 +01:00
skc7	b434051dc8	[AMDGPU] Introduce SIInstrWorklist to process instructions in moveToVALU Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D147168	2023-04-10 11:34:14 +05:30
Nicolai Hähnle	10cef708a7	AMDGPU: Clean up LDS-related occupancy calculations Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the number of SIMDs over which a workgroup may be distributed. getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs into account at all. At the same time, we need to take into account that WGP mode offers access to a larger total amount of LDS, since this can affect how non-power-of-two LDS allocations are rounded. To make this work consistently, we distinguish between (available) local memory size and addressable local memory size (which is always limited by 64kB on gfx10+, even with WGP mode). This change results in a massive amount of test churn. A lot of it is caused by the fact that the default work group size is 1024, which means that (due to rounding effects) the default occupancy on older hardware is 8 instead of 10, which affects scheduling via register pressure estimates. I've adjusted most tests by just running the UTC tools, but in some cases I manually changed the work group size to 32 or 64 to make sure that work group size chunkiness has no effect. Differential Revision: https://reviews.llvm.org/D139468	2023-01-23 21:43:06 +01:00
Nikita Popov	bdf2fbba9c	[AMDGPU] Convert some tests to opaque pointers (NFC)	2022-12-19 12:41:13 +01:00
Alexander Timofeev	fbdea5a2e9	[AMDGPU] Always select s_cselect_b32 for uniform 'select' SDNode This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D133593	2022-09-15 22:03:56 +02:00
Simon Pilgrim	5ec47c6dc5	[DAG] Add MERGE_VALUE computeKnownBits/ComputeNumSignBits handling. Just forward the value tracking to the operand specified by the ResNo	2022-07-17 11:58:08 +01:00
Alexander Timofeev	2e29b0138c	[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable. Since the divergence-driven instruction selection has been enabled for AMDGPU, all the uniform instructions are expected to be selected to SALU form, except those not having one. VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form and overhead introduced by the data transfer between the VGPRs and SGPRs. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D128252	2022-07-14 23:59:02 +02:00
Jay Foad	e2926501d8	[AMDGPU] Aggressively fold immediates in SIShrinkInstructions Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage. Differential Revision: https://reviews.llvm.org/D114644	2022-05-18 11:04:33 +01:00
Benjamin Kramer	0776f6e04d	[LSV] Vectorize loads of vectors by turning it into a larger vector Use shufflevector to do the subvector extracts. This allows a lot more load merging on AMDGPU and also on NVPTX when <2 x half> is involved. Differential Revision: https://reviews.llvm.org/D117219	2022-01-26 11:38:41 +01:00
Austin Kerbow	da067ed569	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777	2021-12-01 22:31:28 -08:00
Joe Nash	3ce1b9631a	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde	2021-09-14 15:11:27 -04:00
alex-t	ed0f4415f0	[AMDGPU] Divergence-driven compare operations instruction selection Description: This change enables the compare operations to be selected to SALU/VALU form dependent of the SDNode divergence flag. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D106079	2021-08-25 18:30:49 +03:00
Mircea Trofin	e881a25f1e	[NFC] Removed unused prefixes in CodeGen/AMDGPU This covers tests starting with s. Differential Revision: https://reviews.llvm.org/D94184	2021-01-07 08:00:11 -08:00
Piotr Sobczak	62d8b8a225	Fix 64-bit copy to SCC Fix 64-bit copy to SCC by restricting the pattern resulting in such a copy to subtargets supporting 64-bit scalar compare, and mapping the copy to S_CMP_LG_U64. Before introducing the S_CSELECT pattern with explicit SCC (0045786f146e78afee49eee053dc29ebc842fee1), there was no need for handling 64-bit copy to SCC ($scc = COPY sreg_64). The proposed handling to read only the low bits was however based on a false premise that it is only one bit that matters, while in fact the copy source might be a vector of booleans and all bits need to be considered. The practical problem of mapping the 64-bit copy to SCC is that the natural instruction to use (S_CMP_LG_U64) is not available on old hardware. Fix it by restricting the problematic pattern to subtargets supporting the instruction (hasScalarCompareEq64). Differential Revision: https://reviews.llvm.org/D85207	2020-08-09 20:50:30 +02:00
Jay Foad	62fd7f767c	[MachineScheduler] Fix the TopDepth/BotHeightReduce latency heuristics tryLatency compares two sched candidates. For the top zone it prefers the one with lesser depth, but only if that depth is greater than the total latency of the instructions we've already scheduled -- otherwise its latency would be hidden and there would be no stall. Unfortunately it only tests the depth of one of the candidates. This can lead to situations where the TopDepthReduce heuristic does not kick in, but a lower priority heuristic chooses the other candidate, whose depth is greater than the already scheduled latency, which causes a stall. The fix is to apply the heuristic if the depth of either candidate is greater than the already scheduled latency. All this also applies to the BotHeightReduce heuristic in the bottom zone. Differential Revision: https://reviews.llvm.org/D72392	2020-07-17 11:02:13 +01:00
Jay Foad	ecac951be9	[AMDGPU] Fix and simplify AMDGPUTargetLowering::LowerUDIVREM Use the algorithm from AMDGPUCodeGenPrepare::expandDivRem32. Differential Revision: https://reviews.llvm.org/D83382	2020-07-08 19:14:49 +01:00
Jay Foad	f4bd01c191	[AMDGPU] Fix and simplify AMDGPUCodeGenPrepare::expandDivRem32 Fix the division/remainder algorithm by adding a second quotient refinement step, which is required in some cases like 0xFFFFFFFFu / 0x11111111u (https://bugs.llvm.org/show_bug.cgi?id=46212). Also document, rewrite and simplify it by ensuring that we always have a lower bound on inv(y), which simplifies the UNR step and the quotient refinement steps. Differential Revision: https://reviews.llvm.org/D83381	2020-07-08 19:14:48 +01:00
Piotr Sobczak	0045786f14	[AMDGPU] Select s_cselect Summary: Add patterns to select s_cselect in the isel. Handle more cases of implicit SCC accesses in si-fix-sgpr-copies to allow new patterns to work. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits Tags: #llvm Re-commit D81925 with a bugfix D82370. Differential Revision: https://reviews.llvm.org/D81925 Differential Revision: https://reviews.llvm.org/D82370	2020-06-25 10:38:23 +02:00
Piotr Sobczak	6d9565d6d5	Revert "[AMDGPU] Select s_cselect" This caused some failures detected by the buildbot with expensive checks enabled. This reverts commit 4067de569f119a81419fbf2e79d5f3307dfdda5b.	2020-06-19 16:41:04 +02:00
Piotr Sobczak	4067de569f	[AMDGPU] Select s_cselect Summary: Add patterns to select s_cselect in the isel. Handle more cases of implicit SCC accesses in si-fix-sgpr-copies to allow new patterns to work. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81925	2020-06-19 16:17:46 +02:00
Simon Pilgrim	ea80b40669	[DAG] SimplifyDemandedBits - peek through SHL if we only demand sign bits. If we're only demanding the (shifted) sign bits of the shift source value, then we can use the value directly. This handles SimplifyDemandedBits/SimplifyMultipleUseDemandedBits for both ISD::SHL and X86ISD::VSHLI. Differential Revision: https://reviews.llvm.org/D80869	2020-06-03 16:11:54 +01:00
Tim Renouf	1e926a9f9c	[AMDGPU] Fix some tests that did not specify -mcpu Summary: This fixes some tests that did not specify -mcpu. Doing that disables all subtarget features, which gives behavior that (a) does not necessarily correspond to any actual target, and (b) can change as we add new subtarget features. Also added gfx1010 to memtime test. Differential Revision: https://reviews.llvm.org/D74594 Change-Id: I8c0fe4fa03e9a93ef8bb722cd42d22e064526309	2020-02-17 14:02:32 +00:00
Jay Foad	2252cac694	[ANDGPU] getMemOperandsWithOffset: support BUF non-stack-access instructions with resource but no vaddr Summary: This enables clustering for many more BUF instructions. Reviewers: rampitec, arsenm, nhaehnle Subscribers: jvesely, wdng, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D73868	2020-02-03 22:49:30 +00:00
Simon Pilgrim	743d45ee25	[TargetLowering] Add SimplifyMultipleUseDemandedBits This patch introduces the DAG version of SimplifyMultipleUseDemandedBits, which attempts to peek through ops (mainly and/or/xor so far) that don't contribute to the demandedbits/elts of a node - which means we can do this even in cases where we have multiple uses of an op, which normally requires us to demanded all bits/elts. The intention is to remove a similar instruction - SelectionDAG::GetDemandedBits - once SimplifyMultipleUseDemandedBits has matured. The InstCombine version of SimplifyMultipleUseDemandedBits can constant fold which I haven't added here yet, and so far I've only wired this up to some basic binops (and/or/xor/add/sub/mul) to demonstrate its use. We do see a couple of regressions that need to be addressed: AMDGPU unsigned dot product codegen retains an AND mask (for ZERO_EXTEND) that it previously removed (but otherwise the dotproduct codegen is a lot better). X86/AVX2 has poor handling of vector ANY_EXTEND/ANY_EXTEND_VECTOR_INREG - it prematurely gets converted to ZERO_EXTEND_VECTOR_INREG. The code owners have confirmed its ok for these cases to fixed up in future patches. Differential Revision: https://reviews.llvm.org/D63281 llvm-svn: 366799	2019-07-23 12:39:08 +00:00
Simon Pilgrim	cd1878d0f9	[AMDGPU] Regenerate SDIV tests for an upcoming patch llvm-svn: 362303	2019-06-01 18:27:06 +00:00
Alexander Timofeev	db7ee7660a	[AMDGPU] Preliminary patch for divergence driven instruction selection. Immediate selection predicate changed Differential revision: https://reviews.llvm.org/D51734 Reviewers: rampitec llvm-svn: 341928	2018-09-11 11:56:50 +00:00
Stanislav Mekhanoshin	1a1687f1bb	[AMDGPU] Convert rcp to rcp_iflag If a source of rcp instruction is a result of any conversion from an integer convert it into rcp_iflag instruction. No FP exception can ever happen except division by zero if a single precision rcp argument is a representation of an integral number. Differential Revision: https://reviews.llvm.org/D48569 llvm-svn: 335742	2018-06-27 15:33:33 +00:00
Matt Arsenault	84445dd13c	AMDGPU: Use gfx9 carry-less add/sub instructions llvm-svn: 319491	2017-11-30 22:51:26 +00:00
Dmitry Preobrazhensky	a0342dc9eb	[AMDGPU][MC][GFX8][GFX9] Corrected names of integer v_{add/addc/sub/subrev/subb/subbrev} See bug 34765: https://bugs.llvm.org//show_bug.cgi?id=34765 Reviewers: tamazov, SamWot, arsenm, vpykhtin Differential Revision: https://reviews.llvm.org/D40088 llvm-svn: 318675	2017-11-20 18:24:21 +00:00
Alexander Timofeev	982aee6a38	[AMDGPU] Switch scalarize global loads ON by default Differential revision: https://reviews.llvm.org/D34407 llvm-svn: 307097	2017-07-04 17:32:00 +00:00
NAKAMURA Takumi	e4a741376b	Revert r307026, "[AMDGPU] Switch scalarize global loads ON by default" It broke a testcase. Failing Tests (1): LLVM :: CodeGen/AMDGPU/alignbit-pat.ll llvm-svn: 307054	2017-07-04 02:14:18 +00:00
Alexander Timofeev	ea7f08bee5	[AMDGPU] Switch scalarize global loads ON by default Differential revision: https://reviews.llvm.org/D34407 llvm-svn: 307026	2017-07-03 14:54:11 +00:00
Stanislav Mekhanoshin	56ea488d8b	[AMDGPU] Allow SDWA in instructions with immediates and SGPRs An encoding does not allow to use SDWA in an instruction with scalar operands, either literals or SGPRs. That is however possible to copy these operands into a VGPR first. Several copies of the value are produced if multiple SDWA conversions were done. To cleanup MachineLICM (to hoist copies out of loops), MachineCSE (to remove duplicate copies) and SIFoldOperands (to replace SGPR to VGPR copy with immediate copy right to the VGPR) runs are added after the SDWA pass. Differential Revision: https://reviews.llvm.org/D33583 llvm-svn: 304219	2017-05-30 16:49:24 +00:00
Matt Arsenault	3dbeefa978	AMDGPU: Mark all unspecified CC functions in tests as amdgpu_kernel Currently the default C calling convention functions are treated the same as compute kernels. Make this explicit so the default calling convention can be changed to a non-kernel. Converted with perl -pi -e 's/define void/define amdgpu_kernel void/' on the relevant test directories (and undoing in one place that actually wanted a non-kernel). llvm-svn: 298444	2017-03-21 21:39:51 +00:00
Matt Arsenault	7aad8fd8f4	Enable FeatureFlatForGlobal on Volcanic Islands This switches to the workaround that HSA defaults to for the mesa path. This should be applied to the 4.0 branch. Patch by Vedran Miletić <vedran@miletic.net> llvm-svn: 292982	2017-01-24 22:02:15 +00:00
Valery Pykhtin	8a89d3662a	[AMDGPU] Expand vector mulhu/mulhs Differential revision: https://reviews.llvm.org/D26077 llvm-svn: 285684	2016-11-01 10:26:48 +00:00
Changpeng Fang	71369b3a39	AMDGPU/SI: Enable load-store-opt by default. Summary: Enable load-store-opt by default, and update LIT tests. Reviewers: arsenm Differential Revision: http://reviews.llvm.org/D20694 llvm-svn: 270894	2016-05-26 19:35:29 +00:00
Matt Arsenault	81a709503d	AMDGPU: Fix high bits after division optimization This is essentially doing a 24-bit signed division with FP. We need to truncate to the N bit result. llvm-svn: 270305	2016-05-21 01:53:33 +00:00
Tom Stellard	45bb48ea19	R600 -> AMDGPU rename llvm-svn: 239657	2015-06-13 03:28:10 +00:00

42 Commits