llvm-project

Author	SHA1	Message	Date
Christudasan Devadasan	042104985c	[AMDGPU][NewPM] Port SIShrinkInstructions to new pass manager. (#106967 )	2024-09-03 10:52:50 +05:30
Shilei Tian	cb949b74e8	[NFC][FIX] Work around update_test_checks bug	2024-09-02 12:33:24 -04:00
Shilei Tian	f32f0289fd	[NFC] Update check lines of the test case `llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll`	2024-09-02 12:23:26 -04:00
Akshat Oke	da13754103	AMDGPU/NewPM Port SILoadStoreOptimizer to NPM (#106362 )	2024-09-02 11:41:56 +05:30
Changpeng Fang	26b0bef192	AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761 ) Use GCNPat instead of Custom Lowering to select instructions for intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as a predicate to select only when the rounding mode is supported. "as_hw_round_mode : SDNodeXForm" is developed to translate the round modes to the corresponding ones that hardware recognizes.	2024-08-29 11:43:58 -07:00
Stephen Tozer	3d08ade7bd	[ExtendLifetimes] Implement llvm.fake.use to extend variable lifetimes (#86149 ) This patch is part of a set of patches that add an `-fextend-lifetimes` flag to clang, which extends the lifetimes of local variables and parameters for improved debuggability. In addition to that flag, the patch series adds a pragma to selectively disable `-fextend-lifetimes`, and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes` for this pointers only. All changes and tests in these patches were written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer) has handled review and merging. The extend lifetimes flag is intended to eventually be set on by `-Og`, as discussed in the RFC here: https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850 This patch implements a new intrinsic instruction in LLVM, `llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand and has no effect other than "using" its operand, to ensure that its operand remains live until after the fake use. This patch does not emit fake uses anywhere; the next patch in this sequence causes them to be emitted from the clang frontend, such that for each variable (or this) a fake.use operand is inserted at the end of that variable's scope, using that variable's value. This patch covers everything post-frontend, which is largely just the basic plumbing for a new intrinsic/instruction, along with a few steps to preserve the fake uses through optimizations (such as moving them ahead of a tail call or translating them through SROA). Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>	2024-08-29 17:53:32 +01:00
Pierre van Houtryve	1f8f2ed66a	[NFC][AMDGPU] Autogenerate tests for uniform i32 promo in ISel (#106382 ) Many tests were easy to update, but these are quite big and I think it's better to autogenerate them to see the difference well.	2024-08-29 15:20:32 +02:00
Matt Arsenault	7b7b0b95b2	DAG: Check if is_fpclass is custom, instead of isLegalOrCustom (#105577 ) For some reason, isOperationLegalOrCustom is not the same as isOperationLegal \|\| isOperationCustom. Unfortunately, it checks if the type is legal which makes it uesless for custom lowering on non-legal types (which is always ppcf128). Really the DAG builder shouldn't be going to expand this in the builder, it makes it difficult to work with. It's only here to work around the DAG requiring legal integer types the same size as the FP type after type legalization.	2024-08-29 14:05:43 +04:00
Akshat Oke	fdca2c33a1	AMDGPU/NewPM Port GCNDPPCombine to NPM (#105816 ) Co-authored-by: Akshat Oke <Akshat.Oke@amd.com>	2024-08-29 14:49:52 +05:30
Akshat Oke	2adc94cd6c	AMDGPU/NewPM: Port SIFoldOperands to new pass manager (#105801 )	2024-08-29 11:34:54 +05:30
Shilei Tian	572d2fd327	[Attributor] Fix an issue that could potentially cause `AccessList` and `OffsetBins` out of sync (#106187 ) The implementation of `AAPointerInfo::RangeList::set_difference` doesn't consider the case where two ranges have the same offset but different sizes. This could cause `AccessList` and `OffsetBins` out of sync because a range has been already updated in `AccessList` but missing in `ToRemove`. I do have a reproducer but the reproducer itself is 248kb. `llvm-reduce` can't further reduce it. Not sure how I can make a smaller reproducer. Fixes: SWDEV-479757.	2024-08-29 01:02:19 -04:00
Changpeng Fang	53d95f3056	AMDGPU: Rename fail.llvm.fptrunc.round.ll to llvm.fptrunc.round.err.ll (#106452 ) Also correct the suffix of the intrinsic	2024-08-28 13:52:07 -07:00
Jon Chesterfield	1bde8e0b80	[AMDGPU] Don't realign already allocated LDS. Point fix for 106412 (#106421 ) Fixes 106412. The logic that skips the pass on already-lowered variables doesn't cover the path that increases alignment of variables. If a variable is allocated at 24 and then given 16 byte alignment, the backend notices and fatal-errors on the inconsistency.	2024-08-28 18:30:48 +01:00
Alex MacLean	4c4908cd5d	[AMDGPU] adjust tests to prevent fpclass bitcast folding (#106268 ) Make some minor tweaks to AMDGPU tests to ensure they still work as intended after https://github.com/llvm/llvm-project/pull/97762. These tests can be radically simplified after bitcast aware fpclass deduction.	2024-08-27 13:20:44 -07:00
Shilei Tian	d880f5a4c9	[AMDGPU][Attributor] Remove uniformity check in the indirect call specialization callback (#106177 ) This patch removes the conservative uniformity check in the indirect call specialization callback, as whether the function pointer is uniform doesn't matter too much. Instead, we add an argument to control specialization.	2024-08-27 12:27:17 -04:00
Jay Foad	d0fe52d951	[AMDGPU] Fix sign confusion in performMulLoHiCombine (#105831 ) SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.	2024-08-27 17:09:40 +01:00
Brox Chen	2e0583ef8b	[AMDGPU][CodeGen][NFC] update a mir test file with latest update_mir_test_check script (#106073 ) Run latest update_mir_test_checks.py and update one codeGen test file This is to clean up the mir test files diff generated by python script version update	2024-08-26 16:45:12 -04:00
Sameer Sahasrabuddhe	fa4cc9ddd5	[FixIrreducible] Use CycleInfo instead of a custom SCC traversal (#101386 ) [FixIrreducible] Use CycleInfo instead of a custom SCC traversal 1. CycleInfo efficiently locates all cycles in a single pass, while the SCC is repeated inside every natural loop. 2. CycleInfo provides a hierarchy of irreducible cycles, and the new implementation transforms each cycle in this hierarchy separately instead of reducing an entire irreducible SCC in a single step. This reduces the number of control-flow paths that pass through the header of each newly created loop. This is evidenced by the reduced number of predecessors on the "guard" blocks in the lit tests, and fewer operands on the corresponding PHI nodes. 3. When an entry of an irreducible cycle is the header of a child natural loop, the original implementation destroyed that loop. This is now preserved, since the incoming edges on non-header entries are not touched. 4. In the new implementation, if an irreducible cycle is a superset of a natural loop with the same header, then that natural loop is destroyed and replaced by the newly created loop.	2024-08-26 15:51:34 +05:30
Chaitanya	1f02be2e17	[AMDGPU] Enable "amdgpu-sw-lower-lds" pass in pipeline. (#89206 ) This PR enables "amdgpu-sw-lower-lds" pass in the pipeline. Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to enbale/disable the pass.	2024-08-26 14:21:19 +05:30
Chaitanya	7bc9d95b7e	[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (#87265 ) This PR introduces new pass "amdgpu-sw-lower-lds". This pass lowers the local data store, LDS, uses in kernel and non-kernel functions in module to use dynamically allocated global memory. Packed LDS Layout is emulated in the global memory. The lowered memory instructions from LDS to global memory are then instrumented for address sanitizer, to catch addressing errors. This pass only work when address sanitizer has been enabled and has instrumented the IR. It identifies that IR has been instrumented using "nosanitize_address" module flag. For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels). Replacement of Kernel LDS accesses: - All the LDS accesses corresponding to kernel will be packed together, where all static LDS accesses will be allocated first and then dynamic LDS follows. The total size with alignment is calculated. A new LDS global will be created for the kernel called "SW LDS" and it will have the attribute "amdgpu-lds-size" attached with value of the size calculated. All the LDS accesses in the module will be replaced by GEP with offset into the "Sw LDS". - A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing the dynamic LDS. This will be marked used by kernel and will have MD_absolue_symbol metadata set to total static LDS size, Since dynamic LDS allocation starts after all static LDS allocation. - A device global memory equal to the total LDS size will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in "SW LDS". To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "SW LDS metadata" in this pass. - SW LDS: It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.<kernel-name>". - SW LDS Metadata: It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced and third represents the total aligned size. It will have name "llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument. - At the epilogue of kernel, allocated memory would be made free by the same single work-item. Replacement of non-kernel LDS accesses: - Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted. - This information is used to build two tables: - Base table: Base table will have single row, with elements of the row placed as per kernel ID. Each element in the row corresponds to ptr of "SW LDS" variable created for that kernel. - Offset table: Offset table will have multiple rows and columns. Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the ptr of the replacement of LDS global done by that particular kernel. - A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, ptr of "SW LDS" for that corresponding kernel is obtained from base table. The Offset into the base "SW LDS" is obtained from corresponding element in offset table. With this information, replacement value is obtained.	2024-08-26 08:59:26 +05:30
Austin Kerbow	ceb587a16c	[AMDGPU] Fix crash in allowsMisalignedMemoryAccesses with i1 (#105794 )	2024-08-23 11:51:37 -07:00
Jay Foad	fa2dccb377	[AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (#105550 ) When a loop contains a VMEM load whose result is only used outside the loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for vmcnt will be required inside the loop anyway, because VMEM instructions can write their VGPR results out of order.	2024-08-23 10:31:33 +01:00
Jay Foad	b02b5b7b59	[AMDGPU] Simplify use of hasMovrel and hasVGPRIndexMode (#105680 ) The generic subtarget has neither of these features. Rather than forcing HasMovrel on, it is simpler to expand dynamic vector indexing to a sequence of compare/select instructions. NFC for real subtargets.	2024-08-23 09:59:19 +01:00
Matt Arsenault	ee08d9cba5	AMDGPU: Remove global/flat atomic fadd intrinics (#97051 ) These have been replaced with atomicrmw.	2024-08-22 23:27:33 +04:00
Jeffrey Byrnes	7bcf4d63cf	[AMDGPU] Correctly insert s_nops for dst forwarding hazard (#100276 ) MI300 ISA section 4.5 states there is a hazard between "VALU op which uses OPSEL or SDWA with changes the result’s bit position" and "VALU op consumes result of that op" This includes the case where the second op is SDWA with same dest and dst_sel != DWORD && dst_unused == UNUSED_PRESERVE. In this case, there is an implicit read of the first op dst and the compiler needs to resolve this hazard. Confirmed with HW team. We model dst_unused == UNUSED_PRESERVE as tied-def of implicit operand, so this PR checks for that. MI300_SP_MAS section 1.3.9.2 specifies that CVT_SR_FP8_F32 and CVT_SR_BF8_F32 with opsel[3:2] !=0 have dest forwarding issue. Currently, we only add check for CVT_SR_FP8_F32 with opsel[3] != 0 -- this PR adds support opsel[2] != 0 as well	2024-08-22 11:38:24 -07:00
Jay Foad	2012b25420	[AMDGPU][GlobalISel] Disable fixed-point iteration in all Combiners (#105517 ) Disable fixed-point iteration in all AMDGPU Combiners after #102163. This saves around 2% compile time in ad hoc testing on some large graphics shaders. I did not notice any regressions in the generated code, just a bunch of harmless differences in instruction selection and register allocation.	2024-08-22 17:14:53 +01:00
Jay Foad	c4c5fdd933	[AMDGPU] Generate checks for vector indexing. NFC. (#105668 ) This allows combining some test files that were only split because adding new RUN lines introduced too much churn in the checks.	2024-08-22 16:11:12 +01:00
Jay Foad	5506831f7b	[AMDGPU] GFX12 VMEM loads can write VGPR results out of order (#105549 ) Fix SIInsertWaitcnts to account for this by adding extra waits to avoid WAW dependencies.	2024-08-22 11:46:51 +01:00
Jay Foad	61194617ad	[AMDGPU] Add GFX12 test coverage for vmcnt flushing in loop headers (#105548 )	2024-08-22 11:42:57 +01:00
Sameer Sahasrabuddhe	5f6172f068	[Transforms] Refactor CreateControlFlowHub (#103013 ) CreateControlFlowHub is a method that redirects control flow edges from a set of incoming blocks to a set of outgoing blocks through a new set of "guard" blocks. This is now refactored into a separate file with one enhancement: The input to the method is now a set of branches rather than two sets of blocks. The original implementation reroutes every edge from incoming blocks to outgoing blocks. But it is possible that for some incoming block InBB, some successor S might be in the set of outgoing blocks, but that particular edge should not be rerouted. The new implementation makes this possible by allowing the user to specify the targets of each branch that need to be rerouted. This is needed when improving the implementation of FixIrreducible #101386. Current use in FixIrreducible does not demonstrate this finer control over the edges being rerouted. But in UnifyLoopExits, when only one successor of an exiting block is an exit block, this refinement now reroutes only the relevant control-flow through the edge; the non-exit successor is not rerouted. This results in fewer branches and PHI nodes in the hub.	2024-08-22 12:18:01 +05:30
Matt Arsenault	8039886e6d	AMDGPU: Handle folding frame indexes into s_add_i32 (#101694 ) This does not yet enable producing direct frame index references in s_add_i32, only the lowering.	2024-08-22 09:16:37 +04:00
Jay Foad	a6bae5cb37	[AMDGPU] Split GCNSubtarget into its own file. NFC. (#105525 )	2024-08-21 19:11:02 +01:00
Sumanth Gundapaneni	e78156a0e2	Scalarize the vector inputs to llvm.lround intrinsic by default. (#101054 ) Verifier is updated in a different patch to let the vector types for llvm.lround and llvm.llround intrinsics.	2024-08-21 12:13:56 -05:00
Brox Chen	ddb5480e67	[AMDGPU][True16][MC] added VOPC realtrue/faketrue flag and fake16 instructions (#104739 ) VOPC instructions were defined with HasTrue16BitInst flag while these true16 instructions are actually implemented with fake16 profile. Seperate them to true16 version and fake16 version by adding UseRealTrue16 and UseFakeTrue16 flag and fake16 instructions. The code default to use fake16. This is preparing for the upcoming changes in MC to support realtrue 16bit operands and vdst. The true16 and fake16 profile will be modified in the later patches.	2024-08-21 17:47:36 +03:00
Simon Pilgrim	8109e5de57	[DAG] Add select_cc -> abd folds (#102137 ) Fixes #100810	2024-08-21 12:07:40 +01:00
Matt Arsenault	9d364286f3	AMDGPU: Remove flat/global atomic fadd v2bf16 intrinsics (#97050 ) These are now fully covered by atomicrmw.	2024-08-21 14:26:42 +04:00
Nikita Popov	a105877646	[InstCombine] Remove some of the complexity-based canonicalization (#91185 ) The idea behind this canonicalization is that it allows us to handle less patterns, because we know that some will be canonicalized away. This is indeed very useful to e.g. know that constants are always on the right. However, this is only useful if the canonicalization is actually reliable. This is the case for constants, but not for arguments: Moving these to the right makes it look like the "more complex" expression is guaranteed to be on the left, but this is not actually the case in practice. It fails as soon as you replace the argument with another instruction. The end result is that it looks like things correctly work in tests, while they actually don't. We use the "thwart complexity-based canonicalization" trick to handle this in tests, but it's often a challenge for new contributors to get this right, and based on the regressions this PR originally exposed, we clearly don't get this right in many cases. For this reason, I think that it's better to remove this complexity canonicalization. It will make it much easier to write tests for commuted cases and make sure that they are handled.	2024-08-21 12:02:54 +02:00
Stanislav Mekhanoshin	5fcd05967a	[AMDGPU] Add VOPD combine dependency tests. NFC. (#104841 )	2024-08-19 15:43:27 -07:00
Jay Foad	564bd20658	[AMDGPU][GlobalISel] Save a copy in one case of addrspacecast (#104789 ) Refactor legalization of addrspacecast local/private -> flat to avoid building a copy in the nonnull case.	2024-08-19 18:22:29 +01:00
Jay Foad	2258bc429b	[AMDGPU] Simplify, fix and improve known bits for mbcnt (#104768 ) Simplify by using KnownBits::add. Fix GlobalISel path which was ignoring the known bits of src1. Improve analysis of mbcnt.hi which adds at most 31 even in wave64.	2024-08-19 18:22:06 +01:00
Austin Kerbow	7d5281a66d	[AMDGPU][NFC] Fix preload-kernarg.ll test after attributor move (#98840 ) Update was to stale version of the test with missing functions and extra runlines that had been removed.	2024-08-18 17:04:27 -07:00
Changpeng Fang	16929219b0	AMDGPU: Add tonearest and towardzero roundings for intrinsic llvm.fptrunc.round (#104486 ) This work simplifies and generalizes the instruction definition for intrinsic llvm.fptrunc.round. We no longer name the instruction with the rounding mode. Instead, we introduce an immediate operand for the rounding mode for the pseudo instruction. This immediate will be used to set up the hardware mode register at the time the real instruction is generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for f32 -> f16), which is easy to generalize for other types. "round.towardzero" and "round.tonearest" are added for f32 -> f16 truncating, in addition to the existing "round.upward" and "round.downward". Other rounding modes are not supported by hardware at this moment.	2024-08-17 11:22:47 -07:00
Carl Ritson	fc6300a5f7	[AMDGPU] Disable inline constants for pseudo scalar transcendentals (#104395 ) Prevent operand folding from inlining constants into pseudo scalar transcendental f16 instructions. However still allow literal constants.	2024-08-17 16:52:38 +09:00
Jeffrey Byrnes	fcefe957dd	[LegalizeTypes][AMDGPU]: Allow for scalarization of insert_subvector (#104236 ) Legalization for when the inserted subvector is to be scalarized. https://godbolt.org/z/vx3joWqoh	2024-08-15 08:07:05 -07:00
Shilei Tian	1ca9fe6db3	Reapply "[Attributor][AMDGPU] Enable AAIndirectCallInfo for AMDAttributor (#100952 )" This reverts commit 36467bfe89f231458eafda3edb916c028f1f0619.	2024-08-14 17:16:47 -04:00
Matt Arsenault	21cea3f3be	AMDGPU: Stop promoting allocas with addrspacecast users (#104051 ) We cannot promote this case unless we know the value is only observed through flat operations. We cannot analyze this through a call. PointerMayBeCaptured was an imprecise check for this. A callee with a nocapture attribute may still cast to private and observe the address space, so really we need a different notion of nocapture. I doubt this was of any use anyway. The promotable cases should have optimized out addrspacecast to begin earlier. Fixes #66669 Fixes #104035	2024-08-14 21:53:38 +04:00
Matt Arsenault	36a0f20ac3	AMDGPU/NewPM: Fill out addPreISelPasses (#102814 ) This specific callback should now be at parity with the old pass manager version. There are still some missing IR passes before this point. Also I don't understand the need for the RequiresAnalysisPass at the end. SelectionDAG should just be using the uncached getResult?	2024-08-14 20:57:00 +04:00
Craig Topper	abc1acf8df	[TargetLowering][AMDGPU][ARM][RISCV][X86] Teach SimplifyDemandedBits to combine (srl (sra X, C1), ShAmt) -> sra(X, C1+ShAmt) (#101751 ) If the upper bits of the shr aren't demanded. This helps with cases where the outer srl was originally an sra and was converted to a srl by SimplifyDemandedBits before it had a chance to combine with the inner sra. This can occur when the inner sra was part of a sign_extend_inreg expansion. There are some regressions in ARM and Thumb2.	2024-08-14 08:44:57 -07:00
Jay Foad	df57833ea8	[AMDGPU] Generate checks for llvm.amdgcn.is.private/shared (#103859 ) Also combine the GlobalISel tests into the SelectionDAG ones.	2024-08-14 13:23:33 +01:00
Matt Arsenault	edded8d7b5	AMDGPU: Stop handling legacy amdgpu-unsafe-fp-atomics attribute (#101699 ) This is now autoupgraded to annotate atomicrmw instructions in old bitcode.	2024-08-13 22:02:25 +04:00

1 2 3 4 5 ...

7728 Commits