llvm-project

Author	SHA1	Message	Date
Juan Manuel Martinez Caamaño	0375ef07c3	[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741 ) This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as a 4-bit signed integer and returns `0.0625f * src`. SWDEV-518861	2025-04-02 19:51:40 +02:00
Tim Gymnich	1d0005a69a	[GlobalISel][NFC] Rename GISelKnownBits to GISelValueTracking (#133466 ) - rename `GISelKnownBits` to `GISelValueTracking` to analyze more than just `KnownBits` in the future	2025-03-29 11:51:29 +01:00
LU-JOHN	827f2ad643	AMDGPU: Convert vector 64-bit shl to 32-bit if shift amt >= 32 (#132964 ) Convert vector 64-bit shl to 32-bit if shift amt is known to be >= 32. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-03-28 23:46:35 +07:00
Diana Picus	e17b3cdfb3	[AMDGPU] Dynamic VGPR support for llvm.amdgcn.cs.chain (#130094 ) The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may indicate that we want to reallocate the VGPRs before performing the call. A call with the following arguments: ``` llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args, /flags/0x1, %num_vgprs, %fallback_exec, %fallback_callee ``` is supposed to do the following: - copy the SGPR and VGPR args into their respective registers - try to change the VGPR allocation - if the allocation has succeeded, set EXEC to %exec and jump to %callee, otherwise set EXEC to %fallback_exec and jump to %fallback_callee This patch implements the dynamic VGPR behaviour by generating an S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and callee. The rest of the call sequence is left undisturbed (i.e. identical to the case where the flags are 0 and we don't use dynamic VGPRs). We achieve this by introducing some new pseudos (SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason is so that we don't risk other passes (particularly the PostRA scheduler) introducing instructions between the S_ALLOC_VGPR and the jump. Such instructions might end up using VGPRs that have been deallocated, or the wrong EXEC mask. Once the whole backend treats S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use VGPRs, we could in principle move the expansion earlier (but in the absence of a good reason for that my personal preference is to keep it later in order to make debugging easier). Since the expansion happens after register allocation, we're careful to select constants to immediate operands instead of letting ISel generate S_MOVs which could interfere with register allocation (i.e. make it look like we need more registers than we actually do). For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during ISel in wave64 mode. However, we can define the pseudos for wave64 too so it's easy to handle if future generations support it. --------- Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com> Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-03-20 08:38:04 +01:00
David Green	bd1be8a242	[CodeGen][GlobalISel] Add a getVectorIdxWidth and getVectorIdxLLT. (#131526 ) From #106446, this adds a variant of getVectorIdxTy that returns an LLT. Many uses only look at the width, so a getVectorIdxWidth was added as the common base.	2025-03-18 08:31:11 +00:00
Jay Foad	44607666b3	[AMDGPU] Simplify conditional expressions. NFC. (#129228 ) Simplfy `cond ? val : false` to `cond && val` and similar.	2025-03-03 10:40:49 +00:00
LU-JOHN	5decab178f	AMDGPU: Reduce shl64 to shl32 if shift range is [63-32] (#125574 ) Reduce: DST = shl i64 X, Y where Y is in the range [63-32] to: DST = [0, shl i32 X, (Y & 32)] Alive2 analysis: https://alive2.llvm.org/ce/z/w_u5je --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-02-13 13:40:25 -06:00
Matt Arsenault	077e0c134a	AMDGPU: Generalize truncate of shift of cast build_vector combine (#125617 ) Previously we only handled cases that looked like the high element extract of a 64-bit shift. Generalize this to handle any multiple indexing. I was hoping this would help avoid some regressions, but it did not. It does however reduce the number of steps the DAG takes to process these cases. NFC-ish, I have yet to find an example where this changes the final output.	2025-02-04 11:46:30 +07:00
Sergei Barannikov	9ae92d7056	[SelectionDAG] Virtualize isTargetStrictFPOpcode / isTargetMemoryOpcode (#119969 ) With this change, targets are no longer required to put memory / strict-fp opcodes after special `ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers. This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709). Pull Request: https://github.com/llvm/llvm-project/pull/119969	2024-12-21 05:29:51 +03:00
Craig Topper	bd261ecc5a	[SelectionDAG] Add SDNode::user_begin() and use it in some places (#120509 ) Most of these are just places that want the first user and aren't iterating over the whole list. While there I changed some use_size() == 1 to hasOneUse() which is more efficient. This is part of an effort to rename use_iterator to user_iterator and provide a use_iterator that dereferences to SDUse&. This patch helps reduce the diff on later patches.	2024-12-18 22:13:04 -08:00
Craig Topper	104ad9258a	[SelectionDAG] Rename SDNode::uses() to users(). (#120499 ) This function is most often used in range based loops or algorithms where the iterator is implicitly dereferenced. The dereference returns an SDNode * of the user rather than SDUse * so users() is a better name. I've long beeen annoyed that we can't write a range based loop over SDUse when we need getOperandNo. I plan to rename use_iterator to user_iterator and add a use_iterator that returns SDUse& on dereference. This will make it more like IR.	2024-12-18 20:09:33 -08:00
LiqinWeng	3083acc215	[DAGCombine] Remove oneuse restrictions for RISCV in folding (shl (add_nsw x, c1)), c2) and folding (shl(sext(add x, c1)), c2) in some scenarios (#101294 ) This patch remove the restriction for folding (shl (add_nsw x, c1)), c2) and folding (shl(sext(add x, c1)), c2), and test case from dhrystone , see this link: riscv32: https://godbolt.org/z/o8GdMKrae riscv64: https://godbolt.org/z/Yh5bPz56z	2024-12-10 11:17:54 +08:00
Jon Chesterfield	4e0ba801ea	Revert "[amdgpu][lds] Simplify error diag path - lds variable names are no longer special" Test case didn't run locally, investigating This reverts commit 7bad469182ff2f6423ea209d5a1e81acca600568.	2024-12-08 12:00:13 +00:00
Jon Chesterfield	7bad469182	[amdgpu][lds] Simplify error diag path - lds variable names are no longer special	2024-12-08 11:26:33 +00:00
Nikita Popov	3317c9ceac	[AMDGPU] Use getSignedConstant() where necessary (#117328 ) Create signed constant using getSignedConstant(), to avoid future assertion failures when we disable implicit truncation in getConstant(). This also touches some generic legalization code, which apparently only AMDGPU tests.	2024-11-25 09:49:34 +01:00
Kazu Hirata	be187369a0	[AMDGPU] Remove unused includes (NFC) (#116154 ) Identified with misc-include-cleaner.	2024-11-13 21:10:03 -08:00
Sergei Barannikov	3d73dbe7f0	[AMDGPU] Remove unused AMDGPUISD enum members (NFC) (#115582 ) Those were only used in `getTargetNodeName`.	2024-11-11 23:39:20 +03:00
Gang Chen	8c752900dd	[AMDGPU] modify named barrier builtins and intrinsics (#114550 ) Use a local pointer type to represent the named barrier in builtin and intrinsic. This makes the definitions more user friendly bacause they do not need to worry about the hardware ID assignment. Also this approach is more like the other popular GPU programming language. Named barriers should be represented as global variables of addrspace(3) in LLVM-IR. Compiler assigns the special LDS offsets for those variables during AMDGPULowerModuleLDS pass. Those addresses are converted to hw barrier ID during instruction selection. The rest of the instruction-selection changes are primarily due to the intrinsic-definition changes.	2024-11-06 10:37:22 -08:00
Matt Arsenault	88e23eb2cf	DAG: Fix legalization of vector addrspacecasts (#113964 )	2024-10-29 08:08:50 -05:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Jay Foad	39babbffc9	[AMDGPU] Implement isSDNodeAlwaysUniform for INTRINSIC_W_CHAIN (#110114 ) There are no always uniform side-effecting intrinsics upstream to test this with, but we have examples downstream.	2024-09-26 14:44:14 +01:00
Pierre van Houtryve	758444ca3e	[AMDGPU] Promote uniform ops to I32 in DAGISel (#106383 ) Promote uniform binops, selects and setcc between 2 and 16 bits to 32 bits in DAGISel Solves #64591	2024-09-19 09:00:21 +02:00
Stanislav Mekhanoshin	0745219d4a	[AMDGPU] Add target intrinsic for s_buffer_prefetch_data (#107293 )	2024-09-06 11:41:21 -07:00
Changpeng Fang	26b0bef192	AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761 ) Use GCNPat instead of Custom Lowering to select instructions for intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as a predicate to select only when the rounding mode is supported. "as_hw_round_mode : SDNodeXForm" is developed to translate the round modes to the corresponding ones that hardware recognizes.	2024-08-29 11:43:58 -07:00
Matt Arsenault	7b7b0b95b2	DAG: Check if is_fpclass is custom, instead of isLegalOrCustom (#105577 ) For some reason, isOperationLegalOrCustom is not the same as isOperationLegal \|\| isOperationCustom. Unfortunately, it checks if the type is legal which makes it uesless for custom lowering on non-legal types (which is always ppcf128). Really the DAG builder shouldn't be going to expand this in the builder, it makes it difficult to work with. It's only here to work around the DAG requiring legal integer types the same size as the FP type after type legalization.	2024-08-29 14:05:43 +04:00
Jay Foad	d0fe52d951	[AMDGPU] Fix sign confusion in performMulLoHiCombine (#105831 ) SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.	2024-08-27 17:09:40 +01:00
Changpeng Fang	16929219b0	AMDGPU: Add tonearest and towardzero roundings for intrinsic llvm.fptrunc.round (#104486 ) This work simplifies and generalizes the instruction definition for intrinsic llvm.fptrunc.round. We no longer name the instruction with the rounding mode. Instead, we introduce an immediate operand for the rounding mode for the pseudo instruction. This immediate will be used to set up the hardware mode register at the time the real instruction is generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for f32 -> f16), which is easy to generalize for other types. "round.towardzero" and "round.tonearest" are added for f32 -> f16 truncating, in addition to the existing "round.upward" and "round.downward". Other rounding modes are not supported by hardware at this moment.	2024-08-17 11:22:47 -07:00
Craig Topper	51bad732dc	[SelectionDAG] Replace EVTToAPFloatSemantics with MVT/EVT::getFltSemantics. (#103001 )	2024-08-13 11:35:28 -07:00
Kazu Hirata	f4fb735840	[llvm] Construct SmallVector<SDValue> with ArrayRef (NFC) (#102578 )	2024-08-09 09:15:42 -07:00
Matt Arsenault	88a85942ce	AMDGPU: Directly handle all atomicrmw cases in SIISelLowering (#102439 )	2024-08-08 22:45:43 +04:00
Matt Arsenault	1d2b2d29d7	AMDGPU: Cleanup extract_subvector actions (NFC) (#101454 ) The base AMDGPUISelLowering was setting custom action on 16-bit vector types, but also set in SIISelLowering.	2024-08-01 10:55:28 +04:00
Matt Arsenault	e24dc34aa0	AMDGPU: Fix asserting in DAG kernel argument lowering on v6i32 (#100528 ) Remove this pointless assertion for the number of vector elements.	2024-07-25 14:03:28 +04:00
Sumanth Gundapaneni	0ee32c4573	[AMDGPU] Implement llvm.lrint intrinsic lowering (#98931 ) This patch enabled the target-independent lowering of llvm.lrint via GlobalISel. For SelectionDAG, the instrinsic is custom lowered for AMDGPU.	2024-07-24 23:34:31 +04:00
Sumanth Gundapaneni	fc832d5349	[AMDGPU] Implement llvm.lround intrinsic lowering. (#98970 ) This patch enables the target-independent lowering of llvm.lround via GlobalISel. For SelectionDAG, the instrinsic is custom lowered for AMDGPU. In order to support vector floating point input for llvm.lround, this patch extends the target independent APIs and provide support for scalarizing. pr98950 is needed to let verifier allow vector floating point types	2024-07-23 20:34:34 +04:00
Jay Foad	c7309dadbf	[AMDGPU] Use range-based for loops. NFC. (#99047 )	2024-07-17 10:18:03 +01:00
Jay Foad	38a1dec30b	[AMDGPU] Use std::min with initializer list. NFC.	2024-07-16 15:25:08 +01:00
Joseph Huber	3f1a767572	[LLVM] Factor disabled Libcalls into the initializer (#98421 ) Summary: These Libcalls represent which functions are available to the backend. If a runtime call is not available, the target sets the the name to `nullptr`. Currently, this logic is spread around the various targets. This patch pulls all of the locations that disable libcalls into the intializer. This patch is effectively NFC. The motivation behind this patch is that currently the LTO handling uses the list of all runtime calls to determine which functions cannot be internalized and must be extracted from static libraries. We do not want this to happen for libcalls that are not emitted by the backend. A follow-up patch will move out this logic so the LTO pass can know which rtlib calls are actually used by the backend.	2024-07-11 12:59:25 -05:00
Fabian Ritter	e1094dd889	[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (#96185 ) In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.	2024-07-03 08:32:35 +02:00
Nikita Popov	9df71d7673	[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919 ) Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.	2024-06-28 08:36:49 +02:00
Matt Arsenault	8520061281	AMDGPU: Support local atomicrmw fmin/fmax for float/double (#95590 ) This has always been supported. Somehow, we ended up with 2 copies of clang builtins for this case, and the newer one erroneously requires gfx8-insts.	2024-06-18 18:34:34 +02:00
Matt Arsenault	3b997294d6	AMDGPU: Remove .v2bf16 buffer atomic fadd intrinsics (#95783 ) These are redundant with the unsuffixed versions, and have a name collision with surprising behavior when the base intrinsic is used with v2bf16. The global and flat variants should be removed too, but those are complicated due to using v2i16 in place of the natural v2bf16. Those cases can soon be completely deleted in favor of atomicrmw. The GlobalISel codegen change is broken and substitutes handling as bf16 for handling as f16, but it's a bug that this passed the IRTranslator in the first place.	2024-06-17 21:44:52 +02:00
Matt Arsenault	c894f90c58	AMDGPU: Do not assert on v6x16 buffer load intrinsics (#94966 ) Just use the original type and let it hit a standard legalization error.	2024-06-10 16:38:06 +02:00
Mirko Brkušanin	1e6a82b8ef	[AMDGPU] Legalize and select raw/struct_buffer_load with tfe (#93310 )	2024-05-27 14:09:17 +02:00
Leon Clark	fb2c6597e3	[AMDGPU] Use LSH for lowering ctlz_zero_undef.i8/i16 (#88512 ) Use LSH to lower ctlz_zero_undef instead of subtracting leading zeros for i8 and i16. Related to [77615](https://github.com/llvm/llvm-project/pull/77615). --------- Co-authored-by: Leon Clark <leoclark@amd.com>	2024-05-19 21:45:24 +01:00
Stanislav Mekhanoshin	5d18d575d8	[AMDGPU] Make fneg/fabs/copysign legal for bf16 (#91676 ) These are just bit operations, exactly the same as with f16.	2024-05-10 14:33:47 -07:00
Matt Arsenault	82bb2534d4	AMDGPU: Don't bitcast float typed atomic store in IR (#90116 ) Implement the promotion in the DAG. Depends #90113	2024-05-07 21:43:22 +02:00
Matt Arsenault	7927bcdb8a	AMDGPU: Do not bitcast atomicrmw in IR (#90045 ) This is the first step to eliminating shouldCastAtomicRMWIInIR. This and the other atomic expand casting hooks should be removed. This adds duplicate legalization machinery and interfaces. This is already what codegen is supposed to do, and already does for the promotion case. In the case of atomicrmw xchg, there seems to be some benefit to having the bitcasts moved outside of the cmpxchg loop on targets with separate int and FP registers, which we should be able to deal with by directly checking for the legality of the underlying operation. The casting path was also losing metadata when it recreated the instruction.	2024-05-07 18:26:32 +02:00
Kazu Hirata	c18bcd0a57	[Target] Use StringRef::operator== instead of StringRef::equals (NFC) (#91072 ) (#91138 ) I'm planning to remove StringRef::equals in favor of StringRef::operator==. - StringRef::operator==/!= outnumber StringRef::equals by a factor of 38 under llvm/ in terms of their usage. - The elimination of StringRef::equals brings StringRef closer to std::string_view, which has operator== but not equals. - S == "foo" is more readable than S.equals("foo"), especially for !Long.Expression.equals("str") vs Long.Expression != "str".	2024-05-05 13:43:10 -07:00
Shilei Tian	d47c4984e9	[AMDGPU][ISel] Add more trunc store actions regarding bf16 (#90493 )	2024-04-29 18:27:52 -04:00

1 2 3 4 5 ...

640 Commits