llvm-project

Author	SHA1	Message	Date
Matt Arsenault	160d7227e0	DAG: Fix libcall expansion for frexp on ARM The ExpandLibcallResult result was a bitcast and not the direct call result, so we couldn't find the chain. Use the new separate chain return value instead.	2023-06-30 09:03:45 -04:00
Jingu Kang	eadfa2801b	[tests] precommit test for MachineLICM subloops	2023-06-30 12:37:34 +01:00
David Green	d36c81e7f6	[AArch64] Fold tree of offset loads combine This attempts to fold trees of add(ext(load p), shl(ext(load p+4)) into a single load of twice the size, that we extract the bottom part and top part so that the shl can start to use a shll2 instruction. The two loads in that example can also be larger trees of instructions, which are identical except for the leaves which are all loads offset from the LHS, including buildvectors of multiple loads. For example: sub(zext(buildvec(load p+4, load q+4)), zext(buildvec(load r+4, load s+4))) Whilst it can be common for the larger loads to replace LDP instructions (which doesn't gain anything on its own), the larger loads in buildvectors can help create more efficient code, and prevent the need for ld1 lane inserts which can be more expensive than continuous loads. This creates a fairly niche, fairly large combine that attempts to be fairly general where it is beneficial. It helps some SLP vectorized code to avoid the use of the more expensive ld1 lane inserting loads. Differential Revision: https://reviews.llvm.org/D153972	2023-06-30 12:25:07 +01:00
David Green	09f4cedd61	[AArch64] Codegen tests for fold from D153972. NFC	2023-06-30 12:25:06 +01:00
David Green	14f54a594e	[DAG][AArch64] Fold shuffle_vector<4,5,6,7> to extract_subvector During legalization, we can end up with shuffles that are identity masks, so act like extract_subvector, but do not simplify to extract_subvector. This adjusts the profitability heuristic in foldExtractSubvectorFromShuffleVector to allow identity vectors that do not start at element 0. Undef masks elements are excluded as it can be more useful to keep the undef elements. Differential Revision: https://reviews.llvm.org/D153504	2023-06-30 11:13:39 +01:00
Sameer Sahasrabuddhe	7a101798b7	Revert "[AMDGPU] Mark mbcnt as convergent" This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb. The output of mbcnt does not depend on other active lanes, and hence it is not convergent. The original change was made as a possible fix for https://github.com/ROCm-Developer-Tools/HIP/issues/3172 But changing mbcnt does not fix that issue. Reviewed By: ruiling, foad, yaxunl Differential Revision: https://reviews.llvm.org/D153953	2023-06-30 13:10:44 +05:30
OverMighty	ea045b99da	[AArch64] Add patterns for scalar FMUL, FMULX Scalar FMUL, FMULX instructions perform better or the same compared to indexed FMUL, FMULX. For example, the Arm Cortex-A55 Software Optimization Guide lists the following instructions with a throughput of 2 IPC: - "FP multiply" FMUL - "ASIMD FP multiply" FMULX whereas it lists the following with a throughput of 1 IPC: - "ASIMD FP multiply, by element" FMUL, FMULX The Arm Cortex-A510 Software Optimization Guide, however, does not separately list "by element" variants of the "ASIMD FP multiply" instructions, which are listed with the same throughput as the non-ASIMD ones. Fixes #60817. Differential Revision: https://reviews.llvm.org/D153207	2023-06-30 08:34:20 +01:00
Ian Douglas Scott	6a4e72b232	[M68k][MC] Add support for 32 bit register-register multiply/divide Previously when targeting 68020+, instruction selection attempted to emit a 32-bit register-register multiplication, but failed at instruction selection. With this, it succeeds. Differential Revision: https://reviews.llvm.org/D152120	2023-06-29 21:39:41 -07:00
Ben Shi	f40682a930	[CSKY] Optimize subtraction with SUBI32/SUBI16 Reviewed By: zixuan-wu Differential Revision: https://reviews.llvm.org/D153326	2023-06-30 11:33:20 +08:00
Ting Wang	0b955fee90	[PowerPC][NFC] add SADDO/SSUBO test case Differential Revision: https://reviews.llvm.org/D152339 Reviewed By: qiucf	2023-06-29 20:35:59 -04:00
Ting Wang	919588fd10	[PowerPC][NFC] expose issue on absol-jump-table-enabled.ll (relocation-model=pic + ppc-use-absolute-jumptables) Differential Revision: https://reviews.llvm.org/D154047	2023-06-29 20:32:15 -04:00
Brendon Cahoon	853b2a84cb	[AMDGPU] Reserve SGPR pair when long branches are present Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the case when an indirect branch target is too far away. The register scavanger may not find available registers, which causes a “did not find scavenging index” assert to occur in assignRegToScavengingIndex. In this patch, we estimate before register allocation whether an indirect branch is likely to be needed, and reserve 2 SGPRs if the branch distance is found to be above a threshold. The distance threshold is an approximation as the exact code size and branch distance are unknown prior to register allocation. Patch by Corbin Robeck. Thanks! Differential Review: https://reviews.llvm.org/D149775	2023-06-29 16:50:46 -05:00
Johannes Doerfert	d33bca840a	[Attributor] Introduce helpers to judge AAs prior to creation This is a partial cleanup to centralize the initialization and update decisions for AAs. Lifting the burdon and boilerplate on users and making it harder to accidentally perform unsound deductions. The two static helpers show how we can lift the decisions to generate an AA into the Attributor, avoiding trivial AAs that just cost us compile time and maintenance code (to check for pre-conditions).	2023-06-29 12:32:45 -07:00
Philip Reames	92b5a3405d	[RISCV] Remove legacy TA/TU pseudo distinction for unary instructions This change continues with the line of work discussed in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295. In D153155, we started removing the legacy distinction between unsuffixed (TA) and _TU pseudos. This patch continues that effort for the unary instruction families. The change consists of a few interacting pieces: * Adding a vector policy operand to VPseudoUnaryNoMaskTU. * Then using VPseudoUnaryNoMaskTU for all cases where VPseudoUnaryNoMask was previously used and deleting the unsuffixed form. * Then renaming VPseudoUnaryNoMaskTU to VPseudoUnaryNoMask, and adjusting the RISCVMaskedPseudo table to use the combined pseudo. * Fixing up two places in C++ code which manually construct VMV_V_* instructions. Normally, I'd try to factor this into a couple of changes, but in this case, the table structure is tied to naming and thus we can't really separate the otherwise NFC bits. As before, we see codegen changes (some improvements and some regressions) due to scheduling differences caused by the extra implicit_def instructions. Differential Revision: https://reviews.llvm.org/D153899	2023-06-29 07:34:14 -07:00
Simon Pilgrim	34961c600d	[X86] LowerTRUNCATE - attempt to use PACKSS/PACKUS on AVX512 targets if the truncation source is concatenating from smaller subvectors Don't just use AVX512 truncation ops if PACKSS/PACKUS can do this more cheaply	2023-06-29 15:27:41 +01:00
Nikita Popov	d95c2c27f8	[X86] Add tests for PR63475 (NFC)	2023-06-29 15:31:45 +02:00
pvanhout	c59f9eada1	[MCP] Optimize copies from undef Revert D152502 and instead optimize away copy from undefs, but clear the undef flag on the original copy. Apparently, not optimizing the COPY can cause performance issues in some cases. Fixes SWDEV-405813, SWDEV-405899 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D153838	2023-06-29 15:12:27 +02:00
pvanhout	026fc9e9c4	[AMDGPU] Handle Additional Cases in tryFoldPhiAGPR Sometimes PHI have different incoming values, such as: ``` %1:vgpr_256 = COPY %0:agpr_256 %2:vgpr_32 = COPY %1:vgpr_256.sub0 ``` Those weren't handled, which could lead to massive performance issues if break-large-PHIs kicked in + AGPRs were used (MFMA) Fixes SWDEV-407986 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D153879	2023-06-29 14:49:18 +02:00
Ben Shi	6cda80b918	[CSKY][test][NFC] Add tests of ANDI/ORI These tests will be optimized with BSETI32/BCLRI32 in the future. Reviewed By: zixuan-wu Differential Revision: https://reviews.llvm.org/D153613	2023-06-29 19:31:49 +08:00
Yunze Zhu	9d22b54d6b	[RISCV] Use temporary stack in expanding SPLAT_VECTOR_SPLIT_I64_VL node There is an issue: https://github.com/llvm/llvm-project/issues/63515 The issue is because when expanding SPLAT_VECTOR_SPLIT_I64_VL node, only memoperand is used to create dependency. However in ScheduleDAGNodes, dependency is checked with chain only, and breaks order of store/load instructions. I think in llvm.bitreverse.nxv2i64 intrinsic SPLAT_VECTOR_SPLIT_I64_VL nodes are parallel processed, so no chain should be add to these nodes. Using temporary in expanding SPLAT_VECTOR_SPLIT_I64_VL node can keep vlse instruction get correct value no matter order of store instructions is changed. Differential Revision: https://reviews.llvm.org/D153743	2023-06-29 16:45:16 +08:00
Michael Platings	54c79fa53c	[test] Replace aarch64--eabi with aarch64 Also replace aarch64_be--eabi with aarch64_be Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver. We want to avoid it elsewhere as well. Just use the common "aarch64" without other triple components. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D153943	2023-06-29 09:06:00 +01:00
Juan Manuel MARTINEZ CAAMAÑO	dd1df099ae	[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2) Before this patch, the compiler gave a bump to the inline-threshold when the total size of the allocas passed as arguments to the callee was below 256 bytes. This heuristic ignores that some of these allocas could have be removed by SROA if inlining was applied. Ideally, this bonus would be attributed to the threshold once the size of all the allocas that could not be handled by SROA is known: at the end of the InlineCost analysis. However, we may never reach this point if the inline-cost analysis exits early when the inline cost goes over the threshold mid-analysis. This patch proposes: * Attribute the bonus in the inline-threshold when allocas are passed as arguments (regardless of their total size). * Assigns a cost to each alloca proportional to its size, such that the cost of all the allocas cancels the bonus. Potential problems: * This patch assumes that removing alloca instructions with SROA is always profitable. This may not be the case if the total size of the allocas is still too big to be promoted to registers/LDS. * Redundant calls to getTotalAllocaSize * Awkwardly, the threshold attributed contributes to the single-bb and vector bonus. Reviewed By: scchan Differential Revision: https://reviews.llvm.org/D149741	2023-06-29 09:49:16 +02:00
Craig Topper	1c676e08d0	[RISCV] Do a more complete job of disabling extending loads and truncating stores for fixed vector types. We weren't marking some combinations as Expand if ones of the types wasn't legal. Fixes #63596.	2023-06-29 00:23:16 -07:00
Jianjian GUAN	a09a19be58	[RISCV] Update computeKnownBitsForTargetNode for FPCLASS. The fclass instruction only set one of the low 10 bits. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D154040	2023-06-29 14:13:01 +08:00
4vtomat	02f94a655f	[RISCV] Bump vector crypto to v1.0.0-rc1 Differential Revision: https://reviews.llvm.org/D153836	2023-06-28 19:53:07 -07:00
Luke Lau	699e0bed4b	[RISCV] Add test cases for vmv.v.vs which could be combined Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D153350	2023-06-28 22:52:51 +01:00
Luke Lau	3e1a75109f	[RISCV] Add test cases for insert subvector shuffles for fixed vectors These cases could have the vmv.v.v folded into the VL of the previous instruction. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D153030	2023-06-28 22:52:48 +01:00
Luke Lau	742fb8b5c7	[DAGCombine] Fold (store (insert_elt (load p)) x p) -> (store x) If we have a store of a load with no other uses in between it, it's considered dead and is removed. So sometimes when legalizing a fixed length vector store of an insert, we end up producing better code through scalarization than without. An example is the follow below: %a = load <4 x i64>, ptr %x %b = insertelement <4 x i64> %a, i64 %y, i32 2 store <4 x i64> %b, ptr %x If this is scalarized, then DAGCombine successfully removes 3 of the 4 stores which are considered dead, and on RISC-V we get: sd a1, 16(a0) However if we make the vector type legal (-mattr=+v), then we lose the optimisation because we don't scalarize it. This patch attempts to recover the optimisation for vectors by identifying patterns where we store a load with a single insert inbetween, replacing it with a scalar store of the inserted element. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D152276	2023-06-28 22:45:04 +01:00
Luke Lau	9ad29e7b3d	[RISCV] Add fixed vector insert tests that are pass by value So we can still test insert_vector_elt lowering with D152276 Reviewed By: frasercrmck, craig.topper Differential Revision: https://reviews.llvm.org/D153964	2023-06-28 22:45:02 +01:00
Paul Kirth	75a1797044	Reland [llvm] Preliminary fat-lto-objects support Fat LTO objects contain both LTO compatible IR, as well as generated object code. This allows users to defer the choice of whether to use LTO or not to link-time. This is a feature available in GCC for some time, and makes the existing -ffat-lto-objects flag functional in the same way as GCC's. Within LLVM, we add a new EmbedBitcodePass that serializes the module to the object file, and expose a new pass pipeline for compiling fat objects. The new pipeline initially clones the module and runs the selected (Thin)LTOPrelink pipeline, after which it will serialize the module into a `.llvm.lto` section of an ELF file. When compiling for (Thin)LTO, this normally the point at which the compiler would emit a object file containing the bitcode and metadata. After that point we compile the original module using the PerModuleDefaultPipeline used for non-LTO compilation. We generate standard object files at the end of this pipeline, which contain machine code and the new `.llvm.lto` section containing bitcode. Since the two pipelines operate on different copies of the module, we can be sure that the bitcode in the `.llvm.lto` section and object code in `.text` are congruent with the existing output produced by the default and LTO pipelines. Original RFC: https://discourse.llvm.org/t/rfc-ffat-lto-objects-support/63977 Earlier versions of this patch were missing REQUIRES lines for llc related tests in Transforms/EmbedBitcode. Those tests are now under CodeGen/X86, which should avoid running the check on unsupported platforms. The EmbedbBitcodePass also returned PreservedAnalyses::all when adding a metadata section, which failed expensive checks, since it modified the module. This is now corrected. Reviewed By: tejohnson, MaskRay, nikic Differential Revision: https://reviews.llvm.org/D146776	2023-06-28 21:37:50 +00:00
root	250f2bb2c6	adding bf16 support to NVPTX Currently, bf16 has been scatteredly added to the PTX codegen. This patch aims to complete the set of instructions and code path required to support bf16 data type. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D144911 Co-authored-by: Artem Belevich <tra@google.com>	2023-06-28 11:57:13 -07:00
Matt Arsenault	003b58f65b	IR: Add llvm.frexp intrinsic Add an intrinsic which returns the two pieces as multiple return values. Alternatively could introduce a pair of intrinsics to separately return the fractional and exponent parts. AMDGPU has native instructions to return the two halves, but could use some generic legalization and optimization handling. For example, we should be able to handle legalization of f16 on older targets, and for bf16. Additionally antique targets need a hardware workaround which would be better handled in the backend rather than in library code where it is now.	2023-06-28 14:50:16 -04:00
Matt Arsenault	d7d4aa539c	AMDGPU: Move AMDGPUAttributor run earlier Move it up with other module passes. It's a higher level optimization that should probably be done before hacking up the IR for codegen. It should really be done earlier than this. We could possibly move this with other IPO passes, but we'd have to stop inferring the lack of lds.kernel.id calls and have the LDS module pass mark functions which don't need the ID. The one test change is because that pass is relying on the backend run of SROA (which we ideally wouldn't have).	2023-06-28 12:42:40 -04:00
Yusra Syeda	1bfdc534aa	Revert "[SystemZ][z/OS] This patch adds support for the ADA (associated data area), doing the following:" This reverts commit 9df0f66af5462e23216eae31aedbd4d2f459cc3d.	2023-06-28 11:18:12 -04:00
Jeffrey Byrnes	be92848ea4	[AMDGPU] NFC: Add schedule-relaxed-occupancy to relax occupancy targets for wave-limited/membound kernels Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target. This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules. Differential Revision: https://reviews.llvm.org/D153925 Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b	2023-06-28 08:12:31 -07:00
Yusra Syeda	9df0f66af5	[SystemZ][z/OS] This patch adds support for the ADA (associated data area), doing the following: - Creates the ADA table to handle displacements - Emits the ADA section in the SystemZAsmPrinter - Lowers the ADA_ENTRY node into the appropriate load instruction Differential Revision: https://reviews.llvm.org/D153788	2023-06-28 10:13:10 -04:00
Nikita Popov	8de9c1ab51	[AArch64] Make tests more robust (NFC)	2023-06-28 14:51:55 +02:00
Jingu Kang	0e4d5b1398	[AArch64] Remove vector shift instrinsic with shift amount zero Differential Revision: https://reviews.llvm.org/D153847	2023-06-28 13:41:19 +01:00
John Brawn	4fb0e0114f	[ARM] Generate out-of-line jump tables for XO without 32-bit branch When we only have a 16-bit pc-relative branch instruction we generate a table of address for a jump table. Currently this is placed inline, but this won't work with execute-only memory. In this case generate the jump table out-of-line. Differential Revision: https://reviews.llvm.org/D153774	2023-06-28 13:30:39 +01:00
Francesco Petrogalli	f0a290faf8	[MISched] Fix bug(s) in bottom-up scheduling. BUG 1 - choosing the right cycle when booking a resource. --------------------------------------------------------- Bottom up scheduling should take in account the current cycle at the scheduling boundary when determing at what cycle a resource can be issued. Supposed the schedule boundary is at cycle `C`, and that we want to check at what cycle a 3 cycles resource can be instantiated. We have two cases: A, in which the last seen resource cycle LSRC in which the resource is known to be used is more than oe euqual to 3 cycles away from current cycle `C`, (`C - LSRC >=3`) and B in which the LSRC is less than 3 cycles away from C (`C - LSRC < 3`). Note that, in bottom-up scheduling LRS is always smaller or eaual to the current cycle `C`. The two cases can be schematized as follow: ``` ... \| C + 1 \| C \| C - 1 \| C - 2 \| C - 3 \| C - 4 \| ... \| \| \| \| \| \| LSRC \| -> Case A \| \| \| \| LSRC \| \| \| -> Case B // Before allocating the resource LSRC(A) = C - 4 LSRC(B) = C - 2 ``` In case A, the scheduler sees cycles `C`, `C-1` and `C-2` being available for booking the 3-cycles resource. Therefore the LSRC can be updated to be `C`, and the resource can be scheduled from cycle `C` (the `X` in the table): ``` ... \| C + 1 \| C \| C - 1 \| C - 2 \| C - 3 \| C - 4 \| ... \| \| X \| X \| X \| \| \| -> Case A // After allocating the resource LSRC(A) = C ``` In case B, the 3-cycle resource usage would clash with the LSRC if allocated starting from cycle C: ``` ... \| C + 1 \| C \| C - 1 \| C - 2 \| C - 3 \| C - 4 \| ... \| \| X \| X \| X \| \| \| -> clash at cycle C - 2 \| \| \| \| LSRC \| \| \| -> Case B ``` Therefore, the cycle in which the resource can be scheduled needs to be greater than `C`. For the example, the resource is booked in cycle `C + 1`. ``` ... \| C + 1 \| C \| C - 1 \| C - 2 \| C - 3 \| C - 4 \| ... \| X \| X \| X \| \| \| \| // After allocating the resource LSRC(B) = C + 1 ``` The behavior we need to correctly support cases A and B is obtained by computing the next value of the LSRC as the maximum between: 1. the current cycle `C`; 2. and the previous LSRC plus the number of cycle CYCLES the resource will need. In formula: ``` LSRC(next) = max(C, LSRC(previous) + CYCLES) ``` BUG 2 - booking the resource for the correct number of cycles. -------------------------------------------------------------- When storing the next LSRC, the funcion `getNextResourceCycle` was being invoked setting to 0 the number of cycles a resource was using. The invocation of `getNextResourceCycle` is now using the values of `Cycles` instead of 0. Effects on code generation -------------------------- This fix have effects only on AArch64, for the Cortex-A55 scheduling model (`-mcpu=cortex-a55`). The changes in the MIR tests caused by this patch show that the value now reported by `getNextResourceCycle` is correct. Other cortex-a55 tests have been touched by this change, where some instructions have been swapped. The final generated code is equivalent in term of the total number of cycles. The test `llvm/test/CodeGen/AArch64/misched-detail-resource-booking-02.mir` shows in details the correctness of the bottom up scheduling, and the effect on the codegen change that are visible in the test `llvm/test/CodeGen/AArch64/aarch64-smull.ll`. Reviewed By: andreadb, dmgreen Differential Revision: https://reviews.llvm.org/D153117	2023-06-28 13:27:02 +02:00
Ties Stuij	4f19c6a7c7	[ARM] allow long-call codegen for armv6-M eXecute Only (XO) Recently eXecute Only (XO) codegen was also allowed for armv6-M. Previously this was only implemented for ~armv7+, effectively if MOVW/MOVT is available. Regarding long calls, we remove the check for MOVW/MOVT when generating code for XO, which already was redundant as in the subtarget initialization we already check if XO is valid for the target. And targets that generate valid XO code should be able to handle the (wrapper globaladdress) node. Reviewed By: efriedma Differential Revision: https://reviews.llvm.org/D153782	2023-06-28 10:50:24 +01:00
pvanhout	7007b99340	Revert "[AMDGPU] Use SSAUpdater in PromoteAlloca" This reverts commit 091bfa76db64fbe96d0e53d99b2068cc05f6aa16.	2023-06-28 11:14:17 +02:00
pvanhout	091bfa76db	[AMDGPU] Use SSAUpdater in PromoteAlloca This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly. Note PromoteAlloca is still reliant on SROA running first to canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D152706	2023-06-28 08:12:22 +02:00
Fangrui Song	d39b4ce3ce	[test] Replace aarch64-*-eabi with aarch64 Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver. We want to avoid it elsewhere as well. Just use the common "aarch64" without other triple components.	2023-06-27 20:02:52 -07:00
Fangrui Song	ebbfdca586	[test] Replace aarch64-arm-none-eabi with aarch64 Similar to 02e9441d6ca73314afa1973a234dce1e390da1da, but for llvm/test and one lld/test/ELF test.	2023-06-27 19:36:27 -07:00
WANG Xuerui	c88f27fe1e	[LoongArch] Add back SDNPSideEffect properties to CSR and IOCSR read ops In general, CSR and IOCSR reads should be treated as volatile because: * there may well be intervening writes between seemingly common expressions; * the stateful entity behind a given (IO)CSR may well be volatile. Confirmed to fix broken Clang Linux/LoongArch builds (dying when a userspace process tries to use FPU, panicking when that process happens to be PID 1) with this patch. Fixes: https://github.com/llvm/llvm-project/issues/63549 Fixes: 2efdacf74c54 ("[LoongArch] Add missing chains and remove unnecessary `SDNPSideEffect` property for some intrinsic nodes") Reviewed By: SixWeining, hev Differential Revision: https://reviews.llvm.org/D153865	2023-06-28 08:25:49 +08:00
Rahman Lavaee	c13b046de3	[Propeller] Match debug info filenames from profiles to distinguish internal linkage functions with the same names. Basic block sections profiles are ingested based on the function name. However, conflicts may occur when internal linkage functions with the same symbol name are linked into the binary (for instance static functions defined in different modules). Currently, these functions cannot be optimized unless we use `-funique-internal-linkage-names` (D89617) to enforce unique symbol names. However, we have found that `-funique-internal-linkage-names` does not play well with inline assembly code which refers to the symbol via its symbol name. For example, the Linux kernel does not build with this option. This patch implements a new feature which allows differentiating profiles based on the debug info filenames associated with each function. When specified, the given path is compared against the debug info filename of the matching function and profile is ingested only when the debug info filenames match. Backward-compatibility is guaranteed as omitting the specifiers from the profile would allow them to be matched by function name only. Also specifiers can be included for a subset of functions only. Reviewed By: shenhan Differential Revision: https://reviews.llvm.org/D146770	2023-06-27 21:28:28 +00:00
FLZ101	32e4013dd4	[AArch64][SelectionDAG] fix infinite loop caused by legalizing & combining CONCAT_VECTORS Legalizing in `AArch64TargetLowering::LowerCONCAT_VECTORS()` and combining in `DAGCombiner::visitCONCAT_VECTORS()` could cause an infinite loop. This commit fixes that issue by conditionally skipping the combining. Fix https://github.com/llvm/llvm-project/issues/63322 Reviewed By: RKSimon, MaskRay Differential Revision: https://reviews.llvm.org/D153316	2023-06-27 13:57:41 -07:00
Simon Pilgrim	d07ff1d109	[X86] LowerABD - improve pre-SSE41 handling for v16i8/v4i32 nodes The generic expansion still causes a problem for SSE targets without BLENDV/select node, but we can create a custom lowering until that can be addressed.	2023-06-27 18:39:47 +01:00
Simon Pilgrim	2d3792bef0	[X86] Fold ANDNP(x, -1) -> NOT(x) -> XOR(x, -1) Prefer XOR to ANDNP as its commutative	2023-06-27 18:16:51 +01:00

... 80 81 82 83 84 ...

52796 Commits