52796 Commits

Author SHA1 Message Date
Matt Arsenault
160d7227e0 DAG: Fix libcall expansion for frexp on ARM
The ExpandLibcallResult result was a bitcast and not the direct call
result, so we couldn't find the chain. Use the new separate chain
return value instead.
2023-06-30 09:03:45 -04:00
Jingu Kang
eadfa2801b [tests] precommit test for MachineLICM subloops 2023-06-30 12:37:34 +01:00
David Green
d36c81e7f6 [AArch64] Fold tree of offset loads combine
This attempts to fold trees of add(ext(load p), shl(ext(load p+4)) into a
single load of twice the size, that we extract the bottom part and top part so
that the shl can start to use a shll2 instruction. The two loads in that
example can also be larger trees of instructions, which are identical except
for the leaves which are all loads offset from the LHS, including buildvectors
of multiple loads. For example:
sub(zext(buildvec(load p+4, load q+4)), zext(buildvec(load r+4, load s+4)))

Whilst it can be common for the larger loads to replace LDP instructions (which
doesn't gain anything on its own), the larger loads in buildvectors can help
create more efficient code, and prevent the need for ld1 lane inserts which can
be more expensive than continuous loads.

This creates a fairly niche, fairly large combine that attempts to be fairly
general where it is beneficial. It helps some SLP vectorized code to avoid the
use of the more expensive ld1 lane inserting loads.

Differential Revision: https://reviews.llvm.org/D153972
2023-06-30 12:25:07 +01:00
David Green
09f4cedd61 [AArch64] Codegen tests for fold from D153972. NFC 2023-06-30 12:25:06 +01:00
David Green
14f54a594e [DAG][AArch64] Fold shuffle_vector<4,5,6,7> to extract_subvector
During legalization, we can end up with shuffles that are identity masks, so
act like extract_subvector, but do not simplify to extract_subvector. This
adjusts the profitability heuristic in foldExtractSubvectorFromShuffleVector to
allow identity vectors that do not start at element 0. Undef masks elements are
excluded as it can be more useful to keep the undef elements.

Differential Revision: https://reviews.llvm.org/D153504
2023-06-30 11:13:39 +01:00
Sameer Sahasrabuddhe
7a101798b7 Revert "[AMDGPU] Mark mbcnt as convergent"
This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb.

The output of mbcnt does not depend on other active lanes, and hence it is not
convergent. The original change was made as a possible fix for

https://github.com/ROCm-Developer-Tools/HIP/issues/3172

But changing mbcnt does not fix that issue.

Reviewed By: ruiling, foad, yaxunl

Differential Revision: https://reviews.llvm.org/D153953
2023-06-30 13:10:44 +05:30
OverMighty
ea045b99da [AArch64] Add patterns for scalar FMUL, FMULX
Scalar FMUL, FMULX instructions perform better or the same compared to indexed
FMUL, FMULX.

For example, the Arm Cortex-A55 Software Optimization Guide lists the following
instructions with a throughput of 2 IPC:
 - "FP multiply" FMUL
 - "ASIMD FP multiply" FMULX

whereas it lists the following with a throughput of 1 IPC:
 - "ASIMD FP multiply, by element" FMUL, FMULX

The Arm Cortex-A510 Software Optimization Guide, however, does not separately
list "by element" variants of the "ASIMD FP multiply" instructions, which are
listed with the same throughput as the non-ASIMD ones.

Fixes #60817.

Differential Revision: https://reviews.llvm.org/D153207
2023-06-30 08:34:20 +01:00
Ian Douglas Scott
6a4e72b232 [M68k][MC] Add support for 32 bit register-register multiply/divide
Previously when targeting 68020+, instruction selection attempted to
emit a 32-bit register-register multiplication, but failed at instruction
selection. With this, it succeeds.

Differential Revision: https://reviews.llvm.org/D152120
2023-06-29 21:39:41 -07:00
Ben Shi
f40682a930 [CSKY] Optimize subtraction with SUBI32/SUBI16
Reviewed By: zixuan-wu

Differential Revision: https://reviews.llvm.org/D153326
2023-06-30 11:33:20 +08:00
Ting Wang
0b955fee90 [PowerPC][NFC] add SADDO/SSUBO test case
Differential Revision: https://reviews.llvm.org/D152339

Reviewed By: qiucf
2023-06-29 20:35:59 -04:00
Ting Wang
919588fd10 [PowerPC][NFC] expose issue on absol-jump-table-enabled.ll (relocation-model=pic + ppc-use-absolute-jumptables)
Differential Revision: https://reviews.llvm.org/D154047
2023-06-29 20:32:15 -04:00
Brendon Cahoon
853b2a84cb [AMDGPU] Reserve SGPR pair when long branches are present
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.

In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.

Patch by Corbin Robeck. Thanks!

Differential Review: https://reviews.llvm.org/D149775
2023-06-29 16:50:46 -05:00
Johannes Doerfert
d33bca840a [Attributor] Introduce helpers to judge AAs prior to creation
This is a partial cleanup to centralize the initialization and update
decisions for AAs. Lifting the burdon and boilerplate on users and
making it harder to accidentally perform unsound deductions.

The two static helpers show how we can lift the decisions to generate an
AA into the Attributor, avoiding trivial AAs that just cost us compile
time and maintenance code (to check for pre-conditions).
2023-06-29 12:32:45 -07:00
Philip Reames
92b5a3405d [RISCV] Remove legacy TA/TU pseudo distinction for unary instructions
This change continues with the line of work discussed in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295. In D153155, we started removing the legacy distinction between unsuffixed (TA) and _TU pseudos. This patch continues that effort for the unary instruction families.

The change consists of a few interacting pieces:
* Adding a vector policy operand to VPseudoUnaryNoMaskTU.
* Then using VPseudoUnaryNoMaskTU for all cases where VPseudoUnaryNoMask was previously used and deleting the unsuffixed form.
* Then renaming VPseudoUnaryNoMaskTU to VPseudoUnaryNoMask, and adjusting the RISCVMaskedPseudo table to use the combined pseudo.
* Fixing up two places in C++ code which manually construct VMV_V_* instructions.

Normally, I'd try to factor this into a couple of changes, but in this case, the table structure is tied to naming and thus we can't really separate the otherwise NFC bits.

As before, we see codegen changes (some improvements and some regressions) due to scheduling differences caused by the extra implicit_def instructions.

Differential Revision: https://reviews.llvm.org/D153899
2023-06-29 07:34:14 -07:00
Simon Pilgrim
34961c600d [X86] LowerTRUNCATE - attempt to use PACKSS/PACKUS on AVX512 targets if the truncation source is concatenating from smaller subvectors
Don't just use AVX512 truncation ops if PACKSS/PACKUS can do this more cheaply
2023-06-29 15:27:41 +01:00
Nikita Popov
d95c2c27f8 [X86] Add tests for PR63475 (NFC) 2023-06-29 15:31:45 +02:00
pvanhout
c59f9eada1 [MCP] Optimize copies from undef
Revert D152502 and instead optimize away copy from undefs, but clear the undef flag on the original copy.
Apparently, not optimizing the COPY can cause performance issues in some cases.

Fixes SWDEV-405813, SWDEV-405899

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D153838
2023-06-29 15:12:27 +02:00
pvanhout
026fc9e9c4 [AMDGPU] Handle Additional Cases in tryFoldPhiAGPR
Sometimes PHI have different incoming values, such as:
 ```
%1:vgpr_256 = COPY %0:agpr_256
%2:vgpr_32 = COPY %1:vgpr_256.sub0
```

Those weren't handled, which could lead to massive performance issues if break-large-PHIs kicked in + AGPRs were used (MFMA)

Fixes SWDEV-407986

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D153879
2023-06-29 14:49:18 +02:00
Ben Shi
6cda80b918 [CSKY][test][NFC] Add tests of ANDI/ORI
These tests will be optimized with BSETI32/BCLRI32
in the future.

Reviewed By: zixuan-wu

Differential Revision: https://reviews.llvm.org/D153613
2023-06-29 19:31:49 +08:00
Yunze Zhu
9d22b54d6b [RISCV] Use temporary stack in expanding SPLAT_VECTOR_SPLIT_I64_VL node
There is an issue: https://github.com/llvm/llvm-project/issues/63515
The issue is because when expanding SPLAT_VECTOR_SPLIT_I64_VL node, only memoperand is used to create dependency.
However in ScheduleDAGNodes, dependency is checked with chain only, and breaks order of store/load instructions.
I think in llvm.bitreverse.nxv2i64 intrinsic SPLAT_VECTOR_SPLIT_I64_VL nodes are parallel processed,
so no chain should be add to these nodes.
Using temporary in expanding SPLAT_VECTOR_SPLIT_I64_VL node can keep vlse instruction get correct value
no matter order of store instructions is changed.

Differential Revision: https://reviews.llvm.org/D153743
2023-06-29 16:45:16 +08:00
Michael Platings
54c79fa53c [test] Replace aarch64-*-eabi with aarch64
Also replace aarch64_be-*-eabi with aarch64_be

Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver.
We want to avoid it elsewhere as well. Just use the common "aarch64" without
other triple components.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D153943
2023-06-29 09:06:00 +01:00
Juan Manuel MARTINEZ CAAMAÑO
dd1df099ae [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)
Before this patch, the compiler gave a bump to the inline-threshold
when the total size of the allocas passed as arguments to the
callee was below 256 bytes.
This heuristic ignores that some of these allocas could have be removed
by SROA if inlining was applied.

Ideally, this bonus would be attributed to the threshold once the
size of all the allocas that could not be handled by SROA is known:
at the end of the InlineCost analysis.
However, we may never reach this point if the inline-cost analysis exits
early when the inline cost goes over the threshold mid-analysis.

This patch proposes:
* Attribute the bonus in the inline-threshold when allocas are passed
  as arguments (regardless of their total size).
* Assigns a cost to each alloca proportional to its size,
  such that the cost of all the allocas cancels the bonus.

Potential problems:
* This patch assumes that removing alloca instructions with SROA is
  always profitable. This may not be the case if the total size of the
  allocas is still too big to be promoted to registers/LDS.
* Redundant calls to getTotalAllocaSize
* Awkwardly, the threshold attributed contributes to the single-bb and
  vector bonus.

Reviewed By: scchan

Differential Revision: https://reviews.llvm.org/D149741
2023-06-29 09:49:16 +02:00
Craig Topper
1c676e08d0 [RISCV] Do a more complete job of disabling extending loads and truncating stores for fixed vector types.
We weren't marking some combinations as Expand if ones of the
types wasn't legal.

Fixes #63596.
2023-06-29 00:23:16 -07:00
Jianjian GUAN
a09a19be58 [RISCV] Update computeKnownBitsForTargetNode for FPCLASS.
The fclass instruction only set one of the low 10 bits.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D154040
2023-06-29 14:13:01 +08:00
4vtomat
02f94a655f [RISCV] Bump vector crypto to v1.0.0-rc1
Differential Revision: https://reviews.llvm.org/D153836
2023-06-28 19:53:07 -07:00
Luke Lau
699e0bed4b [RISCV] Add test cases for vmv.v.vs which could be combined
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D153350
2023-06-28 22:52:51 +01:00
Luke Lau
3e1a75109f [RISCV] Add test cases for insert subvector shuffles for fixed vectors
These cases could have the vmv.v.v folded into the VL of the previous
instruction.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D153030
2023-06-28 22:52:48 +01:00
Luke Lau
742fb8b5c7 [DAGCombine] Fold (store (insert_elt (load p)) x p) -> (store x)
If we have a store of a load with no other uses in between it, it's
considered dead and is removed. So sometimes when legalizing a fixed
length vector store of an insert, we end up producing better code
through scalarization than without.
An example is the follow below:

  %a = load <4 x i64>, ptr %x
  %b = insertelement <4 x i64> %a, i64 %y, i32 2
  store <4 x i64> %b, ptr %x

If this is scalarized, then DAGCombine successfully removes 3 of the 4
stores which are considered dead, and on RISC-V we get:

  sd a1, 16(a0)

However if we make the vector type legal (-mattr=+v), then we lose the
optimisation because we don't scalarize it.

This patch attempts to recover the optimisation for vectors by
identifying patterns where we store a load with a single insert
inbetween, replacing it with a scalar store of the inserted element.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D152276
2023-06-28 22:45:04 +01:00
Luke Lau
9ad29e7b3d [RISCV] Add fixed vector insert tests that are pass by value
So we can still test insert_vector_elt lowering with D152276

Reviewed By: frasercrmck, craig.topper

Differential Revision: https://reviews.llvm.org/D153964
2023-06-28 22:45:02 +01:00
Paul Kirth
75a1797044 Reland [llvm] Preliminary fat-lto-objects support
Fat LTO objects contain both LTO compatible IR, as well as generated
object code. This allows users to defer the choice of whether to use LTO
or not to link-time. This is a feature available in GCC for some time,
and makes the existing -ffat-lto-objects flag functional in the same
way as GCC's.

Within LLVM, we add a new EmbedBitcodePass that serializes the module to
the object file, and expose a new pass pipeline for compiling fat
objects. The new pipeline initially clones the module and runs the
selected (Thin)LTOPrelink pipeline, after which it will serialize the
module into a `.llvm.lto` section of an ELF file. When compiling for
(Thin)LTO, this normally the point at which the compiler would emit a
object file containing the bitcode and metadata.

After that point we compile the original module using the
PerModuleDefaultPipeline used for non-LTO compilation. We generate
standard object files at the end of this pipeline, which contain machine
code and the new `.llvm.lto` section containing bitcode.

Since the two pipelines operate on different copies of the module, we
can be sure that the bitcode in the `.llvm.lto` section and object code
in  `.text` are congruent with the existing output produced by the
default and LTO pipelines.

Original RFC: https://discourse.llvm.org/t/rfc-ffat-lto-objects-support/63977

Earlier versions of this patch were missing REQUIRES lines for llc
related tests in Transforms/EmbedBitcode. Those tests are now under
CodeGen/X86, which should avoid running the check on unsupported
platforms.

The EmbedbBitcodePass also returned PreservedAnalyses::all when adding a
metadata section, which failed expensive checks, since it modified the
module. This is now corrected.

Reviewed By: tejohnson, MaskRay, nikic

Differential Revision: https://reviews.llvm.org/D146776
2023-06-28 21:37:50 +00:00
root
250f2bb2c6 adding bf16 support to NVPTX
Currently, bf16 has been scatteredly added to the PTX codegen. This patch aims to complete the set of instructions and code path required to support bf16 data type.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D144911

Co-authored-by: Artem Belevich <tra@google.com>
2023-06-28 11:57:13 -07:00
Matt Arsenault
003b58f65b IR: Add llvm.frexp intrinsic
Add an intrinsic which returns the two pieces as multiple return
values. Alternatively could introduce a pair of intrinsics to
separately return the fractional and exponent parts.

AMDGPU has native instructions to return the two halves, but could use
some generic legalization and optimization handling. For example, we
should be able to handle legalization of f16 on older targets, and for
bf16. Additionally antique targets need a hardware workaround which
would be better handled in the backend rather than in library code
where it is now.
2023-06-28 14:50:16 -04:00
Matt Arsenault
d7d4aa539c AMDGPU: Move AMDGPUAttributor run earlier
Move it up with other module passes. It's a higher level optimization
that should probably be done before hacking up the IR for codegen. It
should really be done earlier than this. We could possibly move this
with other IPO passes, but we'd have to stop inferring the lack of
lds.kernel.id calls and have the LDS module pass mark functions which
don't need the ID.

The one test change is because that pass is relying on the backend run
of SROA (which we ideally wouldn't have).
2023-06-28 12:42:40 -04:00
Yusra Syeda
1bfdc534aa Revert "[SystemZ][z/OS] This patch adds support for the ADA (associated data area), doing the following:"
This reverts commit 9df0f66af5462e23216eae31aedbd4d2f459cc3d.
2023-06-28 11:18:12 -04:00
Jeffrey Byrnes
be92848ea4 [AMDGPU] NFC: Add schedule-relaxed-occupancy to relax occupancy targets for wave-limited/membound kernels
Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target.

This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules.

Differential Revision: https://reviews.llvm.org/D153925

Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b
2023-06-28 08:12:31 -07:00
Yusra Syeda
9df0f66af5 [SystemZ][z/OS] This patch adds support for the ADA (associated data area), doing the following:
- Creates the ADA table to handle displacements
- Emits the ADA section in the SystemZAsmPrinter
- Lowers the ADA_ENTRY node into the appropriate load instruction

Differential Revision: https://reviews.llvm.org/D153788
2023-06-28 10:13:10 -04:00
Nikita Popov
8de9c1ab51 [AArch64] Make tests more robust (NFC) 2023-06-28 14:51:55 +02:00
Jingu Kang
0e4d5b1398 [AArch64] Remove vector shift instrinsic with shift amount zero
Differential Revision: https://reviews.llvm.org/D153847
2023-06-28 13:41:19 +01:00
John Brawn
4fb0e0114f [ARM] Generate out-of-line jump tables for XO without 32-bit branch
When we only have a 16-bit pc-relative branch instruction we generate
a table of address for a jump table. Currently this is placed inline,
but this won't work with execute-only memory. In this case generate
the jump table out-of-line.

Differential Revision: https://reviews.llvm.org/D153774
2023-06-28 13:30:39 +01:00
Francesco Petrogalli
f0a290faf8 [MISched] Fix bug(s) in bottom-up scheduling.
BUG 1 - choosing the right cycle when booking a resource.
---------------------------------------------------------

Bottom up scheduling should take in account the current cycle at
the scheduling boundary when determing at what cycle a resource can be
issued. Supposed the schedule boundary is at cycle `C`, and that we
want to check at what cycle a 3 cycles resource can be instantiated.

We have two cases: A, in which the last seen resource cycle LSRC in
which the resource is known to be used is more than oe euqual to 3
cycles away from current cycle `C`, (`C - LSRC >=3`) and B in which
the LSRC is less than 3 cycles away from C (`C - LSRC < 3`). Note
that, in bottom-up scheduling LRS is always smaller or eaual to the
current cycle `C`.

The two cases can be schematized as follow:

```
... | C + 1 | C    | C - 1 | C - 2 | C - 3 | C - 4 | ...
    |       |      |       |       |       | LSRC  |   -> Case A
    |       |      |       | LSRC  |       |       |   -> Case B

// Before allocating the resource
LSRC(A) = C - 4
LSRC(B) = C - 2
```

In case A, the scheduler sees cycles `C`, `C-1` and `C-2` being
available for booking the 3-cycles resource. Therefore the LSRC can be
updated to be `C`, and the resource can be scheduled from cycle `C`
(the `X` in the table):

```
... | C + 1 | C    | C - 1 | C - 2 | C - 3 | C - 4 | ...
    |       | X    | X     | X     |       |       |  -> Case A
// After allocating the resource
LSRC(A) = C
```

In case B, the 3-cycle resource usage would clash with the LSRC if
allocated starting from cycle C:

```
... | C + 1 | C    | C - 1 | C - 2 | C - 3 | C - 4 | ...
    |       | X    | X     | X     |       |       |   -> clash at cycle C - 2
    |       |      |       | LSRC  |       |       |   -> Case B
```

Therefore, the cycle in which the resource can be scheduled needs to
be greater than `C`. For the example, the resource is booked
in cycle `C + 1`.

```
... | C + 1 | C    | C - 1 | C - 2 | C - 3 | C - 4 | ...
    | X     | X    | X     |       |       |       |
// After allocating the resource
LSRC(B) = C + 1
```

The behavior we need to correctly support cases A and B is obtained by
computing the next value of the LSRC as the maximum between:

1. the current cycle `C`;

2. and the previous LSRC plus the number of cycle CYCLES the resource will need.

In formula:

```
LSRC(next) = max(C, LSRC(previous) + CYCLES)
```

BUG 2 - booking the resource for the correct number of cycles.
--------------------------------------------------------------

When storing the next LSRC, the funcion `getNextResourceCycle` was
being invoked setting to 0  the number of cycles a resource was using.
The invocation of `getNextResourceCycle` is now using the values of
`Cycles` instead of 0.

Effects on code generation
--------------------------

This fix have effects only on AArch64, for the Cortex-A55
scheduling model (`-mcpu=cortex-a55`).

The changes in the MIR tests caused by this patch show that the value
now reported by `getNextResourceCycle` is correct.

Other cortex-a55 tests have been touched by this change, where some
instructions have been swapped. The final generated code is equivalent
in term of the total number of cycles. The test
`llvm/test/CodeGen/AArch64/misched-detail-resource-booking-02.mir`
shows in details the correctness of the bottom up scheduling, and the
effect on the codegen change that are visible in the test
`llvm/test/CodeGen/AArch64/aarch64-smull.ll`.

Reviewed By: andreadb, dmgreen

Differential Revision: https://reviews.llvm.org/D153117
2023-06-28 13:27:02 +02:00
Ties Stuij
4f19c6a7c7 [ARM] allow long-call codegen for armv6-M eXecute Only (XO)
Recently eXecute Only (XO) codegen was also allowed for armv6-M. Previously this
was only implemented for ~armv7+, effectively if MOVW/MOVT is
available. Regarding long calls, we remove the check for MOVW/MOVT when
generating code for XO, which already was redundant as in the subtarget
initialization we already check if XO is valid for the target. And targets that
generate valid XO code should be able to handle the (wrapper globaladdress)
node.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D153782
2023-06-28 10:50:24 +01:00
pvanhout
7007b99340 Revert "[AMDGPU] Use SSAUpdater in PromoteAlloca"
This reverts commit 091bfa76db64fbe96d0e53d99b2068cc05f6aa16.
2023-06-28 11:14:17 +02:00
pvanhout
091bfa76db [AMDGPU] Use SSAUpdater in PromoteAlloca
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D152706
2023-06-28 08:12:22 +02:00
Fangrui Song
d39b4ce3ce [test] Replace aarch64-*-eabi with aarch64
Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver.
We want to avoid it elsewhere as well. Just use the common "aarch64" without
other triple components.
2023-06-27 20:02:52 -07:00
Fangrui Song
ebbfdca586 [test] Replace aarch64-arm-none-eabi with aarch64
Similar to 02e9441d6ca73314afa1973a234dce1e390da1da, but for llvm/test and one
lld/test/ELF test.
2023-06-27 19:36:27 -07:00
WANG Xuerui
c88f27fe1e [LoongArch] Add back SDNPSideEffect properties to CSR and IOCSR read ops
In general, CSR and IOCSR reads should be treated as volatile because:

* there may well be intervening writes between seemingly common
  expressions;
* the stateful entity behind a given (IO)CSR may well be volatile.

Confirmed to fix broken Clang Linux/LoongArch builds (dying when a
userspace process tries to use FPU, panicking when that process happens
to be PID 1) with this patch.

Fixes: https://github.com/llvm/llvm-project/issues/63549
Fixes: 2efdacf74c54 ("[LoongArch] Add missing chains and remove unnecessary `SDNPSideEffect` property for some intrinsic nodes")

Reviewed By: SixWeining, hev

Differential Revision: https://reviews.llvm.org/D153865
2023-06-28 08:25:49 +08:00
Rahman Lavaee
c13b046de3 [Propeller] Match debug info filenames from profiles to distinguish internal linkage functions with the same names.
Basic block sections profiles are ingested based on the function name. However, conflicts may occur when internal linkage functions with the same symbol name are linked into the binary (for instance static functions defined in different modules). Currently, these functions cannot be optimized unless we use `-funique-internal-linkage-names` (D89617) to enforce unique symbol names.

However, we have found that `-funique-internal-linkage-names` does not play well with inline assembly code which refers to the symbol via its symbol name. For example, the Linux kernel does not build with this option.

This patch implements a new feature which allows differentiating profiles based on the debug info filenames associated with each function. When specified, the given path is compared against the debug info filename of the matching function and profile is ingested only when the debug info filenames match. Backward-compatibility is guaranteed as omitting the specifiers from the profile would allow them to be matched by function name only. Also specifiers can be included for a subset of functions only.

Reviewed By: shenhan

Differential Revision: https://reviews.llvm.org/D146770
2023-06-27 21:28:28 +00:00
FLZ101
32e4013dd4 [AArch64][SelectionDAG] fix infinite loop caused by legalizing & combining CONCAT_VECTORS
Legalizing in `AArch64TargetLowering::LowerCONCAT_VECTORS()` and combining in `DAGCombiner::visitCONCAT_VECTORS()` could cause an infinite loop.
This commit fixes that issue by conditionally skipping the combining.

Fix https://github.com/llvm/llvm-project/issues/63322

Reviewed By: RKSimon, MaskRay

Differential Revision: https://reviews.llvm.org/D153316
2023-06-27 13:57:41 -07:00
Simon Pilgrim
d07ff1d109 [X86] LowerABD - improve pre-SSE41 handling for v16i8/v4i32 nodes
The generic expansion still causes a problem for SSE targets without BLENDV/select node, but we can create a custom lowering until that can be addressed.
2023-06-27 18:39:47 +01:00
Simon Pilgrim
2d3792bef0 [X86] Fold ANDNP(x, -1) -> NOT(x) -> XOR(x, -1)
Prefer XOR to ANDNP as its commutative
2023-06-27 18:16:51 +01:00