30286 Commits

Author SHA1 Message Date
Alexey Bataev
76422385c3
[SLP]Support reordered buildvector nodes for better clustering
Patch adds reordering of the buildvector nodes for better clustering of
the compatible operations and future vectorization. Includes basic cost
estimation and if the transformation is not profitable - reverts it.

AVX512, -O3+LTO
Metric: size..text

Program                                                                          size..text
                                                                                       results     results0    diff
                        test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test    74565.00    75701.00  1.5%
                test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2014462.00  2024494.00  0.5%
                         test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   395219.00   396979.00  0.4%
                         test-suite :: MultiSource/Applications/JM/lencod/lencod.test   857795.00   859667.00  0.2%
                    test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   800472.00   802440.00  0.2%
                       test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   590699.00   591403.00  0.1%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   203006.00   203102.00  0.0%
            test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    42408.00    42424.00  0.0%
            test-suite ::  External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12451575.00  12451927.00  0.0%
            test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1396480.00  1396448.00 -0.0%
             test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1396480.00  1396448.00 -0.0%
                        test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1047708.00  1047580.00 -0.0%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   111344.00   111328.00 -0.0%
                test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1087660.00  1087500.00 -0.0%
       test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   280664.00   280616.00 -0.0%
                          test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   502646.00   502006.00 -0.1%
                      test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1033135.00  1031567.00 -0.2%
        test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2070917.00  2065845.00 -0.2%
       test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2070917.00  2065845.00 -0.2%
                        test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33893.00    33797.00 -0.3%
          test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39677.00    39549.00 -0.3%
                 test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39674.00    39546.00 -0.3%
test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11512.00 -0.4%
                 test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   653867.00   649275.00 -0.7%
                  test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   653867.00   649275.00 -0.7%

CINT2006/401.bzip2 - extra code vectorized
CINT2017rate/541.leela_r
CINT2017speed/641.leela_s - function
_ZN9FastBoard25get_pattern3_augment_specEiib not inlined anymore, better
vectorization
CFP2017rate/510.parest_r - better vectorization
JM/ldecod - better vectorization
JM/lencod - same
CINT2006/464.h264ref - extra code vectorized
CFP2006/447.dealII - extra vector code
MiBench/consumer-lame - vectorized 2 loops previously scalar
DOE-ProxyApps-C/miniGMG - small changes
Benchmarks/7zip - extra code vectorized, better vectorization
CFP2017rate/526.blender_r - extra vectorization
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorization
MiBench/consumer-jpeg - extra vectorization
CINT2006/400.perlbench - extra vectorization
Prolangs-C/TimberWolfMC - small variations
Applications/sqlite3 - extra function vectorized and inlined
Benchmarks/tramp3d-v4 - extra code vectorized
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - extra code vectorized, function digcpy gets
vectorized and inlined
CINT2006/473.astar - extra code vectorized
MiBench/telecomm-gsm - extra code vectorized, better vector code
mediabench/gsm - same
MiBench/security-blowfish - extra code vectorized
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - sub4x4_dct function vectorized and gets
inlined

RISCV-V, SiFive-p670, O3+LTO

CFP2017rate/510.parest_r - extra vectorization
CFP2017rate/526.blender_r - extra vectorization
MiBench/consumer-lame - extra vectorized code

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/114284
2024-11-06 10:51:15 -05:00
Hari Limaye
fbd89bcc66
Reland "[LTO] Run Argument Promotion before IPSCCP" (#111853)
Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more
constants to be propagated. We also run PostOrderFunctionAttrs to
improve the information available to ArgumentPromotion's alias analysis,
and SROA to clean up allocas.

Relands #111163.
2024-11-06 13:54:48 +00:00
Hari Limaye
88e9b373c0
[FuncSpec] Query SCCPSolver in more places (#114964)
When traversing the use-def chain of an Argument in a candidate
specialization, also query the SCCPSolver to see if a Value is constant.
This allows us to better estimate the codesize savings of a candidate in
the presence of instructions that are a user of the argument we are
estimating savings for which also use arguments that have been found
constant by IPSCCP.

Similarly when estimating the dead basic blocks from branch and switch
instructions which become constant, also query the SCCPSolver to see if
a predecessor is unreachable.
2024-11-06 13:25:15 +00:00
Simon Pilgrim
e3a0775651 [VectorCombine] foldExtractedCmps - (re-)enable fold on non-commutative binops
#114901 exposed that foldExtractedCmps didn't account for non-commutative binops, and were disabled by 05e838f428555bcc4507bd37912da60ea9110ef6

This patch re-enables support for non-commutative binops by ensuring that the LHS/RHS arg order of the binop is retained.
2024-11-06 12:10:31 +00:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Simon Pilgrim
d8354d63db [VectorCombine] Extend test coverage for #114901 with commuted test case 2024-11-06 11:27:05 +00:00
Mel Chen
4480a22c2b
[LV][EVL] Emit vp.merge intrinsic to enable out-loop reduction in EVL vectorization. (#101641)
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.

This patch performs transformation to convert 
  select(header_mask, LHS, RHS)
into
  vp.merge(all-true, LHS, RHS, EVL) 
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.

TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
2024-11-06 14:53:49 +08:00
vporpo
320389d428
[SandboxVec][BottomUpVec] Generate vector instructions (#115087)
This patch implements some very basic code generation, for some opcodes.
2024-11-05 16:27:24 -08:00
Alexey Bataev
c1cec8c0dc [SLP][NFC]Add a test with missed splat ordering for loads, NFC 2024-11-05 14:08:17 -08:00
Florian Hahn
a353e258ba
[LAA] Don't require Stride == 1/-1 for inbounds pointer AddRecs nowrap. (#113126)
If we have a pointer AddRec, the maximum increment is
2^(pointer-index-wdith - 1) - 1. This means that if incrementing the
AddRec wraps, the distance between the previously accessed location and
the wrapped location is > 2^(pointer-index-wdith - 1), i.e. if the GEP
for the AddRec is inbounds, this would be poison due to the object being
larger than half the pointer index type space. The poison would be
immediate UB when the memory access gets executed..

Similar reasoning can be applied for decrements.

PR: https://github.com/llvm/llvm-project/pull/113126
2024-11-05 22:45:56 +01:00
Andreas Jonson
6d6287af84
[NFC] Fix test for zext(shl(trunc)) fold (#113778)
This fold already exist but there is a call to [shouldChangeType
](91fdfec263/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp (L1202))
that blocks it if the target layout is missing a definition of the types
in the casts.

closes https://github.com/llvm/llvm-project/issues/61650
2024-11-05 18:13:33 +01:00
Alexey Bataev
0c18def2c1 [SLP]Allow interleaving check only if it is less than number of elements
Need to check if the interleaving factor is less than total number of
elements in loads slice to handle it correctly and avoid compiler crash.

Fixes report https://github.com/llvm/llvm-project/pull/112361#issuecomment-2457227670
2024-11-05 07:06:15 -08:00
Nuno Lopes
6070aeb3b7 [Coro] Use poison instead of undef as placeholder [NFC] 2024-11-05 13:51:36 +00:00
Nuno Lopes
928460afc1 [ArgPromotion] Use poison instead of undef as placeholder in deleted metadata [NFC] 2024-11-05 13:44:34 +00:00
Yongtao Huang
f6948e8f9d
[LoopVectorize] Fix typo in branch-weights.ll test CEHCK->CHECK (NFC) (#113574)
Fix the typo CEHCK.

Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2024-11-05 14:22:24 +01:00
Simon Pilgrim
05e838f428 [VectorCombine] foldExtractedCmps - disable fold on non-commutative binops
The fold needs to be adjusted to correctly track the LHS/RHS operands, which will take some refactoring, for now just disable the fold in this case.

Fixes #114901
2024-11-05 11:42:30 +00:00
Simon Pilgrim
a88be11eef [VectorCombine] Add test coverage for #114901 2024-11-05 11:42:30 +00:00
Nikita Popov
46ccefb123 [CVP] Fix APInt ctor assertion with i1 urem
This is creating an APInt with value 2, which may not be well-defined
for i1 values. Fix this by replacing the Y.umul_sat(2) with
Y.uadd_sat(Y).
2024-11-05 10:03:37 +01:00
Matt Arsenault
ea859005b5
SafeStack: Respect alloca addrspace (#112536)
Just insert addrspacecast in cases where the alloca uses a
different address space, since I don't know what else you
could possibly do.
2024-11-04 17:51:30 -08:00
Matt Arsenault
30dd1297fa
AMDGPU: Custom expand flat cmpxchg which may access private (#109410)
64-bit flat cmpxchg instructions do not work correctly for scratch
addresses, and need to be expanded as non-atomic.

Allow custom expansion of cmpxchg in AtomicExpand, as is
already the case for atomicrmw.
2024-11-04 09:29:38 -08:00
Ramkumar Ramachandra
cd16b077bf
IR: introduce CmpInst::isEquivalence (#111979)
Steal impliesEquivalanceIf{True,False} (sic) from GVN, and extend it for
floating-point constant vectors, and accounting for denormal values.
Since InstCombine also performs GVN-like replacements, introduce
CmpInst::isEquivalence, and remove the corresponding code in GVN, with
the intent of using it in more places.

The code in GVN also has a bad FIXME saying that the optimization may be
valid in the nsz case, but this is not the case.

Alive2 proof: https://alive2.llvm.org/ce/z/vEaK8M
2024-11-04 15:54:59 +00:00
Jay Foad
f8559751fc
[llvm-project] Fix typo "propogate" (#114795) 2024-11-04 15:33:19 +00:00
zhijian lin
a51712751c
[PowerPC][LLC] Utilize PPC::getNormalizedPPCTargetCPU() to set CPU (#113943)
Utilize common API in PPCTargetParser
(https://github.com/llvm/llvm-project/pull/97541) to set default CPU
with same interfaces for LLC.
This will update AIX default CPU to pwr7 and LoP powerppc64 default CPU
to ppc64.
2024-11-04 09:40:54 -05:00
Alexey Bataev
899336735a [SLP]Be more pessimistic about poisonous reductions
Consider all possible reductions ops as being non-poisoning boolean
logical operations, which require freeze to be fully correct.

https://alive2.llvm.org/ce/z/TKWDMP

Fixes #114738
2024-11-04 06:13:52 -08:00
Alexey Bataev
a15bf88d53 [SLP][NFC]Add a test with missing freeze instruction before reduction, NFC 2024-11-04 04:38:09 -08:00
Hari Limaye
5ed3f46359
[FuncSpec] Improve handling of Comparison Instructions (#114073)
When visiting comparison instructions during computation of a
specializations's bonus, make use of information from the lattice value
of the other operand in the case where we have not found this to have a
specific constant value.
2024-11-04 11:39:00 +00:00
Simon Pilgrim
ac1869aa70
[CostModel][X86] Add initial costs for non-lane-crossing one/two input shuffles (#114680)
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
2024-11-04 10:19:02 +00:00
Nikita Popov
8851ea64a5 [ConstantHoist] Fix APInt ctor assertion
The result here may require truncation. Fix this by removing the
calculateOffsetDiff() helper entirely. As far as I can tell, this
code does not actually have to deal with different bitwidths.

findBaseConstants() will produce ranges of constants with equal
types, which is what maximizeConstantsInRange() will then work
on.

Fixes assertion reported at:
https://github.com/llvm/llvm-project/pull/114539#issuecomment-2453008679
2024-11-04 11:17:24 +01:00
Hari Limaye
daa9af179f
[FuncSpec] Handle ssa_copy intrinsic calls in InstCostVisitor (#114247)
Look through ssa_copy intrinsic calls when computing codesize bonus for
a specialization.

Also remove redundant logic to skip computing codesize bonus for
ssa_copy intrinsics, now these are considered zero-cost by TTI (in PR
#75294).
2024-11-04 10:09:50 +00:00
Luke Lau
beb12f92c7
[RISCV] Add +optimized-nfN-segment-load-store (#114414)
This is a follow up to #111511, where after benchmarking we learnt that
the Banana Pi F3 has fast segmented loads for not just NF=2, but also
NF=3 and NF=4:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput

This adds tuning features to allow these segment loads and stores to be
costed cheaper and enables it for the spacemit-x60.

It also enables +optimized-nf2-segment-load-store by default in the
generic tuning to maintain the previous behaviour when compiled without
-mcpu or -mtune.
2024-11-04 06:43:58 +08:00
Hubert Tong
5091a359d9
[ConstantFold] Special case log1p +/-0.0 (#114635)
C's Annex F specifies that log1p +/-0.0 returns the input value;
however, this behavior is optional and host C libraries may behave
differently. This change applies the Annex F behavior to constant
folding by LLVM.
2024-11-02 20:06:39 -04:00
Jorge Botto
fcd51dee42
[InstCombine] Factorise Add and Min/Max using distributivity (#101717)
This PR fixes part of https://github.com/llvm/llvm-project/issues/92433.

It specifically adds the 4 cases mentioned in
https://github.com/llvm/llvm-project/issues/92433#issuecomment-2117064459.

I've added 8 positive tests, 4 of which are mentioned in the comment
above and 4 which are their commutative equivalents. Alive proof:
https://alive2.llvm.org/ce/z/z6eFTb
I've also added 8 negative tests, because we want to make sure we do not
optimise if the relevant flags are not relevant because the optimisation
wouldn't be sound. Alive proof that the optimisation is invalid:
https://alive2.llvm.org/ce/z/NvNjTD
I did have to make the integer types `i4` to make Alive not timeout and
to fit them all on one page.
2024-11-02 17:08:12 +01:00
serge-sans-paille
01a103b0b9
[llvm] Fix __builtin_object_size interaction between Negative Offset … (#111827)
…and Select/Phi

When picking a SizeOffsetAPInt through combineSizeOffset, the behavior
differs if we're going to apply a constant offset that's positive or
negative: If it's positive, then we need to compare the remaining bytes
(i.e. Size
- Offset), but if it's negative, we need to compare the preceding bytes
(i.e. Offset).

Fix #111709
2024-11-02 09:14:35 +00:00
Yingwei Zheng
f1e1055c84
[ValueTracking] Compute known bits from recursive select/phi (#113707)
This patch is inspired by
https://github.com/llvm/llvm-project/pull/113686. I found that it
removes a lot of unnecessary "and X, 1" in some applications that
represent boolean values with int.
2024-11-02 15:45:46 +08:00
Florian Hahn
17bad1a9da
[LV] Bail out on header phis in shouldConsiderInvariant.
This fixes an infinite recursion in rare cases.

Fixes https://github.com/llvm/llvm-project/issues/113794.
2024-11-01 20:51:25 +00:00
Han-Kuan Chen
a795a18bba
[SLP][REVEC] VF should be scaled when ScalarTy is FixedVectorType. (#114551) 2024-11-02 03:03:52 +08:00
Simon Pilgrim
718d50d6d0 [VectorCombine] foldPermuteOfBinops - prefer the new fold for matching costs.
Minor tweak to #114101 - as we're reducing the instruction count, we should prefer the fold if the old/new costs are the same.
2024-11-01 17:28:37 +00:00
Min-Yih Hsu
64314dedeb
[InlineCost] Print inline cost for invoke call sites as well (#114476)
Previously InlineCostAnnotationPrinter only prints inline cost for call
instructions. I don't think there is any reason not to analyze invoke
and its callee, and this patch adds such support.
2024-11-01 09:55:17 -07:00
Yingwei Zheng
a77dedcacb
[InstSimplify][InstCombine][ConstantFold] Move vector div/rem by zero fold to InstCombine (#114280)
Previously we fold `div/rem X, C` into `poison` if any element of the
constant divisor `C` is zero or undef. However, it is incorrect when
threading udiv over an vector select:
https://alive2.llvm.org/ce/z/3Ninx5
```
define <2 x i32> @vec_select_udiv_poison(<2 x i1> %x) {
  %sel = select <2 x i1> %x, <2 x i32> <i32 -1, i32 -1>, <2 x i32> <i32 0, i32 1>
  %div = udiv <2 x i32> <i32 42, i32 -7>, %sel
  ret <2 x i32> %div
}
```
In this case, `threadBinOpOverSelect` folds `udiv <i32 42, i32 -7>, <i32
-1, i32 -1>` and `udiv <i32 42, i32 -7>, <i32 0, i32 1>` into
`zeroinitializer` and `poison`, respectively. One solution is to
introduce a new flag indicating that we are threading over a vector
select. But it requires to modify both `InstSimplify` and
`ConstantFold`.

However, this optimization doesn't provide benefits to real-world
programs:

https://dtcxzyw.github.io/llvm-opt-benchmark/coverage/data/zyw/opt-ci/actions-runner/_work/llvm-opt-benchmark/llvm-opt-benchmark/llvm/llvm-project/llvm/lib/IR/ConstantFold.cpp.html#L908

https://dtcxzyw.github.io/llvm-opt-benchmark/coverage/data/zyw/opt-ci/actions-runner/_work/llvm-opt-benchmark/llvm-opt-benchmark/llvm/llvm-project/llvm/lib/Analysis/InstructionSimplify.cpp.html#L1107

This patch moves the fold into InstCombine to avoid breaking numerous
existing tests.

Fixes #114191 and #113866 (only poison-safety issue).
2024-11-01 22:56:22 +08:00
Yingwei Zheng
e577f14b67
[InstCombine] Use m_NotForbidPoison when folding (X u< Y) ? -1 : (~X + Y) --> uadd.sat(~X, Y) (#114345)
Alive2: https://alive2.llvm.org/ce/z/mTGCo-
We cannot reuse `~X` if `m_AllOnes` matches a vector constant with some
poison elts. An alternative solution is to create a new not instead of
reusing `~X`. But it doesn't worth the effort because we need to add a
one-use check.

Fixes https://github.com/llvm/llvm-project/issues/113869.
2024-11-01 22:18:44 +08:00
David Green
0f919444ad
[ValueTracking] Handle recursive phis in knownFPClass (#114008)
As a follow-on to 113686, this breaks the recursion between phi nodes
that have p1 = phi(x, p2) and p2 = phi(y, p1). The knownFPClass can be
calculated from the classes of p1 and p2.
2024-11-01 13:38:29 +00:00
Han-Kuan Chen
e4aeeba84c
[SLP][REVEC] When ScalarTy is FixedVectorType, the insertion index should consider the number of elements of ScalarTy. (#114526) 2024-11-01 21:17:57 +08:00
Nuno Lopes
344d972736 AssumeBundleBuilder: switch placeholder from undef to poison [NFC] 2024-11-01 10:12:10 +00:00
Yingwei Zheng
f16bff1261
[GVN][NewGVN][Local] Handle attributes for function calls after CSE (#114011)
This patch intersects attributes of two calls to avoid introducing UB.
It also skips incompatible call pairs in GVN/NewGVN. However, I cannot
provide negative tests for these changes.

Fixes https://github.com/llvm/llvm-project/issues/113997.
2024-11-01 12:44:33 +08:00
Lei Wang
bef3b54ea1
[InstrPGO] Avoid using global variable to fix potential data race (#114364)
In https://github.com/llvm/llvm-project/pull/109837, it sets a global
variable(`PGOInstrumentColdFunctionOnly`) in PassBuilderPipelines.cpp
which introduced a data race detected by TSan. To fix this, I decouple
the flag setting, the flags are now set
separately(`instrument-cold-function-only-path` is required to be used
with `--pgo-instrument-cold-function-only`).
2024-10-31 21:28:13 -07:00
Yingwei Zheng
96b14f2ccb
[Reland][InstCombine] Fix FMF propagation in foldSelectIntoOp (#114499)
Relands #114356. Compared to the last version, this patch only merges
poison-generating/nsz flags from the select to fix LV regression in
`llvm/test/Transforms/PhaseOrdering/AArch64/predicated-reduction.ll`.
2024-11-01 12:22:57 +08:00
c8ef
cf0b6cc711
Revert "[ConstantFold] Fold tgamma and tgammaf when the input parameter is a constant value." (#114496)
Reverts llvm/llvm-project#114065
2024-11-01 09:26:11 +08:00
c8ef
1f07f995cc
[ConstantFold] Fold tgamma and tgammaf when the input parameter is a constant value. (#114065)
This patch adds support for constant folding for the `tgamma` and
`tgammaf` libc functions.
2024-11-01 09:07:55 +08:00
Ruiling, Song
54d31bde32
Reapply "StructurizeCFG: Optimize phi insertion during ssa reconstruction (#101301)" (#114347)
This reverts commit be40c723ce2b7bf2690d22039d74d21b2bd5b7cf.
2024-11-01 08:29:59 +08:00
Florian Hahn
b021464d35
[VPlan] Introduce scalar loop header in plan, remove VPLiveOut. (#109975)
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.

Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.

PR: https://github.com/llvm/llvm-project/pull/109975
2024-10-31 21:36:44 +01:00