130 Commits

Author SHA1 Message Date
Drew Kersnar
8e7461e29a
[LoadStoreVectorizer] Batch alias analysis results to improve compile time (#147555)
This should be generally good for a lot of LSV cases, but the attached
test demonstrates a specific compile time issue that appears in the
event where the `CaptureTracking` default max uses is raised.

Without using batching alias analysis, this test takes 6 seconds to
compile in a release build. With, less than a second. This is because
the mechanism that proves `NoAlias` in this case is very expensive
(`CaptureTracking.cpp`), and caching the result leads to 2 calls to that
mechanism instead of ~300,000 (run with -stats to see the difference)

This test only demonstrates the compile time issue if
`capture-tracking-max-uses-to-explore` is set to at least 1024, because
with the default value of 100, the `CaptureTracking` analysis is not
run, `NoAlias` is not proven, and the vectorizer gives up early.
2025-07-10 11:23:33 -05:00
Drew Kersnar
a1e1a84d2c
[NVPTX] Vectorize and lower 256-bit global loads/stores for sm_100+/ptx88+ (#139292)
PTX 8.8+ introduces 256-bit-wide vector loads/stores under certain
conditions. This change extends the backend to lower these loads/stores.
It also overrides getLoadStoreVecRegBitWidth for NVPTX, allowing the
LoadStoreVectorizer to create these wider vector operations.

See the spec for the three relevant PTX instructions here:
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st
2025-05-13 13:36:09 -07:00
Anshil Gandhi
dadd91e793
[NFC] Precommit tests for an LSV patch (#138167)
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: https://github.com/llvm/llvm-project/pull/134436

This is a reland of https://github.com/llvm/llvm-project/pull/138155,
which was reverted due to missed nits.
2025-05-01 12:50:31 -04:00
Anshil Gandhi
a7aca819d4
Revert "[NFC] Precommit: Autogenerate checks for an LSV test" (#138161)
Reverts llvm/llvm-project#138155
2025-05-01 12:09:51 -04:00
Anshil Gandhi
0e9740ea17
[NFC] Precommit: Autogenerate checks for an LSV test (#138155)
Related PR: https://github.com/llvm/llvm-project/pull/134436
2025-05-01 12:00:43 -04:00
Alexander Richardson
a57847232f
[LoadStoreVectorizer] Remove more unnecessary data layouts from tests
The tests in this directory all depend on the AMDGPU target being
present so we can let opt infer the data layout.

Reviewed By: arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/137924
2025-04-30 10:58:33 -07:00
Piotr Sobczak
170c0dac44
[AMDGPU] Fix edge case of buffer OOB handling (#115479)
Strengthen out-of-bounds guarantees for buffer accesses by disallowing
buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access
starts out-of-bounds and then enters in-bounds, as the hardware would
treat the entire access as being out-of-bounds. This is normally not
needed for most users, but at least one graphics device extension
(VK_EXT_robustness2) has very strict requirements - in-bounds accesses
must return correct value, and out-of-bounds accesses must return zero.

The direct consequence of the patch is that a buffer access at negative
address is not merged by load-store-vectorizer with one at a positive
address, which fixes a CTS test.

Targets that do not care about the new behavior are advised to use the
new target feature relaxed-buffer-oob-mode that maintains the state from
before the patch.
2025-03-07 08:56:44 +01:00
Fabian Ritter
a33a84ee63
[AMDGPU][NFC] Replace gfx940 and gfx941 with gfx942 in llvm/test (#125711)
[AMDGPU][NFC] Replace gfx940 and gfx941 with gfx942 in llvm/test

gfx940 and gfx941 are no longer supported. This is one of a series of PRs to remove them from the code base.

This PR uses gfx942 instead of gfx940 and gfx941 in the test RUN-lines (unless there is already a RUN-line for gfx942).

The only notable difference in the test output is that gfx942 does not force the use of sc0 and sc1 on stores while gfx940 and gfx941 do (cf. https://reviews.llvm.org/D149986).

For SWDEV-512631
2025-02-13 15:17:12 +01:00
Vyacheslav Klochkov
9184c42869
[LoadStoreVectorizer] Postprocess and merge equivalence classes (#121861)
This patch introduces a new method:

void Vectorizer::mergeEquivalenceClasses(EquivalenceClassMap &EQClasses)
const;

The method is called at the end of
Vectorizer::collectEquivalenceClasses() and is needed to merge
equivalence classes that differ only by their underlying objects (UO1
and UO2), where UO1 is 1-level-indirection underlying base for UO2. This
situation arises due to the limited lookup depth used during the search
of underlying bases with llvm::getUnderlyingObject(ptr).

Using any fixed lookup depth can result into creation of multiple
equivalence classes that only differ by 1-level indirection bases.

The new approach merges equivalence classes if they have adjacent bases
(1-level indirection). If a series of equivalence classes form ladder
formed of 1-step/level indirections, they are all merged into a single
equivalence class. This provides more opportunities for the load-store
vectorizer to generate better vectors.

---------

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
2025-01-07 17:17:26 -08:00
Fangrui Song
133352feb3 [test] Remove redundant -march= when target triple is specified in IR 2024-12-15 12:42:17 -08:00
Michal Paszkowski
04313b86a5
Revert "[LoadStoreVectorizer] Postprocess and merge equivalence classes" (#119657)
Reverts llvm/llvm-project#114501, due to the following failure:
https://lab.llvm.org/buildbot/#/builders/55/builds/4171
2024-12-11 20:36:23 -08:00
Vyacheslav Klochkov
fd2f8d485d
[LoadStoreVectorizer] Postprocess and merge equivalence classes (#114501)
This patch introduces a new method:
void Vectorizer::mergeEquivalenceClasses(EquivalenceClassMap &EQClasses)
const

The method is called at the end of
Vectorizer::collectEquivalenceClasses() and is needed to merge
equivalence classes that differ only by their underlying objects (UO1
and UO2), where UO1 is 1-level-indirection underlying base for UO2. This
situation arises due to the limited lookup depth used during the search
of underlying bases with llvm::getUnderlyingObject(ptr).

Using any fixed lookup depth can result into creation of multiple
equivalence classes that only differ by 1-level indirection bases.

The new approach merges equivalence classes if they have adjacent bases
(1-level indirection). If a series of equivalence classes form ladder
formed of 1-step/level indirections, they are all merged into a single
equivalence class. This provides more opportunities for the load-store
vectorizer to generate better vectors.

---------

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
2024-12-11 19:01:35 -08:00
Danial Klimkin
9671ed1afc
Revert "LSV: forbid load-cycles when vectorizing; fix bug (#104815)" (#106245)
This reverts commit c46b41aaa6eaa787f808738d14c61a2f8b6d839f.

Multiple tests time out, either due to performance hit (see comment) or
a cycle.
2024-08-27 18:45:22 +02:00
Austin Kerbow
ceb587a16c
[AMDGPU] Fix crash in allowsMisalignedMemoryAccesses with i1 (#105794) 2024-08-23 11:51:37 -07:00
Ramkumar Ramachandra
c46b41aaa6
LSV: forbid load-cycles when vectorizing; fix bug (#104815)
Forbid load-load cycles which would crash LoadStoreVectorizer when
reordering instructions.

Fixes #37865.
2024-08-22 11:37:19 +01:00
Ramkumar Ramachandra
fff78a51ee
LSV/test/AArch64: add missing lit.local.cfg; fix build (#102607)
Follow up on 199d6f2 (LSV: document hang reported in #37865) to fix the
build when omitting the AArch64 target. Add the missing lit.local.cfg.
2024-08-09 14:17:13 +01:00
Ramkumar Ramachandra
199d6f2c0c
LSV: document hang reported in #37865 (#102479)
LoadStoreVectorizer hangs on certain examples, when its reorder function
goes into a cycle. Detect this cycle and explicitly forbid it, using an
assert, and document the resulting crash in a test-case under AArch64.
2024-08-09 11:34:09 +01:00
Nikita Popov
2d69827c5c [Transforms] Convert tests to opaque pointers (NFC) 2024-02-05 11:57:34 +01:00
Nick Anderson
f1ec0d12bb
Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#77182)
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #75380

Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
2024-01-09 13:32:59 +07:00
Simon Pilgrim
7648371c25 Revert 4d7c5ad58467502fcbc433591edff40d8a4d697d "[NewPM] Update CodeGenPreparePass reference in CodeGenPassBuilder (#77054)"
Revert e0c554ad87d18dcbfcb9b6485d0da800ae1338d1 "Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)"

Revert #75380 and #77054 as they were breaking EXPENSIVE_CHECKS buildbots: https://lab.llvm.org/buildbot/#/builders/104
2024-01-05 12:28:10 +00:00
Nick Anderson
e0c554ad87
Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #64560

Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
2024-01-05 13:47:56 +07:00
Nikita Popov
eecb99c5f6 [Tests] Add disjoint flag to some tests (NFC)
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
2023-12-05 14:09:36 +01:00
Nikita Popov
edb2fc6dab [llvm] Remove explicit -opaque-pointers flag from tests (NFC)
Opaque pointers mode is enabled by default, no need to explicitly
enable it.
2023-07-12 14:35:55 +02:00
Matt Arsenault
a393870085 AMDGPU: Extract test out of old patch
Don't think the patch is still useful but I don't see equivalent tests
for decreased alloca alignment.

https://reviews.llvm.org/D23908
2023-06-09 21:53:33 -04:00
Bjorn Pettersson
263bc7f905 [LoadStoreVectorizer] Only upgrade align for alloca
In commit 2be0abb7fe72ed453 (D149893) the load store vectorized was
reimplemented. One thing that can happen with the new LSV is that
it can increase the align of alloca and global objects. However,
the code comments indicate that the intention only was to increase
alignment of alloca.
Now we will use stripPointerCasts to analyse if the load/store really
is accessing an alloca (same as getOrEnforceKnownAlignment is using).
And then we only try to change the align if we find an alloca
instruction. This way the code will match better with code comments,
and we won't change alignment of non-stack variables to use the
"StackAdjustedAlignment".

Differential Revision: https://reviews.llvm.org/D152386
2023-06-09 15:33:35 +02:00
Krzysztof Drewniak
e7acd8bdf7 [LoadStoreVectorizer] Fix index width != pointer width case
Fixes https://github.com/llvm/llvm-project/issues/62856

Reviewed By: jlebar

Differential Revision: https://reviews.llvm.org/D151754
2023-05-31 17:27:26 +00:00
Krzysztof Drewniak
087b67cc06 [AMDGPU][LoadStoreVectorizer] Pre-commit test for addrspace 7 crash
Differential Revision: https://reviews.llvm.org/D151751
2023-05-30 21:15:21 +00:00
Justin Lebar
420cf6927c
[LSV] Return same bitwidth from getConstantOffset.
Previously, getConstantOffset could return an APInt with a different
bitwidth than the input pointers.  For example, we might be loading an
opaque 64-bit pointer, but stripAndAccumulateInBoundsConstantOffsets
might give a 32-bit offset.

This was OK in most cases because in gatherChains, we casted the APInt
back to the original ASPtrBits.

But it was not OK when considering selects.  We'd call getConstantOffset
twice and compare the resulting APInt's, which might not have the same
bit width.

This fixes that.  Now getConstantOffset always returns offsets with the
correct width, so we don't need the hack of casting it in gatherChains,
and it works correctly when we're handling selects.

Differential Revision: https://reviews.llvm.org/D151640
2023-05-29 08:43:47 -07:00
Justin Lebar
f225471c68
[LSV] Fix the ContextInst for computeKnownBits.
Previously we used the later of GEPA or GEPB.  This is hacky because
really we should be using the later of the two load/store instructions
being considered.  But also it's flat-out incorrect, because GEPA and
GEPB might be in different BBs, in which case we cannot ask which one
comes last (assertion failure,
https://reviews.llvm.org/D149893#4378332).

Fixed, now we use the correct context instruction.

Differential Revision: https://reviews.llvm.org/D151630
2023-05-28 08:00:52 -07:00
Justin Lebar
2be0abb7fe
Rewrite load-store-vectorizer.
The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.

This kernel has a few hundred i8 loads and stores.  Merging is critical for
performance.

The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores.  This limit is necessary to contain the
O(n^2) behavior of the pass.  I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.

So we rewrite basically the whole thing to use a new algorithm.  Before, we
compared every load/store to every other to see if they're consecutive.  The
insight (from tra@) is that this is redundant.  If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.

So that's what we do.  When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain.  Each instruction gets compared only to the leaders of
all the chains.

In the worst case, this is still O(n^2), because all chains might be of length
1.  To prevent compile time blowup, we only consider the 64 most recently used
chains.  Thus we do no more comparisons than before, but we have the potential
to make much longer chains.

This rewrite affects many tests.  The changes to tests fall into two
categories.

1. The old code had what appears to be a bug when deciding whether a misaligned
   vectorized load is fast.  Suppose TTI reports that load <i32 x 4> align 4
   has relative speed 1, and suppose that load i32 align 4 has relative speed
   32.

   The intent of the code seems to be that we prefer the scalar load, because
   it's faster.  But the old code would choose the vectorized load.
   accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
   even call into TTI to get the relative speed), because the scalar load is
   aligned.

   After this patch, we will prefer the scalar load if it's faster.

2. This patch changes the logic for how we vectorize.  Usually this results in
   vectorizing more.

Explanation of changes to tests:

 - AMDGPU/adjust-alloca-alignment.ll: #1
 - AMDGPU/flat_atomic.ll: #2, we vectorize more.
 - AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this.  Before, we'd vectorize in case 1 and not case 2.  Now we vectorize in case 2 and not case 1.  So we just move the call.
 - AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
 - AMDGPU/insertion-point.ll: #2 we vectorize more
 - AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
 - AMDGPU/multiple_tails.ll: #1
 - AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
 - AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
 - NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
 - NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
 - X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
 - X86/subchain-interleaved.ll: #2, we vectorize more
 - X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
 - X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical.  It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
 - X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll

Differential Revision: https://reviews.llvm.org/D149893
2023-05-26 15:15:39 -07:00
Tobias Hieta
f84bac329b
[NFC][Py Reformat] Reformat lit.local.cfg python files in llvm
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.

Reformatting is done with `black`.

If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.

If you run into any problems, post to discourse about it and
we will try to help.

RFC Thread below:

https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style

Reviewed By: barannikov88, kwk

Differential Revision: https://reviews.llvm.org/D150762
2023-05-17 17:03:15 +02:00
Krzysztof Drewniak
f0415f2a45 Re-land "[AMDGPU] Define data layout entries for buffers""
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.

This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.

Differential Revision: https://reviews.llvm.org/D149776
2023-05-03 19:43:56 +00:00
Krzysztof Drewniak
3f2fbe92d0 Revert "[AMDGPU] Define data layout entries for buffers"
This reverts commit f9c1ede2543b37fabe9f2d8f8fed5073c475d850.

Differential Revision: https://reviews.llvm.org/D149758
2023-05-03 16:11:00 +00:00
Krzysztof Drewniak
f9c1ede254 [AMDGPU] Define data layout entries for buffers
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.

The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.

The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.

Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.

This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.

Depends on D143437

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D145441
2023-05-03 15:25:58 +00:00
Artem Belevich
faa631f939 [LSV] Improve chain splitting in some corner cases.
Currently we happen to split a chain of 12xi8 accesses into 6xi8 + 6xi8, which
produces rather suboptimal code.

This change attempts to split-off non-multiples of 4bytes at the end and if that
does not work, splits on the smaller power-of-2 boundary.

Differential Revision: https://reviews.llvm.org/D147976
2023-04-17 13:42:00 -07:00
Krzysztof Drewniak
916425b2d1 [llvm] Use pointer index type for more GEP offsets (pre-codegen)
Many uses of getIntPtrType() were using that type to calculate the
neened type for GEP offset arguments. However, some time ago,
DataLayout was extended to support pointers where the size of the
pointer is not equal to the size of the values used to index it.

Much code was already migrated to, for example, use getIndexSizeInBits
instead of getPtrSizeInBits, but some rewrites still used
getIntPtrType() to get the type for GEP offsets.

This commit changes uses of getIntPtrType() to getIndexType() where
they are involved in a GEP-related calculation.

In at least one case (bounds check insertion) this resolves a compiler
crash that the new test added here would previously trigger.

This commit does not impact
- C library-related rewriting (memcpy()), which are operating under
the assumption that intptr_t == size_t. While all the mechanisms for
breaking this assumption now exist, doing so is outside the scope of
this commit.
- Code generation and below. Note that the use of getIntPtrType() in
CodeGenPrepare will be changed in a future commit.
- Usage of getIntPtrType() in any backend

Depends on D143435

Reviewed By: arichardson

Differential Revision: https://reviews.llvm.org/D143437
2023-03-28 16:41:02 +00:00
Jeffrey Byrnes
b89236a96f [AMDGPU] Vectorize misaligned global loads & stores
Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments.

Differential Revision: https://reviews.llvm.org/D145170

Change-Id: I6ee6c76e6ace7fc373cc1b2aac3818fc1425a0c1
2023-03-03 13:18:25 -08:00
Nikita Popov
5867241eac [Transforms] Convert some tests to opaque pointers (NFC) 2023-01-06 12:14:45 +01:00
Nikita Popov
0d18d36b18 [LoadStoreVectorizer] Convert tests to opaque pointers (NFC) 2022-12-27 13:13:56 +01:00
Nikita Popov
314d0dbb20 [LoadStoreVectorize] Regenerate test checks (NFC) 2022-12-27 13:08:57 +01:00
Nikita Popov
ba1759c498 [LoadStoreVectorizer] Convert some tests to opaque pointers (NFC) 2022-12-27 12:57:01 +01:00
Roman Lebedev
c37dfd0fae
[NFC] Port last few Transforms tests to -passes= syntax 2022-12-09 02:07:27 +03:00
Roman Lebedev
b1a9584818
[opt] Disincentivize new tests from using old pass syntax
Over the past day or so, i've took a large swing at our tests,
and reduced the number of tests that were still using the old syntax
from ~1800 to just 200.

Left to handle: (as it is seen in this patch)
* Transforms/LSR
* Transforms/CGP
* Transforms/TypePromotion
* Transforms/HardwareLoops
* Analysis/*
* some misc.

I think this is the right point to start actively refusing
to honor the old syntax, except for the old tests,
to prevent the old syntax from creeping back in.

Thus, let's add temporary default-off flag,
and if it is not passed refuse to accept old syntax.
The tests that still need porting are annotated with this flag.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D139647
2022-12-08 23:54:03 +03:00
Roman Lebedev
d6e7e477ee
[NFC] Port all LoadStoreVectorizer tests to -passes= syntax 2022-12-08 02:38:45 +03:00
Arthur Eubanks
f3a928e233 [opt] Don't translate legacy -analysis flag to require<analysis>
Tests relying on this should explicitly use -passes='require<analysis>,foo'.
2022-10-07 14:54:34 -07:00
Arthur Eubanks
d3d8465446 [opt] Stop treating alias analysis specially when translating legacy opt syntax
I've attempted to keep AA tests as close to their original intent as possible.
2022-10-07 11:50:43 -07:00
Johannes Doerfert
1fb415fee9 [AMDGPU][FIX] Proper load-store-vectorizer result with opaque pointers
The original code relied on the fact that we needed a bitcast
instruction (for non constant base objects). With opaque pointers there
might not be a bitcast. Always check if reordering is required instead.

Fixes: https://github.com/llvm/llvm-project/issues/54896

Differential Revision: https://reviews.llvm.org/D123694
2022-04-15 13:42:46 -05:00
Stanislav Mekhanoshin
a41a676e8a [AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses
Differential Revision: https://reviews.llvm.org/D123268
2022-04-06 18:05:02 -07:00
Benjamin Kramer
0776f6e04d [LSV] Vectorize loads of vectors by turning it into a larger vector
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.

Differential Revision: https://reviews.llvm.org/D117219
2022-01-26 11:38:41 +01:00
Nikita Popov
330cb03269 [LoadStoreVectorizer] Check for guaranteed-to-transfer (PR52950)
Rather than checking for nounwind in particular, make sure the
instruction is guaranteed to transfer execution, which will also
handle non-willreturn calls correctly.

Fixes https://github.com/llvm/llvm-project/issues/52950.
2022-01-03 10:55:47 +01:00