13 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
ac94073daa [AMDGPU] Refine 64 bit misaligned LDS ops selection
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64                       aligned by  8:  3.2 sec
ds_write2_b32                      aligned by  8:  3.2 sec
ds_write_b16 * 4                   aligned by  8:  7.0 sec
ds_write_b8 * 8                    aligned by  8: 13.2 sec
ds_write_b64                       aligned by  1:  7.3 sec
ds_write2_b32                      aligned by  1:  7.5 sec
ds_write_b16 * 4                   aligned by  1: 14.0 sec
ds_write_b8 * 8                    aligned by  1: 13.2 sec
ds_write_b64                       aligned by  2:  7.3 sec
ds_write2_b32                      aligned by  2:  7.5 sec
ds_write_b16 * 4                   aligned by  2:  7.1 sec
ds_write_b8 * 8                    aligned by  2: 13.3 sec
ds_write_b64                       aligned by  4:  4.6 sec
ds_write2_b32                      aligned by  4:  3.2 sec
ds_write_b16 * 4                   aligned by  4:  7.1 sec
ds_write_b8 * 8                    aligned by  4: 13.3 sec
ds_read_b64                        aligned by  8:  2.3 sec
ds_read2_b32                       aligned by  8:  2.2 sec
ds_read_u16 * 4                    aligned by  8:  4.8 sec
ds_read_u8 * 8                     aligned by  8:  8.6 sec
ds_read_b64                        aligned by  1:  4.4 sec
ds_read2_b32                       aligned by  1:  7.3 sec
ds_read_u16 * 4                    aligned by  1: 14.0 sec
ds_read_u8 * 8                     aligned by  1:  8.7 sec
ds_read_b64                        aligned by  2:  4.4 sec
ds_read2_b32                       aligned by  2:  7.3 sec
ds_read_u16 * 4                    aligned by  2:  4.8 sec
ds_read_u8 * 8                     aligned by  2:  8.7 sec
ds_read_b64                        aligned by  4:  4.4 sec
ds_read2_b32                       aligned by  4:  2.3 sec
ds_read_u16 * 4                    aligned by  4:  4.8 sec
ds_read_u8 * 8                     aligned by  4:  8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64                       aligned by  8:  4.4 sec
ds_write2_b32                      aligned by  8:  4.3 sec
ds_write_b16 * 4                   aligned by  8:  7.9 sec
ds_write_b8 * 8                    aligned by  8: 13.0 sec
ds_write_b64                       aligned by  1: 23.2 sec
ds_write2_b32                      aligned by  1: 23.1 sec
ds_write_b16 * 4                   aligned by  1: 44.0 sec
ds_write_b8 * 8                    aligned by  1: 13.0 sec
ds_write_b64                       aligned by  2: 23.2 sec
ds_write2_b32                      aligned by  2: 23.1 sec
ds_write_b16 * 4                   aligned by  2:  7.9 sec
ds_write_b8 * 8                    aligned by  2: 13.1 sec
ds_write_b64                       aligned by  4: 13.5 sec
ds_write2_b32                      aligned by  4:  4.3 sec
ds_write_b16 * 4                   aligned by  4:  7.9 sec
ds_write_b8 * 8                    aligned by  4: 13.1 sec
ds_read_b64                        aligned by  8:  3.5 sec
ds_read2_b32                       aligned by  8:  3.4 sec
ds_read_u16 * 4                    aligned by  8:  5.3 sec
ds_read_u8 * 8                     aligned by  8:  8.5 sec
ds_read_b64                        aligned by  1: 13.1 sec
ds_read2_b32                       aligned by  1: 22.7 sec
ds_read_u16 * 4                    aligned by  1: 43.9 sec
ds_read_u8 * 8                     aligned by  1:  7.9 sec
ds_read_b64                        aligned by  2: 13.1 sec
ds_read2_b32                       aligned by  2: 22.7 sec
ds_read_u16 * 4                    aligned by  2:  5.6 sec
ds_read_u8 * 8                     aligned by  2:  7.9 sec
ds_read_b64                        aligned by  4: 13.1 sec
ds_read2_b32                       aligned by  4:  3.4 sec
ds_read_u16 * 4                    aligned by  4:  5.6 sec
ds_read_u8 * 8                     aligned by  4:  7.9 sec
```

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Differential Revision: https://reviews.llvm.org/D123956
2022-04-21 09:37:16 -07:00
Stanislav Mekhanoshin
f6462a26f0 [AMDGPU] Split unaligned 4 DWORD DS operations
Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128                      aligned by 16:  4.9 sec
ds_write2_b64                      aligned by 16:  5.1 sec
ds_write2_b32 * 2                  aligned by 16:  5.5 sec
ds_write_b128                      aligned by  1:  8.1 sec
ds_write2_b64                      aligned by  1:  8.7 sec
ds_write2_b32 * 2                  aligned by  1: 14.0 sec
ds_write_b128                      aligned by  2:  8.1 sec
ds_write2_b64                      aligned by  2:  8.7 sec
ds_write2_b32 * 2                  aligned by  2: 14.0 sec
ds_write_b128                      aligned by  4:  5.6 sec
ds_write2_b64                      aligned by  4:  8.7 sec
ds_write2_b32 * 2                  aligned by  4:  5.6 sec
ds_write_b128                      aligned by  8:  5.6 sec
ds_write2_b64                      aligned by  8:  5.1 sec
ds_write2_b32 * 2                  aligned by  8:  5.6 sec
ds_read_b128                       aligned by 16:  3.8 sec
ds_read2_b64                       aligned by 16:  3.8 sec
ds_read2_b32 * 2                   aligned by 16:  4.0 sec
ds_read_b128                       aligned by  1:  4.6 sec
ds_read2_b64                       aligned by  1:  8.1 sec
ds_read2_b32 * 2                   aligned by  1: 14.0 sec
ds_read_b128                       aligned by  2:  4.6 sec
ds_read2_b64                       aligned by  2:  8.1 sec
ds_read2_b32 * 2                   aligned by  2: 14.0 sec
ds_read_b128                       aligned by  4:  4.6 sec
ds_read2_b64                       aligned by  4:  8.1 sec
ds_read2_b32 * 2                   aligned by  4:  4.0 sec
ds_read_b128                       aligned by  8:  4.6 sec
ds_read2_b64                       aligned by  8:  3.8 sec
ds_read2_b32 * 2                   aligned by  8:  4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128                      aligned by 16:  6.2 sec
ds_write2_b64                      aligned by 16:  7.1 sec
ds_write2_b32 * 2                  aligned by 16:  7.6 sec
ds_write_b128                      aligned by  1: 24.1 sec
ds_write2_b64                      aligned by  1: 25.2 sec
ds_write2_b32 * 2                  aligned by  1: 43.7 sec
ds_write_b128                      aligned by  2: 24.1 sec
ds_write2_b64                      aligned by  2: 25.1 sec
ds_write2_b32 * 2                  aligned by  2: 43.7 sec
ds_write_b128                      aligned by  4: 14.4 sec
ds_write2_b64                      aligned by  4: 25.1 sec
ds_write2_b32 * 2                  aligned by  4:  7.6 sec
ds_write_b128                      aligned by  8: 14.4 sec
ds_write2_b64                      aligned by  8:  7.1 sec
ds_write2_b32 * 2                  aligned by  8:  7.6 sec
ds_read_b128                       aligned by 16:  6.2 sec
ds_read2_b64                       aligned by 16:  6.3 sec
ds_read2_b32 * 2                   aligned by 16:  7.5 sec
ds_read_b128                       aligned by  1: 12.5 sec
ds_read2_b64                       aligned by  1: 24.0 sec
ds_read2_b32 * 2                   aligned by  1: 43.6 sec
ds_read_b128                       aligned by  2: 12.5 sec
ds_read2_b64                       aligned by  2: 24.0 sec
ds_read2_b32 * 2                   aligned by  2: 43.6 sec
ds_read_b128                       aligned by  4: 12.5 sec
ds_read2_b64                       aligned by  4: 24.0 sec
ds_read2_b32 * 2                   aligned by  4:  7.5 sec
ds_read_b128                       aligned by  8: 12.5 sec
ds_read2_b64                       aligned by  8:  6.3 sec
ds_read2_b32 * 2                   aligned by  8:  7.5 sec
```

Differential Revision: https://reviews.llvm.org/D123634
2022-04-12 16:07:13 -07:00
Stanislav Mekhanoshin
65b8a43243 [AMDGPU] Update ds-alignment.ll test checks. NFC. 2022-04-12 12:06:02 -07:00
Stanislav Mekhanoshin
3870b36025 [AMDGPU] Split unaligned 3 DWORD DS operations
I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96                                  aligned: 3.4 sec
ds_write_b32 + ds_write_b64                   aligned: 4.5 sec
ds_write_b32 * 3                              aligned: 4.8 sec
ds_write_b96                          misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 7.2 sec
ds_write_b32 * 3                      misaligned by 1: 10.0 sec
ds_write_b96                          misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 7.2 sec
ds_write_b32 * 3                      misaligned by 2: 10.1 sec
ds_write_b96                          misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.2 sec
ds_write_b32 * 3                      misaligned by 4: 4.9 sec
ds_write_b96                          misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 4.6 sec
ds_write_b32 * 3                      misaligned by 8: 4.9 sec
ds_read_b96                                   aligned: 3.3 sec
ds_read_b32 + ds_read_b64                     aligned: 4.9 sec
ds_read_b32 * 3                               aligned: 2.6 sec
ds_read_b96                           misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 7.2 sec
ds_read_b32 * 3                       misaligned by 1: 10.1 sec
ds_read_b96                           misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 7.2 sec
ds_read_b32 * 3                       misaligned by 2: 10.1 sec
ds_read_b96                           misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 2.6 sec
ds_read_b32 * 3                       misaligned by 4: 2.6 sec
ds_read_b96                           misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 4.9 sec
ds_read_b32 * 3                       misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96                                  aligned: 4.1 sec
ds_write_b32 + ds_write_b64                   aligned: 13.0 sec
ds_write_b32 * 3                              aligned: 4.5 sec
ds_write_b96                          misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 22.0 sec
ds_write_b32 * 3                      misaligned by 1: 31.5 sec
ds_write_b96                          misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 22.0 sec
ds_write_b32 * 3                      misaligned by 2: 31.5 sec
ds_write_b96                          misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.0 sec
ds_write_b32 * 3                      misaligned by 4: 4.5 sec
ds_write_b96                          misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 13.0 sec
ds_write_b32 * 3                      misaligned by 8: 4.5 sec
ds_read_b96                                   aligned: 3.8 sec
ds_read_b32 + ds_read_b64                     aligned: 12.8 sec
ds_read_b32 * 3                               aligned: 4.4 sec
ds_read_b96                           misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 21.8 sec
ds_read_b32 * 3                       misaligned by 1: 31.5 sec
ds_read_b96                           misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 21.9 sec
ds_read_b32 * 3                       misaligned by 2: 31.5 sec
ds_read_b96                           misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 3.8 sec
ds_read_b32 * 3                       misaligned by 4: 4.5 sec
ds_read_b96                           misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 12.8 sec
ds_read_b32 * 3                       misaligned by 8: 4.5 sec
```

Fixes: SWDEV-330802

Differential Revision: https://reviews.llvm.org/D123524
2022-04-12 07:52:39 -07:00
Stanislav Mekhanoshin
e66f0edb40 [AMDGPU] Split unaligned LDS access instead of scalarizing
There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://reviews.llvm.org/D123330
2022-04-07 14:27:37 -07:00
Abinav Puthan Purayil
d8b690409d [AMDGPU] Set MemoryVT for truncstores in tblgen.
GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.

Differential Revision: https://reviews.llvm.org/D117762
2022-01-20 19:05:12 +05:30
Matt Arsenault
c222972442 AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting
These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the register
type). Unaligned accesses should be split as a type of lowering.

This has the effect of improving the code in many cases since now we
produce zextloads instead of separate loads with ands. The load/store
legality rules still seem far more complicated than necessary though.
2021-12-20 18:07:11 -05:00
Austin Kerbow
da067ed569 [AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777
2021-12-01 22:31:28 -08:00
Konstantin Schwarz
d2e66d7fa4 [GlobalISel] Add a combine for and(load , mask) -> zextload
This only handles simple masks, not shifted masks, for now.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D109357
2021-09-16 10:42:46 +02:00
Joe Nash
3ce1b9631a [AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
2021-09-14 15:11:27 -04:00
Matt Arsenault
31a9659de5 GlobalISel: Avoid use of G_INSERT in insertParts
G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef values (although this can get pretty large).

For the case of load/store narrowing, this is still performing the
load/stores in irregularly sized pieces. It might be cleaner to split
this down into equal sized pieces, and rely on load/store merging to
optimize it.
2021-06-08 14:44:24 -04:00
hsmahesha
ac64995ceb [AMDGPU] Only use ds_read/write_b128 for alignment >= 16
PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100008
2021-04-08 08:12:05 +05:30
hsmahesha
d5fee599c5 [AMDGPU] Add some exhaustive ds read/write alignment tests
PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100007
2021-04-08 08:08:49 +05:30