Nikita Popov
bdf2fbba9c
[AMDGPU] Convert some tests to opaque pointers (NFC)
2022-12-19 12:41:13 +01:00
Jay Foad
898b18844c
[AMDGPU] Add GFX11 to some tests with manual checks
...
Differential Revision: https://reviews.llvm.org/D138138
2022-11-17 09:42:28 +00:00
Jay Foad
462e461666
[AMDGPU] Reinstate some dwordx3 tests
2022-11-16 14:57:01 +00:00
Stanislav Mekhanoshin
f6462a26f0
[AMDGPU] Split unaligned 4 DWORD DS operations
...
Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b128 aligned by 16: 4.9 sec
ds_write2_b64 aligned by 16: 5.1 sec
ds_write2_b32 * 2 aligned by 16: 5.5 sec
ds_write_b128 aligned by 1: 8.1 sec
ds_write2_b64 aligned by 1: 8.7 sec
ds_write2_b32 * 2 aligned by 1: 14.0 sec
ds_write_b128 aligned by 2: 8.1 sec
ds_write2_b64 aligned by 2: 8.7 sec
ds_write2_b32 * 2 aligned by 2: 14.0 sec
ds_write_b128 aligned by 4: 5.6 sec
ds_write2_b64 aligned by 4: 8.7 sec
ds_write2_b32 * 2 aligned by 4: 5.6 sec
ds_write_b128 aligned by 8: 5.6 sec
ds_write2_b64 aligned by 8: 5.1 sec
ds_write2_b32 * 2 aligned by 8: 5.6 sec
ds_read_b128 aligned by 16: 3.8 sec
ds_read2_b64 aligned by 16: 3.8 sec
ds_read2_b32 * 2 aligned by 16: 4.0 sec
ds_read_b128 aligned by 1: 4.6 sec
ds_read2_b64 aligned by 1: 8.1 sec
ds_read2_b32 * 2 aligned by 1: 14.0 sec
ds_read_b128 aligned by 2: 4.6 sec
ds_read2_b64 aligned by 2: 8.1 sec
ds_read2_b32 * 2 aligned by 2: 14.0 sec
ds_read_b128 aligned by 4: 4.6 sec
ds_read2_b64 aligned by 4: 8.1 sec
ds_read2_b32 * 2 aligned by 4: 4.0 sec
ds_read_b128 aligned by 8: 4.6 sec
ds_read2_b64 aligned by 8: 3.8 sec
ds_read2_b32 * 2 aligned by 8: 4.0 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b128 aligned by 16: 6.2 sec
ds_write2_b64 aligned by 16: 7.1 sec
ds_write2_b32 * 2 aligned by 16: 7.6 sec
ds_write_b128 aligned by 1: 24.1 sec
ds_write2_b64 aligned by 1: 25.2 sec
ds_write2_b32 * 2 aligned by 1: 43.7 sec
ds_write_b128 aligned by 2: 24.1 sec
ds_write2_b64 aligned by 2: 25.1 sec
ds_write2_b32 * 2 aligned by 2: 43.7 sec
ds_write_b128 aligned by 4: 14.4 sec
ds_write2_b64 aligned by 4: 25.1 sec
ds_write2_b32 * 2 aligned by 4: 7.6 sec
ds_write_b128 aligned by 8: 14.4 sec
ds_write2_b64 aligned by 8: 7.1 sec
ds_write2_b32 * 2 aligned by 8: 7.6 sec
ds_read_b128 aligned by 16: 6.2 sec
ds_read2_b64 aligned by 16: 6.3 sec
ds_read2_b32 * 2 aligned by 16: 7.5 sec
ds_read_b128 aligned by 1: 12.5 sec
ds_read2_b64 aligned by 1: 24.0 sec
ds_read2_b32 * 2 aligned by 1: 43.6 sec
ds_read_b128 aligned by 2: 12.5 sec
ds_read2_b64 aligned by 2: 24.0 sec
ds_read2_b32 * 2 aligned by 2: 43.6 sec
ds_read_b128 aligned by 4: 12.5 sec
ds_read2_b64 aligned by 4: 24.0 sec
ds_read2_b32 * 2 aligned by 4: 7.5 sec
ds_read_b128 aligned by 8: 12.5 sec
ds_read2_b64 aligned by 8: 6.3 sec
ds_read2_b32 * 2 aligned by 8: 7.5 sec
```
Differential Revision: https://reviews.llvm.org/D123634
2022-04-12 16:07:13 -07:00
Stanislav Mekhanoshin
3870b36025
[AMDGPU] Split unaligned 3 DWORD DS operations
...
I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.
The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.
Here is the test output on Vega and Navi:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b96 aligned: 3.4 sec
ds_write_b32 + ds_write_b64 aligned: 4.5 sec
ds_write_b32 * 3 aligned: 4.8 sec
ds_write_b96 misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec
ds_write_b32 * 3 misaligned by 1: 10.0 sec
ds_write_b96 misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec
ds_write_b32 * 3 misaligned by 2: 10.1 sec
ds_write_b96 misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec
ds_write_b32 * 3 misaligned by 4: 4.9 sec
ds_write_b96 misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec
ds_write_b32 * 3 misaligned by 8: 4.9 sec
ds_read_b96 aligned: 3.3 sec
ds_read_b32 + ds_read_b64 aligned: 4.9 sec
ds_read_b32 * 3 aligned: 2.6 sec
ds_read_b96 misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec
ds_read_b32 * 3 misaligned by 1: 10.1 sec
ds_read_b96 misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec
ds_read_b32 * 3 misaligned by 2: 10.1 sec
ds_read_b96 misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec
ds_read_b32 * 3 misaligned by 4: 2.6 sec
ds_read_b96 misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec
ds_read_b32 * 3 misaligned by 8: 2.6 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b96 aligned: 4.1 sec
ds_write_b32 + ds_write_b64 aligned: 13.0 sec
ds_write_b32 * 3 aligned: 4.5 sec
ds_write_b96 misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec
ds_write_b32 * 3 misaligned by 1: 31.5 sec
ds_write_b96 misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec
ds_write_b32 * 3 misaligned by 2: 31.5 sec
ds_write_b96 misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec
ds_write_b32 * 3 misaligned by 4: 4.5 sec
ds_write_b96 misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec
ds_write_b32 * 3 misaligned by 8: 4.5 sec
ds_read_b96 aligned: 3.8 sec
ds_read_b32 + ds_read_b64 aligned: 12.8 sec
ds_read_b32 * 3 aligned: 4.4 sec
ds_read_b96 misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec
ds_read_b32 * 3 misaligned by 1: 31.5 sec
ds_read_b96 misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec
ds_read_b32 * 3 misaligned by 2: 31.5 sec
ds_read_b96 misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec
ds_read_b32 * 3 misaligned by 4: 4.5 sec
ds_read_b96 misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec
ds_read_b32 * 3 misaligned by 8: 4.5 sec
```
Fixes: SWDEV-330802
Differential Revision: https://reviews.llvm.org/D123524
2022-04-12 07:52:39 -07:00
hsmahesha
ac64995ceb
[AMDGPU] Only use ds_read/write_b128 for alignment >= 16
...
PS: Submitting on behalf of Jay.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100008
2021-04-08 08:12:05 +05:30
Mirko Brkusanin
0c7cce54eb
[AMDGPU] Resolve issues when picking between ds_read/write and ds_read2/write2
...
Both ds_read_b128 and ds_read2_b64 are valid for 128bit 16-byte aligned
loads but the one that will be selected is determined either by the order in
tablegen or by the AddedComplexity attribute. Currently ds_read_b128 has
priority.
While ds_read2_b64 has lower alignment requirements, we cannot always
restrict ds_read_b128 to 16-byte alignment because of unaligned-access-mode
option. This was causing ds_read_b128 to be selected for 8-byte aligned
loads regardles of chosen access mode.
To resolve this we use two patterns for selecting ds_read_b128. One
requires alignment of 16-byte and the other requires
unaligned-access-mode option.
Same goes for ds_write2_b64 and ds_write_b128.
Differential Revision: https://reviews.llvm.org/D92767
2020-12-10 12:40:49 +01:00
Mirko Brkusanin
ae36c02ad0
[AMDGPU] Set DS alignment requirements to be more strict
...
Alignment requirements for ds_read/write_b96/b128 for gfx9 and onward are
now the same as for other GCN subtargets. This way we can avoid any
unintentional use of these instructions on systems that do not support dword
alignment and instead require natural alignment.
This also makes 'SH_MEM_CONFIG.alignment_mode == STRICT' the default.
Differential Revision: https://reviews.llvm.org/D87821
2020-09-18 15:26:24 +02:00
Mirko Brkusanin
43af2a6faa
[AMDGPU] Workaround for LDS Misalignment bug on GFX10
...
Add subtarget feature check to avoid using ds_read/write_b96/128 with too
low alignment if a bug is present on that specific hardware.
Add this "feature" to GFX 10.1.1 as it is also affected.
Add global-isel test.
2020-09-09 11:46:09 +02:00
Mirko Brkusanin
0654ff703d
[AMDGPU] Use ds_read/write_b96/b128 when possible for SDag
...
Do not break down local loads and stores so ds_read/write_b96/b128 in
ISelLowering can be selected on subtargets that support them and if align
requirements allow them.
Differential Revision: https://reviews.llvm.org/D84403
2020-08-21 12:26:31 +02:00
Stanislav Mekhanoshin
c43e67bfff
[AMDGPU] gfx1011/gfx1012 targets
...
Differential Revision: https://reviews.llvm.org/D63307
llvm-svn: 363344
2019-06-14 00:33:31 +00:00
Stanislav Mekhanoshin
a224f68a10
[AMDGPU] gfx1010 DS implementation
...
Differential Revision: https://reviews.llvm.org/D61332
llvm-svn: 359696
2019-05-01 16:11:11 +00:00