406 Commits

Author SHA1 Message Date
Dávid Bolvanský
f02a0a69af [NFCI] Fixed missing colon in CHECK directives 2022-04-03 11:52:38 +02:00
Daniil Kovalev
a8c277041a [NVPTX] Fix poorly designed assertion introduced in D120129
NVPTXTargetLowering::getFunctionParamOptimizedAlign, which was introduces in
D120129, contained a poorly designed assertion checking that a function with
internal or private linkage is not a kernel. It relied on invariants that
were not actually guaranteed, and that resulted in compiler crash with some
CUDA versions (see discussion with @jdoerfert in D120129). This patch changes
that assertion and makes it use isKernelFunction which is designed exactly for
such checks. This patch also includes a test with IR that caused compiler crash
before.

Differential Revision: https://reviews.llvm.org/D122562
2022-03-28 17:34:58 +03:00
Daniil Kovalev
828b63c309 [NVPTX] Enhance vectorization of ld.param & st.param
Since function parameters and return values are passed via param space, we
can force special alignment for values hold in it which will add vectorization
options. This change may be done if the function has private or internal
linkage. Special alignment is forced during 2 phases.

1) Instruction selection lowering. Here we use special alignment for function
   prototypes (changing both own return value and parameters alignment), call
   lowering (changing both callee's return value and parameters alignment).

2) IR pass nvptx-lower-args. Here we change alignment of byval parameters that
   belong to param space (or are casted to it). We only handle cases when all
   uses of such parameters are loads from it. For such loads, we can change the
   alignment according to special type alignment and the load offset. Then,
   load-store-vectorizer IR pass will perform vectorization where alignment
   allows it.

Special alignment calculated as maximum from default ABI type alignment and
alignment 16. Alignment 16 is chosen because it's the maximum size of
vectorized ld.param & st.param.

Before specifying such special alignment, we should check if it is a multiple
of the alignment that the type already has. For example, if a value has an
enforced alignment of 64, default ABI alignment of 4 and special alignment
of 16, we should preserve 64.

This patch will be followed by a refactoring patch that removes duplicating
code in handling byval and non-byval arguments.

Differential Revision: https://reviews.llvm.org/D120129
2022-03-24 12:36:52 +03:00
Daniil Kovalev
a034878564 Revert "[NVPTX] Enhance vectorization of ld.param & st.param"
This reverts commit f854434f0f2a01027bdaad8e6fdac5a782fce291.

Placed URL to wrong differential revision in commit message.
2022-03-24 12:32:06 +03:00
Daniil Kovalev
f854434f0f [NVPTX] Enhance vectorization of ld.param & st.param
Since function parameters and return values are passed via param space, we
can force special alignment for values hold in it which will add vectorization
options. This change may be done if the function has private or internal
linkage. Special alignment is forced during 2 phases.

1) Instruction selection lowering. Here we use special alignment for function
   prototypes (changing both own return value and parameters alignment), call
   lowering (changing both callee's return value and parameters alignment).

2) IR pass nvptx-lower-args. Here we change alignment of byval parameters that
   belong to param space (or are casted to it). We only handle cases when all
   uses of such parameters are loads from it. For such loads, we can change the
   alignment according to special type alignment and the load offset. Then,
   load-store-vectorizer IR pass will perform vectorization where alignment
   allows it.

Special alignment calculated as maximum from default ABI type alignment and
alignment 16. Alignment 16 is chosen because it's the maximum size of
vectorized ld.param & st.param.

Before specifying such special alignment, we should check if it is a multiple
of the alignment that the type already has. For example, if a value has an
enforced alignment of 64, default ABI alignment of 4 and special alignment
of 16, we should preserve 64.

This patch will be followed by a refactoring patch that removes duplicating
code in handling byval and non-byval arguments.

Differential Revision: https://reviews.llvm.org/D121549
2022-03-24 12:25:36 +03:00
Igor Kudrin
d7681d9f77 [NVPTX] Avoid a crash when 'llc' is called with '-filetype=null'
For '-filetype=null', 'NVPTXTargetStreamer' is not created, so the
return value of 'OutStreamer->getTargetStreamer()' should be checked
before calling the methods.

Differential Revision: https://reviews.llvm.org/D122001
2022-03-22 16:46:47 +04:00
Kristina Bessonova
57aaab3b17 [NVPTX] Fix nvvm.match.sync*.i64 intrinsics return type (i64 -> i32)
NVVM IR specification defines them with i32 return type:

  declare i32 @llvm.nvvm.match.any.sync.i64(i32 %membermask, i64 %value)
  declare {i32, i1} @llvm.nvvm.match.all.sync.i64(i32 %membermask, i64 %value)
  ...
  The i32 return value is a 32-bit mask where bit position in mask corresponds
  to thread’s laneid.

as well as PTX ISA:

  9.7.12.8. Parallel Synchronization and Communication Instructions: match.sync

  match.any.sync.type  d, a, membermask;
  match.all.sync.type  d[|p], a, membermask;
  ...
  Destination d is a 32-bit mask where bit position in mask corresponds
  to thread’s laneid.

Additionally, ptxas doesn't accept intructions, produced by NVPTX backend.
After this patch, it compiles with no issues.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D120499
2022-03-01 12:26:16 +02:00
Kristina Bessonova
3fe6f9388f [NVPTX][AsmPrinter] Emit .attribute(.managed) for global variable declarations
Declaration and definition attributes must match,
otherwise it may cause issues on linking.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D120493
2022-02-25 10:21:31 +02:00
Nicolas Miller
69a8350c23 [NVPTX] Add ex2.approx.f16/f16x2 support
his patch adds builtins and intrinsics for the f16 and f16x2 variants of the ex2
instruction.

These two variants were added in PTX7.0, and are supported by sm_75 and above.

Note that this isn't wired with the exp2 llvm intrinsic because the ex2
instruction is only available in its approx variant.

Running ptxas on the assembly generated by the test f16-ex2.ll works as
expected.

Differential Revision: https://reviews.llvm.org/D119157
2022-02-23 13:56:53 -08:00
Jakub Chlanda
be672934ff [NVPTX] Add more FMA intriniscs/builtins
This patch adds builtins/intrinsics for the following variants of FMA:

- f16, f16x2
  - rn
  - rn_ftz
  - rn_sat
  - rn_ftz_sat
  - rn_relu
  - rn_ftz_relu
- bf16, bf16x2
  - rn
  - rn_relu

ptxas (Cuda compilation tools, release 11.0, V11.0.194) is happy with the generated assembly.

Differential Revision: https://reviews.llvm.org/D118977
2022-02-23 13:56:53 -08:00
Jakub Chlanda
e0dc4ac28f [NVPTX] Expose float tys min, max, abs, neg as builtins
Adds support for the following builtins:

- abs, neg:
- .bf16,
- .bf16x2
- min, max
- {.ftz}{.NaN}{.xorsign.abs}.f16
- {.ftz}{.NaN}{.xorsign.abs}.f16x2
- {.NaN}{.xorsign.abs}.bf16
- {.NaN}{.xorsign.abs}.bf16x2
- {.ftz}{.NaN}{.xorsign.abs}.f32

Differential Revision: https://reviews.llvm.org/D117887
2022-02-23 13:56:53 -08:00
Dmitry Vassiliev
885140171a [NVPTX] Fix NVPTXReplaceImageHandles for multiple uses of a texref
The texsurf_handle is removed by NVPTXReplaceImageHandles.cpp. There are more than one uses of the texsurf_handle, one of them is a regular function call, and one of them is a texture intrinsic.
The current hacky logic in NVPTXReplaceImageHandles.cpp for CUDA cannot handle such a mixed use. This patch fixes this issue.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D119635
2022-02-15 01:30:13 +03:00
Dmitry Vassiliev
6645bfa8f5 [NVPTX] Fix bug with int_nvvm_rotate_b64 when operand immediate
Need to subract from 64, not 32.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D119639
2022-02-15 01:23:11 +03:00
Fangrui Song
a00ae86ab2 Revert D119669 "[NVPTX] Prefix "$L__" for branch label names"
This reverts commit cccef321096c20825fe8738045c1d91d3b9fd57d.

Broke clang-cuda-t4

```
/buildbot/cuda-t4-0/work/clang-cuda-t4/clang/bin/clang++  -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time -UNDEBUG --cuda-path=/buildbot/cuda-t4-0/work/clang-cuda-t4/external/cuda/cuda-11.0 -I/buildbot/cuda-t4-0/work/clang-cuda-t4/external/cuda/cuda-11.0/include --cuda-gpu-arch=sm_75 -std=c++20 -stdlib=libstdc++ --gcc-toolchain=/buildbot/cuda-t4-0/work/clang-cuda-t4/external/cuda/gcc-8 -DSTDLIB_VERSION=2014 -MD -MT External/CUDA/CMakeFiles/complex-cuda-11.0-c++20-libstdc++-8.dir/complex.cu.o -MF External/CUDA/CMakeFiles/complex-cuda-11.0-c++20-libstdc++-8.dir/complex.cu.o.d -o External/CUDA/CMakeFiles/complex-cuda-11.0-c++20-libstdc++-8.dir/complex.cu.o -c /buildbot/cuda-t4-0/work/clang-cuda-t4/llvm-test-suite/External/CUDA/complex.cu
ptxas /tmp/complex-cfa050/complex-sm_75.s, line 250; fatal   : Parsing error near '$L__BB6_2': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
```
2022-02-14 13:23:22 -08:00
Dmitry Vassiliev
cccef32109 [NVPTX] Prefix "$L__" for branch label names
A global variable may have the same name as a label, and ptxas does not accept it.
Prefix labels with $L__ to fix this.

Reviewed By: MaskRay, tra

Differential Revision: https://reviews.llvm.org/D119669
2022-02-14 23:51:36 +03:00
Nikita Popov
1c729d719a [NVPTX] Use align attribute for kernel pointer arg alignment
Instead of determining the alignment based on the pointer element
type (which is incompatible with opaque pointers), make use of
alignment annotations added by the frontend.

In particular, clang will add alignment attributes to OpenCL kernels
since D118894. Other frontends might need to be adjusted to add
the attribute as well.

Differential Revision: https://reviews.llvm.org/D119247
2022-02-10 11:56:48 +01:00
Daniil Kovalev
0f9109cc9d [NVPTX] Eliminate StoreRetval instructions with undef operand
Previously a lot of StoreRetval instructions with undef operand were
generated on NVPTX target when a big struct was returned by value.
It resulted in a lot of unneeded st.param.* instructions in final
assembly. The patch solves the issue by implementing the logic in
NVPTX-specific part of DAG combiner.

Differential Revision: https://reviews.llvm.org/D118973
2022-02-10 11:39:43 +03:00
Christian Sigg
f7da4a5d4d [NVPTX] Remove fmin/fmax.NaN.f64 again
Added in https://reviews.llvm.org/D117204, but it does not exist.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D118398
2022-01-28 07:46:16 +01:00
Christian Sigg
dc441d776f [NVPTX] NFC: Remove unused arguments and attribute from test 2022-01-26 15:57:27 +01:00
Jack Kirk
bef3eb8344 [Clang][NVPTX]Add NVPTX intrinsics and builtins for CUDA PTX cvt sm80 instructions
Adds NVPTX intrinsics and builtins for CUDA PTX cvt instructions for sm80
architectures and above. Requires ptx 7.0.

PTX ISA description of cvt instructions :
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cvt

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Differential Revision: https://reviews.llvm.org/D116673
2022-01-13 13:29:48 -08:00
Christian Sigg
ffee3b2f7a [NVPTX] Add version test for sm_75, sm_80, sm_86.
Combine the sm-version tests into a single file.

Reviewed By: bkramer, tra

Differential Revision: https://reviews.llvm.org/D117198
2022-01-13 20:24:09 +01:00
Christian Sigg
efb8d4cff3 [NVPTX] Add fmin/fmax.NaN lowering for sm_80+.
Reviewed By: bkramer, tra

Differential Revision: https://reviews.llvm.org/D117204
2022-01-13 20:22:41 +01:00
Christian Sigg
cc1b9acf55 [NVPTX] Lower fp16 fminnum, fmaxnum to native on sm_80.
Reviewed By: bkramer, tra

Differential Revision: https://reviews.llvm.org/D117122
2022-01-13 08:52:31 +01:00
Andrew Savonichev
e29ba97d23 [NVPTX] Auto-generate tests for sufrace and texture instructions
The patch adds LIT tests for SULD, SUST, TEX and TLD4 instructions as
a follow up for D112232. There are a number of FIXME marks that
highlight possible bugs or missed instruction variants.

Differential Revision: https://reviews.llvm.org/D114367
2021-12-07 15:27:51 +03:00
Andrew Savonichev
00aa0aeb06 [NVPTX] Add imm variants for surface and texture instructions
Texture/sampler/surface operands can be either a register or an
immediate (an index of .texref, .samplerref or .surfref).

TableGen declarations for these instructions used to only have
Int64Regs operands, so this caused issues when machine verifier
is turned on:

    *** Bad machine code: Expected a register operand. ***
    - function:    bar
    - basic block: %bb.0  (0x55b144d99ab8)
    - instruction: %4:int32regs = SULD_1D_I32_TRAP 0, killed %2:int32regs
    - operand 1:   0

The solution is to duplicate these instructions for all possible
operand types (i16imm and Int64Regs). Since this would
essentially double the amount code in TableGen, the patch also
does some refactoring for the original instructions to keep
things manageable.

Differential Revision: https://reviews.llvm.org/D112232
2021-11-10 19:05:03 +03:00
Andrew Savonichev
123ad720f1 [NVPTX] Mark special registers as reserved
A reserved register:
 - is not allocatable
 - is considered always live
 - is ignored by liveness tracking

NVPTX special registers match the criteria, and marking them as
reserved helps to avoid machine verifier error:

    *** Bad machine code: Using an undefined physical register ***
    - function:    foo
    - basic block: %bb.0  (0x557bb178b708)
    - instruction: %0:int32regs = MOV_SPECIAL $envreg0
    - operand 1:   $envreg0

Differential Revision: https://reviews.llvm.org/D113008
2021-11-03 15:48:04 +03:00
Andrew Savonichev
0e70785538 [NVPTX] Add MoveParam instruction for TargetExternalSymbol operand
TargetExternalSymbol is considered to be an immediate and not a
register, so machine verifier emits an error:

    *** Bad machine code: Expected a register operand. ***
    - function:    static_offset
    - basic block: %bb.0 bb (0x560e9b306028)
    - instruction: %3:int64regs = MoveParamI64 &static_offset_param_1
    - operand 1:   &static_offset_param_1

The patch adds variants of this instruction with an immediate operand
for byval arguments on 64-bit and 32-bit targets.

Differential Revision: https://reviews.llvm.org/D113006
2021-11-03 14:43:41 +03:00
Andrew Savonichev
30a3a17df8 [NVPTX] Copy machine operand flags in TII::insertBranch
Before this patch, flags such as undef were dropped by TII::insertBranch
(used by BranchFolding pass), resulting in the following error from
machine verifier:

    *** Bad machine code: Reading virtual register without a def ***
    - function:    hoge
    - basic block: %bb.0 bb (0x562e9c240e68)
    - instruction: CBranch %2:int1regs, %bb.3
    - operand 0:   %2:int1regs

Differential Revision: https://reviews.llvm.org/D113001
2021-11-03 12:38:27 +03:00
Artem Belevich
b6b7fe60a4 [NVPTX] Add a late SROA pass which allows optimizing away more allocas.
Fixes performance regression https://bugs.llvm.org/show_bug.cgi?id=52037

Differential Revision: https://reviews.llvm.org/D111471
2021-10-19 16:18:28 -07:00
Arthur Eubanks
15fefcb9eb [opt] Directly translate -O# to -passes='default<O#>'
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.

Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.

Tests need to be updated to not use `opt -O# -pass-name`.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D112036
2021-10-18 16:48:10 -07:00
Andrew Savonichev
51eefa8164 [NVPTX] Add VRFrame and VRFrameLocal to integer register classes
These registers are used as operands for instructions that expect an
integer register, so they should be added to Int32Regs or Int64Regs
register classes. Otherwise the machine verifier emits an error for
the following LIT tests when LLVM_ENABLE_MACHINE_VERIFIER=1
environment variable is set:

*** Bad machine code: Illegal physical register for instruction ***
- function:    kernel_func
- basic block: %bb.0 entry (0x55c8903d5438)
- instruction: %3:int64regs = LEA_ADDRi64 $vrframelocal, 0
- operand 1:   $vrframelocal
$vrframelocal is not a Int64Regs register.

    CodeGen/NVPTX/call-with-alloca-buffer.ll
    CodeGen/NVPTX/disable-opt.ll
    CodeGen/NVPTX/lower-alloca.ll
    CodeGen/NVPTX/lower-args.ll
    CodeGen/NVPTX/param-align.ll
    CodeGen/NVPTX/reg-types.ll
    DebugInfo/NVPTX/dbg-declare-alloca.ll
    DebugInfo/NVPTX/dbg-value-const-byref.ll

Differential Revision: https://reviews.llvm.org/D110164
2021-10-14 16:19:03 +03:00
Jake Egan
56049b7129 Fix tests defaulting to incorrect triples on AIX
The tests only specify -march, so when the tests are run on AIX the target OS defaults to AIX, which causes the tests to misbehave.

This patch constrains the tests by specifying -mtriple instead of -march.

Reviewed By: daltenty, jsji, MaskRay

Differential Revision: https://reviews.llvm.org/D110186
2021-09-27 11:30:45 -04:00
Artem Belevich
d99a83b4e5 [NVPTX] Simplify and generalize constant printer.
This allows handling i128 values and fixes
https://bugs.llvm.org/show_bug.cgi?id=51789.

Differential Revision: https://reviews.llvm.org/D109458
2021-09-09 11:30:19 -07:00
Steffen Larsen
1b4c85fc02 [NVPTX] Add NVPTX intrinsics for CUDA PTX 6.5 ldmatrix instructions
Adds NVPTX intrinsics for the CUDA PTX `ldmatrix.sync.aligned` instructions added in PTX 6.5.

PTX ISA description of `ldmatrix.sync.aligned`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D107046
2021-08-06 16:13:35 -07:00
Eli Friedman
bdd55b2f18 Fix the default alignment of i1 vectors.
Currently, the default alignment is much larger than the actual size of
the vector in memory.  Fix this to use a sane default.

For SVE, temporarily remove lowering of load/store operations for
predicates with less than 16 elements. The layout the backend was
assuming for SVE predicates with less than 16 elements doesn't agree
with the frontend. More work probably needs to be done here.

This change is, strictly speaking, not backwards-compatible at the
bitcode level. But probably nobody is actually depending on that; i1
vectors in memory are rare, and the code that does use them probably
ends up forcing the alignment to something sane anyway.  If we think
this is a concern, I can restrict this to scalable vectors for now
(where it's actually causing issues for me at the moment).

Differential Revision: https://reviews.llvm.org/D88994
2021-07-31 14:09:59 -07:00
Simon Pilgrim
fcb710a7ad [NVPTX] Add select(cc,binop(),binop()) fast-math tests
As discussed on D106058 - we're not propagating the common flags to the merged binop
2021-07-18 15:30:24 +01:00
Artem Belevich
d774b4aa5e [NVPTX, CUDA] Add .and.popc variant of the b1 MMA instruction.
That should allow clang to compile mma.h from CUDA-11.3.

Differential Revision: https://reviews.llvm.org/D105384
2021-07-15 12:02:09 -07:00
Simon Pilgrim
3cc38703d5 [NVPTX] Tweak fast-math tests to avoid select(binop(x,y),binop(x,z)) fold
As suggested on D106058, tweak the tests to keep the combineRepeatedFPDivisors test coverage.
2021-07-15 15:42:25 +01:00
Simon Pilgrim
e21663d32b [NVPTX] Add selp.f32 checks to select(cond,fpbinop(),fpbinop()) tests
Will help show codegen diffs in an upcoming patch
2021-07-15 12:42:29 +01:00
Tom Stellard
7f1c077c30 tests/CodeGen: Use %python lit substitution when invoking python
This will use the python that LLVM was configured to use rather than
python from PATH.

Reviewed By: serge-sans-paille

Differential Revision: https://reviews.llvm.org/D105224
2021-07-06 18:46:36 -07:00
Steffen Larsen
3644726a78 [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
Adds NVPTX builtins and intrinsics for the CUDA PTX `wmma.load`, `wmma.store`, `wmma.mma`, and `mma` instructions added in PTX 6.5 and 7.0.

PTX ISA description of

  - `wmma.load`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
  - `wmma.store`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-st
  - `wmma.mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
  - `mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Overview of `wmma.mma` and `mma` matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D104847
2021-06-29 15:44:07 -07:00
Bjorn Pettersson
4c7f820b2b Update @llvm.powi to handle different int sizes for the exponent
This can be seen as a follow up to commit 0ee439b705e82a4fe20e2,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.

The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.

One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.

Differential Revision: https://reviews.llvm.org/D99439
2021-06-17 09:38:28 +02:00
serge-sans-paille
4ab3041acb Revert "[NFC] remove explicit default value for strboolattr attribute in tests"
This reverts commit bda6e5bee04c75b1f1332b4fd1ac4e8ef6c3c247.

See https://lab.llvm.org/buildbot/#/builders/109/builds/15424 for instance
2021-05-24 19:43:40 +02:00
serge-sans-paille
bda6e5bee0 [NFC] remove explicit default value for strboolattr attribute in tests
Since d6de1e1a71406c75a4ea4d5a2fe84289f07ea3a1, no attributes is quivalent to
setting attribute to false.

This is a preliminary commit for https://reviews.llvm.org/D99080
2021-05-24 19:31:04 +02:00
thomasraoux
505933a489 [NVPTX] Fix lowering of frem for negative values
to match fmod frem result must have the dividend sign. Previous implementation
had the wrong sign when passing negative numbers. For ex: frem(-16, 7) was
returning 5 instead of -2. We should just a ftrunc instead of floor when
lowering to get the right behavior.

Differential Revision: https://reviews.llvm.org/D102528
2021-05-24 07:45:03 -07:00
Steffen Larsen
f226e28a88 [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX redux.sync instructions
Adds NVPTX builtins and intrinsics for the CUDA PTX `redux.sync` instructions
for `sm_80` architecture or newer.

PTX ISA description of `redux.sync`:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-redux-sync

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>

Differential Revision: https://reviews.llvm.org/D100124
2021-05-17 09:46:59 -07:00
Stuart Adams
02c2468864 [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX cp.async instructions
Adds NVPTX builtins and intrinsics for the CUDA PTX `cp.async` instructions for
`sm_80` architecture or newer.

PTX ISA description of `cp.async`:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive

Authored-by: Stuart Adams <stuart.adams@codeplay.com>
Co-Authored-by: Alexander Johnston <alexander@codeplay.com>

Differential Revision: https://reviews.llvm.org/D100394
2021-05-17 09:46:59 -07:00
William S. Moses
7aa3cad46a [NVPTX] Enable lowering of atomics on local memory
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.

Differential Revision: https://reviews.llvm.org/D98650
2021-04-26 20:12:12 -04:00
William S. Moses
8ede96493c Revert "[NVPTX] Enable lowering of atomics on local memory"
This reverts commit fede99d386ec9e7bab2762aa16cb10c0513ae464.
2021-04-26 19:33:01 -04:00
William S. Moses
fede99d386 [NVPTX] Enable lowering of atomics on local memory
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.

Differential Revision: https://reviews.llvm.org/D98650
2021-04-26 19:27:27 -04:00