24 Commits

Author SHA1 Message Date
Matt Arsenault
51b4ada458
clang/AMDGPU: Set noalias.addrspace metadata on atomicrmw (#102462) 2024-10-17 17:10:45 +04:00
Matt Arsenault
d50302f31c
clang/AMDGPU: Stop emitting amdgpu-unsafe-fp-atomics attribute (#111579) 2024-10-09 08:52:32 +04:00
Matt Arsenault
e108853ac8
clang: Allow targets to set custom metadata on atomics (#96906)
Use this to replace the emission of the amdgpu-unsafe-fp-atomics
attribute in favor of per-instruction metadata. In the future
new fine grained controls should be introduced that also cover
the integer cases.

Add a wrapper around CreateAtomicRMW that appends the metadata,
and update a few use contexts to use it.
2024-07-26 09:57:28 +04:00
Mariya Podchishchaeva
6d973b4548
[clang][CodeGen] Return RValue from EmitVAArg (#94635)
This should simplify handling of resulting value by the callers.
2024-06-17 13:29:20 +02:00
Jon Chesterfield
8516f54e6a
[AMDGPU] Implement variadic functions by IR lowering (#93362)
This is a mostly-target-independent variadic function optimisation and
lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost
abstraction. They are lowered to equivalent IR which is then amenable to
other optimisations. This is inherently slightly target specific but
much less so than one might expect - the C varargs interface heavily
constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has
a straightforward variadic lowering strategy which coincides exactly
with what this pass transforms code into and a struct passing convention
with few cases to check. Adding further targets conventions is
straightforward and elided from this patch primarily to simplify the
review. Implemented in other branches are Linux X86, AMD64, AArch64 and
NVPTX.

Testing for targets that have existing lowering for va_arg from clang is
most efficiently done by checking that clang | opt completely elides the
variadic syntax from test cases. The lowering produces a struct for each
call site which can be inspected to check the various alignment and
indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf
handling. Combined with the pass being inactive on all other targets
landing this represents strict increase in capability with zero risk.
Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64
clang/musl toolchain was constructed using the "lowering" instead of the
systemv ABI and used to build various C programs like lua and libxml2.
2024-06-06 10:44:53 +01:00
Jon Chesterfield
794457f6f9
[amdgpu] Pass variadic arguments without splitting (#94083)
Pass variadic arguments without changing their type, unlike the fixed
ones.

Fixed arguments are modified to better fit into registers. This patch
leaves those unchanged.

Splitting struct types into individual fields and packing small structs
into integers works well for passing via registers. Variadic arguments
are currently unimplemented in the backend. They're likely to be
implemented as a pointer to stack memory in which case register-themed
optimisations are inapplicable.

Splitting the struct into fields makes it difficult to implement va_arg
robustly. The rules around padding and alignment to inverse the struct
splitting could be constructed, but at high complexity and no particular
advantage.

Passing types as-is means there is a 1:1 correspondence with the type
information va_arg has to work with and the parameter type at the call
site.

This is an ABI change, but as the only functions affected are variadic
ones which are presently a compilation error, not a functional break.
Factored out of the larger #93362 and can land independently.
2024-06-04 13:10:10 +01:00
Jun Wang
c4e517f59c
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Joseph Huber
4e80bc7d71
[Clang] Introduce scoped variants of GNU atomic functions (#72280)
Summary:
The standard GNU atomic operations are a very common way to target
hardware atomics on the device. With more heterogenous devices being
introduced, the concept of memory scopes has been in the LLVM language
for awhile via the `syncscope` modifier. For targets, such as the GPU,
this can change code generation depending on whether or not we only need
to be consistent with the memory ordering with the entire system, the
single GPU device, or lower.

Previously these scopes were only exported via the `opencl` and `hip`
variants of these functions. However, this made it difficult to use
outside of those languages and the semantics were different from the
standard GNU versions. This patch introduces a `__scoped_atomic` variant
for the common functions. There was some discussion over whether or not
these should be overloads of the existing ones, or simply new variants.
I leant towards new variants to be less disruptive.

The scope here can be one of the following

```
__MEMORY_SCOPE_SYSTEM // All devices and systems
__MEMORY_SCOPE_DEVICE // Just this device
__MEMORY_SCOPE_WRKGRP // A 'work-group' AKA CUDA block
__MEMORY_SCOPE_WVFRNT // A 'wavefront' AKA CUDA warp
__MEMORY_SCOPE_SINGLE // A single thread.
```
Naming consistency was attempted, but it is difficult to capture to full
spectrum with no many names. Suggestions appreciated.
2023-12-07 13:40:25 -06:00
Dominik Adamski
95943d2fab [Flang] Add code-object-version option (#72638)
Information about code object version can be configured by the user for
AMD GPU target and it needs to be placed in LLVM IR generated by Flang.

Information about code object version in MLIR generated by the parser
can be reused by other tools. There is no need to specify extra flags if
we want to invoke MLIR tools (like fir-opt) separately.

Changes in comparison to a8ac93:
 * added information about required targets for test
   flang/test/Driver/driver-help.f90
2023-11-29 03:01:01 -06:00
Dominik Adamski
f00ffcdb58 Revert "[Flang] Add code-object-version option (#72638)"
This commit causes test errors on buildbots.

This reverts commit a8ac930b99d93b2a539ada7e566993d148899144.
2023-11-28 13:18:46 -06:00
Dominik Adamski
a8ac930b99
[Flang] Add code-object-version option (#72638)
Information about code object version can be configured by the user for
AMD GPU target and it needs to be placed in LLVM IR generated by Flang.

Information about code object version in MLIR generated by the parser
can be reused by other tools. There is no need to specify extra flags if
we want to invoke MLIR tools (like fir-opt) separately.
2023-11-28 19:57:36 +01:00
Saiyedul Islam
21861991e7
[OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234)
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".

It also adds some minor fields in ImplicitArg structure.
2023-11-09 10:34:35 +05:30
Johannes Doerfert
0ba57c8bba
[OpenMP] Pass min/max thread and team count to the OMPIRBuilder (#70247)
We now provide the information about the min/max thread and team count
from to the OMPIRBuilder, no matter what the source was. That means we
unify `thread_limit`, `num_teams`, `num_threads` handling with the
target specific attriutes (`__launch_bounds__` and
`amdgpu_flat_work_group_size`). This is in preparation to pass the
values to the runtime, and to allow the middle-end (OpenMP-opt) to
tighten the values if it seems appropriate. There is no "real" change
after this commit.
2023-10-26 14:45:07 -07:00
Joseph Huber
1d959f9327
[OpenMP] Prevent AMDGPU from overriding visibility on DT_nohost variables (#68264)
Summary:
There's some logic in the AMDGPU target that manually resets the
requested visibility of certain variables. This was triggering when we
set a constant variable in OpenMP. However, we shouldn't do this for
OpenMP when the variable has the `nohost` type. That implies that the
variable is not visible to the host and therefore does not need to be
visible, so we should respect the original value of it.
2023-10-05 17:10:03 -05:00
Joseph Huber
1b7a095e27
[Clang][AMDGPU] Permit language address spaces for AMDGPU globals (#66205)
Summary:
Currently, there is an assertion that prevents us from emitting an
AMDGPU global with a non-target specific address space (i.e. numerical
attribute). I'm unsure what the original intentions of this assertion
were, but we should be able to use OpenCL address spaces when compiling
directly to AMDGPU from C++. This is permitted on NVPTX so I'm unsure
what this assertion is guarding. The patch simply removes the assertion
and adds a test to ensure that these emit the expected address spaces.

Fixes https://github.com/llvm/llvm-project/issues/65069
2023-09-13 08:43:01 -05:00
Joseph Huber
49ff6a96a7
[Clang] Define AMDGPU ABI when referenced in CodeGen for ABI "none" (#66162)
Summary:
We use the `llvm.amgcn.abi.version` varaible to control code generation.
This is emitted in every module now to indicate what should be used when
compiling. Previously, the logic caused us to emit an external reference
to this variable when creating the code for the `none` type. This would
then cause us not to emit the actual definition. This patch refines the
logic to create the external reference, and then update it if it is
found unset by the time we emit the global. I had to remove the
reference to `GetOrCreateLLVmGlobal` because it did not accept the
proper address space.
2023-09-13 08:31:31 -05:00
Saiyedul Islam
f616c3eeb4
[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.

CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.

Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Differential Revision: https://reviews.llvm.org/D139730

Reviewed By: jhuber6, yaxunl
2023-08-29 06:35:44 -05:00
David Blaikie
19f2b68095 Make globals with mutable members non-constant, even in custom sections
Turned out we were making overly simple assumptions about which sections (& section flags) would be used when emitting a global into a custom section. This lead to sections with read-only flags being used for globals of struct types with mutable members.

Fixed by porting the codegen function with the more nuanced handling/checking for mutable members out of codegen for use in the sema code that does this initial checking/mapping to section flags.

Differential Revision: https://reviews.llvm.org/D156726
2023-08-14 22:25:42 +00:00
Changpeng Fang
d77c62053c [clang][AMDGPU]: Don't use byval for struct arguments in function ABI
Summary:
  Byval requires allocating additional stack space, and always requires an implicit copy to be inserted in codegen,
where it can be difficult to optimize. In this work, we use byref/IndirectAliased promotion method instead of
byval with the implicit copy semantics.

Reviewers:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D155986
2023-08-11 16:37:42 -07:00
Yaxun (Sam) Liu
ac72531043 [Driver] Add -f[no-]offload-uniform-block
By default, clang assumes HIP kernels are launched with uniform block size,
which is the case for kernels launched through triple chevron or
hipLaunchKernelGGL. Clang adds uniform-work-group-size function attribute
to HIP kernels to allow the backend to do optimizations on that.

However, in some rare cases, HIP kernels can be launched
through hipExtModuleLaunchKernel where global work size is specified,
which may result in non-uniform block size.

To be able to support non-uniform block size for HIP kernels,
an option `-f[no-]offload-uniform-block is added. This option
is generic for offloading languages. Its default value is on for
CUDA/HIP and off otherwise.

Make -cl-uniform-work-group-size an alias to -foffload-uniform-block.

Reviewed by: Siu Chi Chan, Matt Arsenault, Fangrui Song, Johannes Doerfert

Differential Revision: https://reviews.llvm.org/D155213

Fixes: SWDEV-406592
2023-07-27 16:36:02 -04:00
Johannes Doerfert
08a220764b Reapply "[OpenMP] Add the ompx_attribute clause for target directives"
This reverts commit 0d12683046ca75fb08e285f4622f2af5c82609dc and
reapplies ef9ec4bbcca2fa4f64df47bc426f1d1c59ea47e2 with an extension to
fix the Flang build.

Differential Revision: https://reviews.llvm.org/D156184
2023-07-25 10:40:35 -07:00
Aaron Ballman
0d12683046 Revert "[OpenMP] Add the ompx_attribute clause for target directives"
This reverts commit ef9ec4bbcca2fa4f64df47bc426f1d1c59ea47e2.

The changes broke several bots:
https://lab.llvm.org/buildbot/#/builders/176/builds/3408
https://lab.llvm.org/buildbot/#/builders/198/builds/4028
https://lab.llvm.org/buildbot/#/builders/197/builds/8491
https://lab.llvm.org/buildbot/#/builders/197/builds/8491
2023-07-25 07:57:36 -04:00
Johannes Doerfert
ef9ec4bbcc [OpenMP] Add the ompx_attribute clause for target directives
CUDA and HIP have kernel attributes to tune the code generation (in the
backend). To reuse this functionality for OpenMP target regions we
introduce the `ompx_attribute` clause that takes these kernel
attributes and emits code as if they had been attached to the kernel
fuction (which is implicitly generated).

To limit the impact, we only support three kernel attributes:
`amdgpu_waves_per_eu`, for AMDGPU
`amdgpu_flat_work_group_size`, for AMDGPU
`launch_bounds`, for NVPTX

The existing implementations of those attributes are used for error
checking and code generation. `ompx_attribute` can be attached to any
executable target region and it can hold more than one kernel attribute.

Differential Revision: https://reviews.llvm.org/D156184
2023-07-24 22:04:45 -07:00
Sergei Barannikov
992cb98462 [clang][CodeGen] Break up TargetInfo.cpp [8/8]
This commit breaks up CodeGen/TargetInfo.cpp into a set of *.cpp files,
one file per target. There are no functional changes, mostly just code moving.

Non-code-moving changes are:
* A virtual destructor has been added to DefaultABIInfo to pin the vtable to a cpp file.
* A few methods of ABIInfo and DefaultABIInfo were split into declaration + definition
  in order to reduce the number of transitive includes.
* Several functions that used to be static have been placed in clang::CodeGen
  namespace so that they can be accessed from other cpp files.

RFC: https://discourse.llvm.org/t/rfc-splitting-clangs-targetinfo-cpp/69883

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D148094
2023-06-17 07:14:50 +03:00