Summary:
The changes in https://www.github.com/llvm/llvm-project/pull/185552
allowed us to
start building the standard `libclang_rt.profile.a` for GPU targets.
This PR expands this by adding an optimized GPU routine for counter
increment and removing the special-case handling of these functions in
the OpenMP runtime.
Vast majority of these functions are boilerplate, but we should be able
to do more interesting things with this in the future, like value or
memory profiling.
Summary:
The module splitting used for `-flto-partitions=8` support (which is
passed by default) did not correctly handle aliases. We mainly need to
do two things: keep the aliases in the they are used in and externalize
them. Internalize linkage needs to be handled conservatively.
This is needed because these aliases show up in PGO contexts.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
Extend the generic or_disjoint pattern to call haveNoCommonBitsSet, this
allows us to remove the similar x86 or_is_add pattern, use or_disjoint
directly and merge some add/or_is_add matching patterns to use a
add_like wrapper pattern instead
This change adds a rewrite that simplifies `memref.copy` operations whose
destination is a scalar view produced by `memref.reinterpret_cast`.
The pattern matches cases where a reinterpret cast creates a scalar view
(`sizes = [1, ..., 1]`) into a memref that has a single non-unit dimension. In
this situation the view refers to exactly one element in the base buffer, so
the accessed address depends only on the base pointer and the offset.
The stride information of the view does not affect the accessed element,
because the only valid index into the view is `[0, ..., 0]`.
Therefore the copy can be rewritten into a direct load from the source and a
store into the base memref using the offset from the reinterpret cast.
This makes the `memref.reinterpret_cast` redundant for the copy and simplifies
the IR.
Assisted-by: ChatGPT (refine implementation + tests). I reviewed all code and
tests before submission.
### Example
Before:
```mlir
func.func private @concat() {
%src = memref.alloc() : memref<1x1xf32>
%base = memref.alloc() : memref<1x108xf32>
%view = memref.reinterpret_cast %base
to offset: [0], sizes: [1, 1], strides: [108, 1]
: memref<1x108xf32>
to memref<1x1xf32, strided<[108, 1]>>
memref.copy %src, %view
: memref<1x1xf32>
to memref<1x1xf32, strided<[108, 1]>>
}
```
After:
```mlir
func.func private @concat() {
%src = memref.alloc() : memref<1x1xf32>
%base = memref.alloc() : memref<1x108xf32>
%c0 = arith.constant 0 : index
%v = memref.load %src[%c0, %c0] : memref<1x1xf32>
memref.store %v, %base[%c0, %c0] : memref<1x108xf32>
}
```
### Motivation
This rewrite simplifies IR and helps eliminate `memref.reinterpret_cast`
operations in preparation for later lowerings (e.g. EmitC lowering), where
pointer-based access patterns are easier to handle once scalar accesses are
explicit.
### Scope
This rewrite is intentionally narrow:
- It only applies when both source and destination reduce to scalar accesses.
- It does not attempt to rewrite general `memref.copy` operations.
- It does not introduce loops or handle multi-element copies.
The pass currently performs only this transformation, so it is expected to be
used intentionally rather than as part of a broad optimization pipeline.
### Why not use `memref.copy` directly?
`memref.copy` requires source and destination memrefs to have the same shape.
The destination of the copy here is a scalar view derived from a larger memref,
so copying directly into the base memref would violate this requirement.
Instead, the rewrite loads the scalar value from the source and stores it into
the base memref, at the index determined by the reinterpret cast offset.
## Problem
In the MemRef → EmitC conversion, `memref.load` and `memref.store`
assume that the converted memref operand is an `emitc.array`, as defined
by the type conversion in `populateMemRefToEmitCTypeConversion`.
However, `memref.alloc` is lowered to a `malloc` call returning
`emitc.ptr`. When such values are used by `memref.load` or
`memref.store`, the conversion framework inserts a bridging
`builtin.unrealized_conversion_cast` from `emitc.ptr` to `emitc.array`.
These casts have no EmitC representation and therefore remain in the IR
after conversion, preventing valid C/C++ emission.
## Solution
Extend the `memref.load` and `memref.store` conversions to handle
pointer-backed buffers.
If the memref operand is defined by an `UnrealizedConversionCastOp`
whose input is an `emitc.ptr`, the cast is stripped and the underlying
pointer operand is used directly. Since pointer subscripting in EmitC is
one-dimensional, the multi-dimensional memref indices are converted to a
row-major linear index (matching the default memref layout) using the
original `MemRefType` shape before emitting `emitc.subscript`.
The existing array-based lowering path remains unchanged.
This patch intentionally does ***not*** modify the MemRef → EmitC type
conversion rule (`memref → emitc.array`). Instead, the mismatch
introduced by `memref.alloc` returning a pointer is handled locally in
the `LoadOp` and `StoreOp` conversions.
## Example 1: Single-dimensional store
### Input
```mlir
func.func @alloc_store(%arg0: i32, %i: index) {
%alloc = memref.alloc() : memref<999xi32>
memref.store %arg0, %alloc[%i] : memref<999xi32>
return
}
```
### Current lowering
```mlir
// AllocOp conversion unchanged -> excluded for brevity
%5 = builtin.unrealized_conversion_cast %4 : !emitc.ptr<i32> to !emitc.array<999xi32>
%6 = subscript %5[%arg1]
assign %arg0 to %6 : <i32>
```
The `unrealized_conversion_cast` remains in the IR.
### Lowering after this patch
```mlir
%5 = subscript %4[%arg1] : (!emitc.ptr<i32>, !emitc.size_t) -> !emitc.lvalue<i32>
assign %arg0 : i32 to %5 : <i32>
```
The cast is eliminated and pointer subscripting is used directly.
## Example 2: Multi-dimensional store
### Input
```mlir
func.func @memref_alloc_store(%v : f32, %i : index, %j : index) {
%alloc = memref.alloc() : memref<4x8xf32>
memref.store %v, %alloc[%i, %j] : memref<4x8xf32>
return
}
```
### Current lowering
```mlir
// AllocOp conversion unchanged -> excluded for brevity
%5 = builtin.unrealized_conversion_cast %4 : !emitc.ptr<f32> to !emitc.array<4x8xf32>
%6 = subscript %5[%arg1, %arg2] : (!emitc.array<4x8xf32>, !emitc.size_t, !emitc.size_t) -> !emitc.lvalue<f32>
assign %arg0 : f32 to %6 : <f32>
```
### Lowering after this patch
```mlir
%5 = "emitc.constant"() <{value = 8 : index}> : () -> !emitc.size_t
%6 = mul %arg1, %5 : (!emitc.size_t, !emitc.size_t) -> !emitc.size_t
%7 = add %6, %arg2 : (!emitc.size_t, !emitc.size_t) -> !emitc.size_t
%8 = subscript %4[%7] : (!emitc.ptr<f32>, !emitc.size_t) -> !emitc.lvalue<f32>
assign %arg0 : f32 to %8 : <f32>
```
The multi-dimensional indices are converted into a linear row-major
index before pointer subscripting.
Assisted-by: ChatGPT (refine implementation + tests). I reviewed all
code and tests before submission.
Without this, we were assuming that __ulock was unavailable on visionOS
and falling back to the manual implementation, when in reality we can
always rely on the existence of ulock.
Fixes#186467
This patch adds a check to ensure that the addrecs have nsw flags at the
beginning of the Exact SIV test. If either of them doesn't have, the
analysis bails out. This check is necessary because the subsequent
process in the Exact SIV test assumes that they don't wrap.
This is a follow-up to
https://github.com/swiftlang/llvm-project/pull/12317#discussion_r2850297229
Per that discussion, given that deserializers *can* fail given a corrupt
PDB, it's preferable to handle the error instead of crashing.
This specific change is limited to "easy" changes (read: I have high
confidence in their correctness). The ideal end state is funneling all
errors to a few central places in `SymbolFileNativePDB`.
This PR relaxes the 2d reduction lowering in the peephole optimization
pass to allow source tensor to have n-d shape.
It also fixes a minor bug of accumulator lowering in the current
implementation.
Since https://reviews.llvm.org/D144004, DwarfDebug asserts if
function-local imported entities are present in the imports field of
DICompileUnit.
This patch adds a Verifier check to detect such invalid IR earlier.
Incorrect occurrences of imported entities in DICompileUnit's imports
field in llvm/test/Bitcode/DIImportedEntity_elements.ll,
llvm/test/Bitcode/DIModule-fortran-external-module.ll are fixed.
This change is extracted from https://reviews.llvm.org/D144008.
When a friend function template is defined inline inside a
[[clang::suppress]]-annotated class but was forward-declared at
namespace scope, the instantiation's lexical DeclContext was the
namespace (from the forward-declaration), not the class.
The lexical parent chain walk in BugSuppression::isSuppressed therefore
never reached the class and suppression did not apply.
Fix by extending preferTemplateDefinitionForTemplateSpecializations to
handle FunctionDecl instances: calling getTemplateInstantiationPattern()
that maps the instantiation back to the primary template FunctionDecl,
whose lexical DC is the class where the friend was defined inline.
So the existing parent-chain walk then finds the suppression attribute.
Assisted-By: claude
The OpenACC `routine` directive may specify a `bind(name)` clause to
associate the routine with a different symbol for device code. This pass
`ACCBindRoutine` finds calls inside offload regions that target such
routines and rewrites the callee to the bound symbol.
---------
Co-authored-by: Delaram Talaashrafi <dtalaashrafi@nvidia.com>
Summary:
Currently, the GPU targets ignore the standard profiling arguments. This
PR changes the behavior to use the standard handling, which links the in
the now-present `libclang_rt.profile.a` if the user built with the
compiler-rt support enabled. If it is not present this is a linker error
and we can always suppress with `-Xarch_host` and `-Xarch_device`.
Hopefully this doesn't cause some people pain if they're used to doing
`-fprofile-generate` on a CPU unguarded since it was a stange mix of a
no-op and not a no-op on the GPU until now.
Recursively splitting out some work from #183318; this covers
the enums for early exit loop type (none, readonly, readwrite)
and the style used (just readonly and
masked-handle-ee-in-scalar-tail for now) and refactoring for
basic use of those enums.
Summary:
Follow up on removal of OPENMP_STANDALONE_BUILD in openmp (#149878).
This
build method is redundant and can be accomplished via runtimes.
Removes support for:
`cmake -S <llvm-project>/offload ...`
Switches over to:
`make -S <llvm-project>/runtimes -DLLVM_ENABLE_RUNTIMES=openmp;offload
...`
Libomptarget has a dependency on libomp.so and requires the omp cmake
target to exist at build time, which is why both runtimes are listed.
Updates cmake compiler logic in offload/CMakeLists.txt to mirror openmp
changes:
[openmp] Allow testing OpenMP without a full clang build tree (#182470)
User will still need to have a separate invocation to build openmp
DeviceRTL via:
`-DLLVM_ENABLE_RUNTIMES=openmp`
`-DLLVM_DEFAULT_TARGET_TRIPLE=<amdgcn-amd-amdhsa|nvptx64-nvidia-cuda>`
This test is failing on the llvm-clang-x-aarch64 buildbot due to what
looks like a difference in rounding behaviour when printing estimated
cost per lane. Solve this by removing the fractional part, which is what
we've done in the past when this has happened (e.g. commit aeb88f677).
The OpenMP API does not allow to have THREADPRIVATE variable appear in
an EQUIVALENCE statement. It has been requested by the community to
extend Flang such that it permits these non-conforming patterns. This PR
changes Flang to inherit the DSA of the base object of the EQUIVALENCE
statement to the equivalenced variables. The orginal error message is
turned into a warning.
This PR contains code from downstream PR
https://github.com/arm/arm-toolchain/pull/755 that @tblah pointed to
during the review.
Fixes https://github.com/llvm/llvm-project/issues/180493
Assisted-by: Claude Code, Opus 4.6
Transform MachineCycleInfo into a class that can be declared and remove
include from many source files.
Similar to 810ba55de9159932d498e9387d031f362b93fbea.
If `.Label` is not within +-4KiB range, we convert
```
beqi/bnei reg, imm, .Label
```
to
```
bnei/beqi reg, imm, 8
j .Label
```
This is similar to what is done for the RISCV conditional branches
and `Xqcibi` conditional branches.
---------
Co-authored-by: Sudharsan Veeravalli <svs@qti.qualcomm.com>
FastISel was dropping llvm.fake.use because they are not meant to be
generated at O0 with clang.
This patch adds support in FastISel to generate FAKE_USE for llvm.fake.use.
The handling is simpler than in SelectionDagBuilder because no attempt is made to
get rid of useless FAKE_USE (e.g. for constant SSA values) to keep FastISel simple.
The motivation is that flang will generate llvm.fake.use for function arguments under
`-g` (and O0) because Fortran arguments are not copied to the stack (they are
reference like arguments in most cases) and one should be able to access these
variables from the debugger at any point of the function, even after their last use in the
function.
This reverts commit 91b928f919364b29e241821fc639b9ef56dab1a5.
This complicates some analysis that need the happen on the scalar VPlan,
before regions have been created, e.g.
https://github.com/llvm/llvm-project/pull/185323/.
Alive2 proof:
https://alive2.llvm.org/ce/z/bK93Cn
I've implemented a fold in `InstCombineAndOrXor.cpp` to canonicalize `~x
| (x - 1)` to `~(x & -x)` which enables the CodeGen to emit the `blsi`
instruction.
I've also added a test in `CodeGen/X86`.
Fixes#184055
---------
Co-authored-by: Tim Gymnich <tim@gymni.ch>
CycleInfo currently has a second map, that stores the top-level cycle
for a block. I don't think storing this per-block makes a lot of sense,
because the top-level cycle is always the same for all blocks in a
cycle.
So instead store it as a member of the cycle.
This patch improves the lowering of 128-bit unsigned division and
remainder by constants (UDIV/UREM) by avoiding a fallback to libcall
(__udivti3/uremti3) for specific divisors.
When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the
128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed
before applying a smaller UDIV/UREM. This transformation is based on the
"remainder by summing digits" trick described in Hacker’s Delight.
This fixes#137514 for some constants.
If current buildvector node is part of the combined nodes of the
matching candidate node, this matching candidate must be considered as
non-matching to prevent wrong def-use chain
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/187491
When matching scalar steps of the canonical IV, also match a derived IV
of the canonical IV if the derivation is essentially a no-op. Fixes a
failure in the mve-reg-pressure-spills.ll test when expensive checks are
enabled.
Correctly inform transform passes about our registers; this prevents the
issue with the `find-last` test where the loop vectorizer pass
mistakenly thinks that the backend has vector capabilities and generates
vector types, which causes the backend to crash.
See also: https://github.com/sparclinux/issues/issues/69
When a virtual destructor is encountered before any module providing
std::align_val_t is loaded, DeclareGlobalNewDelete() implicitly creates
a std::align_val_t EnumDecl. However, this EnumDecl was not added to the
std namespace's DeclContext -- it was only stored in the
Sema::StdAlignValT field.
Later, when a module containing an explicit std::align_val_t definition
is loaded, ASTReaderDecl::findExisting() attempts to find the implicit
decl via DeclContext::noload_lookup() on the std namespace. Since the
implicit EnumDecl was never added to that DeclContext, the lookup fails,
and the two align_val_t declarations are not merged into a single
redeclaration chain. This results in two distinct types both named
std::align_val_t.
The implicitly declared operator delete overloads (also created by
DeclareGlobalNewDelete) use the implicit align_val_t type for their
aligned-deallocation parameter. When module code (e.g. std::allocator::
deallocate) calls __builtin_operator_delete with the module's
align_val_t, overload resolution fails because the two align_val_t types
are not the same, producing:
error: no matching function for call to 'operator delete'
note: no known conversion from 'std::align_val_t' to 'std::align_val_t'
The fix adds the implicit align_val_t EnumDecl to the std namespace
DeclContext via getOrCreateStdNamespace()->addDecl(AlignValT), so the
module merger can find it via noload_lookup and merge the two
declarations.
This bug was exposed by a libc++ change (2b01e7cf2b70) that removed the
#include <__new/global_new_delete.h> line from allocate.h, which meant
modules no longer had explicit operator delete declarations to paper
over the type mismatch.
Assisted-by: Claude Code