The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.
Assisted-By: Claude Sonnet 4.5
This is part of a stack:
- #185813
- #185810
Fixes: LCOMPILER-332
The previous position of llvm.protected.field.ptr lowering for loads
and stores was problematic as it not only inhibited optimizations such
as DSE (as stores to a llvm.protected.field.ptr were not considered to
must-alias stores to the non-protected.field pointer) but also required
changes to other optimization passes to avoid transformations that would
reduce PFP coverage.
Address this by moving the load/store part of the lowering to
InstCombine, where it will run earlier than the PFP-breaking and
AA-relying transformations. The deactivation symbol, null comparison
and EmuPAC parts of the lowering remain in PreISelLowering.
Now that the transformation inhibitions are no longer needed, remove them
(i.e. partially revert #151649, and revert #182976).
This change resulted in a 2.4% reduction in Fleetbench .text size and
the following improvements to PFP performance overhead for BM_PROTO_Arena
on various microarchitectures:
before after
Apple M2 Ultra 3.5% 3.3%
Google Axion C4A 3.3% 2.9%
Google Axion N4A 2.7% 2.2%
Reviewers: fmayer, nikic, vitalybuka
Reviewed By: fmayer
Pull Request: https://github.com/llvm/llvm-project/pull/186548
The VPatBinaryV_VI_VROL multiclass was using InvRot64Imm for all SEW
widths when converting vrol immediate intrinsics to vror.vi. This
produced unnecessarily large immediates for narrower element types
(e.g., 61 instead of 5 for SEW=8 rotate-left by 3).
Use the appropriate InvRot{SEW}Imm transform to match what the SDNode
patterns already do.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(Reland of #190092 with verifier change to look through GlobalAliases)
So that it's preserved across all inline invocations rather than just
one inliner pass run.
This prevents cases where devirtualization in the simplification
pipeline uncovers inlining opportunities that should be discarded due to
inline history, but we dropped the inline history between inliner pass
runs, causing code size to blow up, sometimes exponentially.
For compile time reasons, we want to limit this to only call sites that
have the potential to inline through SCCs, potentially with the help of
devirtualization. This means that the callee is in a non-trivial
(Ref)SCC, or the call site was previously an indirect call, which can
potentially be devirtualized to call any function.
The CGSCCUpdater::InlinedInternalEdges logic still seems to be relevant
even with this change, as monster_scc.ll blows up if I remove that code.
http://llvm-compile-time-tracker.com/compare.php?from=e830d88e8ae5f44a97cc76136a0a4e83aa9157c0&to=ed535e732fc41b79ab8efda2417886cbd0812f7f&stat=instructions:uFixes#186926.
This adds CIR handling for the __builtin_flt_rounds and
__builtin_set_flt_rounds builtin functions. Because the LLVM dialect
does not have dedicated operations for these, I have chosen not to
implement them as operations in CIR either. Instead, we just call the
LLVM intrinsic.
This change introduces TableGen support for indicating CIR attributes
that require a CIRAttrToValue visitor, adds the new flag to all
attributes to which it applies, and replaces the explicit declarations
with the tablegen output.
This fixes a rematerializer issue wherein re-creating the interval of a
non-rematerializable super-register defined over multiple MIs, some of
which defining entirely dead sub-registers, could cause a crash when
changing the order of sub-definitions (for example during scheduling)
because the re-created interval could end up with multiple connected
components, which is illegal. The solution is to split separate
components of the interval in such cases. The added unit test crashes
without that added behavior.
#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.
This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.
This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
This adds the handling for the case where the address of a static local
variable is used to initialize another static local. In this case, the
address of the first variable is emitted as a constant in the
initializer of the second variable.
Today, on Darwin platforms, almost every test binary in our test suite
loads two copies of libc++, libc++abi, and libunwind. This is because
each of the test binaries explicitly link against a just-built libc++
(which is explicitly required on Darwin right now) but we don't take the
correct steps to replace the system libc++. Doing so is unnecessary and
potentially error-prone, so most tests should link against the system
libc++ where possible.
Background:
The lldb test suite has a collection of tests that rely on libc++
explicitly. The two biggest categories are data formatter tests (which
make sure that we can correctly display values for std types) and
import-std-module tests (which test that we can import the libc++ std
module). To make sure these tests are run, we require a just-built
libc++ to be used.
All of the test binaries link against the just-built libc++, so it gets
loaded. However, when any system library tries to load libc++, it
attempts to load the system one. dyld checks loaded libraries against
the request to load a new one using the full path, meaning anyone
linking against `/usr/lib/libc++.1.dylib` will get it no matter what
other libc++ dylib is already loaded.
The proper way to handle this is using `DYLD_LIBRARY_PATH`, which
switches dyld to checking the leaf name of a dylib instead of the full
path. In theory this works, but we run into an issue where the system
libc++ has additional symbols and many system libraries fail to load.
Louis Dionne added stubs in libc++abi for these missing symbols, meaning
it would be possible to make this scenario work. This may be useful for
the existing libc++ tests.
The backward matching loop in `matchNonCallsiteLocs` was ineffective
because `InsertMatching` used `std::unordered_map::insert()` which does
not overwrite existing entries. Since forward matching already inserted
entries for all non-anchor locations, the backward matching for the
second half was silently ignored.
The backward matching can update forward mappings in
`IRToProfileLocationMap` in 2 ways:
- The IR location maps a new different profile location. Change
`insert()` to `insert_or_assign()` so that entry overwrite can happen.
- The IR location maps the same profile location. Add `erase()` to
remove such mapping.
### Summary
`SBModuleSpecList` already supports `len()` and iteration via `__len__`
and `__iter__`, but is not subscriptable — `specs[0]` raises
`TypeError`.
This adds a `__getitem__` method that supports integer indexing (with
negative index support) and string lookup using `endswith()` matching,
which works for both Unix and Windows paths.
### Supported key types
| Key type | Example | Behavior |
|---|---|---|
| `int` | `specs[0]`, `specs[-1]` | Direct index with negative index
support |
| `str` | `specs['a.out']`, `specs['/usr/lib/liba.dylib']` | Lookup by
basename or partial/full path via `endswith()`. Returns first match or
`None` |
### Error handling
- **`IndexError`** for out-of-bounds integer indices
- **`TypeError`** for unsupported key types (e.g., `float`)
- **`None`** for string lookups with no match
### Before
```python
>>> specs = lldb.SBModuleSpecList.GetModuleSpecifications('/bin/ls')
>>> specs[0]
Traceback (most recent call last):
File "<console>", line 1, in <module>
TypeError: 'SBModuleSpecList' object is not subscriptable
```
### After
```python
>>> import lldb, re
>>> specs = lldb.SBModuleSpecList.GetModuleSpecifications('/bin/ls')
>>> specs[0]
file = '/bin/ls', arch = x86_64-*-linux, uuid = 3CCC0D8A-..., object size = 140928
>>> specs[-1]
file = '/bin/ls', arch = x86_64-*-linux, uuid = 3CCC0D8A-..., object size = 140928
>>> specs['ls']
file = '/bin/ls', arch = x86_64-*-linux, uuid = 3CCC0D8A-..., object size = 140928
>>> specs[999]
IndexError: list index out of range
>>> specs[1.5]
TypeError: unsupported index type: <class 'float'>
```
### Test plan
Added test_module_spec_list_indexing to TestSBModule.py covering:
- Positive and negative integer indexing
- Out-of-bounds raises IndexError
- Unsupported key type raises TypeError
- String lookup by basename and full path (endswith() matching)
- Missing key returns None
```
bin/llvm-lit -sv lldb/test/API/python_api/sbmodule/TestSBModule.py
```
```
PASS: LLDB :: test_GetObjectName_dwarf (TestSBModule.SBModuleAPICase)
PASS: LLDB :: test_GetObjectName_dwo (TestSBModule.SBModuleAPICase)
PASS: LLDB :: test_module_spec_list_indexing_dwarf (TestSBModule.SBModuleAPICase)
PASS: LLDB :: test_module_spec_list_indexing_dwo (TestSBModule.SBModuleAPICase)
----------------------------------------------------------------------
Ran 12 tests in 1.854s
OK (skipped=6)
```
Co-authored-by: Piyush Jaiswal <piyushjais@meta.com>
Allow `parseString()` to return an empty `StringRef` when the delimiter
appears at position 0. This enables parsing pre-aggregated profile
addresses with an omitted buildid but preserved colon (`:addr` format),
where the empty buildid corresponds to the main binary.
Previously, `parseString()` rejected zero-length fields by treating
`StringEnd == 0` the same as `StringRef::npos` (delimiter not found).
These are distinct situations: `npos` means no delimiter exists, while
`0` means the field before the delimiter is empty. The fix removes the
`StringEnd == 0` sub-condition so only the missing-delimiter case
errors.
The existing test for buildid-prefixed addresses is extended to also
verify that `:addr` input produces identical output to the plain-address
and non-empty-buildid variants.
Test Plan:
Added empty-buildid input file and extended
`pre-aggregated-perf-buildid.test` to run perf2bolt with `:addr` format
and diff the fdata output against the existing buildid-prefixed result.
Similar to other tests, we are adding code that the AddRecs used in GCD
test are `nsw`. In this case, all recursively identified `AddRec`s are
also checked. Note that there is already a similar check in
`getConstantCoefficient` for expressions processed in that function.
`isRISCV()` check always returns false because we only get here if
`min_op_byte_size` and `max_op_byte_size` are equal, which is not true
for RISC-V.
Also, replase `if (!got_op)` check with an `else`. The check is
equivalent to
`if (min_op_byte_size != max_op_byte_size)`, and the `if` above checks
for the opposite condition.
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.
## Reproduction
Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.
This script reproduces the error message from `mgpuModuleUnload` on my
system:
```
#!/bin/bash
set -e
LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}
cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
gpu.terminator
}
return
}
MLIR
$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
-gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
| $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll
$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o
cc /tmp/repro.o \
-L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
-lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro
echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context
This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload
Fixes#170833
The mov64 pseudo is split into two 32 bit movs, but those 32 bit movs
had the full 64-bit register still implicitly defined. VOPD formation is
affected, so we can emit more of them.
This PR enhance the multi-reduction op pattern of wg-to-sg distribution
pass:
1. allows each sg have multiple distribution of sg_data tiles.
2. expand the slm buffer size.
3. construct the layout based on the partial reduced vector and use
layout.computeDistributedCoords() to compute coordinates. the layout is
constructed so that the store is cooperative, and load overlapps with
neighbour threads.
4. perform save and load.
Standard porting (note that TargetPassConfig dependency was [removed
earlier](e27e7e4339)).
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
When processing coverage generated from branch coverage mode, some
functions can reach findMainViewFileID with an empty CountedRegions
list. In that case the current logic still proceeds to infer the main
view file, even though there is no regular counted region available to
do so.
Return std::nullopt early when CountedRegions is empty.
This was observed when reproducing issue #189169 with:
cargo llvm-cov --lib --branch
The issue appears related to branch-only coverage information being
recorded separately in CountedBranchRegions, while
findMainViewFileID currently only consults CountedRegions.
This patch is a defensive fix for the empty-region case; further
investigation may still be needed to determine whether branch regions
should participate in main view file selection.
Co-authored-by: Zile Xiong <xiongzile99@gmail.com>
Block pointers are only stored while constructing the analysis, so the
value handle to catch erased blocks is no longer needed when using
stable block numbers.
This is fix for
[187902](https://github.com/llvm/llvm-project/issues/187902), where
`LoopInfo` is not in a valid state at the beginning of `ScalarEvolution::createSCEVIter`.
The reason for the bug is that, `mergeLatch()` is called at a place
where control flow and dominator trees have been updated but `LoopInfo`
has not completed the update yet. `mergeLatch()` calls into
`ScalarEvolution` that uses `LoopInfo`, where out-of-date `LoopInfo` would
result in crash or unpredictable results.
This patch moves `mergeLatch()` to the place where `LoopInfo` has
completed its update and hence is in a valid state.
Fixes found by fuzzer:
OnDiskTrieRawHashMap:
- Bounds-check data slot offsets in TrieVerifier::visitSlot() before
calling getRecord(), preventing asData() assertion on out-of-bounds
trie entries.
- Validate subtrie headers (NumBits, bounds) before constructing
SubtrieHandle, preventing SEGV in getSlots() from corrupt NumBits.
- Validate arena bump pointer alignment, catching misaligned BumpPtr
that would crash store() with an alignment assertion.
- Fix comma operator bug in getOrCreateRoot() where the
compare_exchange_strong result was discarded, causing asSubtrie()
assertion when RootTrieOffset was corrupted to zero.
OnDiskGraphDB:
- Reject invalid (zero) ref offsets in validate callback, preventing
asData() assertion when corrupt data pool refs are resolved via
recoverFromFileOffset().
- Validate DataRecordHandle layout flags before calling getTotalSize(),
preventing llvm_unreachable on corrupt NumRefsFlags/DataSizeFlags.
- Validate data pool bump pointer alignment, catching misaligned
BumpPtr that would crash store() in DataRecordHandle::constructImpl().
- Check data record refs offset alignment before calling getRefs(),
preventing PointerUnion assertion from misaligned refs pointer.
MappedFileRegionArena:
- Convert assertions in initializeHeader() to errors so corrupted
arena headers return an error on CAS open instead of crashing.
Assisted-By: Claude
Add a fuzzer that creates an on-disk CAS database, stores objects, then
corrupts the on-disk data files using fuzzer-provided bytes and calls
validate(). The goal is that validate() should either succeed or return
an error, never crash.
The fuzzer supports 6 corruption modes: byte-level mutations, file
truncation, appending garbage, zeroing ranges, standalone file
corruption, and combined mutations with continued CAS operations.
Assisted-By: Claude
This changes `DenseMapInfo<UUID>::getTombstoneKey()` to return a 1-byte
`{0xFF}` sentinel instead of the empty, default constructed UUID().
Returning the same key for the empty and tombstone value apparently
violates the `DenseMap` invariant.
The rematerializer implements support for rolling back
rematerializations by modifying MIs that should normally be deleted in
an attempt to make them "transparent" to other analyses. This involves:
1. setting their opcode to DBG_VALUE and
2. setting their read register operands to the sentinel register.
This approach has several drawbacks.
1. It forces the rematerializer to support tracking these "dead MIs"
(even if support is optional, these data-structures have to exist).
2. It is not actually clear whether this mechanism will interact well
with all other analyses. This is an issue since the intent of the
rematerializer is to be usable in as many contexts as possible.
3. In practice, it has shown itself to be relatively error-prone.
This commit removes rollback support from the rematerializer and moves
those capabilities to a rematerializer listener than can be instantiated
on-demand and implements the same functionality on top of standard
rematerializer operations. The rematerializer now actually deletes MIs
that are no longer useful after rematerializations, and has support for
re-creating them on-demand without requiring additional tracking on its
part.
Since #189401, LLVM and Clang generate `S_DEFRANGE_REGISTER_REL_INDIR`
for indirect locations. This adds support in LLDB.
The offset added after dereferencing is signed here - unlike in
`S_REGREL32_INDIR` (at least that's the assumption). So I updated
`MakeRegisterBasedIndirectLocationExpressionInternal` to handle the
signedness. This is the reason the MSVC test was changed here.
I didn't find a test case where LLVM emits the record with the `VFRAME`
register. Other than that, the clang test is similar to the MSVC one
except that the locations are slightly different.
When a VFS overlay YAML file contains malformed content such as tabs,
the YAML parser can produce KeyValueNode entries where `getKey` returns
nullptr. The VFS overlay parser then passes the nullptr to
`parseScalarString`, which then calls dyn_cast.
Switch to `dyn_cast_if_present` for the above callsites and a few more.
The Hexagon driver only checked for an explicit -pie flag when
constructing the link command, ignoring the toolchain's PIE default. For
linux-musl targets, isPIEDefault() returns true (via the Linux toolchain
base class), so the compiler generates PIC/PIE code (-pic-level 2
-pic-is-pie) but the linker never received -pie.
This mismatch caused LTO failures: without -pie the linker sets
Reloc::Static for the LTO backend, which generates GP-relative
(small-data) references that lld cannot resolve.
Use hasFlag() to respect the toolchain default, and guard the -pie
emission against -shared and -r (relocatable) modes.
16-bit instructions there are in fake16 mode and shall also be
compatible with older targets. The purpose of the test is to
check literals, so fake16 or real16 is not important.
The visited set can grow rather large and we can use an unused field in
SDNode to store the same information without the use of a hash set.
This improves compile times: stage2-O3 -0.14%.