`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.
## Reproduction
Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.
This script reproduces the error message from `mgpuModuleUnload` on my
system:
```
#!/bin/bash
set -e
LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}
cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
gpu.terminator
}
return
}
MLIR
$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
-gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
| $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll
$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o
cc /tmp/repro.o \
-L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
-lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro
echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context
This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload
Fixes#170833
The mov64 pseudo is split into two 32 bit movs, but those 32 bit movs
had the full 64-bit register still implicitly defined. VOPD formation is
affected, so we can emit more of them.
This PR enhance the multi-reduction op pattern of wg-to-sg distribution
pass:
1. allows each sg have multiple distribution of sg_data tiles.
2. expand the slm buffer size.
3. construct the layout based on the partial reduced vector and use
layout.computeDistributedCoords() to compute coordinates. the layout is
constructed so that the store is cooperative, and load overlapps with
neighbour threads.
4. perform save and load.
Standard porting (note that TargetPassConfig dependency was [removed
earlier](e27e7e4339)).
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
When processing coverage generated from branch coverage mode, some
functions can reach findMainViewFileID with an empty CountedRegions
list. In that case the current logic still proceeds to infer the main
view file, even though there is no regular counted region available to
do so.
Return std::nullopt early when CountedRegions is empty.
This was observed when reproducing issue #189169 with:
cargo llvm-cov --lib --branch
The issue appears related to branch-only coverage information being
recorded separately in CountedBranchRegions, while
findMainViewFileID currently only consults CountedRegions.
This patch is a defensive fix for the empty-region case; further
investigation may still be needed to determine whether branch regions
should participate in main view file selection.
Co-authored-by: Zile Xiong <xiongzile99@gmail.com>
Block pointers are only stored while constructing the analysis, so the
value handle to catch erased blocks is no longer needed when using
stable block numbers.
This is fix for
[187902](https://github.com/llvm/llvm-project/issues/187902), where
`LoopInfo` is not in a valid state at the beginning of `ScalarEvolution::createSCEVIter`.
The reason for the bug is that, `mergeLatch()` is called at a place
where control flow and dominator trees have been updated but `LoopInfo`
has not completed the update yet. `mergeLatch()` calls into
`ScalarEvolution` that uses `LoopInfo`, where out-of-date `LoopInfo` would
result in crash or unpredictable results.
This patch moves `mergeLatch()` to the place where `LoopInfo` has
completed its update and hence is in a valid state.
Fixes found by fuzzer:
OnDiskTrieRawHashMap:
- Bounds-check data slot offsets in TrieVerifier::visitSlot() before
calling getRecord(), preventing asData() assertion on out-of-bounds
trie entries.
- Validate subtrie headers (NumBits, bounds) before constructing
SubtrieHandle, preventing SEGV in getSlots() from corrupt NumBits.
- Validate arena bump pointer alignment, catching misaligned BumpPtr
that would crash store() with an alignment assertion.
- Fix comma operator bug in getOrCreateRoot() where the
compare_exchange_strong result was discarded, causing asSubtrie()
assertion when RootTrieOffset was corrupted to zero.
OnDiskGraphDB:
- Reject invalid (zero) ref offsets in validate callback, preventing
asData() assertion when corrupt data pool refs are resolved via
recoverFromFileOffset().
- Validate DataRecordHandle layout flags before calling getTotalSize(),
preventing llvm_unreachable on corrupt NumRefsFlags/DataSizeFlags.
- Validate data pool bump pointer alignment, catching misaligned
BumpPtr that would crash store() in DataRecordHandle::constructImpl().
- Check data record refs offset alignment before calling getRefs(),
preventing PointerUnion assertion from misaligned refs pointer.
MappedFileRegionArena:
- Convert assertions in initializeHeader() to errors so corrupted
arena headers return an error on CAS open instead of crashing.
Assisted-By: Claude
Add a fuzzer that creates an on-disk CAS database, stores objects, then
corrupts the on-disk data files using fuzzer-provided bytes and calls
validate(). The goal is that validate() should either succeed or return
an error, never crash.
The fuzzer supports 6 corruption modes: byte-level mutations, file
truncation, appending garbage, zeroing ranges, standalone file
corruption, and combined mutations with continued CAS operations.
Assisted-By: Claude
This changes `DenseMapInfo<UUID>::getTombstoneKey()` to return a 1-byte
`{0xFF}` sentinel instead of the empty, default constructed UUID().
Returning the same key for the empty and tombstone value apparently
violates the `DenseMap` invariant.
The rematerializer implements support for rolling back
rematerializations by modifying MIs that should normally be deleted in
an attempt to make them "transparent" to other analyses. This involves:
1. setting their opcode to DBG_VALUE and
2. setting their read register operands to the sentinel register.
This approach has several drawbacks.
1. It forces the rematerializer to support tracking these "dead MIs"
(even if support is optional, these data-structures have to exist).
2. It is not actually clear whether this mechanism will interact well
with all other analyses. This is an issue since the intent of the
rematerializer is to be usable in as many contexts as possible.
3. In practice, it has shown itself to be relatively error-prone.
This commit removes rollback support from the rematerializer and moves
those capabilities to a rematerializer listener than can be instantiated
on-demand and implements the same functionality on top of standard
rematerializer operations. The rematerializer now actually deletes MIs
that are no longer useful after rematerializations, and has support for
re-creating them on-demand without requiring additional tracking on its
part.
Since #189401, LLVM and Clang generate `S_DEFRANGE_REGISTER_REL_INDIR`
for indirect locations. This adds support in LLDB.
The offset added after dereferencing is signed here - unlike in
`S_REGREL32_INDIR` (at least that's the assumption). So I updated
`MakeRegisterBasedIndirectLocationExpressionInternal` to handle the
signedness. This is the reason the MSVC test was changed here.
I didn't find a test case where LLVM emits the record with the `VFRAME`
register. Other than that, the clang test is similar to the MSVC one
except that the locations are slightly different.
When a VFS overlay YAML file contains malformed content such as tabs,
the YAML parser can produce KeyValueNode entries where `getKey` returns
nullptr. The VFS overlay parser then passes the nullptr to
`parseScalarString`, which then calls dyn_cast.
Switch to `dyn_cast_if_present` for the above callsites and a few more.
The Hexagon driver only checked for an explicit -pie flag when
constructing the link command, ignoring the toolchain's PIE default. For
linux-musl targets, isPIEDefault() returns true (via the Linux toolchain
base class), so the compiler generates PIC/PIE code (-pic-level 2
-pic-is-pie) but the linker never received -pie.
This mismatch caused LTO failures: without -pie the linker sets
Reloc::Static for the LTO backend, which generates GP-relative
(small-data) references that lld cannot resolve.
Use hasFlag() to respect the toolchain default, and guard the -pie
emission against -shared and -r (relocatable) modes.
16-bit instructions there are in fake16 mode and shall also be
compatible with older targets. The purpose of the test is to
check literals, so fake16 or real16 is not important.
The visited set can grow rather large and we can use an unused field in
SDNode to store the same information without the use of a hash set.
This improves compile times: stage2-O3 -0.14%.
From #185382
Lower `vduph_lane_f16` and `vduph_laneq_f16` to `cir::VecExtractOp`
Tests moved from `v8.2a-neon-instrinsics-generic.c` to a new CIR-enabled
test file.
I tried following from notes made in #185852 (BF16)
Previously, `computeProcResourceMasks()` would print resource masks on
debug mode from multiple call sites, creating noise in the debug output.
This patch aims to fix this and also print more info about the
resources.
It splits to 2 types of debug prints for resources:
1. No simulation - mask only
2. Simulation - mask + other info
For 2, it shares printing on a single place in `ResourceManager`
constructor, that should cover all the other simulation cases
indirectly:
1. `llvm/lib/MCA/HardwareUnits/ResourceManager` - covered
2. `llvm/lib/MCA/InstrBuilder.c` - should be covered indirectly - only
used by `llvm-mca` before simulation that constructs a `ResourceManager`
3. `llvm/tools/llvm-mca/Views/SummaryView.cpp` - after simulation that
constructs a `ResourceManager`
4. `llvm/tools/llvm-mca/Views/BottleneckAnalysis.cpp` - after simulation
that constructs a `ResourceManager`
It also adds `BufferSize` to the output, which should be useful to debug
scheduling model + MCA integration.
For 1, it inlines mask-only printing into 2 other callers:
1. `llvm/include/llvm/MCA/Stages/InstructionTables.h`
2. `llvm/tools/llvm-exegesis/lib/SchedClassResolution.cpp`
as they only use the masks there. I think this is a reasonable
duplication across distinguishably different users/tools.
Now every pair of callers, even across groups (1 and 2), effectively
print in a mutually exclusive way.
The patch adds debug tests for the 3 new callers, in the corresponding
root test directories, to drive further location of logically
target-independent tests that just require some target at the root. I
think this convention is more discoverable, and is pretty widely used in
the project.
So that it's preserved across all inline invocations rather than just
one inliner pass run.
This prevents cases where devirtualization in the simplification
pipeline uncovers inlining opportunities that should be discarded due to
inline history, but we dropped the inline history between inliner pass
runs, causing code size to blow up, sometimes exponentially.
For compile time reasons, we want to limit this to only call sites that
have the potential to inline through SCCs, potentially with the help of
devirtualization. This means that the callee is in a non-trivial
(Ref)SCC, or the call site was previously an indirect call, which can
potentially be devirtualized to call any function.
The CGSCCUpdater::InlinedInternalEdges logic still seems to be relevant
even with this change, as monster_scc.ll blows up if I remove that code.
http://llvm-compile-time-tracker.com/compare.php?from=e830d88e8ae5f44a97cc76136a0a4e83aa9157c0&to=ed535e732fc41b79ab8efda2417886cbd0812f7f&stat=instructions:uFixes#186926.
This is needed so that `allowsMemoryAccessForAlignment` checks for
unaligned vector memory
support instead of unaligned scalar memory support when called from
`RISCVTargetLowering::expandUnalignedVPStore`
While there remove incorrect setting of the truncating store flag
on the vector instruction. And restrict the transform to simple stores
since we don't have tests for volatile or atomic.
Fixes#189037
The RV32 macc*.h00 instructions take the lower half words from rs1 and
rs2, compute the full word product by extending the inputs, and
add to rd. The RV64 macc*.w00 is similar but operates on words
and produces a double word result.
I've restricted this to case where the multiply has a single use.
We don't have a general macc that multiplies the full xlen bits
of rs1 and rs2, so I'm allowing the input to be sext_inreg/and or
have sufficient sign/zero bits according to
ComputeNumSignBits/computeKnownBits.
We should also add mul*.h00/mul.*w00 patterns, but those we should
restrict to at least one input being sext_inreg/and and prefer
regular mul when there are no sext_inreg/and.
Add ThreeOp_v2i32_Pats pattern class to support v2i32 vector operations
for AND_OR_B32 and OR3_B32 instructions. The new patterns check the
v2i32 and-or or or-or instruction sequence, extract individual 32-bit
elements from v2i32 operands, and applies the and_or or or3 vop3
operations.
This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The MLIR LLVM dialect is missing support for several parameter
attributes that
exist in LLVM IR: `writable`, `dead_on_unwind`, `dead_on_return`, and
`nofpclass`. This adds them to the kind-to-name mapping in
`AttrKindDetail.h`
and the corresponding name accessors in `LLVMDialect.td`.
The existing generic conversion infrastructure in `ModuleTranslation`
and
`ModuleImport` picks them up automatically — `writable` and
`dead_on_unwind`
round-trip as `UnitAttr`, while `dead_on_return` and `nofpclass`
round-trip as
`IntegerAttr`.
CIR needs these to match classic codegen's ABI output (sret gets
`writable
dead_on_unwind`, indirect args get `dead_on_return`, fast-math FP args
get
`nofpclass`).
Add skip_tail_padding property to cir.copy to handle
potentially-overlapping
subobject copies directly, instead of falling back to cir.libc.memcpy.
When
set, the lowering uses the record's data size (excluding tail padding)
for
the memcpy length. This keeps typed semantics and promotability of
cir.copy.
Also fix CXXABILowering to preserve op properties when recreating
operations,
and expose RecordType::computeStructDataSize() for computing data size
of
padded record types.
Added vector intrinsics for
vshlq_n_s8
vshlq_n_s16
vshlq_n_s32
vshlq_n_s64
vshlq_n_u8
vshlq_n_u16
vshlq_n_u32
vshlq_n_u64
vshl_n_s8
vshl_n_s16
vshl_n_s32
vshl_n_s64
vshl_n_u8
vshl_n_u16
vshl_n_u32
vshl_n_u64
these cover all the vector intrinsics for constant shift
the method followed
1) the vectors for quad words are of the form `64x2`, `32x4`, `16x8`,
`8x16` and the shift is a constant value but for shift left we need both
of them to be vectors so we take the constant shift and convert it into
a vector of respective form, for `64x2` we convert the constant to
`64x2`, I have learnt that this process is also called **splat**
2) After splat we have that the lhs and rhs are of the same size hence
the shift left can be applied
3) There is one issue though, the ops[0] is not of the right size, for
quad words it falls back to the default int8*16 in the function, so I am
converting it to the required size using bit casting, `8x16` = `64x2` so
we can bitcast and get the vector array in the right form.
Wrote the test cases for all the intrinsics listed above
#185382
Some functions used `new`/`delete` to allocate/free arrays. To avoid
memory leaks, it would be better to avoid using raw pointers. This patch
replaces the use of them with `SmallVector`.
It seems the known bits handling added in
686987a540bc176bceaad43ffe530cb3e88796d5
is insufficient to perform many range based optimizations. For some
reason
computeConstantRange doesn't fall back on KnownBits, and has a separate,
less used form which tries to use computeKnownBits.