Previously, UnrollMultiReductionPattern bailed out when all the
dimensions were reduced to a scalar. This PR adds support for this case
by tiling the source vector and chaining partial reductions through the
accumulator operand.
Note the immediates for these 2 instructions in their MachineInstr
representations both use the type width. The SATI_RV64 binary encoding
and the RISCVISD::SATI encoding uses the type width minus one.
Assisted-by: Claude Sonnet 4.5
The FMulAdd (CombinedVectorize) transformation in transformNodes() marks
an FMul child entry with zero cost, assuming it is fully absorbed into
the fmuladd intrinsic. However, when any FMul scalar has multiple uses
(e.g., also stored separately), the FMul must survive as a separate
node.
Reviewers: hiraditya, RKSimon, bababuck
Pull Request: https://github.com/llvm/llvm-project/pull/189692
Tests with custom a.out targets in their Makefile (i.e.
`TestBSDArchives.py`) bypass the standard Makefile.rules linking step
where `CODESIGN` is applied. This leaves the binary unsigned, causing
the process to get kill it on remote darwin devices.
This adds a codesigning step to the all target in Makefile.rules that
signs both $(EXE) and a.out if they exist. This ensures all test
binaries are signed regardless of how they were built.
rdar://173840592
Signed-off-by: Med Ismail Bennani <ismail@bennani.ma>
This reverts commit 873d6bc3b415f1c2d942bbf4e4219c4bdcd4f2f8.
This causes Linux kernel build to fail because it relied on
alias-invalidation in kernel/core/sched.c.
The `run-clang-tidy.py` script now uses `shlex.join()` to construct the
command string for printing.
This ensures that arguments containing shell metacharacters, such as the
asterisk in `--warnings-as-errors=*`, are correctly quoted. This allows
the command to be safely copied and pasted into any shell for manual
execution, fixing errors previously seen with shells like `fish` that
are strict about wildcard expansion.
Before:
```
[ 1/15][0.2s] /usr/bin/clang-tidy -p=/home/user/work/project/build --warnings-as-errors=* /home/user/work/project/src/main.cpp
```
Note: When running this command in fish shell you get some error like
`fish: No matches for wildcard '--warnings-as-errors=*'. See `help
wildcards-globbing``
After:
```
[ 1/15][0.2s] /usr/bin/clang-tidy -p=/home/user/work/project/build '--warnings-as-errors=*' /home/user/work/project/src/main.cpp
```
Introduce a new formatting option to control spacing before enum
underlying type colons. This preserves existing behavior while
allowing independent control from inheritance colon spacing.
Previously, enum underlying type colons were not configurable.
Fixes#188734
---------
Co-authored-by: Tharun V K <Tharun.V.K@ibm.com>
This PR add Layout Propagation support for multi-reduction/reduction op
with scalar result:
1) Enhance setupMultiReductionResultLayout() and
LayoutInfoPropagation::visitVectorMultiReductionOp() to support scalar
result
2) Add propagation support for vector.reduction op at the lane level,
since the op is only introduced at the lane level.
This introduces new `ModuleCache` interface for writing PCM files.
Together with #188876, this will enable adding a caching layer into the
`InProcessModuleCache` implementation, hopefully reducing IO cost.
Moreover, this makes it super explicit that the PCM is written before
its timestamp, which is an important invariant that we've broken before.
The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.
For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.
Fix both computation sites:
1. In MLIR's `getReductionDataSize()`, use
`DeclareReductionOp::getByrefElementType()` instead of `getType()` when
the reduction is by-ref, so the reduction buffer struct layout (and more
importantly its size) matches that emitted by the `OMPIRBuilder`.
2. In `OMPIRBuilder::createReductionsGPU()`, use
`ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref
reductions when computing `MaxDataSize`. It seems that `MaxDataSize`
isn't actually used in the deviceRTL, but it's better to fix it to avoid
future propagation of this bug.
Finally, add CHECK lines to the existing array-descriptor reduction test
to verify both the kernel environment `ReductionDataSize` and the
`reduce_data_size` call argument reflect the actual element type size.
Assisted-by: Claude Opus 4.6
---------
Co-authored-by: Jeffrey Sandoval <jeffrey.sandoval@hpe.com>
See discussion in #183347.
Added a separate test case rather than reusing
destructor-dead-on-return.cpp as we need to test functionality of the
deleting destructor which update_cc_test_checks.py does not add check
lines for.
Local variables passed by non-const pointer or reference to a function
were previously invalidated in the LocalVariableMap (VarMapBuilder), on
the assumption that the callee might change what they point to. This
caused false positives when the function also carries ACQUIRE/RELEASE
annotations: handleCall translates those annotations with the pre-call
context, while subsequent guard checks use the post-invalidation
context, producing an expansion mismatch and a spurious warning.
The invalidation rules were a heuristic with significant complexity
(including a special-case carve-out for std::bind/bind_front) and
unclear benefit. Instead of adding more heuristics, drop the
alias-invalidation rules entirely.
Discussion: https://github.com/llvm/llvm-project/pull/183640
This PR introduces new `ModuleCache` API for reading PCM files. This
makes it so that we don't go through the `FileManager` and VFS, which is
problematic downstream. We interpose a VFS that unintentionally shuffles
implicitly-built modules in and out of the CAS database, leading to some
unnecessary storage and runtime overhead. Moreover, this (together with
a reading API) will enable adding a caching layer into the
`InProcessModuleCache` implementation, hopefully reducing IO cost.
Summary:
The RPC headers are completely freestanding and can be installed and
used. This just places them in the standard compiler install directory
along with everything else. We put it under `shared/` so the usage
matches with using the upstream version. People using the installed
version will need to manuall `-isystem` into the include directory, but
this is part for the course for these LLVM extra headers.
This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as
requested in
[#185728](https://github.com/llvm/llvm-project/issues/185728), and adds
the corresponding support in `plugins-nextgen`.
A main motivation for this change is to make it possible to measure the
elapsed time of work submitted to a queue, especially kernel launches.
This is relevant to the intended use of the new Offload API for
microbenchmarking GPU libc math functions.
### Summary
The new API returns the elapsed time, in milliseconds, between two
events on the same device.
To support the common pattern `create start event → enqueue kernel →
create end event → sync end event → get elapsed time`, `olCreateEvent`
now always creates and records a backend event through the device
interface. For backends that materialize real event state, this gives
the event concrete backend state that can be used for elapsed-time
measurement. For backends that do not materialize backend event state,
`EventInfo` may still remain null and existing event operations continue
to treat such events as trivially complete.
Previously, an event created on an empty queue could be represented only
as a logical event. That representation was sufficient for sync and
completion queries, but it was not suitable for elapsed-time measurement
because there was no backend event state to timestamp. The new behavior
preserves the meaning of completion of prior work while also allowing
backends with timing support to attach real event state.
### Changes in `plugins-nextgen`
#### Common interface
Add elapsed-time support to the common device and plugin interfaces:
* `GenericPluginTy::get_event_elapsed_time`
* `GenericDeviceTy::getEventElapsedTime`
* `GenericDeviceTy::getEventElapsedTimeImpl`
#### AMDGPU
* Add the required ROCr declarations and wrappers.
* Enable queue profiling at queue creation time.
* Record events by enqueuing a real barrier marker packet on the stream.
* Retain the timing signal needed to query the recorded marker later.
* Implement `getEventElapsedTimeImpl` using
`hsa_amd_profiling_get_dispatch_time`, converting the result to
milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`.
This follows the ROCm/HIP approach of enabling queue profiling at HSA
queue creation time, while keeping the AMDGPU queue path simpler than
the lazy-enable alternative discussed during review.
#### CUDA
* Add the required CUDA driver declarations and wrappers.
* Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`.
#### Host
* Add `getEventElapsedTimeImpl` that stores `0.0f` in the output
pointer, when present, and returns success.
Reason: the host plugin does not materialize backend event state and
already treats event operations as trivially successful. Returning
`0.0f` preserves that model without introducing a new failure mode.
#### Level Zero
* Add `getEventElapsedTimeImpl`, but leave it unimplemented.
Reason: the Level Zero plugin currently does not provide standalone
backend event support for this event model. For example, `waitEventImpl`
/ `syncEventImpl` are still unimplemented there.
---------
Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br>
Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
Several key targets unconditionally depend on `OSUtil.osutil` target,
causing errors when it is unnecessarily linked, or not available. This
PR fine-tuning the dependency on `OSUtil.osutil` to cleanly decouple
those targets, and gracefully skip targets that need `osutil`. Main
changes include:
* Make `LIBC_COPT_USE_C_ASSERT` to a cmake config, allowing
`LIBC_ASSERT` to use system's `assert` and not depending on `osutil`.
* Adjust cmake dependency for the following targets:
- libc.src.__support.libc_assert
- libc.src.__support.time.*
- libc.src.time.linux.*
- libc.src.unistd.*
- LibcTest
* Give an option for `TestLogger` to use system's `fprintf` instead of
`osutil`.
Add support for non-allocatable module-level CUDA managed variables
using pointer indirection through a companion global in
__nv_managed_data__. The CUDA runtime populates this pointer with the
unified memory address via __cudaRegisterManagedVar and
__cudaInitModule.
- Create a .managed.ptr companion global in the __nv_managed_data__
section and register it with _FortranACUFRegisterManagedVariable
- Call __cudaInitModule once after all variables are registered, only
when non-allocatable managed globals are present, to populate managed
pointers
- Annotate managed globals in gpu.module with nvvm.managed for PTX
.attribute(.managed) generation
- Suppress cuf.data_transfer for assignments to/from non-allocatable
module managed variables, since cudaMemcpy would target the shadow
address rather than the actual unified memory
- Preserve cuf.data_transfer for device_var = managed_var assignments
where explicit transfer is still required
Note: This PR depends on
[#189751](https://github.com/llvm/llvm-project/pull/189751) (MLIR:
nvvm.managed attribute).
Use the new input file system for `sifive-p670`'s llvm-mca tests. Some
of the vector crypto extension tests are left intact, due to the lack of
corresponding input files, and moved under the `rvv` sub-directory.
When subtracting the constant part of two addrecs, we need to ensure the
calculation won't overflow. If it may overflow, we conservatively stop
the analysis and return false.
Add support for the `nvvm.managed` attribute on `llvm.mlir.global` ops.
When present, the LLVM IR translation emits `!nvvm.annotations` metadata
with `!"managed"` for the global variable, which the NVPTX backend uses
to generate `.attribute(.managed)` in PTX output.
This enables CUDA managed memory support for frontends that lower
through MLIR.
This fixes an issue with using_if_exists where we would hit `conflicts
with target of using declaration already in scope` with a
using_if_exists attribute referring to a declaration which did not
exist. That is, if we have `using ::bar
__attribute__((using_if_exists))` but `bar` is not in the global
namespace, then nothing should actually be declared here.
This PR contains the following changes:
1. Ensure we only diagnose this error if the target decl and [Non]Tag
decl can be substitutes for each other.
2. Prevent LookupResult from considering UnresolvedUsingIfExistsDecls in
the event of ambiguous results.
3. Update tests. This includes the minimal repo for a regression test,
and changes to existing tests which also seem to exhibit this bug.
Fixes#85335
---------
Co-authored-by: Petr Hosek <phosek@google.com>
Normally sane front-ends with the common calling-conventions avoid
having multiple sret with a return value, so this is NFCI. However,
multiple can be valid. This rewrites an odd looking DenseMap of one
element that was needed for iteration into a more sensible vector.
Noted in https://github.com/llvm/llvm-project/pull/181740 review.
This PR extracts the non-visitor methods of class InstExecutor into a
separate class ExecutorAPI. This reorganization allows library functions
(and any future extensions) to reuse the functionality of InstExecutor
without introducing cyclic dependencies.
See also #185645 and #185817.
Currently, we do not check the module for requires directives, which
means we'll miss these and not set them on the OpenMP module.
Otherwise, due to the first come first serve method we currently check
the symbols, there is certain formats that would mean the compiler would
miss that a user had specified requires somewhere in the module. This is
partially but not fully avoided by the Semantics layer pushing the
requires on to the top most PFT symbol, as it is entirely possible to
create a legal Fortran program where you could have two or more of these
(e.g. module and main program in one file, standalone funcitons
intermixed with modules or main program). Some examples of this are
shown in the added Fortran test. This PR opts to resolve it by gathering
all of the relevant symbols and processing them.
Also removed gathering from BlockDataUnit as I don't think these symbols
ever get the requires applied.
This change adds a `gpu-num-threads` option to the sparsifier. This
allows users to specify the number of threads used for GPU codegen,
similar to the `num-threads` option in the `-sparse-gpu-codegen` pass.
This crashes Clang 19, 21, and 22 on x86-64 that I've tested and I don't
have a ready-to-test build of any other versions but it seems much safer
to just disable for now.
getUsableSize returns the actual capacity of the underlying block, which
may be larger than the size originally requested by the user. If the
user writes data into this extra space accessible via getUsableSize and
subsequently calls reallocate, the existing implementation only copies
the original requested number of bytes. This resulted in data loss for
any information stored beyond the requested size but within the usable
bounds.
Update Flang Extension doc to remove note about a warning that was
removed in a previous PR (PR #178088). It is an oversight that this doc
change was not made in that previous PR. The oversight was only recently
discovered and has led to this PR.