Resolves#186801
Removed the non-standard member type `iterator_type` from `__wrap_iter`.
This member exposed the underlying iterator type, and its removal
prevents users from relying on the implementation detail.
This PR refactors `getExpandedSizes` and `getExpandedStrides` to compute
their results directly from the `output_shape` of `memref.expand_shape`.
Instead of reconstructing expanded sizes/strides through manual
inference, we now rely on the operation’s explicit shape information.
The previous implementation imposed the restriction that there must be
at most one dynamic size per reassociation group. This limitation is
removed by the new approach: any number of dynamic dimensions within a
group is now supported, as long as they are represented in the
`output_shape`.
As a result, the code becomes both simpler and more expressive, while
better matching the semantics of `memref.expand_shape`.
Windows EH requires exception objects allocated on stack. But there is
no reliable way to identify them. CoroSplit employs a best-effort
algorithm to determine whether allocas persist on the stack or the
frame, which may result in miscompilation when Windows exceptions are
used.
This patch proposes that we treat allocas used by catchpad as exception
objects and never place them on the frame. A verifier check is added to
enforce that operands of catchpad are either constants or allocas.
Close#143235Close#153949Close#182584
Adds a new EPCGenericJITLinkMemoryManager convenience constructor that
constructs an instance by looking up the given symbol names in the
bootstrap JITDylib of the given ExecutionSession.
The symbol names default to the SimpleNativeMemoryMap SPS-interface
symbol names provided by the new ORC runtime.
The current implementation of getSqrtEstimate() has incorrect semantics
when using `FRSQRTE`.
`FRSQRTE` computes an approximation to 1/sqrt(x), but the existing code
multiplies the estimate by the operand when Reciprocal is true. This
results in returning sqrt(x) instead of 1/sqrt(x), effectively reversing
the intended semantics of the 'Reciprocal' flag.
Additionally, the implementation does not properly account for LLVM's
Newton-Raphson refinement pipeline. When refinement steps are requested,
the initial estimate must be in reciprocal form so that the generic
DAGCombiner can apply NR iterations correctly.
This patch fixes the behavior by:
- Returning the raw FRSQRTE result when Reciprocal is true, or when
refinement steps are required.
- Only reconstructing sqrt(x) via x * rsqrt(x) when no refinement is
requested.
- Refactoring type checks into a helper function
(isSupportedReciprocalEstimateType)
for improved readability and maintainability.
The updated implementation aligns with the expectations of LLVM's
reciprocal estimate framework and matches the behavior of other targets
such as X86 and AArch64.
No functional change when reciprocal estimates are disabled, but fixes
incorrect results when fast-math enables reciprocal sqrt estimates.
Fixes#186328
Example:
```fortran
!$ACC KERNELS PRESENT(CG, W1)
CG(1:W1%WDES1%NPL, NN) = W1%CPTWFP(1:W1%WDES1%NPL)
CPROJ(:, NN) = W1%CPROJ(1:SIZE(CPROJ,1))
!$ACC END KERNELS
```
When compiling OpenACC kernels containing array section assignments of
rank-2 arrays with a scalar index in one dimension (e.g. `CG(1:NPL,
NN)`), the Fortran lowering creates a `fir.slice` where collapsed
(scalar) dimensions use `fir.undefined index` as the stop/step values.
`SliceOp::getOutputRank()` relies on `getDefiningOp()` returning
`fir::UndefOp` to identify these collapsed dimensions and compute the
correct output rank.
When `fir.undefined` values defined outside an offload region are used
inside it, `gpu-kernel-outlining` turns them into function arguments.
Since function arguments have no defining op (`getDefiningOp()` returns
`nullptr`), `getOutputRank()` no longer recognizes the collapsed
dimensions, computing rank 2 instead of rank 1. This causes the
`fir.rebox` verifier to fail with:
```
'fir.rebox' op result type rank and rank after applying slice operand must match
```
Fix: Register `OutlineRematerializationOpInterface` for `fir::UndefOp`
(and `fir::SliceOp`) in `RegisterOpenACCExtensions.cpp`. This causes
`OffloadLiveInValueCanonicalization` to clone these operations inside
the offload region before outlining, preserving the `fir::UndefOp`
identity so that `getOutputRank()` correctly identifies collapsed
dimensions.
Currently the guards for `totalorderbf16` and `totalordermagbf16` are as
follows:
```
#ifndef LLVM_LIBC_SRC_MATH_TOTALORDERMAGF16_H
#define LLVM_LIBC_SRC_MATH_TOTALORDERMAGF16_H
-
#endif // LLVM_LIBC_SRC_MATH_TOTALORDERMAGF16_H
```
and
```
#ifndef LLVM_LIBC_SRC_MATH_TOTALORDERF16_H
#define LLVM_LIBC_SRC_MATH_TOTALORDERF16_H
-
#endif // LLVM_LIBC_SRC_MATH_TOTALORDERF16_H
```
As we can see these are for F16 and not BF16 .
This Pr intends to fix that with correct guards as `TOTALORDERBF16` and
`TOTALORDERMAGBF16`
This PR intends to fix a small nit caused in
[1c1135b](1c1135b3fc)
```
#endif // LLVM_LIBC_SRC_MATH_ASINIF16_H
```
to
```
#endif // LLVM_LIBC_SRC_MATH_ATANPIF16_H
```
When compiling WebAssembly with ThinLTO, functions are partitioned into
isolated `.bc` modules and dispatched to individual LTO backend threads.
During code generation, the `CoalesceFeaturesAndStripAtomics` pass
iterates over the module to gather the union of target features (like
`+atomics`) attached to defined functions. In particular when not using
threads, it lowers away atomics and TLS variables to their
single-threaded equivalents.
However, if a partitioned module only contains globally defined TLS
variables (e.g. there are no functions, or all functions were fully
inlined or stripped by dropDeadSymbols before ThinLTO optimization), the
module becomes completely devoid of function definitions. The coalescing
pass then falls back to fetching features from the `TargetMachine`.
Because in LTO the `TargetMachine` defaults to a generic target without
atomics enabled, the TLS is lowered away and the `wasm-feature-atomics`
flag is omitted from the resulting ThinLTO object partition, causing
`wasm-ld` to immediately reject it.
To fix this we take advantage of the fact that the linker always knows
whether threads are being used (via the --shared-memory flag). When
using shared memory, we enable +atomics and +bulk-memory in the
TargetMachine that is used for the backend, and the feature coalescing
pass will correctly detect the use of therads.
This only makes sense for atomics because of the global linker
configuration; for other features we wouldn't be able to do this, but we
don't rewrite away any other features anyway.
Added missing ``#include`` insertion when the format function call
appears as an argument to a macro.
Part of #175183
---------
Co-authored-by: Victor Chernyakin <chernyakin.victor.j@outlook.com>
[PyObject members are not to be accessed
directly](https://docs.python.org/3/c-api/structures.html#c.PyObject),
but rather through macros, in this case `Py_REFCNT()`.
In most, ie Global Interpreter Lock-enabled, CPython cases,
`Py_REFCNT()` expands to accessing `ob_refcnt` anyway. However, in a
free-threaded CPython, combined with disabling the limited API (since it
requires the GIL for now), the direct member does not exist, causing the
build to fail. The macro expands to the correct access method in the
free-threaded configuration.
The type `__gnu_cxx::hash_{,multi}map` creates objects of type
`std::pair<Key, Value>` and returns pointers to them of type
`std::pair<const Key, Value>`. If either `Key` or `Value` are
non-standard-layout, this is UB, and is furthermore considered by
pointer field protection to be a type confusion, which leads to a
program crash. Fix it by using the correct type for the pair's storage
and using const_cast to form a pointer to the key in the one place where
that is needed.
Reviewers: ldionne
Reviewed By: ldionne
Pull Request: https://github.com/llvm/llvm-project/pull/183223
We need to add the regions with the direct uses into the list
for processing, otherwise the direct uses will not be removed
and will use the slot after the promotion.
The added LIT test was triggering "after promotion, the slot pointer
should not be used anymore" assertion.
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan:
added pre-aggregated-perf-buildid.test
This reinforces what is already true in the codebase: all uses of
`build()` use keyword arguments.
With this change, it will be an error to call `build` using positional
arguments:
```
TypeError: build() takes 1 positional argument but 2 were given
```
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan: added pre-aggregated-perf-buildid.test
Reviewers:
paschalis-mpeis, maksfb, yavtuk, ayermolo, yozhu, rafaelauler, yota9
Reviewed By: paschalis-mpeis
Pull Request: https://github.com/llvm/llvm-project/pull/186931
Create bolt/docs/profiles.md documenting all accepted profile formats:
perf.data, fdata, YAML, and pre-aggregated. Covers collection methods,
format syntax, examples, and known limitations.
Add reference from bolt/docs/index.rst.
This PR adds a new SPIRV pass that generates a kernel named
"spirv$device$init" that iterates the pointers in the table pointed by
__init_array_start and __init_array_end and executes them. It also
generates symbols for each constructor with the form
__init_array_object_NAME_PRIORITY.
These symbols will be used by the Level Zero plugin in the liboffload
runtime (with the support introduced by #187510) to generate the
aforementioned table as spirv-link cannot create the table itself.
It also does the same thing for destructors, with the kernel name being
"spirv$device$fini", the table pointers __fini_array_start and
__fini_array_end, and the generated symbols prefix __fini_array_object.
The code was mostly generated by Claude 4.5 and has been reviewed by me
to the best of my ability.
This reverts commit b1aa6a45060bb9f89efded9e694503d6b4626a4a and commit
ce44d63e0d14039f1e8f68e6b7c4672457cabd4e.
This fails the build with some older gcc:
llvm/include/llvm/CodeGenTypes/LowLevelType.h:501:35: error: call to
non-constexpr function ‘static llvm::LLT llvm::LLT::integer(unsigned
int)’
return integer(getSizeInBits());
^
We have two tests that use FileCheck for diagnostics and which try to
check that the output contains no compiler errors by checking for the
string 'error'. The issue with this approach is that this also causes
those tests to fail if the *path* contains the word 'error', which can
happen e.g. if the branch name contains the word 'error'.
Instead, we now check for `error:` since that string is much less likely
to appear in a path.
The ScopedTimeout was created as a temporary, causing it to be destroyed
immediately and the timeout to have no effect. Give it a name so it
lives until the end of the function scope.
The ordered reduction support introduced in 94e366ef2060 can cause an
infinite loop when processing complex reduction chains. The worklist
algorithm re-adds instructions from PossibleOrderedReductionOps when
switching to ordered mode, but doesn't track which instructions have
already been processed. This allows instructions to be re-added and
processed multiple times, creating cycles.
Add a Visited set to track processed instructions and skip any that
have already been handled, preventing the infinite loop.
Create takes a JITDylib and a SymbolNames struct, looks up the
implementation symbol addresses in the given JITDylib, and uses them to
construct an EPCGenericJITLinkMemoryManager instance. This makes it
easier for ORC clients to construct the memory manager from named
symbols (e.g. in a bootstrap JITDylib) rather than raw addresses.
The `PointerLikeTypeTraits` for `LazyGenerationalUpdatePtr` claimed
`PointerLikeTypeTraits<T>::NumLowBitsAvailable - 1` spare low bits. This
assumed that the inner `PointerUnion<T, LazyData*>` has `T_bits - 1`
spare bits, which is only true when `alignof(LazyData) >= alignof(*T)`.
On 32-bit systems, `LazyData` (containing pointers and `uint32_t`) has
`alignof = 4`, giving `LazyData*` only 2 low bits. With `T = Decl*` (3
bits due to `alignas(8)`), the inner `PointerUnion` has `min(3,2) - 1 =
1` spare bit, but the PLTT claimed `3 - 1 = 2`.
Historically, the formula was correct when introduced in 053f6c6c9e4d --
at that time `Decl` had no alignment annotation, so `T_bits ==
LazyData*_bits` on all platforms. It became outdated when 771721cb35f3
added `LLVM_ALIGNAS(8)` to `Decl`, raising `Decl*` to 3 bits on 32-bit
while `LazyData*` stayed at 2. The old `PointerIntPair`-based
`PointerUnion::doCast` happened to mask with `minLowBitsAvailable()`
(tolerant of overclaims), so this was never exposed until the
`PunnedPointer` refactoring changed `doCast` to mask with
`To::NumLowBitsAvailable`.
Fixes: https://github.com/llvm/llvm-project/issues/188269
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the explicit `if constexpr` branching for big and little
endianness with compile-time calculated shift constants `VAL_SHIFT` and
`NEXT_SHIFT`. This simplifies the logic and reduces code duplication,
relying on the compiler to constant-fold the zero shifts into no-ops.
Initialize makes sure that it calls RegisterPlugin only once, but
Terminate always calls UnregisterPlugin. This is a problem for tests
that call Initialize/Terminate before and after each test case: the
second case will fail because the trace plugin won't be loaded.
This fixes a test failure introduced by #187768, which adds a test case
that passes on its own but fails when run after the previous test case.
This PR introduces a TableGen-based code generation system for HLSL
intrinsic overloads as described in proposal
[[0043]](https://github.com/llvm/wg-hlsl/blob/main/proposals/0043-hlsl-intrinsic-tablegen.md)
for replacing hand-written boilerplate with declarative .td definitions.
Actual changes to `hlsl_intrinsics.h` and `hlsl_alias_intrinsics.h` to
replace handwritten HLSL intrinsic overloads with TableGen is left to
follow-up PRs.
Assisted-by: GitHub Copilot (powered by Claude Opus 4.6)
Up until now the bottom-up vectorizer pass would not delete the scalar
instructions that have external uses after being vectorized, because it
lacked the ability to generate extracts from the vectors.
With the term "external uses", we refer to uses outside the currently
vectorized graph.
This patch fixes this. We can now properly handle external uses by
extracting from the vectors.
This change relies on the recent changes to the DAG's callbacks because
the external user may not be within the current DAG's interval.
Add the -remove flag to llvm-lipo. This matches the existing Darwin lipo
tool:
```
% xcrun lipo 2>&1 | grep remove
error: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/lipo: one of -create, -thin <arch_type>, -extract <arch_type>, -remove <arch_type>, -replace <arch_type> <file_name>, -verify_arch <arch_type> ... , -archs, -info, or -detailed_info must be specified
-remove <arch_type> [-remove <arch_type> ...]
```
Assisted-by: claude-code
Fixes#188259
---
This PR fixes a crash in duplicate attribute checking when arguments
have different integer signedness. It changes the check to use
`isSameValue` (which uses `compareValues`) instead of `llvm::APSInt`
equality.
Fix lowering so post declare-actions are not attached to acc.terminator
(step back from trailing fir.result/acc.terminator before attaching).
---------
Co-authored-by: Slava Zakharin <szakharin@nvidia.com>
`transformCFGToSCF` would crash with a use-list assertion when it
encountered an op like `spirv.BranchConditional` that implements
`BranchOpInterface` (passing the existing precondition checks) but is
not handled by `createStructuredBranchRegionOp`. The algorithm mutated
the IR significantly before discovering the op was unsupported, leaving
it in a corrupt half-transformed state that triggered the assertion on
teardown.
Fix by adding `canConvertBranchOp` to `CFGToSCFInterface` (default:
accept all ops) and calling it inside `checkTransformationPreconditions`
for every block with more than one successor, before any IR
modifications are made. `ControlFlowToSCFTransformation` overrides the
method to accept only `cf.cond_br` and `cf.switch`.
Fixes#173566
Assisted-by: Claude Code
Division by zero is undefined behavior, so these two ops cannot be pure.
This commit marks them as conditionally speculatable, similar to
`arith.divsi` and `arith.divui`.
The encoded immediate is the number of trailing 1s in the maximum value.
Mailing list discussion has a preference to print and parse this value
plus one. https://lists.riscv.org/g/tech-p-ext/message/910
With this patch, saturating to a signed 8-bit integer would be "sati a0,
a0, 8". Previously it was "sati a0, a0, 7".
This is reflected in version 0.20 of the adoc spec here
https://github.com/riscv/riscv-p-spec/pull/226. I have updated our
RISCVUsage.rst to point to the adoc version of the spec.
This reverts commit 040b7e0a1deb and re-lands the PointerUnion
refactoring from #187950, with a fix for the 32-bit crash.
The bug was in doCast: it masked with
PointerLikeTypeTraits<To>::NumLowBitsAvailable to strip tag bits, but
the old PointerIntPair-based code masked with minLowBitsAvailable() (the
minimum across all union members). When a member type's PLTT over-claims
spare low bits, the new mask was too aggressive and cleared bits
belonging to a nested PointerUnion's tag.
Concretely, on 32-bit systems, Redeclarable::DeclLink nests
LazyGenerationalUpdatePtr (LGUP) whose PLTT claims 2 spare bits (Decl*
has alignas(8) = 3 bits, minus 1). But LGUP's inner PointerUnion<Decl*,
LazyData*> only has 1 spare bit on 32-bit (alignof(LazyData) = 4 gives
LazyData* only 2 low bits, tagShift = 1). Extracting LGUP from the outer
PointerUnion cleared bit 1 (the inner PU's type tag), corrupting the
discriminator and breaking redeclaration chains.
Added new unit tests that cover this scenario, but we will want to fix
the clang side too.
Issue: #188269
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>