This reverts commit 1418f80.
The change can cause an infinite rewrite loop when
ForwardConcatInsertSliceDest interacts with
FoldEmptyTensorWithExtractSliceOp.
The isTreeTinyAndNotFullyVectorizable check for 2-node trees
(insertelement root + gather child) was too aggressive: it rejected
trees even when LoadEntriesToVectorize was non-empty, preventing
gathered loads from being vectorized into masked loads/strided loads, etc.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/190181
The Exact SIV test and the Exact RDIV test behave almost identically,
except that the Exact SIV test also explores the directions in the final
step. This patch consolidates the two duplicate implementations into a
single function that can be used by both tests. While this change
slightly affects things like debug output and metrics, it is not
intended to alter the actual test results.
The main change is to eliminate the use of "Argument" terminology when
dealing with overloaded types since overloaded types can be either
argument or return values, and some additional renaming for clarity.
1. Rename `Tys` argument to various intrinsic APIs to `OverloadTys` to
better reflect its meaning.
2. Rename `IITDescriptorKind::Argument` to
`IITDescriptorKind::Overloaded` to better convey that it's an overloaded
type. Removed "Argument" suffix for other kinds for dependent types.
3. Rename `ArgKind` to `AnyKind`, `getArgumentNumber` to
`getOverloadIndex`, `getArgumentKind` to `getOverloadKind`,
`getRefArgNumber` to `getRefOverloadIndex`, and `IIT_ARG` to `IIT_ANY`.
4. Rename `IIT_ANYPTR` (used to represent a pointer qualified with
address space) to `IIT_PTR_AS` to clearly distinguish it from
`llvm_anyptr_ty`
5. Change the packing of [ref overload index & overload index] for
`VecOfAnyPtrsToElt` to pack the overload index into the lower bits, so
we can use the `getOverloadIndex` function to get the overload index.
This ends up being a pretty trivial amount of work, since we just have
to forward the initialization for a union on to the 'active' field,
which this patch does.
This functionality is described in the Itanium C++ABI 2.5.2 (and is also
where the test comes from). See also VTableBuilder.cpp's documentation
on the declaration of IsOverriderUsed for further details.
However, the explaination is:
When B and C are declared, A is a primary base in each case, so although
vcall offsets are allocated in the A-in-B and A-in-C vtables, no this
adjustment is required and no thunk is generated. However, inside D
objects, A is no longer a primary base of C, so if we allowed calls to
C::f() to use the copy of A's vtable in the C subobject, we would need
to adjust this from C* to B::A*, which would require a third-party
thunk. Since we require that a call to C::f() first convert to A*,
C-in-D's copy of A's vtable is never referenced, so this is not
necessary.
The short of that is: there is no way to call these, so we just emit a
nullptr rather than the required thunk.
This adds the CalculateLevelOfDetail and CalculateLevelOfDetailUnclamped
methods to Texture2D using the establish pattern used for other methods.
Assisted-by: Gemini
Sometimes we use array of bytes to represent `_BitInt` types in memory.
When this is the case the lowered array filler expression reaches
`ConstantEmitter::emitForMemory` already with memory type which will be
array of i8 instead of a single iN, so `cast<llvm::ConstantInt>` was
failing within `ConstantEmitter::emitForMemory`. This patch fixes the
assertion failure by not attempting any type changes if the type is
right already.
Fixes https://github.com/llvm/llvm-project/issues/189643
Assisted-by: claude in FileCheck CHECK lines fixing
Fixes#181643
For queries like `isKnownToBeAPowerOfTwo(V, OrZero=true)`, if an operand
is known to be "pow2-or-zero" but not strictly non-zero power-of-two,
the min/max case currently returns false even when the result remains
pow2-or-zero.
For instance:
- `A = select cond, 4, 0` (A is pow2-or-zero)
- `R = umin(A, 16)`
`R` is always in `{0, 4}` and querying `isKnownToBeAPowerOfTwo(R,
OrZero=true)` should be true.
Added unitests for baseline and failing case and now propagating
correctly to `OrZero` and `DemandedElts`
For all targets, (V)PADDB is always as fast as (V)PSLLW (usually faster)
- and usually as fast as (V)PAND, and avoids having to load a mask - so
for shift lefts by 2, a pair of (V)PADDB is a better choice vs
(V)PSLLW+(V)PAND
This is only necessary if we're avoiding a (V)PAND mask - otherwise we
just need a single (V)PSLLW.
The pass pipeline differs across targets, so make this test use
one specific pipeline, instead of trying to cater to cross-target
differences. Those differences are not relevant to the intent of
the test.
In this patch I am adding the missing target hooks required for the
liveness analysis to run on AArch64. These are
- getFlagsReg()
- getRegsUsedAsParams()
- getDefaultLiveOut()
- getGPRegs()
- isCleanRegXOR()
I am also introducing the following API in LivenessAnalysis
- BitVector getLiveIn/Out(const MCInst &)
- MCPhysReg scavengeRegFromState(BitVector &)
My intention is to allow the LongJmp pass scavenge usable registers when
injecting code.
The isTreeTinyAndNotFullyVectorizable check for 2-node trees
(insertelement root + gather child) was too aggressive: it rejected
trees even when LoadEntriesToVectorize was non-empty, preventing
gathered loads from being vectorized into masked loads/strided loads, etc.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/190040
Extend the existing any_extend(bswap i16) -> rev16 combine to also
handle zero_extend. REV16 preserves a zero upper half, so for i16 loads
this saves one instruction: ldrh+rev+lsr#16 -> ldrh+rev16.
Introduce common infrastructure for runtimes that determines compiler
resource path locations. These variables introduced are:
* RUNTIMES_OUTPUT_RESOURCE_DIR
* RUNTIMES_INSTALL_RESOURCE_PATH
That contain the location for the compiler resource path (typically
`lib/clang/<version>`) in the build tree and the install tree (the
latter relative to CMAKE_INSTALL_PREFIX).
Additionally, define
* RUNTIMES_OUTPUT_RESOURCE_LIB_DIR
* RUNTIMES_INSTALL_RESOURCE_LIB_PATH
as for the location of clang/flang version-locked libraries (typically
`lib${LLVM_LIBDIR_SUFFIX}/<targer-triple>`, but also depends on `APPLE`
and `LLVM_ENABLE_PER_TARGET_RUNTIME_DIR`). This code is moved from
flang-rt and initially becomes its only user.
Refactored out of #171610 as requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2687382481).
Extracted `get_runtimes_target_libdir_common` from compiler-rt as
requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2689565634).
Added TODO comments to all runtimes as requested
[here](https://github.com/llvm/llvm-project/pull/171610#issuecomment-3789598635).
Implements isMemcpyEquivalentSpecialMember in CIR codegen so that
trivial copy/move constructors and defaulted union copy/move ops emit a
cir.copy directly instead of making a real constructor call. The logic
is shared with OG codegen by moving the implementation into ASTContext,
where it also gains the pointer field protection (PFP) check that was
previously missing in CIR.
Initially added in #187709. It was reverted in #188833, because
[llvm-clang-x86_64-sie-win](https://lab.llvm.org/buildbot/#/builders/46/builds/32873)
was failing in
`cross-project-tests/debuginfo-tests/dexter-tests/nrvo.cpp`.
The test passed for me locally. After checking on another machine, I
found that `S_DEFRANGE_REGISTER_REL_INDIR` is only supported by
dbgeng/WinDbg from Windows 10.0 Build 19041 (released 2020) onwards.
SDKs before this will fail to read the value. That buildbot is on
Windows 10.0 Build 17763.
I'm not sure if we should make the generation of that record
conditional. Debuggers that can't read the record will skip it. They'll
still see that there's some local variable, but won't be able to display
the value.
As far as I know, users of older Windows 10 builds should be able to
install a newer Windows SDK and use the WinDbg from that version. But I
haven't tested that.
For whatever reason we ended up with register/register but the first
register just had the second register folder in it.
Move the files up one level so we have register/<test files>.
This patch uses DestructiveBinaryShImmUnpred (which was previously
unused as far as I could tell) to define pseudos for arithmetic
immediate instructions such as ADD (immediate), which allows using
MOVPRFX with these instructions.
Reland of #172062 (a71b1d2), which was reverted in b0234d1.
This patch makes essential Bitset member functions constexpr (`set()`,
`any()`, `none()`, `count()`, `operator==`, `!=`, `<`, `\~`) and adds a
new `all()` method. It also introduces a `maskLastWord()` invariant to
ensure unused high bits in the last word are always zero, which is
required for correctness of `operator~`, `set()`, `all()`, and
comparisons on non-word-aligned sizes (e.g., `Bitset<33>`).
Changes from the original reverted PR:
- Replaced `llvm::any_of` with an inline loop to avoid depending on
constexpr `any_of`/`none_of` from `STLExtras` (#172536), which was also
reverted due to a GCC 15.2.1 bootstrap miscompile.
- The patch is now fully self-contained with no prerequisite changes.
Motivation: This is a prerequisite for making `LaneBitmask` a wrapper
around `Bitset`, enabling scalable lane bitmasks beyond 64 bits
(https://discourse.llvm.org/t/rfc-out-of-lanebitmask-bits-again/88613).
This patch fixes https://github.com/llvm/llvm-project/issues/187033
In BE8 mode, instruction bytes are reversed for sections containing
code. This logic currently assumes that arm mapping symbols (e.g. $a,
$t, $d) are always associated with InputSections.
However, mapping symbols can also be defined in other section types such
as mergeable sections (SHF_MERGE). These are not represented as
InputSection, and attempting to cast them using
cast_if_present<InputSection> results in an assertion failure.
Handle specialization of `linalg.generic` ops representing a unary
elementwise computation to the `linalg.elementwise` category op. This
implements a previously absent path in the linalg morphism.
This patch contains two parts.
1. Update costs reflect to the codegen changes. This is not that
accurate since the step vector can use smaller type if there is a
vscale_range attribute. But we cannot get that in the type-based query
in TTI.
2. Return invalid cost for the vector.extract.last.active that needs
vector split for the step vector. But currently this is not handled
correctly and will hit the assertion.
For not blocking the FindLast reduction in LV
(https://github.com/llvm/llvm-project/pull/184931). We should land this
first and fix the SelectionDAG for vector.extract.last.active lowering.
This is a follow-up of the suggestion left here:
https://github.com/llvm/llvm-project/pull/181707#discussion_r2995733831
The override functions in AMDGPU/ARM/SystemZ/X86 are required to avoid
enabling partial reductions where they were previously disabled (I've
added this for all targets that implement getArithmeticReductionCost).
Fixes#165413. Where a build failure was reported:
```
/b/s/w/ir/x/w/llvm-llvm-project/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp:1182:9: error: unknown type name 'user_sve_header'; did you mean 'sve::user_sve_header'?
1182 | user_sve_header *header =
| ^~~~~~~~~~~~~~~
| sve::user_sve_header
```
To fix this, add sve:: as we do for all other uses of this.
This is LLDB's copy of a structure that Linux also defines. I think the
build worked on some machines because that version ended up being
included, but with a more isolated build, it may not.
We have our own definition of it so we can be sure what we're using in
case Linux extends it later.
Update the built-in signatures in BuiltinsLoongArchLSX.def and
BuiltinsLoongArchLASX.def to precisely match the vector types used in
the corresponding intrinsic headers (lsxintrin.h and lasxintrin.h).
This alignment ensures that these intrinsics can be compiled
successfully even when -flax-vector-conversions=none is specified, since
the built-in arguments no longer rely on implicit vector type
conversions.
Added new test cases to verify the macro-defined LSX/LASX
intrinsic interfaces under -flax-vector-conversions=none.
Fixes#189898
Forward all CTU-import failures as diagnostics (remarks, warnings,
errors), except for `index_error_code::missing_definition` which has the
potential of generating too many diagnostics.
--
CPP-7804
Moving these into the middle-end pipeline will allow for additional
optimization of the expansion result, such as CSE of redundant loads
(c.f. https://godbolt.org/z/bEna4Md9r). For now, we conservatively place
the passes at the end of the middle-end pipeline, so we mostly don't
benefit from additional optimizations yet. The pipeline position will be
moved in a future change.
This builds on work done by legrosbuffle in
https://reviews.llvm.org/D60318.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make this dependent only on the minsize attribute and drop the pipeline
handling.
Rename the enable-loop-header-duplication option to
enable-loop-header-duplication-at-minsize to clarify that it controls
header duplication at minsize only (in other cases it is enabled by
default, independently of this option).
Instead of using the Os/Oz level during pass pipeline construction,
query the optsize/minsize attribute on the function to determine whether
specialization is allowed to take place. This ensures consistent
behavior for per-function attributes.
It's worth noting that FuncSpec *already* checks for minsize, but at the
call-site level.
rather than using absolute symbol constants on ELF/x86.
This leads to better codegen as the absolute symbol constants were not
resolved until link time (see bug for example).
Fixes#188470
Add `markParallel` using level-synchronized `parallelFor`. Each BFS
level is processed in parallel; newly discovered sections are collected
in per-thread queues and merged for the next level.
The parallel path is used when `!TrackWhyLive && partitions.size()==1`.
`parallelFor` naturally degrades to serial when `--threads=1`.
Uses depth-limited inline recursion (depth<3) and optimistic
load-then-exchange dedup for best performance.
Linking a Release+Asserts clang (--gc-sections, --time-trace) on an old
x86-64:
8 threads: markLive 315ms -> 82ms (-234ms). Total 1562ms -> 1350ms
(1.16x).
16 threads: markLive 199ms -> 50ms (-149ms). Total 1017ms -> 862ms
(1.18x).
and on Apple M4: markLive 61ms -> 13ms. Total 317.3ms -> 272.7ms
(1.16x).