This lowers the splitting as:
```
any_active(hi_mask)
? (find_last_active(hi_mask) + lo_mask.getVectorElementCount())
: find_last_active(lo_mask)
```
And trivially lowers `<1 x i1>` scalarization to returning zero. Which
is a natural result of the splitting (and the lack of a sentinel
"none-active" result value).
The lowerings likely can be improved. This patch is for completeness.
Should fix:
https://github.com/llvm/llvm-project/pull/178862#issuecomment-3862310334Fixes#180212
When there is no target-specific lowering of @llvm.cond.loop, it is
lowered into a simple loop by PreISelIntrinsicLowering. Mark the branch
weights into the no-return loop as unknown given we do not have value
metadata to fix the profcheck test for this feature.
Reviewers: mtrofin, alanzhao1, snehasish, pcc
Pull Request: https://github.com/llvm/llvm-project/pull/180390
Adresses one of the subtasks of #150515.
The code is ported from `SelectionDAG::computeKnownBits` and tests are
loosely based on `AArch64/GlobalISel/knownbits-shl.mir`.
The llvm.cond.loop intrinsic is semantically equivalent to a conditional
branch conditioned on ``pred`` to a basic block consisting only of an
unconditional branch to itself. Unlike such a branch, it is guaranteed
to use specific instructions. This allows an interrupt handler or
other introspection mechanism to straightforwardly detect whether
the program is currently spinning in the infinite loop and possibly
terminate the program if so. The intent is that this intrinsic may
be used as a more efficient alternative to a conditional branch to
a call to ``llvm.trap`` in circumstances where the loop detection
is guaranteed to be present. This construct has been experimentally
determined to be executed more efficiently (when the branch is not taken)
than a conditional branch to a trap instruction on AMD and older Intel
microarchitectures, and is also more code size efficient by avoiding the
need to emit a trap instruction and possibly a long branch instruction.
On i386 and x86_64, the infinite loop is guaranteed to consist of a short
conditional branch instruction that branches to itself. Specifically,
the first byte of the instruction will be between 0x70 and 0x7F, and
the second byte will be 0xFE.
Part of this RFC:
https://discourse.llvm.org/t/rfc-optimizing-conditional-traps/89456
Reviewers: arsenm, RKSimon, fmayer, vitalybuka
Pull Request: https://github.com/llvm/llvm-project/pull/177686
Profile data for instructions (e.g., branch weights) is automatically
preserved via `splice()` which moves the basic blocks along with their
instruction metadata. However, entry count is stored as function
metadata, which was dropped when creating merged function and thunks.
The fix is to explicitly set entry count for both merged function (.Tgm)
and thunks via `setEntryCount()`.
This patch fixes a regression introduced by PR #175022, where
a freeze was introduced with the following transformation:
ext(freeze(load(x))) -> freeze(extload(x))
If a new extend is introduced afterwards we then have
ext(freeze(extload(x)))
which doesn't get picked up by existing DAG combines due to
the freeze getting in the way.
Previously, the DAG combiner did not optimize exact signed division by a
power-of-two constant divisor for integer types exceeding the size of
division supported by the target architecture (e.g., i128 on x86-64).
However, such an optimization was expected by the division expansion
logic, leading to unsupported division operations making it to
instruction selection.
This commit addresses this issue by making an exception to the existing
exclusion of signed division with the exact flag for the aforementioned
operations. That is, the DAG combiner will now optimize exact signed
division if the divisor is a power-of-two constant and the integer type
exceeds the size of division supported by the target architecture.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
Handle this case by extending the integer to a wider type. This can
probably be handled more optimally, but this is conservatively correct.
Proof: https://alive2.llvm.org/ce/z/0RwDO1
GISel CallLowering currently does a Type -> EVT -> Type roundtrip early
on when populating ArgInfo in splitToValueType(). This is a bit odd as
this structure operates at the IR Type level. Keep the original type
there and only convert to EVT when performing assignments.
I don't think anything here requires the integer bit width to be
strictly larger. It's fine if it's the same (in which case some zexts
just go away).
Add tests on half + i32 that can be verified by alive2. Note that half
is handled via float, so the minimum supported type is i32 rather than
i16.
Proof (uitofp): https://alive2.llvm.org/ce/z/CsMfkU
Proof (sitofp): https://alive2.llvm.org/ce/z/jzuxyt
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
When computing the viable cycles for scheduling an instruction,
`computeStart` used to include special-case logic to handle loop-carried
dependencies. This special handling was necessary because loop-carried
dependencies were represented by reversed forward-direction edges in the
DAG. Now that we have the DDG, which explicitly models loop-carried
dependencies, this special handling is no longer required. As a first
step towards completely removing `isLoopCarriedDep`, this patch
eliminates the special-case logic from `computeStart` and some related
functions.
Split off from https://github.com/llvm/llvm-project/pull/135148
As with loads and stores, instructions that may trigger floating‑point
exceptions must not be reordered across a barrier instruction. This
patch adds the missing loop‑carried dependencies between such
instructions and the barrier, preventing reordering that could
previously occur. Same as #174391, the implementation is based on that
of `ScheduleDAGInstrs::buildSchedGraph`.
Split off from #135148
The loads/stores must not be reordered across barrier instructions.
However, in MachinePipeliner, it potentially could happen since
loop-carried dependencies from loads/stores to a barrier instruction
were not considered. The same problem exists for barrier-to-barrier
dependencies. This patch adds the handling for those cases. The
implementation is based on that of `ScheduleDAGInstrs::buildSchedGraph`.
Split off from https://github.com/llvm/llvm-project/pull/135148
This PR adds full support for atomicrmw in NVPTX. This includes:
- Memory order and syncscope support (changes in AtomicExpandPass.cpp,
NVPTXIntrinsics.td)
- Script-generated tests for integer and atomic operations
(atomicrmw.py, atomicrmw-sm*.ll in tests/CodeGen/NVPTX). Existing
atomics tests which are subsumed by these have been removed
(atomics-sm*.ll, atomics.ll, atomicrmw-expand.ll).
- ~~Changes shouldExpandAtomicRMWInIR to take a constant argument: This
is to allow some other TargetLowering constant-argument functions to
call it. This change touches several backends. An alternative solution
exists, but to me, this seems the "right" way.~~ Has been split out into
https://github.com/llvm/llvm-project/pull/176073. Rebased.
- NOTE: The initial load issued for atomicrmw emulation loops (and
cmpxchg emulation loops) must be a strong load. Currently,
AtomicExpandPass issues a weak load. Fixing this breaks several
backends. I'm planning to follow up with a separate PR.
Initially failed due to error: ptxas fatal : Value 'sm_60' is not
defined for option 'gpu-name'. Updated RUN lines in atomicrmw-sm*.py to
skip the ptxas-verify check if ptxas does not support that SM version.
The only caller of this function (`PeepholeOptimizer::optimizeSelect`)
did not use most of the parameters, was broadly equivalent to
`MI->isSelect()`, and the `optimizeSelect` hook can return `nullptr`
anyway.
Update `optimizeSelect` to return `nullptr` by default rather than
asserting when not implemented.
Targets without a `modf` libcall lower the intrinsic directly, matching
the existing `llvm.frexp` expansion. Targets with an existing libcall
are unchanged.
Fixes#173021
This patch fixes https://github.com/llvm/llvm-project/issues/150737.
The original computed CSRCost is too small, so the optimization of
spilling instead of using CSR is rarely triggered.
Also the original cost model is too difficult to be understood and too
hard to be tuned by backend developers and users.
So this patch changes the CSRCost to be
CSRCost = TRI->getCSRFirstUseCost() * EntryFreq * Scale
TRI->getCSRFirstUseCost() is the raw cost of save/restore a CSR. Usually
we don't need to tune this number.
EntryFreq is the BlockFrequency of the entry block.
Scale is used to scale down the CSRCost, because we usually prefer a CSR
register instead of spilling if we have similar CSRCost and spill cost,
so it should be less than 100%. We usually tune this number.
Another problem is the original function RAGreedy::calcSpillCost()
actually computes a cost for block split, so this patch also implements
a correct RAGreedy::calcSpillCost() function.
This new behavior is not enabled by default. This optimization is used
by 3 targets (AArch64 / AMDGPU / RISCV), I will change them one by one
in following patches.
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.
`constrainSelectedInstRegOperands` always returns `true`; so it can be
safely transformed to return `void` instead.
A follow-up patch should update `MachineInstrBuilder::constrainAllUses`.
This is part of the work to remove trivial VP intrinsics.
When widening an MLOAD we may use a VP_LOAD if it's supported. We use a
VP_SELECT to merge in the passthru, but we don't check if it's supported
by the target. This changes it to just emit a regular VSELECT instead to
prevent crashing in that case, and a VP_MERGE to keep the lanes past EVL
poison.
This is an attempt to merge https://reviews.llvm.org/D144006 with LTO
fix.
The last merge attempt was
https://github.com/llvm/llvm-project/pull/75385.
The issue with it was investigated in
https://github.com/llvm/llvm-project/pull/75385#issuecomment-2386684121.
The problem happens when
1. Several modules are being linked.
2. There are several DISubprograms that initially belong to different
modules but represent the same source code function (for example, a
function included from the same source code file).
3. Some of such DISubprograms survive IR linking. It may happen if one
of them is inlined somewhere or if the functions that have these
DISubprograms attached have internal linkage.
4. Each of these DISubprograms has a local type that corresponds to the
same source code type. These types are initially from different modules,
but have the same ODR identifier.
If the same (in the sense of ODR identifier/ODR uniquing rules) local
type is present in two modules, and these modules are linked together,
the type gets uniqued. A DIType, that happens to be loaded first,
survives linking, and the references on other types with the same ODR
identifier from the modules loaded later are replaced with the
references on the DIType loaded first. Since defintion subprograms, in
scope of which these types are located, are not deduplicated, the linker
output may contain multiple DISubprogram's having the same (uniqued)
type in their retainedNodes lists.
Further compilation of such modules causes crashes.
To tackle that,
* previous solution to handle LTO linking with local types in
retainedNodes is removed (cloneLocalTypes() function),
* for each loaded distinct (definition) DISubprogram, its retainedNodes
list is scanned after loading, and DITypes with a scope of another
subprogram are removed. If something from a Function corresponding to
the DISubprogram references uniqued type, we rely on cross-CU links.
Additionally:
* a check is added to Verifier to report about local types located in a
wrong retainedNodes list,
Original commit message follows.
---------
RFC https://discourse.llvm.org/t/rfc-dwarfdebug-fix-and-improve-handling-imported-entities-types-and-static-local-in-subprogram-and-lexical-block-scopes/68544
Similar to imported declarations, the patch tracks function-local types in
DISubprogram's 'retainedNodes' field. DwarfDebug is adjusted in accordance with
the aforementioned metadata change and provided a support of function-local
types scoped within a lexical block.
The patch assumes that DICompileUnit's 'enums field' no longer tracks local
types and DwarfDebug would assert if any locally-scoped types get placed there.
Authored-by: Kristina Bessonova <kbessonova@accesssoftek.com>
Co-authored-by: Jeremy Morse <jeremy.morse@sony.com>