- Add `LiveIntervalsAnalysis`.
- Add `LiveIntervalsPrinterPass`.
- Use `LiveIntervalsWrapperPass` in legacy pass manager.
- Use `std::unique_ptr` instead of raw pointer for `LICalc`, so
destructor and default move constructor can handle it correctly.
This would be the last analysis required by `PHIElimination`.
Memcpy intrinsics with statically unknown loop sizes are lowered with
two load/store loops: one with access widths specified by the target,
and a residual loop that copies remaining bytes individually.
As the residual loop operates byte-wise, its accesses are only
1-aligned. However, we currently use the alignment that is optimal for
the first loop in both, which is unsound. With this patch, we use the
correct alignment in the residual loop.
The lowering of memcpy with a static size already handles alignments for
the residual correctly.
Previously we had the same instructions being generated for `ISD::CTLZ` and `ISD::CTLZ_ZERO_UNDEF` which did not take advantage of the fact that zero is an invalid input for `ISD::CTLZ_ZERO_UNDEF`. This commit separates codegen for the two cases to allow for the optimization for the latter case.
The details of the optimization are outlined in #82075Fixes#82075
Co-authored-by: Manish Kausik H <hmamishkausik@gmail.com>
This patch fixes a miscompilation when `N` gets CSEed to `Existing`:
```
Existing: t5: i32 = sub nuw Constant:i32<0>, t3
N: t30: i32 = sub Constant:i32<0>, t3
```
Fixes https://github.com/llvm/llvm-project/issues/96366.
So far, we ignored if a memmove intrinsic is volatile when lowering it
to loops in the IR. This change generates volatile loads and stores in
this case (similar to how memcpy is handled) and adds tests for volatile
memmoves and memcpys.
An atomic fadd instruction like this should return %x:
; value at %ptr is %x
%r = atomicrmw fadd ptr %ptr, float %y
After atomic optimization, if %y is uniform, the result is calculated
as %r = %x + * %y * +0.0. This has a couple of problems:
1. If %y is Inf or NaN, this will return NaN instead of %x.
2. If %x is -0.0 and %y is positive, this will return +0.0 instead of
-0.0.
Avoid these problems by disabling the "%y is uniform" path if there are
any uses of the result.
PHIElim deduplicates identical PHI nodes to reduce the number of copies
inserted. There are two cases:
1. Identical PHI nodes are in different blocks. That's the reason for
this optimization; this can't be avoided at SSA-level. A necessary
prerequisite for this is that the predecessors of all basic blocks
(where such a PHI node could occur) are the same. This implies that
all (>= 2) predecessors must have multiple successors, i.e. all edges
into the block are critical edges.
2. Identical PHI nodes are in the same block. CSE can remove these.
There are a few cases, however, where they still occur regardless:
- expand-large-div-rem creates PHI nodes with large integers, which
get lowered into one PHI per MVT. Later, some identical values
(zeroes) get folded, resulting in identical PHI nodes.
- peephole-opt occasionally inserts PHIs for the same value.
- Some pseudo instruction emitters create redundant PHI nodes (e.g.,
AVR's insertShift), merging the same values more than once.
In any case, this happens rarely and MachineCSE handles most cases
anyway, so that PHIElim only gets to see very few of such cases (see
changed test files).
Currently, all PHI nodes are inserted into a DenseMap that checks
equality not by pointer but by operands. This hash map is pretty
expensive (hashing itself and the hash map), but only really useful in
the first case.
Avoid this expensive hashing most of the time by restricting it to basic
blocks with only critical input edges. This improves performance for
code with many PHI nodes, especially at -O0. (Note that Clang often
doesn't generate PHI nodes and -O0 includes no mem2reg. Other
compilers always generate PHI nodes.)
It is documented that immarg is only valid on intrinsic declarations,
although the verifier also tolerates it on intrinsic calls.
This patch updates tests that are not specifically testing the
behavior of the IR parser or verifier.
In the SelectionDAG lowering of the memcpy intrinsic, this optimization
introduces additional chains between fixed-size groups of loads and the
corresponding stores. While initially introduced to ensure that wider
load/store-pair instructions are generated on AArch64, this optimization
also improves code generation for AMDGPU: Ganged loads are scheduled
into a clause; stores only await completion of their corresponding load.
The chosen value of 16 performed good in microbenchmarks, values of 8,
32, or 64 would perform similarly.
The testcase updates are autogenerated by
utils/update_llc_test_checks.py.
See also:
- PR introducing this optimization: https://reviews.llvm.org/D46477
Part of SWDEV-455845.
This was assuming the source address space was at least as large
as the destination of the cast. I'm not sure why this was casting
to begin with; the assumption seems to be the source
address space from the root addrspacecast matches the underlying
object so directly check that.
Fixes#97457
atomicrmw fmax/fmin perform the same operation as llvm.maxnum/minnum
which return the other operand if one operand is nan. This means that,
in the presence of nan arguments, +/- inf is not an identity for these
operations but nan is (at least if you don't care about nan payloads).
`SelectionDAG::getBitcastedAnyExtOrTrunc` assumes that there is always a
valid
integer type corresponding to another type, which is not always true
when it
comes to vector type. For example, `<3 x i8>` doesn't have a
corresponding
integer type.
Fix SWDEV-464698.
We're missing SimplifyDemandedBits styles of optimizations,
so one case differs from the DAG from not trimming the constant.
The other case is an optimization we get that the DAG doesn't do to
split the 64-bit shift.
https://reviews.llvm.org/D138082
Math library code has quite a few places with complex bit
logic that are ultimately fed into a copysign. This helps
avoid some regressions in a future patch.
This assumes the position in the float type, which should
at least be valid for IEEE types. Not sure if we need to guard
against ppc_fp128 or anything else weird.
There appears to be some value in simplifying the value operand
as well, but I'll address that separately.
For the v3f16.v3i32 case, the v3f16 would request widening
to v4f16, but the v3i32 does not require widening to be a legal
type, so GetWidenedVector would fail. We need to widen the exponent
vector to the same element count as the result.
Fixes: SWDEV-470951
Direct mapped dynamic LDS is not lowered in the LowerLDSModule pass.
Hence it is not marked with an absolute symbol. When the LowerLDS pass is
rerun in LTO, compilation fails with an assert "cannot mix abs and non-abs LDVs".
This patch adds an additional check for direct mapped dynLDS to skip the assert.
Fixes SWDEV-454281
We masked out the sign bit from one value, and the non-sign bits
from the other so there should be no common bits set.
No idea how to test this on the DAG path, other than scraping
the debug logs. A few targets hit this path with f16 values, but
the resulting i16 ors get anyext promoted and lose the disjoint
flag. In the fp128 case, PPC gets further and the or loses the flag
somewhere else later. Adding a haveNoCommonBits assert shows this
works though.
These are incremental changes over #89217 , with core logic being the
same. This patch along with #89217 and #91190 should get us ready to enable 64
bit optimizations in atomic optimizer.
This patch is intended to be the first of a series with end goal to
adapt atomic optimizer pass to support i64 and f64 operations (along
with removing all unnecessary bitcasts). This legalizes 64 bit readlane,
writelane and readfirstlane ops pre-ISel
---------
Co-authored-by: vikramRH <vikhegde@amd.com>
For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.
New PR of https://github.com/llvm/llvm-project/pull/79553.