This patch fixes a regression introduced by PR #175022, where
a freeze was introduced with the following transformation:
ext(freeze(load(x))) -> freeze(extload(x))
If a new extend is introduced afterwards we then have
ext(freeze(extload(x)))
which doesn't get picked up by existing DAG combines due to
the freeze getting in the way.
Previously, the DAG combiner did not optimize exact signed division by a
power-of-two constant divisor for integer types exceeding the size of
division supported by the target architecture (e.g., i128 on x86-64).
However, such an optimization was expected by the division expansion
logic, leading to unsupported division operations making it to
instruction selection.
This commit addresses this issue by making an exception to the existing
exclusion of signed division with the exact flag for the aforementioned
operations. That is, the DAG combiner will now optimize exact signed
division if the divisor is a power-of-two constant and the integer type
exceeds the size of division supported by the target architecture.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
There are target intrinsics that logically require two MMOs, such as
llvm.amdgcn.global.load.lds, which is a copy from global memory to LDS,
so there's both a load and a store to different addresses.
Add an overload of getTgtMemIntrinsic that produces intrinsic info in a
vector, and implement it in terms of the existing (now protected)
overload.
GlobalISel and SelectionDAG paths are updated to support multiple MMOs.
The main part of this change is supporting multiple MMOs in
MemIntrinsicNodes.
Converting the backends to using the new overload is a fairly mechanical step
that is done in a separate change in the hope that that allows reducing merging
pains during review and for downstreams. A later change will then enable
using multiple MMOs in AMDGPU.
TLI.isBinOp recognises some opcodes that have multiple results,
including UADDO etc.
In most cases we currently just bail if a binop has multiple results,
but shuffle combining was missing the check and its pretty trivial to
add handling in this case.
I've added add/sub-overflow opcodes to verifyNode to help catch these
cases in the future - IIRC there was a plan to autogen these, but there
isn't anything at the moment.
Fixes#179112
This is a reland of #172523.
The original patch caused an assertion failure on RISC-V because it
attempted to create a bitcast from an illegal type (i32 on RV64) during
the post-type-legalization DAGCombine stage.
Added a `TLI.isTypeLegal(Val.getValueType())` check to ensure we only
proceed with the bitcast STLF optimization when the source value's type
is legal for the target.
This patch introduces support for Store-to-Load Forwarding (STLF) in
`DAGCombiner::ForwardStoreValueToDirectLoad` when the store and load
have **different types but equal memory size** (e.g., storing an `i32`
then loading a `float` from the same location).
### What this patch does:
**Enables Optimization:** It allows for the safe forwarding of the
stored value as a Bitcast when the value is:
* A **Constant** (`ConstantSDNode`, `ConstantFPSDNode`,
`ConstantPoolSDNode`).
* **Undef**.
* And the memory sizes (`LdMemSize` == `StMemSize`) match.
### Scope and Next Steps:
This patch **only implements forwarding for constant and undef values
that has the same memory size** so far.
**I am submitting this initial patch to get early review feedback on the
core logic and fix the immediate crashes before tackling the more
complex scenarios.**
For the simple case:
```llvm
; Case Handled by this PR so far (e.g., zeroinitializer is a constant)
define float @test_stlf_integer(ptr %p, float %v) {
store i32 0, ptr %p, align 4
%f = load float, ptr %p, align 4
; ...
}
```
Fixes: #151683
We also get some i32->i64 promotion for CLMULH. The DAGCombiner
change is to prevent an infinite loop from that.
Test file was rewritten to cover all types and split between clmul
and clmulh.
I added a couple masked tests to show that VectorPeephole works.
The test outputs were already large so I didn't want to add more than a couple.
Following on from https://github.com/llvm/llvm-project/pull/172484 I
have added support to tryToFoldExtOfLoad for looking through freezes, in
order to catch more cases of extending loads. This type of code is
sometimes seen being generated by the loop vectoriser. For now I've
limited this to cases where the load is only used by the freeze, since
otherwise it leads to worse code in some X86 tests.
BinaryOpc_match is already wired up for this - but allow us to use
m_BinOp/m_c_BinOp with the required flags directly
Updated the foldShiftToAvg folds to make use of this
Alive2 proof: https://alive2.llvm.org/ce/z/mcatXZ
I've raised #174718 as supposedly PPC has AVGCEIL instructions, but the
patterns in PPCInstrAltivec.td are either incorrect or the instructions
don't account for overflow.
Fixes#128377
This fixes a regression in #174693 caused by using ISD::UMIN to clamp
offset into a vector address.
For (umin x, y) if we know the minimum value of x is >= the maximum
value of y, then y will always be the smaller operand and we can fold to
y.
We can do similar folds for umax, smin and smax too.
In practice the only time we get a useful ConstantRange is with VScale
and a constant RHS, so this patch limits it to this case. I tried
generalizing it with computeKnownBits but it didn't have any effect on
existing tests.
Extend the existing DAGCombine logic in visitIMINMAX so that signed and
unsigned MIN/MAX can be flipped not only when both operands are known
non-negative but also when both operands are known negative. This
replaces the old SignBitIsZero checks with computeKnownBits and explicit
tests for non-negative or negative operands while keeping all existing
legality and saturation gating in place. Add regression tests to cover
both the known-negative case and the known-non-negative case.
Fixes#174325
Libcall lowering decisions should come from the LibcallLoweringInfo
analysis. Query this through the DAG, so eventually the source
can be the analysis. For the moment this is just a wrapper around
the TargetLowering information.
(select (setcc ...) (sub a, b) (sub b, a))
When b is const, the `sub a, b` becomes `add a, -b` which we take care of in this patch with the m_SpecificNeg() matcher.
In line with a std proposal to introduce the llvm.clmul family of
intrinsics corresponding to carry-less multiply operations. This work
builds upon 727ee7e ([APInt] Introduce carry-less multiply primitives),
and follow-up patches will introduce custom-lowering on supported
targets, replacing target-specific clmul intrinsics.
Testing is done on the RISC-V target, which should be sufficient to
prove that the intrinsics work, since no RISC-V specific lowering has
been added.
Ref: https://isocpp.org/files/papers/P3642R3.html
Co-authored-by: Craig Topper <craig.topper@sifive.com>
The RISC-V P extension adds an instruction equivalent to
__builtin_clrsb. AArch64 has a similar instruction that we currently fail to
select when using the builtin.
This patch adds a combine based on the canonical version of the pattern
emitted by clang for the builtin, (add (ctlz (xor x, (sra x, bw-1)))),
-1). I'm starting the combine at the ctlz because the outer add can
easily be combined into other nodes obscuring the full pattern. So we
generate (add (ctls x), 1) and hope the add will be combined away.
I've also added a combine for the pattern AArch64 recognizes
(ctlz_zero_undef (or (shl (xor x, (sra x, bw-1)), 1), 1)).
I've only enabled the combines when the target has a Legal or Custom
action for the operation, taking into account type promotion. We
can relax this in the future by adding a default expansion to
LegalizeDAG and adding more type legalization rules.
`NoSignedZerosFPMath` isn't a hard requirements and in some contexts we
can still apply the truncation without worrying. For example, in cases
where the users of this sequence are overwriting the sign-bit (fabs) or
simply ignoring it (fcmp).
I think the same logic can be applied elsewhere for other DAG
optimizations.
The use of nested m_Reassociatable matchers by #169644 can result in
high compile times as the inner m_Reassociatable call is being repeated
a lot while the outer call is trying to match. Place the inner
m_ReassociatableAnd at the beginning of the pattern so it is not
repeatedly matched in recursion.
SelectionDAG uses the DAGCombiner to fold a load followed by a sext to a
load and sext instruction. For example, in x86 we will see that
```
%1 = load i32, ptr @GlobArr
#dbg_value(i32 %1, !43, !DIExpression(), !52)
%2 = sext i32 %1 to i64, !dbg !53
```
is converted to:
```
%0:gr64_nosp = MOVSX64rm32 $rip, 1, $noreg, @GlobArr, $noreg, debug-instr-number 1, debug-location !51
DBG_VALUE $noreg, $noreg, !"Idx", !DIExpression(), debug-location !52
```
The `DBG_VALUE` needs to be transferred correctly to the new combined
instruction, and it needs to be appended with a `DIExpression` which
contains a `DW_OP_LLVM_fragment`, describing that the lower bits of the
virtual register contain the value.
This patch fixes the above described problem.
Some floating-point optimization don't trigger because they can produce
incorrect results around signed zeros, and rely on the existence of the
nsz flag which commonly appears when fast-math is enabled.
However, this flag is not a hard requirement when all of the users of
the combined value are either guaranteed to overwrite the sign-bit or
simply ignore it (comparisons, etc.).
The optimizations affected:
- fadd x, +0.0 -> x
- fsub x, -0.0 -> x
- fsub +0.0, x -> fneg x
- fdiv(x, sqrt(x)) -> sqrt(x)
- frem lowering with power-of-2 divisors
The existing code for generating umulh/smulh was checking that that the
getTypeToTransformTo was a LegalOrCustom operation. This only takes a
single legalization step though, so if v4i32 was legal, a v8i32 would be
transformed but a v16i32 would not.
This patch introduces a getLegalTypeToTransformTo that performs
getTypeToTransformTo until a legal type is reached. The umulh/smulh code
can then use it to check if the final resultant type will be legal.
Type legalization can promote constant operands. The MULHU optimization
`mulhu x, (1 << c) -> x >> (bitwidth - c)` was failing when constants
were promoted because:
1. `isConstantOrConstantVector` check rejected promoted constants
2. `BuildLogBase2` -> `takeInexpensiveLog2` -> `matchUnaryPredicate`
rejected promoted constants
This fixes both by adding `AllowTruncation=true`, following the pattern
from the recent UDIV fix (#169491).
If we are force reconstructing a carry from a raw MVT::i1 type, make
sure we don't miss any cases while peeling through trunc/ext chains -
check for i1 types at the start of the while loop
Fixes#169691
This reverts commit 6d5f87fc4284c4c22512778afaf7f2ba9326ba7b.
Previously this failed due to treating the unknown MachineMemOperand
value as known uniform.