For a given element, I believe A is only 0 when the divisor is INT_MIN.
The only way for NeedToApplyOffset to be false after processing all
elements, is for all divisors to be INT_MIN. If all divisors are
INT_MIN, then all divisors are a power of 2 and we wouldn't do the
transform.
The evaluation order of function arguments is unspecified by the C++
standard. We had two getNode calls as function arguments which causes
the nodes to be created in a different order depending on the compiler
used. This patch moves them to their own variables to ensure they are
called in the same order on all compilers.
Possible fix for #190148.
If the divisor is INT_MIN, we can still treat it like any other power of
2. We'll fold i32 (seteq (srem X, INT_MIN)) to
(setule (rotr (add (mul X, 1), INT_MIN), 31), 1). Alive2 says this is
correct https://alive2.llvm.org/ce/z/vjzqKk.
The multiply is a NOP, the add toggles the sign bits. The rotate puts
the lowest 31 bits of into the upper 31 bits. The sign bit is now in the
LSB. The compare checks if the upper 31 bits are 0.
srem X, INT_MIN has a remainder of 0 if X is 0 or INT_MIN which is
equivalent to checking if the uppper 31 bits are 0 after the rotate.
I don't think we need to add any constant for power of 2 but toggling
the sign bit like we do now doesn't hurt.
Remove LowerFCanonicalize. Added fallback for cases when the scalar type also has its Custom lowering to avoid regressions on AMDGPU and SystemZ.
Fixes#143862
This is a second attempt at "[SelectionDAG] Expand
CTTZ_ELTS[_ZERO_POISON] and handle splitting" (#188220)
That PR had to be reverted in 7d39664a6ae8daaf186b65578492244d96a50bf2
because we had crashes on AMDGPU since we didn't have scalarization
support, and other crashes on PowerPC because we didn't handle the case
when a vector needed widened. Tests for these are added in
AMDGPU/cttz-elts.ll, RISCV/rvv/cttz-elts-scalarize.ll and
PowerPC/cttz-elts.ll.
The former crash has been fixed by adding
DAGTypeLegalizer::ScalarizeVecOp_CTTZ_ELTS.
The second crash has been fixed by reworking
TargetLowering::expandCttzElts. The expansion for CTTZ_ELTS is nearly
identical to VECTOR_FIND_LAST_ACTIVE, except it uses a reverse step
vector and subtracts the result from VF. The easiest way to fix these
crashes without introducing regressions is to reuse the
VECTOR_FIND_LAST_ACTIVE expansion which already handles the case where
the vector needs widened.
This means that the node now needs to take in a boolean vector argument
and uses VSELECT instead of an AND to zero out inactive lanes, so the op
promotion code has also been shared.
This patch adds an initial skeleton for `SelectionDAG::computeKnownFPClass`.
The initial version includes:
- DemandedElts wrapper and max depth early-out
- `ConstantFPSDNode` and `BUILD_VECTOR` handling
- `TargetLowering::computeKnownFPClassForTargetNode` virtual hook for backend extensions
Initial test coverage for constant scalars, BUILD_VECTOR, and max depth
early-out is added in `AArch64SelectionDAGTest.cpp`.
closes#175571
After #187378 these are no longer tested. I'm concerned that we can
create illegal scalar types after type legalization. I don't know how to
test this now so I'd like to remove support until it is needed and can
be tested.
If we we are going to legalize to a vector with the same element type
and mulh or mul_lohi are supported, allow the optimization before type
legalization.
RISC-V will widen vectors using vp.udiv/sdiv that doesn't support
division by constant optimization. In addition, type legalization will create
a build_vector with undef elements making it hard to match after type
legalization.
Other targets may need to widen by a combination of vector and scalar
divisions to avoid traps if we widen a vector with garbage.
I had to enable the MULHU->SRL DAG combine before type legalization to
prevent regressions. After type legalization, the multiply constant
build_vector will have undef elements and the combine won't trigger.
K is an unsigned, it will be zero extended to uint64_t for
the APInt constructor. If the ShSVT has more than 32 bits, we won't
create an all ones ConstantSDNode.
To fix this, explicitly push an all ones constant to KAmts. This
also fixes an APInt ImplicitTrunc.
This allows turnVectorIntoSplatVector to work for this case.
The -1 has 'int' type. The APInt assignment operator takes uint64_t.
Fortunately, due to C rules, the -1 will be converted to an all ones
uint64_t. Unfortunately, if the APInt has more than 64 bits, the upper
words will be zeroed. I don't think we have any testing of that today.
Use setAllBits to avoid the subtle cast and fix the bits > 64 issue.
K still has its own issue that needs to be fixed.
This reduces the number of generated instructions in some cases.
While I was there, I changed an Add to a Disjoint OR since DAGCombine
was going to do it anyway.
Currently a cttz.elts of e.g. nxv32i1 will get expanded to a reduction
of nxv32i64 or equivalent, but we can split it into two legal nxv16i1
cttz.elts once we have dedicated SelectionDAG nodes.
This implements the splitting for them the same way we implement type
splitting for vp.cttz.elts, i.e. check if the low result is VF, and if
so add it to the result of the high result. It also implements operand
type promotion for NEON which needs to promote i1 vectors to something
larger first.
We also need to move expansion into LegalizeVectorOps so it doesn't get
expanded before type legalization can do splitting. This uses
LegalizeVectorOps in case the scalar reduction type, which depends on
the minimum bitwidth needed to store the result, still needs type
promotion.
The TTI costs should be updated after this to reflect the more efficient
codegen, but that is deferred to another PR.
This patch change the type of the step vector lowered from
`expandVectorFindLastActive` from `e8` to the index type of the target
machine.
This can help the index out of bound issue when the VLEN is large.
Note that after this patch, there are still some issue in
expandVectorFindLastActive.
We don't need an AND on the last iteration. If we shifted the dividend
due to trailing zeros in the divisor, we don't need a chunk that only
contains shifted in zeros.
Make the (1 << HBitWidth) % Divisor == 1 path a special case within
the recently added chunk summing algorithm. This allows us to
share the trailing zero shifting code.
While there make some comment improvements and avoid creating
unnecessary nodes.
This replaces LegalVT with HiLoVT and LegalWidth with HBitWidth as
they are the same for all current uses.
Then we rewrite the shifts to operate on LL and LH.
There's a slight regression on RISC-V due to different node creation
order leading to different DAG combine order. I have other refactoring
I'd like to explore then I may try to fix that.
Check the type before we call getOperationAction. Give BuildUDIVPattern
only AllowWiden and a WideSVT.
Update variable names and comments to avoid spreading "64" to too many
places.
This patch improves the lowering of 128-bit unsigned division and
remainder by constants (UDIV/UREM) by avoiding a fallback to libcall
(__udivti3/uremti3) for specific divisors.
When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the
128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed
before applying a smaller UDIV/UREM. This transformation is based on the
"remainder by summing digits" trick described in Hacker’s Delight.
This fixes#137514 for some constants.
This moves it above the type legality check. The legality check we use
for the main division by constant algorithm is probably not right for
BuildExactSDIV and BuildExactSDIV. These checks are largely about the
legality of MUL_LOHI/MULH which are not used for the exact case.
This patch removes the legal type check for the exact case. If we do
need a check it's probably better to have a specific version in
BuildExactSDIV and BuildExactSDIV.
I'm hoping to do some refactoring of the legality checks in
BuildSDIV/BuildUDIV so separating them makes this easier.
Alternative approach to the same goals as #162407
This takes `TargetLoweringBase::getLoadExtAction`, renames it to
`TargetLoweringBase::getLoadAction`, merges `getAtomicLoadExtAction`
into it, and adds more inputs for relavent information (alignment,
address space).
The `isLoadExtLegal[OrCustom]` helpers are also modified in a matching
manner.
This is fully backwards compatible, with the existing `setLoadExtAction`
working as before. But this allows targets to override a new hook to
allow the query to make more use of the information. The hook
`getCustomLoadAction` is called with all the parameters whenever the
table lookup yields `LegalizeAction::Custom`, and can return any other
action it wants.
Reuse the ExpandIntRes_CLMUL identity to expand vector
CLMUL/CLMULR/CLMULH on wider element types (vXi16, vXi32, vXi64) by
decomposing into half-element-width operations that eventually reach a
legal CLMUL type.
Three generic strategies in expandCLMUL:
1. Halve: halve element width (e.g. v8i16 -> v8i8 on AArch64)
2. promote to double : zext to wider type if CLMUL is legal there (e.g.
x86)
3. Count widen: pad with undef to double element count (e.g. v4i16 ->
v8i16)
A helper canNarrowCLMULToLegal() guides strategy selection and prevents
circular expansion in the CLMULH bitreverse path.
Also add Custom BITREVERSE lowering for v4i16/v8i16 on AArch64 using
REV16+RBIT, which the CLMULH expansion relies on.
Fixes#183768
This PR optimizes 32-bit unsigned division by constants when the magic
constant is 33 bits (IsAdd=true case in UnsignedDivisionByConstantInfo)
on 64-bit targets.
## Overview
Compiler optimization for constant division of `uint32_t` variables
(such as `x / 7`) is based on the method
proposed by Granlund and Montgomery in 1994 (hereafter referred to as
the GM method).
However, the GM method for the IsAdd=true case was optimized for 32-bit
CPUs, not 64-bit CPUs.
This patch provides optimizations specifically for 64-bit CPUs (such as
x86_64 and Apple M-series).
A simple benchmark demonstrates over 60% speedup on both Intel Xeon and
Apple M4 processors.
## The GM Method
The GM method for `x / 7` can be expressed in C code as follows,
where the constants `c` and `a` are magic numbers determined by the
divisor:
```cpp
uint32_t udiv_original(uint32_t x) {
uint64_t v = x * c;
v >>= 32;
uint32_t t = uint32_t(x) - uint32_t(v);
t >>= 1;
t += uint32_t(v);
t >>= a - 33;
return t;
}
```
For example, division by 7 on x86_64 generates 7 instructions:
```asm
movl %edi, %eax
imulq $613566757, %rax, %rax
shrq $32, %rax
subl %eax, %edi
shrl %edi
addl %edi, %eax
shrl $2, %eax
```
## Proposed Solution
This patch generates the following optimized code:
```cpp
uint32_t udiv_optimized(uint32_t x) {
uint128_t v = uint128_t(x) * ((c + 0x100000000) << (64 - a));
return uint32_t(v >> 64);
}
```
Since a 64-bit right shift of a 128-bit variable extracts the upper 64
bits,
this code eliminates the need for shifts after multiplication.
The implementation pre-shifts the 33-bit magic constant `c = 2^32 +
Magic` left by `(64-a)` bits
and uses the high 64 bits of a 64 x 64 -> 128 bit multiplication
directly.
This eliminates the add/sub/shift sequence.
After optimization, division by 7 becomes 4 instructions (or 3 with
BMI2):
```asm
# Standard (4 instructions)
movl %edi, %eax
movabsq $2635249153617166336, %rcx
mulq %rcx
movq %rdx, %rax
# With BMI2 (3 instructions)
movl %edi, %edx
movabsq $2635249153617166336, %rax
mulxq %rax, %rax, %rax
```
For v8i8 on AArch64, `expandCLMUL` picked the zext path (ExtVT=v8i16) since ZERO_EXTEND/SRL were legal, but CLMUL on v8i16 is not, resulting in a bit-by-bit expansion (~42 insns). Prefer the bitreverse path when CLMUL is legal on VT but not ExtVT.
v8i8 CLMULR: 42 → 4 instructions.
Fixes#182780
Add DemandedElts handling to allow better vector support
To prevent RISCV falling back to a mul call in known-never-zero.ll I've
had to tweak the (mul step_vector(C0), C1) to (step_vector(C0 * C1))
fold to only occur if C0 is already non-power-of-2, C0 * C1 is a
power-of-2 or the target has good mul support.
RISC-V doesn't have a carry flag which makes the UADDO expansion
expensive to emulate.
I've disabled the code by checking if UADDO is not supported for the
type that will be legalized too. Unfortunatley, we have custom lowering
of UADDO on RV64 so this doesn't disable this code there.
fptoui.sat can currently use a minnum/maxnum based expansion, which
relies on NaNs not being propagated. Specifically, it relies on
minnum(maxnum(NaN, 0), MAX) to return 0. However, if the input is sNaN,
then maxnum(sNaN, 0) is allowed to return qNaN, in which case the final
result will be MAX rather than 0.
This PR does the following changes:
* Support the fold for minimumnum/maximumnum, which guarantees that NaN
is not propagated even for sNaN, so it can use the old lowering. Test
this using Hexagon which has legal minimumnum but illegal minnum.
* For the minnum/maxnum case, remove the special unsigned case and
instead always insert the explicit NaN check. In that case the NaN
propagation semantics don't matter.
* This also means that we can support this expansion for
minimum/maximum.
Given the test case:
struct CBase {
virtual void foo();
};
void bar(CBase *Base) {
Base->foo();
}
and using '-emit-call-site-info' with llc, the following DWARF
is produced for the indirect call 'Base->foo()':
1$: DW_TAG_structure_type "CBase"
...
2$: DW_TAG_subprogram "foo"
...
3$: DW_TAG_subprogram "bar"
...
4$: DW_TAG_call_site
...
We add DW_AT_LLVM_virtual_call_origin to existing call-site
information, linking indirect calls to the function-declaration
they correspond to.
4$: DW_TAG_call_site
...
DW_AT_LLVM_virtual_call_origin (2$ "_ZN5CBase3fooEv")
The new attribute DW_AT_LLVM_virtual_call_origin helps to
address the ambiguity to any consumer due to the usage of
DW_AT_call_origin.
The functionality is available to all supported debuggers and
it is generated only for DWARF version 5 or greater.
When we have a BITCAST and the source type is a vector with smaller
elements compared to the destination type, then we need to demand all
the source elements that make up the demanded elts for the result when
doing recursive calls to SimplifyDemandedBits,
SimplifyDemandedVectorElts and SimplifyMultipleUseDemandedBits. Problem
is that those simplifications are allowed to turn non-demanded elements
of a vector into POISON, so unless we demand all source elements that
make up the result there is a risk that the result would be more
poisonous (even for demanded elts) after the simplification.
The patch fixes some bugs in SimplifyMultipleUseDemandedBits and
SimplifyDemandedBits for situations when we did not consider the problem
described above. Now we make sure that we also demand vector elements
that "must not be turned into poison" even if those elements correspond
to bits that does not need to be defined according to the DemandedBits
mask.
Fixes#138513
Scalar multiply is not part of the most basic RISC-V ISA. Use a
and+setcc+select for these targets.
The and+setcc+select is also beneficial for targets with bit test
instructions. RISC-V may not get the full benefit here due to
not having a cmove-like instruction without Zicond.
Co-authored-by: fbrv <Fabio.Baravalle@gmail.com>
…n in DWARF. (#167666)"
This reverts commit 418ba6e8ae2cde7924388142b8ab90c636d2c21f.
The commit caused an ICE due to hitting unreachable in
llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:1307
Fixes#182337
If a FSHR node's DemandedBits mask and maximum shift amount doesn't
demand any bits from the X upper register, then simplify to a SRL node.
FSHL is less useful but we could add it as a future patch if there's
interest
Based off a discussion on #182021
Converts ARM scalar CLS intrinsics to use the unified ISD::CTLS node
instead of custom manual expansion. This addresses the issue
[#174337](https://github.com/llvm/llvm-project/issues/174337).
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Given the test case:
struct CBase {
virtual void foo();
};
void bar(CBase *Base) {
Base->foo();
}
and using '-emit-call-site-info' with llc, the following DWARF
is produced for the indirect call 'Base->foo()':
1$: DW_TAG_structure_type "CBase"
...
2$: DW_TAG_subprogram "foo"
...
3$: DW_TAG_subprogram "bar"
...
4$: DW_TAG_call_site
...
We add DW_AT_LLVM_virtual_call_origin to existing call-site
information, linking indirect calls to the function-declaration
they correspond to.
4$: DW_TAG_call_site
...
DW_AT_LLVM_virtual_call_origin (2$ "_ZN5CBase3fooEv")
The new attribute DW_AT_LLVM_virtual_call_origin helps to
address the ambiguity to any consumer due to the usage of
DW_AT_call_origin.
The functionality is available to all supported debuggers.