Some cores implement an optimization for a strided load with an x0
stride, which results in fewer memory operations being performed then
implied by VL since all address are the same. It seems to be the case
that this is the case only for a minority of available implementations.
We know that sifive-x280 does, but sifive-p670 and spacemit-x60 both do
not.
(To be more precise, measurements on the x60 appear to indicate that a
stride of x0 has similar latency to a non-zero stride, and that both
are about twice a vleN.v. I'm taking this to mean the x0
case is not optimized.)
We had an existing flag by which a processor could opt out of this
assumption but no upstream users. Instead of adding this flag to the
p670 and x60, this patch reverses the default and adds the opt-in flag
only to the x280.
This avoids some cases where LSR produces results that lead to very poor
codegen. There's a chance we'll see minor degradations for some inputs
in the case that our metrics say the found solution is worse, but in
reality it's better than the starting point.
Per the review thread, at least one vendor has been enabling this by
defualt for some time and found overall it's an improvement. As such,
we'll enable by default and aim to fix any as-yet-unknown regressions
in-tree.
- Add `LiveIntervalsAnalysis`.
- Add `LiveIntervalsPrinterPass`.
- Use `LiveIntervalsWrapperPass` in legacy pass manager.
- Use `std::unique_ptr` instead of raw pointer for `LICalc`, so
destructor and default move constructor can handle it correctly.
This would be the last analysis required by `PHIElimination`.
So far branch protection, sign return address, guarded control stack
attributes are
only emitted as module flags to indicate the functions need to be
generated with
those features.
The problem is in case of an LTO build the module flags are merged with
the `min`
rule which means if one of the module is not build with sign return
address then the features
will be turned off for all functions. Due to the functions take the
branch-protection and
sign-return-address features from the module flags. The
sign-return-address is
function level option therefore it is expected functions from files that
is
compiled with -mbranch-protection=pac-ret to be protected.
The inliner might inline functions with different set of flags as it
doesn't consider
the module flags.
This patch adds the attributes to all functions and drops the checking
of the module flags
for the code generation.
Module flag is still used for generating the ELF markers.
Also drops the "true"/"false" values from the
branch-protection-enforcement,
branch-protection-pauth-lr, guarded-control-stack attributes as presence
of the
attribute means it is on absence means off and no other option.
Releand with test fixes.
Memcpy intrinsics with statically unknown loop sizes are lowered with
two load/store loops: one with access widths specified by the target,
and a residual loop that copies remaining bytes individually.
As the residual loop operates byte-wise, its accesses are only
1-aligned. However, we currently use the alignment that is optimal for
the first loop in both, which is unsound. With this patch, we use the
correct alignment in the residual loop.
The lowering of memcpy with a static size already handles alignments for
the residual correctly.
So far branch protection, sign return address, guarded control stack
attributes are
only emitted as module flags to indicate the functions need to be
generated with
those features.
The problem is in case of an LTO build the module flags are merged with
the `min`
rule which means if one of the module is not build with sign return
address then the features
will be turned off for all functions. Due to the functions take the
branch-protection and
sign-return-address features from the module flags. The
sign-return-address is
function level option therefore it is expected functions from files that
is
compiled with -mbranch-protection=pac-ret to be protected.
The inliner might inline functions with different set of flags as it
doesn't consider
the module flags.
This patch adds the attributes to all functions and drops the checking
of the module flags
for the code generation.
Module flag is still used for generating the ELF markers.
Also drops the "true"/"false" values from the
branch-protection-enforcement,
branch-protection-pauth-lr, guarded-control-stack attributes as presence
of the
attribute means it is on absence means off and no other option.
Make SLS Hardening pass handle BLRA* instructions the same way it
handles BLR. The thunk names have the form
__llvm_slsblr_thunk_xN for BLR thunks
__llvm_slsblr_thunk_(aaz|abz)_xN for BLRAAZ and BLRABZ thunks
__llvm_slsblr_thunk_(aa|ab)_xN_xM for BLRAA and BLRAB thunks
Now there are about 1800 possible thunk names, so do not rely on linear
thunk function's name lookup and parse the name instead.
This patch reapplies llvm/llvm-project#97605.
Many parts of PAuth-related codegen are not MachO- or ELF-specific. Add
RUN lines against ELF targets to ensure that codegen works for ELF as
well as for MachO.
4e0bd3f improved early MachineLICM's capabilities to hoist COPY from
physical registers out of a loop. However, it accidentally broke one of
MachineSink's preconditions on sinking cheap instructions (in this case,
COPY) which considered those instructions being profitable to sink only
when there are at least two of them in the same def-use chain in the
same basic block. So if early MachineLICM hoisted one of them out,
MachineSink no longer sink rest of the cheap instructions. This results
in redundant load immediate instructions from the motivating example
we've seen on RISC-V.
This patch fixes this by teaching MachineSink that if there is more than
one demand to sink a register into the same block from different
critical edges, it should be considered profitable as it increases the
CSE opportunities.
This change also improves two of the AArch64's cases.
This was reverted in f985a8826bfa4ca3d23e654185de35e30ea6dc79. Since that,
the default WMO lowering has moved to A67 compatible, the ABI attribute
emission has landed (off by default), and the LLD change to merge said
attributes have landed. Our ztso lowering is believed to also be A67
compatible, and no known issues remain.
Original commit message:
Ztso 1.0 was ratified in January 2023.
Documentation:
https://github.com/riscv/riscv-isa-manual/blob/main/src/ztso-st-ext.adoc
Marking SPLAT_VECTOR as Custom enables generic DAGCombine to turn
BUILD_VECTOR into SPLAT_VECTOR. We need to custom type legalize BUILD_VECTOR
without Zfhmin since we don't have the scalar f16 type. If we allow
SPLAT_VECTOR to be formed, we'll need to custom type legalize it too.
Easiest fix is to only enable SPLAT_VECTOR with Zvfhmin+Zfhmin. There's
still an issue that we need to properly support BUILD_VECTOR with Zvfhmin+Zfhmin.
Should fix the new case reported in #97849.
I've also changed the predicates to Zfhmin instead of ZfhminOrZhinxmin
since Zhinx isn't compatible with Zvfhmin.
We currently only fold a vmerge into a masked true operand if the vmerge
has an all-ones mask, since we end up keeping the mask from the true
operand.
But if the masks are the same then we can still fold, because vmerge and
true have the same passthru. If an element was masked off in the
original vmerge, it will also be masked off in the resulting true, and
will have the same passthru value.
The motivation for this is to lower masked VP loads and stores with
passthrus to masked RVV instructions. Normally you can express a masked
RVV instruction with a mask undisturbed passthru via a combination of a
VP op with an all-ones mask and a vp.merge. But for loads and stores you
need the same mask on the VP op as well as the vp.merge.
In 03d4332, we extended build_vector lowering to pack elements into the
largest size which doesn't exceed either ELEN or XLEN. The zbkb
extension - ratified under scalar crypto, but otherwise not really
connected to crypto per se - adds the packh, packw, and pack
instructions. These instructions are designed for exactly this pairwise
packing.
I ended up choosing to directly lower to machine nodes. A combination of
the slightly non-uniform semantics of these instructions (packw *sign*
extends the result, whereas packh *zero* extends it), and our generic
dag canonicalization (which sinks shl through or nodes), make pattern
matching these tricky and not particularly robust. Another alternative
was to have an ISD node for them, but that didn't seem to add much in
practice.
With the tag merging in place, we can safely change the default for
+seq-cst-trailing-fence to the default, according to the recommendation
in
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-atomic.adoc
This patch changes the default for the feature flag, and moves to more
consistent naming with respect to existing features.
This was reverted with https://github.com/llvm/llvm-project/pull/84597,
because ld.bfd would segfault with unknown riscv attributes. Now that
attributes emission is guarded with a backend flag,
`--riscv-abi-attributes`, this should be safe to reland, since it won't
introduce abi tags unless the user opts into them.
Add __hlt, which is a MSVC ARM64 intrinsic.
This intrinsic is just the HLT instruction. MSVC's version seems to
return something undefined; in this patch
it will just return zero.
MSVC intrinsics are defined here
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics.
I used unsigned int as the return type, because that is what the MSVC
intrin.h header uses, even though
it conflicts with the documentation.
Our worst case build_vector lowering is a serial chain of vslide1down.vx
operations which creates a serial dependency chain through a relatively
high latency operation. We can instead pack together elements into ELEN
sized chunks, and move them from integer to scalar in a single
operation.
This reduces the length of the serial chain on the vector side, and
costs at most three scalar instructions per element. This is a win for
all cores when the sum of the latencies of the scalar instructions is
less than the vslide1down.vx being replaced, and is particularly
profitable for out-of-order cores which can overlap the scalar
computation.
This patch is restricted to configurations with zba and zbb. Without
both, the zero extend might require two instructions which would bring
the total scalar instructions per element to 4. zba and zba are both
present in the rva22u64 baseline which is looking to be quite common for
hardware in practice; we could extend this to systems without bitmanip
with a bit of extra effort.
For windows __security_check_cookie call gets call everytime function is return without fixup. Since this function is defined in runtime library, it incures cost of call in dll which simply does comparison and returns most time. With Fixup, We selective move to call in DLL only if comparison fails.
Previously we had the same instructions being generated for `ISD::CTLZ` and `ISD::CTLZ_ZERO_UNDEF` which did not take advantage of the fact that zero is an invalid input for `ISD::CTLZ_ZERO_UNDEF`. This commit separates codegen for the two cases to allow for the optimization for the latter case.
The details of the optimization are outlined in #82075Fixes#82075
Co-authored-by: Manish Kausik H <hmamishkausik@gmail.com>
#96414 + #97206 didn't ensure that we were extracting subvectors from a vector double the width of the destination.
We can relax this in a future patch, but fix the #97968 crash first.
Fixes#97968
This re-commits d1a4f0c9fb559eb4c2fb56112e56343bcd333edc after
a issue was fixed in f92bfca9fc217cad9026598ef6755e711c0be070
("[AArch64] All bits of an exact right shift are demanded (#97448)").
If a function's address is taken, which means it may be called via a function pointer,
we need the function descriptor for it.
Otherwise, the function descriptor can be omitted for external symbols.
If we don't have Zfhmin, we will call `SoftPromoteHalfOperand` on the
BUILD_VECTOR. This operation is not supported by the generic code.
Instead, custom lower to a vXi16 BUILD_VECTOR using bitcasts.
Fixes#97849.
Make SLS Hardening pass handle BLRA* instructions the same way it
handles BLR. The thunk names have the form
__llvm_slsblr_thunk_xN for BLR thunks
__llvm_slsblr_thunk_(aaz|abz)_xN for BLRAAZ and BLRABZ thunks
__llvm_slsblr_thunk_(aa|ab)_xN_xM for BLRAA and BLRAB thunks
Now there are about 1800 possible thunk names, so do not rely on linear
thunk function's name lookup and parse the name instead.