Annotation attributes may be attached to a function to mark it with
custom data that will be contained in the final Wasm file. The
annotation causes a custom section named
"func_attr.annotate.<name>.<arg0>.<arg1>..." to be created that will
contain each function's index value that was marked with the annotation.
A new patchable relocation type for function indexes had to be created so
the custom section could be updated during linking.
Reviewed By: sbc100
Differential Revision: https://reviews.llvm.org/D150803
Extended BPFCheckAndAdjustIR pass with sinkMinMax() transformation
that undoes LICM hoistMinMax pass.
The undo transformation converts the following patterns:
x < min(a, b) -> x < a && x < b
x > min(a, b) -> x > a || x > b
x < max(a, b) -> x < a || x < b
x > max(a, b) -> x > a && x > b
Where 'a' or 'b' is a constant.
Also supports `sext min(...) ...` and `zext min(...) ...`.
~~~
This was previously commited as 09feee559a29 and reverted in
0bf9bfeacc8c because of the testbot memory leak report:
https://lab.llvm.org/buildbot/#/builders/5/builds/34931
The memory leak issue was caused by incorrect instruction removal
sequence in skinMinMaxBB():
I->dropAllReferences(); --------> I->eraseFromParent();
I->removeFromParent(); fixed to
Differential Revision: https://reviews.llvm.org/D147990
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
This change continues with the line of work discussed in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295.
This change handles most of the binary pseudos. I excluded pseudos which _TIED variants, and those that produce mask results. Both a bit different in functionality, and deserve their own change and review. As with previous changes in the series, we replace the existing TA and TU forms with a single unified pseudo with a passthru (which may be implicit_def) and a policy operand.
As before, we see codegen changes (some improvements and some regressions) due to scheduling differences caused by the extra implicit_def instructions.
Differential Revision: https://reviews.llvm.org/D154245
For inline WebAssembly, passing a numeric operand to global.get is
unsupported. This causes encodeInstruction to reach an llvm_unreachable
call, leading to undefined behaviors. This patch fixes the issue for
this invalid instruction encoding, making it report an error by adding
an MCContext field in class WebAssemblyMCCodeEmitter.
Reviewed By: sbc100, bryanpkc
Differential Revision: https://reviews.llvm.org/D154734
The combination was designed to combine a negative imaginary value
rather then a full negative complex value.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154213
Mark the tMOVi32imm pseudo instr as killing the flags register.
The pseudo instruction expands to a sequence of 7 movs/lsls/adds
instructions, which are all Thumb-1 flag setting instructions.
For a test case, take an existing arm test which checks for
"Don't CSE a cmp across a call that clobbers CPSR."
and retarget it at thumbv6m execute-only.
Reviewed By: stuij
Differential Revision: https://reviews.llvm.org/D154845
Change-Id: I8f8209fbc40a833f8875629937b9606c1e2c021d
Using X86::isConstantSplat instead of ISD::isConstantSplatVector allows us to detect constant masks after they've been lowered to constant pool loads.
Addresses regression from D154592
Currently in LowerConstantFP, when we compile for execute-only (XO) we don't
check what architecture we're compiling for (v6m=< or >v6m). We shouldn't get
here for v6m, so put in an assert.
Reviewed By: simonwallis2, dmgreen
Differential Revision: https://reviews.llvm.org/D154506
Extend coverage for lowering wide vector types during type legalization to allow us to use PACKSS/PACKUS patterns instead of dropping down to shuffle lowering.
First step towards avoiding premature folds of TRUNCATE to PACKSS/PACKUS nodes as described on Issue #63710 - which causes a large number of regressions on D152928 - we will next need to tweak the TRUNCATE widening in ReplaceNodeResults
Differential Revision: https://reviews.llvm.org/D154592
Only a few minor test changes needed because I removed the "helper" suffix from the combiner name, as it's not really a helper anymore but more like the implementation itself.
Depends on D153757
NOTE: This would land iff D153757 (RFC) lands too.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D153850
Fixes#63692.
In reference to volatile memory accesses, the langref says:
> the backend should never split or merge target-legal volatile load/store instructions.
Differential Revision: https://reviews.llvm.org/D154609
We try to fold RISCVISD::VMV_V_X_VL series node + scalar load -> vector load.
But if scalar load is indexed load (load update form), it's not profitable to fold because load update node can't be removed after fold.
Differential Revision: https://reviews.llvm.org/D152222
For `CSR_Interrupt`, we can generate the register list via a single
`sequence`.
For `CSR_XLEN_F32_Interrupt` and `CSR_XLEN_F64_Interrupt`, I don't
see the reason why we need to keep the order the same as how we used
to allocate registers (and we have changed the order in D146488), so
I fold them into one `sequence`.
There are some *.ll changes because of the order change.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D154837
The original MFS work D85368 shows good performance improvement with
Instrumented FDO. However, AutoFDO or Flow-Sensitive AutoFDO (FSAFDO)
does not show performance gain. This is mainly caused by a less
accurate profile compared to the iFDO profile.
For the past few months, we have been working to improve FSAFDO
quality, like in D145171. Taking advantage of this improvement, MFS
now shows performance improvements over FSAFDO profiles.
That being said, 2 minor changes need to be made, 1) An FS-AutoFDO
profile generation pass needs to be added right before MFS pass and an
FSAFDO profile load pass is needed when FS-AutoFDO is enabled and the
MFS flag is present. 2) MFS only applies to hot functions, because we
believe (and experiment also shows) FS-AutoFDO is more accurate about
functions that have plenty of samples than those with no or very few
samples.
With this improvement, we see a 1.2% performance improvement in clang
benchmark, 0.9% QPS improvement in our internal search benchmark, and
3%-5% improvement in internal storage benchmark.
This is #1 of the two patches that enables the improvement.
Reviewed By: wenlei, snehasish, xur
Differential Revision: https://reviews.llvm.org/D152399
Some instructions treat x0 as a special encoding rather than as a
value of 0. Since we don't parse the inline assembly to know what
the instruction is, chooser the safest option of never using x0.
Fixes#63747.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D154744
This patch extends support of the option `-frecord-command-line` to XCOFF. XCOFF doesn’t have custom sections like ELF, so the command line data is emitted to a .info section instead. A C_INFO symbol is generated with the .info section to preserve the command line data past the link step. Multiple command lines are separated by newlines and null bytes. The command line data can be retrieved on AIX with command `what file_name`.
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D153600
Using ACLE intrinsics, it is possible to create a loop that the
deinterleaving pass incorrectly classified as a reduction loop.
For example, for fixed-width vectors the loop was like below:
vector.body:
%a = phi <4 x float> [ %init.a, %entry ], [ %updated.a, %vector.body ]
%b = phi <4 x float> [ %init.b, %entry ], [ %updated.b, %vector.body ]
...
; Does not depend on %a or %b:
%updated.a = ...
%updated.b = ...
Differential Revision: https://reviews.llvm.org/D154598
This check was unnecessary/incorrect, it was already being done by the target
hook default implementation, and the one in the matcher was checking for a
completely different thing. This change:
1) Removes the check and updates affected tests which now do some more reassociations.
2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse
check. Not sure why I didn't do this the first time.
This implements the v1.0-rc1 draft extension.
amocas.d on RV32 and amocas.q have the restriction that rd and rs2 must
be even registers. I've opted to implement this restriction in
RISCVAsmParser::validateInstruction even though for codegen we'll need a
new register class and can then remove this validation. This also
sidesteps, for now, the issue of amocas.d being different on rv32 vs
rv64.
See <https://github.com/riscv-non-isa/riscv-c-api-doc/issues/37> for the
issue of needing an agreed asm register constraint for register pairs.
Differential Revision: https://reviews.llvm.org/D149248
Previously we remove a pattern like:
%reg = and32ri %in_reg, 5
... // EFLAGS not changed.
%src_reg = subreg_to_reg 0, %reg, %subreg.sub_index
test64rr %src_reg, %src_reg, implicit-def $eflags
We can remove test64rr since it has same functionality as and subreg_to_reg avoid the opt in previous code, so we handle this case specially.
And this case is also can be opted for the same reason, like:
%reg = and32ri %in_reg, 5
... // EFLAGS not changed.
%src_reg = copy %reg.sub_16bit:gr32
test16rr %src_reg, %src_reg, implicit-def $eflags
The COPY from gr32 to gr16 prevent the opt in previous code too, just handle it specially as what we did for test64rr.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D154193
Handle (zero_extend (and (srl X, C1), C2)) patterns to allow foldMaskAndShiftToExtract to match h-register extractions from smaller types
Ideally matchAddressRecursively needs to be able to recurse through ZEXT/SEXT nodes generally but for now we should just handle specific cases when they occur
Addresses regressions in D146121
InstCombine should have taken care of this, but I think
this is more useful in the future when the expansion
tries to handle multiple cases at a time with fcmp.
x87 looks worse to me but the only thing I know about it is that
I aggressively do not care about it.
https://reviews.llvm.org/D143198
The commit originally caused a bootstrap failure on the big endian
PPC bot as the combine was interfering with the legalizer when
applied on illegal types. This update restricts the combine to
the only types for which it is actually needed. Tested on PPC BE
bootstrap locally.
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.
Differential Revision: https://reviews.llvm.org/D124196
So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.
Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.
This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D143762
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.
This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D143759
Replacing D143754. Right now the LiveRangeSplitting during register allocation uses
TargetOpcode::COPY instruction for splitting. For AMDGPU target that creates a
problem as we have both vector and scalar copies. Vector copies perform a copy over
a vector register but only on the lanes(threads) that are active. This is mostly sufficient
however we do run into cases when we have to copy the entire vector register and
not just active lane data. One major place where we need that is live range splitting.
Allowing targets to use their own copy instructions(if defined) will provide a lot of
flexibility and ease to lower these pseudo instructions to correct MIR.
- Introduce getTargetCopyOpcode() virtual function and use if to generate copy in Live range
splitting.
- Replace necessary MI.isCopy() checks with TII.isCopyInstr() in register allocator pipeline.
Reviewed By: arsenm, cdevadas, kparzysz
Differential Revision: https://reviews.llvm.org/D150388
Count of input operands affect pipeline forwarding in scheduling model.
Previous Power10 model definition arranges some instructions into
incorrect groups, by counting the wrong number of input operands.
This patch updates the model, setting the input operands count correctly
by excluding irrelevant immediate operands and count memory operands of
load instructions correctly.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D153842
As noted by @reames, we should be checking that the memory access is aligned to
the element size (or the unaligned vector memory access feature is enabled)
before lowering vlseg/vsseg intrinsics via the interleaved access pass.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D154536