On Powerpc, set instruction count as lsr first priority of lsr by default.
Add an option ppc-lsr-no-insns-cost to return back to default lsr cost model.
Reviewed By: steven.zhang, jsji
Differential Revision: https://reviews.llvm.org/D72683
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.
This removes the need to pass Chain and IsSignaling all the way
down.
Summary:
Many directives are unavailable, and support for others may be limited.
This first draft has preliminary support for:
- conditional directives (including errors),
- data allocation (unsigned types up to 8 bytes, and ALIGN),
- equates/variables (numeric and text),
- and procedure directives (without parameters),
as well as COMMENT, ECHO, INCLUDE, INCLUDELIB, PUBLIC, and EXTERN. Text variables (aka text macros) are expanded in-place wherever the identifier occurs.
We deliberately ignore all ml.exe processor directives.
Prominent features not yet supported:
- structs
- macros (both procedures and functions)
- procedures (with specified parameters)
- substitution & expansion operators
Conditional directives are complicated by the fact that "ifdef rax" is a valid way to check if a file is being assembled for a 64-bit x86 processor; we add support for "ifdef <register>" in general, which requires adding a tryParseRegister method to all MCTargetAsmParsers. (Some targets require backtracking in the non-register case.)
Reviewers: rnk, thakis
Reviewed By: thakis
Subscribers: kerbowa, merge_guards_bot, wuzish, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, mgorny, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72680
The unseen logic diff occurs because MayFoldLoad() is defined like this:
static bool MayFoldLoad(SDValue Op) {
return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}
The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.
The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.
Differential Revision: https://reviews.llvm.org/D74088
This allows it to work properly with masked inc/dec for avx512. Those
would have a vselect as the root node so didn't get a chance to call
combineIncDecVector.
This also simplifies the logic because we don't have to manage
the topological ordering.
CallPreservedMask is used to describe the register liveness after a
function call. The function call in an interrupt handler should use the same
CallPreservedMask as normal functions. So that only callee save registers
can live through the function call.
The division expansions in AMDGPUCodeGenPrepare can't be relied on for
correctness, since they punt to later optimization and possibly
legalization in some cases. We still need a way to be able to write
tests for the legalizer versions of the expansion. This is mostly for
GlobalISel, since the expected optimzations is expecting aren't
implemented.
The interaction with the flag to expand 64-bit division in the IR is
pretty confusing, but these flags have different purposes.
I didn't realize we were already expanding 24/32-bit division here
already. Use the available IntegerDivision utilities. This uses loops,
so produces significantly smaller code than the inline DAG expansion.
This now requires width reductions of 64-bit divisions before
introducing the expanded loops.
This helps work around missing legalization in GlobalISel for
division, which are the only remaining core instructions that didn't
work at all.
I think this is plausibly a better implementation than exists in the
DAG, although turning it on by default misses out on the constant
value optimizations and also needs benchmarking.
This is more or less directly ported from the AMDGPU custom lowering
for FP_TO_FP16. I made a few minor fixups (using G_UNMERGE_VALUES
instead of creating shift/trunc to extract the two halves, and zexting
an inverted compare instead of select_cc).
This also does not include the fast math expansion the DAG which
converts to f32 and then to f16. I think that belongs in a
pre-legalize combine instead.
Like COPY instructions explained in D70616, we don't check the constraints
when combining G_UNMERGE_VALUES. Use the same logic used in D70616 to check
if registers can be replaced, or a COPY instruction needs to be built.
https://reviews.llvm.org/D70564
Assembler now permits pairs like 'v0:1', which are encoded
differently from the odd-first pairs like 'v1:0'.
The compiler will require more work to leverage these new register
pairs.
This patch added generation of SIMD bitwise insert BIT/BIF instructions.
In the absence of GCC-like functionality for optimal constraints satisfaction
during register allocation the bitwise insert and select patterns are matched
by pseudo bitwise select BSP instruction with not tied def.
It is expanded later after register allocation with def tied
to BSL/BIT/BIF depending on operands registers.
This allows to get rid of redundant moves.
Reviewers: t.p.northover, samparker, dmgreen
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D74147
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
Summary: Support for PIC with tests for global variables and function calls.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D74536
Summary:
Also make return calls terminator instructions so epilogues are
inserted before them rather than after them. Together, these changes
make WebAssembly's tail call optimization more stack-safe.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73943
If we widen the compare we might trigger a spurious exception from
the garbage data.
We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.
I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.
Differential Revision: https://reviews.llvm.org/D74522