According to the N2 Software Optimization Guide, arithmetic ops with LSL
≤ 4, no flagset logical ops, and flagset logical ops with LSL = 0 have a
latency of 1 and use pipeline group I. However, most of these ops were
being modelled as having a latency of 2 and using pipeline M. The
affected instructions include the "unshifted" versions of ADD/SUB, among
others.
Differential Revision: https://reviews.llvm.org/D145370
The previous Alderlake P-Core model prefer data from uops.info than intel doc.
Some measures latency from uops.info is larger than real latency. e.g. addpd
latency is 3 in uops.info while 2 in intel doc. This patch adjust the priority
of those two data source so that intel doc is more preferable.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D144388
The instruction regexp "^INSv" for the insert gen-reg-to-element was also
matching the element-to-element instruction, which only has a latency of 2 and
not 5, so we were getting that wrong.
Differential Revision: https://reviews.llvm.org/D144508
The instruction property hasSideEffects relies on the presence of
tablegen isel patterns when constructing its value, unless
specifically overriden. Since adding SVE scheduling information
we've noticed this property flip-flop as isel patterns have been
updated. To make things consistent (and correct) this patch
explicitly sets the property for all SVE instructions.
This has resulted in the following notable changes:
* Normal load and store instructions no longer report having side
effects.
* All prefetch instructions correctly report having side effects.
* FFR related instructions continue to report having side effects.
This is likely overkill but I've chosen to remain cautious here.
* Most all integer instructions no longer report having side effects.
* Most all floating point instructions no longer report having side
effects, but do now report their potential for raising FP
exceptions. I do not know how to test the latter so I've again
took a caution route of taging all floating point instructions
except for DUPs.
* The conflict detection intrinsics now report they don't touch
memory.
NOTE: SVE isel makes significant use of psuedo instructions but
this patch makes no effort to update them.
NOTE: We'll need a similar patch for SME but without a scheduling
model it'll be harder to verify the results.
Differential Revision: https://reviews.llvm.org/D142122
The RMW instructions still need addressing, probably with a new 'WriteXCHGRMW' scheduler class.
Based off llvm-exegesis captures, confirmed with Agner + uops.info
The movprfx is a vector copy, so it doesn't access memory. Set the
value of hasSideEffects 0 to avoid return true for the hasUnmodeledSideEffects(),
which will block the machine scheduler which load/store instructions.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D140680
LOCK + CMPXCHG8/CMPXCHG16 variants still need overriding as they are not completely correct - already much better though
Based off llvm-exegesis captures, confirmed with Agner + uops.info
The set/reset/complement RMW variants use +1uop compared to the BT read-only instructions
Based off llvm-exegesis captures, confirmed with Agner + uops.info
Now that exegesis produces meaningful snippets to measure throughtput
for instructions with tied operands:
2ffe225d11
the measurements clearly show these instructions to have
more optimistic throughtput.
There's still some noise in the reports, especially around instructions
with memory operands. I'm not sure if we measure those correctly.
Fixes https://github.com/llvm/llvm-project/issues/59325
The znver2 override already matched the WriteBlendY class exactly, and the znver1 override wasn't accounting for ymm double-pumping.
Found with the help of D138359
znver1/znver2 has barely any difference in behaviour between the AVX1/2 variants of these instructions - it looks like it was a copy+paste mistake to miss the AVX2 integer domain instructions in the overrides.
Having said that the override numbers don't appear to match the numbers in the AMD 17h SoGs very well - for instance vperm2f128/vperm2i128 might be microcoded from the AMD sense of >3 uops, but it doesn't have a 100cy latency..... These will need to be further addressed.
This appears to be a slow down vs Skylake (which the model was copied off) - confirmed with uops.info / instlatx64
Noticed as D138359 was reporting that many of the PACKS overrides were redundant, but were in fact incorrect
Reported by D138359 - they were being overridden as WriteMicrocoded despite already being declared WriteMicrocoded
It also fixes a rather funny instregex mismatch that was matching the movsldup shuffle by mistake
D138359 was reporting that this override was superfluous, but it had never been setup - I took the numbers from uops.info (I couldn't find an estimate in Intel docs).
These were missed for some reason - only noticed this while investigating a FIXME in the SandyBridge model
Also sync the znver2/znver3 tests which had been missed when LOCK test coverage was added
D138359 was reporting that the EXTRACTPSrr override was unnecessary, however the AMD SoG and Agner both confirm that both the rr and rm versions take 2uops (matching znver1)
NOTE: For IceLakeServer we actually test TigerLake as that's the only target that supports it (we do something similar for F16C on IvyBridge in the SandyBridge tests).
AMD 15h SoG + Agner both indicate there's no difference between MPSADBWrri + VMPSADBWrri - I can't find any data on the folded variant so I've kept the existing numbers
Removes the last X86 override for WriteMPSAD/WritePSADBW classes - removing a further 3 entries from every sched class table
The znver1/znver2 schedules for CVTPD2PS were incorrectly double pumping the xmm-load variant instead of the ymm variants (znver1 only)
Also, the xmm-load variant was incorrectly using FP03 instead of just FP3
Confirmed by the AMD SoG 17h tables, Agner + uops.info
Another step towards removing a lot of unnecessary overrides from all the x86 scheduler models - these should hopefully be convertible into regular WriteCvtPD2I classes soon.