Just commute with (V)BLENDPD/S like all other BLEND instructions
This is now handled more generally by the X86FixupInstTuningPass (OptSize fold occurs even without a scheduler model).
First step towards #142972
This is the x64 equivalent of #121516
Since import call optimization was originally [added to x64 Windows to
implement a more efficient retpoline
mitigation](https://techcommunity.microsoft.com/blog/windowsosplatform/mitigating-spectre-variant-2-with-retpoline-on-windows/295618)
the section and constant names relating to this all mention "retpoline"
and we need to mark indirect calls, control-flow guard calls and jumps
for jump tables in the section alongside calls to imported functions.
As with the AArch64 feature, this emits a new section into the obj which
is used by the MSVC linker to generate the Dynamic Value Relocation
Table and the section itself does not appear in the final binary.
The Windows Loader requires a specific sequence of instructions be
emitted when this feature is enabled:
* Indirect calls/jumps must have the function pointer to jump to in
`rax`.
* Calls to imported functions must use the `rex` prefix and be followed
by a 5-byte nop.
* Indirect calls must be followed by a 3-byte nop.
1. There is ADD64rm_ND instruction emitted with GOTPCREL relocation.
Handled it in "Suppress APX for relocation" pass and transformed it to
ADD64rm with register operand in non-rex2 register class. The relocation
type R_X86_64_CODE_6_GOTPCRELX will be added later for APX enabled with
relocation.
2. The register class for operands in instruction with relocation is
updated to non-rex2 one in "Suppress APX for relocation" pass, but it
may be updated/recomputed to larger register class (like
GR64_NOREX2RegClass to GR64RegClass). Fixed by not updating the register
class if it's non-rex2 register class and APX support for relocation is
disabled.
3. After "Suppress APX for relocation" pass, the instruction with
relocation may be folded with add NDD instruction to a add NDD
instruction with relocation. The later will be emitted to instruction
with APX relocation type which breaks backward compatibility. Fixed by
not folding instruction with GOTPCREL relocation with NDD instruction.
4. If the register in operand 0 of instruction with relocation is used
in the PHI instruction, it may be replaced with operand 0 of PHI
instruction (maybe EGPR) after PHI elimination and Machine Copy
Propagation pass. Fixed by suppressing EGPR in operand 0 of PHI
instruction to avoid APX relocation types emitted.
Suppress EGPR/NDD instructions for relocations to avoid APX relocation
types emitted. This is to keep backward compatibility with old version
of linkers without APX support. The use case is to try APX features with
LLVM + old built-in linker on RHEL9 OS which is expected to be EOL in
2032.
If there are APX relocation types, the old version of linkers would
raise "unsupported relocation type" error. Example:
```
$ llvm-mc -filetype=obj -o got.o -triple=x86_64-unknown-linux got.s
$ ld got.o -o got.exe
ld: got.o: unsupported relocation type 0x2b
...
$ cat got.s
...
movq foo@GOTPCREL(%rip), %r16
$ llvm-objdump -dr got.o
...
1: d5 48 8b 05 00 00 00 00 movq (%rip), %r16
0000000000000005: R_X86_64_CODE_4_GOTPCRELX foo-0x4
```
This reverts commit 7ae75851b2e1570662261c97c13cfc65357c283d.
There is a problem with peephole optimization for CCMP instruction. See
the example below:
C source code:
```
if (a > 2 || (b && (a == 2))) { … }
```
MIR before peephole optimization:
```
TEST8rr %21:gr8, %21:gr8, implicit-def $eflags // b
CCMP32ri %30:gr32, 2, 0, 5, implicit-def $eflags, implicit $eflags // a == 2
CCMP32ri %30:gr32, 3, 0, 5, implicit-def $eflags, implicit $eflags // a > 2 (transformed to a < 3)
JCC_1 %bb.6, 2, implicit $eflags
JMP_1 %bb.3
```
Inputs:
```
a = 1, b = 0.
```
With the inputs above, the expected behavior is to jump to %bb.6 BB.
After TEST8rr instruction being executed with b(%21) == 0, the ZF bit is
set to 1 in eflags, so the eflags doesn't satisfy SCC condition in the
following CCMP32ri instruction (for a==2 condition) which skips compare
a(%30) with 2 and set flags in its payload to 0x202 (ZF = 0). The eflags
satisfies the SCC condition in the 2nd CCMP32ri instruction which
compares a(%30) with 3. It sets CF to 1 in eflags and the JCC
instruction jumps to %bb.6 BB.
But after adding CCMP support, peephole optimization eliminates the 2nd
CCMP32ri instruction and updates the condition of JCC instruction to
"BE" from "B". With the same inputs, JCC instruction falls through to
the next instruction. It's not expected and the 2nd CCMP32ri should not
be eliminated.
```
TEST8rr %21:gr8, %21:gr8, implicit-def $eflags // b
CCMP32ri %30:gr32, 2, 0, 5, implicit-def $eflags, implicit $eflags // a == 2
JCC_1 %bb.6, 6, implicit $eflags
JMP_1 %bb.3
```
This change removes the uint64_t constructor on LocationSize
preventing implicit conversion, and fixes up the using APIs to adapt to
the change. Note that I'm adding a couple of explicit conversion points
on routines where passing in a fixed offset as an integer seems likely
to have well understood semantics.
We had an unfortunate case which arose if you tried to pass a TypeSize
value to a parameter of LocationSize type. We'd find the implicit
conversion path through TypeSize -> uint64_t -> LocationSize which works
just fine for fixed values, but looses information and fails assertions
if the TypeSize was scalable. This change breaks the first link in that
implicit conversion chain since that seemed to be the easier one.
This just moves the x86 implementation into generic code since it appears
to be suitable for any target. The heart of this transform is inside
foldMemoryOperand so other targets won't actually kick in until they
implement said API. This just removes one piece to implement in the
process of enabling foldMemoryOperand.
APX NDD instructions may be compressed when the result is also a source.
For 8/16b instructions, this may create partial register write hazards
if a previous super-register def is within the partial reg update
clearance, or incorrect code if the super-register is not dead.
This change prevents compression when the super-register is marked as an
implicit define, which the virtual rewriter already adds in the case
where a subregister is defined but the super-register is not dead.
The BreakFalseDeps interface is also updated to add implicit
super-register defs for NDD instructions that would incur partial-write
stalls if compressed to legacy ops.
Motivation is supporting scalable spills and reloads, e.g. in
https://github.com/llvm/llvm-project/pull/120524.
Looking at this API, I'm suspicious that the access size should just be
coming from the memory operand on the load or store, but we don't appear
to be consistently setting that up. That's a larger change so I may or
may not bother pursuing that.
The last one may be an implict use, e.g.,
`IDIV32r %4:gr32, implicit-def dead $eax, implicit-def $edx,
implicit-def dead $eflags, implicit $eax, implicit $edx`
https://godbolt.org/z/KPKzj5c8K
This extends `opitimizeCompareInstr` to re-use previous CCMP results if
the
previous comparison was with an immediates that was 1 bigger or smaller.
Example:
```
CCMP x, 13, 2, 5
...
CCMP x, 12, 2, 5 ; can be removed if we change the SETg
SETg ... ; x > 12 changed to SETge (x >= 13) & remove the 2nd
CCMP
```
NVPTX, SPIRV, and WebAssembly pass virtual registers to this function
since they don't perform register allocation. We need to use Register to
avoid a virtual register being converted to MCRegister by the caller.
This avoids dozens of regressions in a future patch. These
primarily manifested as assertions where we had copies of 64-bit
registers to 32-bit registers.
This is testable in principle with hand written MIR, but that's
a bit too much x86 for me.
Intel docs have been updated to be similar to AMD and now describe
BSF/BSR as not changing the destination register if the input value was
zero, which allows us to support CTTZ/CTLZ zero-input cases by setting
the destination to support a NumBits result (BSR is a bit messy as it
has to be XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also
been confirmed to match this behaviour.
This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through"
argument for zero-input cases, by default this is set to UNDEF to match
existing behaviour, but it can be set to a suitable value if supported.
There are still some limits to this - its only supported for x86_64
capable processors (and I've only enabled it for x86_64 codegen), and
Intel CPUs sometimes zero the upper 32-bits of a pass through register
when used for BSR32/BSF32 with a zero source value (i.e. the whole
64bits may not get passed through).
Fixes#122004
This patch is in preparation to enable setting the MachineInstr::MIFlag
flags, i.e. FrameSetup/FrameDestroy, on callee saved register
spill/reload instructions in prologue/epilogue. This eventually helps in
setting the prologue_end and epilogue_begin markers more accurately.
The DWARF Spec in "6.4 Call Frame Information" says:
The code that allocates space on the call frame stack and performs the
save
operation is called the subroutine’s prologue, and the code that
performs
the restore operation and deallocates the frame is called its epilogue.
which means the callee saved register spills and reloads are part of
prologue (a.k.a frame setup) and epilogue (a.k.a frame destruction),
respectively. And, IIUC, LLVM backend uses FrameSetup/FrameDestroy flags
to identify instructions that are part of call frame setup and
destruction.
In the trunk, while most targets consistently set
FrameSetup/FrameDestroy on save/restore call frame information (CFI)
instructions of callee saved registers, they do not consistently set
those flags on the actual callee saved register spill/reload
instructions.
I believe this patch provides a clean mechanism to set
FrameSetup/FrameDestroy flags on the actual callee saved register
spill/reload instructions as needed. And, by having default argument of
MachineInstr::NoFlags for Flags, this patch is a NFC.
With this patch, the targets have to just pass FrameSetup/FrameDestroy
flag to the storeRegToStackSlot/loadRegFromStackSlot calls from the
target derived spillCalleeSavedRegisters and restoreCalleeSavedRegisters
to set those flags on callee saved register spill/reload instructions.
Also, this patch makes it very easy to set the source line information
on callee saved register spill/reload instructions which is needed by
the DwarfDebug.cpp implementation to set prologue_end and epilogue_begin
markers more accurately.
As per DwarfDebug.cpp implementation:
prologue_end is the first known non-DBG_VALUE and non-FrameSetup
location
that marks the beginning of the function body
epilogue_begin is the first FrameDestroy location that has been seen in
the
epilogue basic block
With this patch, the targets have to just do the following to set the
source line information on callee saved register spill/reload
instructions, without hampering the LLVM's efforts to avoid adding
source line information on the artificial code generated by the
compiler.
<Foo>InstrInfo::storeRegToStackSlot() {
...
DebugLoc DL =
Flags & MachineInstr::FrameSetup ? DebugLoc() : MBB.findDebugLoc(I);
...
}
<Foo>InstrInfo::loadRegFromStackSlot() {
...
DebugLoc DL =
Flags & MachineInstr::FrameDestroy ? MBB.findDebugLoc(I) : DebugLoc();
...
}
While I understand this patch would break out-of-tree backend builds, I
think it is in the right direction.
One immediate use case that can benefit from this patch is fixing
#120553 becomes simpler.
Use uppercase in the subvector description ("32x2" -> "32X4" etc.) - matches what we already do in VBROADCAST??X?, and we try to use uppercase for all x86 instruction mnemonics anyway (and lowercase just for the arg description suffix).
FPCLASS is a unary instruction with an immediate operand - update the naming to match similar instructions (e.g. VPSHUFD) by only using the source reg/mem and immediate in the instruction name
This code assumed only PUSHes would appear in call sequences. However,
if calls require frame-pointer/base-pointer spills, only the PUSH
operations inserted by spillFPBP will be recognized, and the adjustments
to frame object offsets in prologepilog will be incorrect.
This change correctly reports the SP adjustment for POP and ADD/SUB to
rsp, and an assertion for unrecognized instructions that modify rsp.
This patch makes the `VBROADCAST***X**` subvector broadcast instructions consistent - the `***X**` section represents the original subvector type/size, but we were not correctly using the AVX512 Z/Z256/Z128 suffix to consistently represent the destination width (or we missed it entirely).
This patch prepares the NFC groundwork for global outlining using
CGData, which will follow
https://github.com/llvm/llvm-project/pull/90074.
- The `MinRepeats` parameter is now explicitly passed to the
`getOutliningCandidateInfo` function, rather than relying on a default
value of 2. For local outlining, the minimum number of repetitions is
typically 2, but for the global outlining (mentioned above), we will
optimistically create a single `Candidate` for each `OutlinedFunction`
if stable hashes match a specific code sequence. This parameter is
adjusted accordingly in global outlining scenarios.
- I have also implemented `unique_ptr` for `OutlinedFunction` to ensure
safe and efficient memory management within `FunctionList`, avoiding
unnecessary implicit copies.
This depends on https://github.com/llvm/llvm-project/pull/101461.
This is a patch for
https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753.
The renamable flag is useful during MachineCopyPropagation but renamable
flag will be dropped after lowerCopy in some case.
This patch introduces extra arguments to pass the renamable flag to
copyPhysReg.