This reverts commit 122efef8ee9be57055d204d52c38700fe933c033.
- Patch fixed to not reuse definitions from predecessors in EH landing pads.
- Late review suggestions (by MaskRay) have been addressed.
- M68k/pipeline.ll test updated.
- Init captures added in processBlock() to avoid capturing structured bindings.
- RISCV has this disabled for now.
Original commit message:
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions
whenever found by scanning the MF once while keeping track of register
definitions in a map. These instructions are typically immediate loads
resulting from rematerialization, and address loads emitted by target in
eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of
'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it
is done by looking at phys-regs, but still quite effective. It would be
desirable to improve other parts of CodeGen and avoid these redundant
instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
Init captures added in processBlock() to avoid capturing structured bindings,
which caused the build problems (with clang).
RISCV has this disabled for now until problems relating to post RA pseudo
expansions are resolved.
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions
whenever found by scanning the MF once while keeping track of register
definitions in a map. These instructions are typically immediate loads
resulting from rematerialization, and address loads emitted by target in
eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of
'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it
is done by looking at phys-regs, but still quite effective. It would be
desirable to improve other parts of CodeGen and avoid these redundant
instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
We support AMX-FP16 isa in https://reviews.llvm.org/D135941 now.
The old intrinsic interface need to manually write tile registers.
So we support its new intrinsic interface to let it be able to do register allocation.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D138987
AMX shape should be defined before AMX intrinsics. However for below
case, the shape a.row is defined after tile load of b. If we transform
`load b` to `@llvm.x86.tileloadd64 intrinsic`, the shape dependency
doesn't meet.
```
void test_tile_dpbsud(__tile1024i a, __tile1024i b, __tile1024i c) {
__tile_dpbsud(&c, a, b);
}
```
This patch is to store the tile b to stack and reloaded it after the
def of b.row. It would cause redundant store/load, but it is simple
to avoid generating invalid IR.
The better way may hoist `def b.row` before tile load instruction,
but it seems more complicated to recursively hoist its operands.
Differential Revision: https://reviews.llvm.org/D137923
This stops reporting CostPerUse 1 for `R8`-`R15` and `XMM8`-`XMM31`.
This was previously done because instruction encoding require a REX
prefix when using them resulting in longer instruction encodings. I
found that this regresses the quality of the register allocation as the
costs impose an ordering on eviction candidates. I also feel that there
is a bit of an impedance mismatch as the actual costs occure when
encoding instructions using those registers, but the order of VReg
assignments is not primarily ordered by number of Defs+Uses.
I did extensive measurements with the llvm-test-suite wiht SPEC2006 +
SPEC2017 included, internal services showed similar patterns. Generally
there are a log of improvements but also a lot of regression. But on
average the allocation quality seems to improve at a small code size
regression.
Results for measuring static and dynamic instruction counts:
Dynamic Counts (scaled by execution frequency) / Optimization Remarks:
Spills+FoldedSpills -5.6%
Reloads+FoldedReloads -4.2%
Copies -0.1%
Static / LLVM Statistics:
regalloc.NumSpills mean -1.6%, geomean -2.8%
regalloc.NumReloads mean -1.7%, geomean -3.1%
size..text mean +0.4%, geomean +0.4%
Static / LLVM Statistics:
mean -2.2%, geomean -3.1%) regalloc.NumSpills
mean -2.6%, geomean -3.9%) regalloc.NumReloads
mean +0.6%, geomean +0.6%) size..text
Static / LLVM Statistics:
regalloc.NumSpills mean -3.0%
regalloc.NumReloads mean -3.3%
size..text mean +0.3%, geomean +0.3%
Differential Revision: https://reviews.llvm.org/D133902
When we fill the shape to tile configure memory, the shape is gotten
from AMX pseudo instruction. However the register for the shape may be
split or spilled by greedy RA. That cause we fill the shape to config
memory after ldtilecfg is executed, so that the shape configuration
would be wrong.
This patch is to split the tile register allocation from greedy register
allocation, so that after tile registers are allocated the shape
registers are still virtual register. The shape register only may be
redefined or multi-defined by phi elimination pass, two address pass.
That doesn't affect tile register configuration.
Differential Revision: https://reviews.llvm.org/D128584
There are some codegen differences here, because presence of
bitcasts affects AMX codegen in minor ways (the bitcasts are not
always in the input IR, but may be added by X86PreAMXConfig
for example).
Differential Revision: https://reviews.llvm.org/D128424
Use an IRBuilder to insert instructions in preWriteTileCfg().
While here, also remove some unnecessary bool return values.
There are some test changes because the IRBuilder folds
"trunc i16 8 to i8" to "i8 8", and that has knock-on effects on
instruction naming.
I ran into this when converting tests to opaque pointers and
noticed that this pass introduces unnecessary "bitcast ptr to ptr"
instructions.
This runs the test through -instnamer and generates test checks
using update_test_checks.py. (The previous comment indicated that
update_llc_test_checks.py was used, but I rather doubt that.)
This relies on the non-determinism fix from
fbb72530fe80a95678a7d643d7a3f5ee8d693c93,
the previous check lines have apparently been written to accomodate
that non-determinism.
Test updates were performed using:
https://gist.github.com/nikic/98357b71fd67756b0f064c9517b62a34
These are only the test updates where the test passed without
further modification (which is almost all of them, as the backend
is largely pointer-type agnostic).
There is intrinsic `@llvm.x86.ldtilecfg` which is lowered to LDTILECFG.
This intrinsic is open for user to configure tile registers by
themselves. There is a chance that `@llvm.x86.ldtilecfg` would be mixed
with the new AMX intrinsics which depend on compiler to configure tile
registers. Separate pusedo instruction PLDTILECFGV would avoid
unexpected behavious when `@llvm.x86.ldtilecfg` is mixed with new AMX
intrinsics. Though user should not mix the two programming model,
compiler should avoid crash or UB when they are mixed.
Differential Revision: https://reviews.llvm.org/D126519
The previous solution depends on variable name to record the shape
information. However it is not reliable, because in release build
compiler would not set the variable name. It can be accomplished with an
additional option `fno-discard-value-names`, but it is not acceptable
for users.
This patch is to preconfigure the tile register with machine
instruction. It follow the same way what sigle configure does. In the
future we can fall back to multiple configure when single configure
fails due to the shape dependency issue.
The algorithm to configure the tile register is simple in the patch. We
may improve it in the future. It configure tile register based on basic
block. Compiler would spill the tile register if it live out the basic
block. After the configure there should be no spill across tile
confgiure in the register alloction. Just like fast register allocation
the algorithm walk the instruction in reverse order. When the shape
dependency doesn't meet, it insert ldtilecfg after the last instruction
that define the shape.
In post configuration compiler also walk the basic block to collect the
physical tile register number and generate instruction to fill the stack
slot for the correponding shape information.
TODO: There is some following work in D125602. The risk is modifying the
fast RA may cause regression as fast RA is usded for different targets.
We may create an independent RA for tile register.
Differential Revision: https://reviews.llvm.org/D125075
To generate zero value, the PXOR instruction need 3 operands that is
tied to the same vreg. If is not good in SSA form and with undef value
two address instruction pass may convert
`%0:vr128 = PXORrr undef %0, undef %0`
to `%1:vr128 = PXORrr undef %1:vr128(tied-def 0), undef %0:vr128`.
It is not expected.
It can be simplified to SET0 instruction which only take 1 destination
operand. It should be more friendly to two address instruction pass and
register allocation pass.
`%0:vr128 = V_SET0`
Also add AVX1 code path so that it is consistant to other code.
Differential Revision: https://reviews.llvm.org/D124903
The `llvm.x86.cast.tile.to.vector` intrinsic is lowered to
`llvm.x86.tilestored64.internal` and `load <256 x i32>`. The
`llvm.x86.cast.vector.to.tile` is lowered to `store <256 x i32>` and
`llvm.x86.tileloadd64.internal`. When `llvm.x86.cast.tile.to.vector` is
used by `store <256 x i32>` or `load <256 x i32>` is used by
`llvm.x86.cast.vector.to.tile`, they can be combined by
`llvm.x86.tilestored64.internal` and `llvm.x86.tileloadd64.internal`.
Differential Revision: https://reviews.llvm.org/D124378
Instead of report fatal error, this patch emit error message and exit
when shapes are not pre-defined. This would cause the compiling fail but
not crash.
Differential Revision: https://reviews.llvm.org/D124342
When walk the user chain to get the shape of a phi node. If it is phi
node in the chain, we should walk to the user of this phi node instead
of the original phi node.
The AMX combiner would store undef or zero to stack and invoke tileload
to load the data to tile register. To avoid the store/load, we can
materialzie undef or zero value to tilezero.
Differential Revision: https://reviews.llvm.org/D122714
We should avoid mixing old AMX instrinsic with new AMX intrinsic. For
old AMX intrinsic, user is responsible for invoking tile release. This
patch is to check if there is any tile config generated by compiler. If
so it emit tilerelease instruction, otherwise it don't emit the
instruction.
Differential Revision: https://reviews.llvm.org/D114066
There is some discussion on the bitcast for vector and x86_amx at https://reviews.llvm.org/D99152. This patch is to introduce a x86 specific cast for vector and x86_amx, so that it can avoid some unnecessary optimization by middle-end. On the other way, we have to optimize the x86 specific cast by ourselves. This patch also optimize the cast operation to eliminate redundant code.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D107544
The motivation is that the update script has at least two deviations
(`<...>@GOT`/`<...>@PLT`/ and not hiding pointer arithmetics) from
what pretty much all the checklines were generated with,
and most of the tests are still not updated, so each time one of the
non-up-to-date tests is updated to see the effect of the code change,
there is a lot of noise. Instead of having to deal with that each
time, let's just deal with everything at once.
This has been done via:
```
cd llvm-project/llvm/test/CodeGen/X86
grep -rl "; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py" | xargs -L1 <...>/llvm-project/llvm/utils/update_llc_test_checks.py --llc-binary <...>/llvm-project/build/bin/llc
```
Not all tests were regenerated, however.
Register allocation may spill virtual registers to the stack, which can
increase alignment requirements of the stack frame. If the the function
did not require stack realignment before register allocation, the
registers required to do so may not be reserved/available. This results
in a stack frame that requires realignment but can not be realigned.
Instead, only increase the alignment of the stack if we are still able
to realign.
The register SpillAlignment will be ignored if we can't realign, and the
backend will be responsible for emitting the correct unaligned loads and
stores. This seems to be the assumed behaviour already, e.g.
ARMBaseInstrInfo::storeRegToStackSlot and X86InstrInfo::storeRegToStackSlot
are both `canRealignStack` aware.
Differential Revision: https://reviews.llvm.org/D103602
The previous code detect if a MBB is bottom block to determine if it is
a backedge of a loop. We should check latch block instead of bottom
block and we should check the header and the bottom block are in the
same loop.
Differential Revision: https://reviews.llvm.org/D103145
This reverts commit 3b8ec86fd576b9808dc63da620d9a4f7bbe04372.
Revert "[X86] Refine AMX fast register allocation"
This reverts commit c3f95e9197643b699b891ca416ce7d72cf89f5fc.
This pass breaks using LLVM in a multi-threaded environment by
introducing global state.
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.
This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.
Reviewed By: xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D101067
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.
This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.
Differential Revision: https://reviews.llvm.org/D101067
This is a follow up of D99010. We didn't consider the live range of shape registers when hoist ldtilecfg. There maybe risks, e.g. we happen to insert it to an invalid range of some registers and get unexpected error.
This patch fixes this problem by storing the value to corresponding stack place of ldtilecfg after all its definition immediately.
This patch also fix a problem in previous code: If we don't have a ldtilecfg which dominates all AMX instructions, we cannot initialize shapes for other ldtilecfg.
There're still some optimization points left. E.g. eliminate unused mov instructions, break the def-use dependency before RA etc.
Reviewed By: LuoYuanke, xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D99966
The previous code calculated the first ldtilecfg by dominating all AMX registers' def. This may result in the ldtilecfg being inserted into a loop.
This patch try to calculate the nearest point where all shapes of AMX registers are reachable.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D99010