Closes#137023
On RISC-V machines without a native multiply instruction (e.g., `rv32i`
base), multiplying a variable by a constant integer often compiles to a
call to a library routine like `__mul{s,d}i3`.
```assembly
.globl __mulxi3
.type __mulxi3, @function
__mulxi3:
mv a2, a0
mv a0, zero
.L1:
andi a3, a1, 1
beqz a3, .L2
add a0, a0, a2
.L2:
srli a1, a1, 1
slli a2, a2, 1
bnez a1, .L1
ret
```
This library function implements multiplication in software using a loop
of shifts and adds, processing the constant bit by bit. On rv32i, it
requires a minimum of 8 instructions (for multiply by `0`) and up to
about 200 instructions (by `0xffffffff`), involves heavy branching and
function call overhead.
When not optimizing for size, we could expand the constant
multiplication into a sequence of shift and add/sub instructions. For
now we use non-adjacent form for the shift and add/sub sequence, which
could save 1/2 - 2/3 instructions compared to a shl+add-only sequence.
This is a tiny change that can save up to 16 bytes of stack allocation,
which is more beneficial on RV32 than RV64.
cm.push allocates multiples of 16 bytes, but only uses a subset of those
bytes for pushing callee-saved registers. Up to 12 (rv32) or 8 (rv64)
bytes are left unused, depending on how many registers are pushed.
Before this change, we told LLVM that the entire allocation was used, by
creating a fixed stack object which covered the whole allocation.
This change instead gives an accurate extent to the fixed stack object,
to only cover the registers that have been pushed. This allows the
PrologEpilogInserter to use any unused bytes for spills. Potentially
this saves an extra move of the stack pointer after the push, because
the push can allocate up to 48 more bytes than it needs for registers.
We cannot do the same change for save/restore, because the restore
routines restore in batches of `stackalign/(xlen/8)` registers, and we
don't want to clobber the saved values of registers that we didn't tell
the compiler we were saving/restoring - for instance `__riscv_restore_0`
is used by the compiler when it only wants to save `ra`, but will end up
restoring `ra` and `s0`.
This an alternative to #84935 to fix the miscompile, but not be optimal.
The immediate for cm.push/pop must be a multiple of 16. For RVE, it
might not be. It's not easy to increase the stack size without messing
up cfa directives and maybe other things.
This patch rounds the stack size down to a multiple of 16 before
clamping it to 48. This causes an extra addi to be emitted to handle the
remainder.
Once this commited, I can commit #84989 to add verification for these
instructions being generated with valid offsets.
When we are using the Zcmp extension together with the E extension in
32-bit mode and we need to spill both callee-saved registers as well as
needing a couple of 32-bit stack slots, we emit a meaningless stack
adjustment with cm.push/cm.popret. Furthermore this leads to the stack
slot for the ra being clobbered so control returns to a random location.
This is just a pre-commit test so that the PR for the fix shows the
difference in code generation.