When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of
instructions which takes the address of one of those instructions, and
uses that address to compute the return address signature. If this is
duplicated, there will be two different addresses used in calculating
the signature, so the epilogue will only be correct for (at most) one of
them.
This change also restricts code generation when using v8.3-A return
address signing, without PAuthLR. This isn't strictly needed, as
duplicating the prologue there would be valid. We could fix this by
having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable,
but I don't think it's worth adding the extra complexity to a security
feature for that.
(cherry picked from commit 36b3c43524c8ca86a5050496b8773f07c5ccddff)
This PR replaces the deleted ext with the promoted value in `AddrMode`.
Fixes#70938.
(cherry picked from commit 3c6aa04cf4dee65113e2a780b9f90b36bb4c4e04)
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.
(cherry picked from commit 66e0498dafbfa7f8fd7deaa88ae62bdf38a12113)
Whilst adding a cross-block test, I encountered an assertion failure in
the second pass where we check the instruction popped off the worklist
is a candidate.
The leaf instruction %c in this case will be added to the worklist when
its VL is VLMAX, but during the first pass it will have its VL reduced
to 1.
Then in the second pass when its processed via the worklist, isCandidate
will no longer be true due to its VL == 1.
This fixes it by moving the VL == 1 check to tryReduceVL, keeping it
alongside the other VL check for bailing out early as an optimisation.
Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn
is generated by clang compiler. But 'may_goto 0' insn is actually a
no-op so it makes sense to remove that in llvm. The patch is also able
to handle the following code pattern
```
...
may_goto 2
may_goto 1
may_goto 0
...
```
where three may_goto insns can all be removed.
---------
Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
This patch contains a number of changes relating to the above flag;
primarily it updates comment references to the old flag names,
"-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names,
"-fextend-variable-liveness[={all,this}]". These changes are all NFC.
This patch also removes the explicit -fextend-this-ptr-liveness flag
alias, and shortens the help-text for the main flag; these are both
changes that were meant to be applied in the initial PR (#110000), but
due to some user-error on my part they were not included in the merged
commit.
The DwarfDebug.cpp implementation expects the epilogue instructions to
have source location of last non-debug instruction after which the epilogue
instructions are inserted. The epilogue_begin is set on location of the first
FrameDestroy instruction with source line information that has been seen in
the epilogue basic block.
In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has
actually begun i.e. after callee saved register reloads and the source line
information is not set on those reload instructions. This is leading to #120553
where, while debugging, breaking on or single stepping to the epilogue_begin
location will make accessing the variables from wrong place as the FP has been
restored to the parent frame's FP.
To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved
register spill/reload instructions which is actually correct. Then the
RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a
reload of the callee saved register in the epilogue and copies the source
line information from insert position instruction to that reload instruction.
Requires PR #120622Fixes#120553
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.
The change explicitly marks offset as a signed value.
In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are
only used by fake uses, as well as the fake use in question. There are
two existing errors with the pass however: it incorrectly examines every
operand of each FAKE_USE, when only the first is relevant (extra
operands will just be "killed" regs assigned by a previous pass), and it
ignores cases where the FAKE_USE register is not an exact match for the
loaded register, which is incorrect as regalloc may choose to load a
wider value than the FAKE_USE required pre-regalloc. This patch fixes
both of these cases.
This is a fix for:
https://github.com/llvm/llvm-project/issues/97290
Please let me know if that is the right way to address the issue. Thank
you!
---------
Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
If the operands to `INSERT_SUBVECTOR` can't be widened legally, just
replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`.
Closes#124255 (and possibly #102016)
Patch was reverted due to test case (added) exposing an infinite loop in
combiner, where (shl C1, C2) create by performSHLCombine isn't
constant-folded:
Combining: t14: i64 = shl t12, Constant:i64<1>
Creating new node: t36: i64 = shl
OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1>
Creating new node: t37: i64 = shl t6, Constant:i64<1>
Creating new node: t38: i64 = and t37, t36
... into: t38: i64 = and t37, t36
...
Combining: t38: i64 = and t37, t36
Creating new node: t39: i64 = and t6,
OpaqueConstant:i64<-2401053089408754003>
Creating new node: t40: i64 = shl t39, Constant:i64<1>
... into: t40: i64 = shl t39, Constant:i64<1>
and subsequently gets simplified by DAGCombiner::visitAND:
// Simplify: (and (op x...), (op y...)) -> (op (and x, y))
if (N0.getOpcode() == N1.getOpcode())
if (SDValue V = hoistLogicOpWithSameOpcodeHands(N))
return V;
before being folded by performSHLCombine once again and so on.
The combine in performSHLCombine should only be done if (shl C1, C2) can
be constant-folded, it may otherwise be unsafe and generally have a
worse end result. Thanks to Dave Sherwood for his insight on this one.
This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.
``` - add clang builtin to Builtins.td
- link builtin in hlsl_intrinsics
- add codegen for spirv intrinsic and two directx intrinsics to retain
signedness information of the operands in CGBuiltin.cpp
- add semantic analysis in SemaHLSL.cpp
- add lowering of spirv intrinsic to spirv backend in
SPIRVInstructionSelector.cpp
- add lowering of directx intrinsics to WaveActiveOp dxil op in
DXIL.td
- add test cases to illustrate passespendent pr merges.
```
Resolves#99170
We currently have an issue where bf16 patters can be used to match fp16
types, as GISel does not know about the difference between the two. This
patch explicitly disables them to make sure that they are never used.
The opposite can also happen too, where fp16 patterns are used for
operators that should be bf16. So this also changes any operations with
bf16 types to now cause a fallback to SDAG.
The pass setup for GISel has been slightly adjusted to make sure that a
verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass,
which otherwise can cause verifier issues when falling back.
Spotted the auipc case while looking at code for P550. I'm not sure this
is the right long term fix. We're still missing rematerialization
opportunities for these pairs so a pseudo might be better. That would
interfere with folding auipc+add into load/store addressing though.
Fixes#76779.
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.
Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.
We currently check for passthrus in two places, on the instruction to
reduce in isCandidate, and on the users in checkUsers.
We cannot reduce the VL if an instruction has a user that's a passthru,
because the user will read elements past VL in the tail.
However it's fine to reduce an instruction if it itself contains a
non-undef passthru. Since the VL can only be reduced, not increased, the
previous tail will always remain the same.
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.
On the CPUs listed below, we want to avoid LDAPUR for performance
reasons. Add a tuning feature to disable them when using:
-mcpu=neoverse-v2
-mcpu=neoverse-v3
-mcpu=cortex-x3
-mcpu=cortex-x4
-mcpu=cortex-x925
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`FLOGB` instructions.
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`FRINTx`, `FRECPX`, and `FSQRT` instructions.
If you try to create a stack frame of 4 GiB or larger with a 32-bit
stack pointer, we currently emit invalid instructions like `mov eax,
5000000000` (unless you specify `-fstack-clash-protection`, in which
case we emit a trap instead).
The trap seems nicer, so let's do that in all cases. This avoids
emitting invalid instructions, and also fixes the "can't have 32-bit
16GB stack frame" assertion in `X86FrameLowering::emitSPUpdate()` (which
used to be triggerable by user code, but is now correct).
This was originally part of #124041.
@phoebewang
If we have a (vselect c, a+b, a-b), we can combine this to a+(vselect c,
b, -b). That by itself isn't hugely profitable, but if we reverse the
select, we get a form which matches a masked vrsub.vi with zero. The
result is that we can use a masked vrsub *before* the add instead of a
masked add or sub. This doesn't change the critical path (since we
already had the pass through on the masked second op), but does reduce
register pressure since a, b, and (a+b) don't need to all be alive at
once.
In addition to the vselect form, we can also see the same pattern with a
vector_shuffle encoding the vselect. I explored canonicalizing these to
vselects instead, but that exposes several unrelated missing combines.
If we're passing an i128 value and we no longer have enough argument
registers (only r9 unallocated), the value gets passed via the stack.
However, r9 is still allocated as a shadow register, which means that a
following i64 argument will not use it. This doesn't match the x86-64
psABI.
Fix this by making i128 arguments as requiring consecutive registers,
and then adding a custom CC lowering that will allocate both parts of
the i128 at the same time, either to register or to stack, without
reserving a shadow register.
Fixes https://github.com/llvm/llvm-project/issues/123935.
Currently, the AMDGPU backend bumps the Stack Pointer
by fixed size offsets in the prolog of device functions, and
restores it by the same amount in the epilog.
Prolog:
sp += frameSize
Epilog:
sp -= frameSize
If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment
Epilog:
sp -= frameSize + max_alignment
These calculations are not optimal in case of dynamic
stack realignment, and completely fail in case of
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP.
Prolog:
fp = sp
sp += frameSize
Epilog:
sp = fp
In case of dynamic stack realignment, SP is restored from
the saved Base Pointer.
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment
Epilog:
sp = bp
(Note: The presence of BP has been enforced in case of any
dynamic stack realignment.)
---------
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
In function handleMFLOSlot, we may get a variable LastInstInFunction
with a value of true from function getNextMachineInstr and IInSlot may
be null which would trigger an assert.
So we need to skip this case.
Fix#118223.
Change existing code for G_PHI to match what LLVM-IR version is doing
via PHINode::hasConstantOrUndefValue. This is not safe for regular PHI
since it may appear with an undef operand and getVRegDef can fail.
Most notably this improves number of values that can be allocated
to sgpr in AMDGPURegBankSelect.
Common case here are phis that appear in structurize-cfg lowering
for cycles with multiple exits:
Undef incoming value is coming from block that reached cycle exit
condition, if other incoming is uniform keep the phi uniform despite
the fact it is joining values from pair of blocks that are entered
via divergent condition branch.
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"
Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.
AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
overview of how to lower generic instructions based on available
target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
all the same just for different combinations of types and banks.
Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
Names of IDs are meant to give brief description what lowering does
for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms
Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
is defined as uniform(AMDGPURegBankSelect) - missing temporal
divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
fixed in AMDGPURegBankSelect via another fix in machine uniformity
analysis.