Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.
Tail calls are also handled in a future patch.
That change adds support for folding a SETCC when one or both of the
operands is a TRUNCATE with the appropriate no-wrap flags. This pattern
can occur when promoting i8 operations in NVPTX, and we currently have
some ISel rules to try to handle it.
On X86-64, LLVM currently generates the same DWARF debug info for `call
rax` and `call [rax]`; in both cases, the generated DWARF claims that
the call goes to address RAX. This bug occurs because the X86 machine
instructions CALL64r and CALL64m both receive register operands, but
those register operands have different semantics.
To fix it, change DwarfDebug::constructCallSiteEntryDIEs() to validate
the callee operand's semantics (`OperandType`) and make sure it is not
semantically describing a memory location.
This fix will result in less DW_TAG_call_site and DW_AT_call_target
entries being generated.
There is an existing test in dwarf-callsite-related-attrs.ll that
asserts the broken behavior; remove the broken check, and instead add a
new test dwarf-callsite-related-attrs-indirect.ll that checks behavior
for indirect calls.
The existing test xray-custom-log.ll is validating something even more
broken: It checks the debug info generated by a PATCHABLE_EVENT_CALL.
`TII->getCalleeOperand()` assumes that the first argument of a call
instruction is always the destination, but the first argument of
PATCHABLE_EVENT_CALL is instead the event structure; and so we were
emitting debug info claiming the callee was stored in a register that
actually contains some kind of xray event descriptor, and the test
validates that this happens.
I am breaking and deleting this test.
I guess the intent there might have been to validate that we emit
debuginfo referencing the target of the direct call that LLVM emits
(which we don't do)? But I'm not sure.
Currently, when an instruction rematerialized by the register coalescer
defines more subregs of the destination register
than the original COPY instruction did, we only add dead defs for the
newly defined subregs if they were not defined anywhere
else. For example, consider something like this before
rematerialization:
```
%0:reg64 = CONSTANT 1
%1:reg128.sub_lo64_lo32 = COPY %0.lo32
%1:reg128.sub_lo64_hi32 = ...
...
```
that would look like this after rematerializing `%0`:
```
%0:reg64 = CONSTANT 2
%1:reg128.sub_lo64 = CONSTANT 2
%1:reg128.sub_lo64_hi32 = ...
...
```
A dead def would not be added for `%1.sub_lo64_hi32` at the 2nd
instruction because it's subrange wasn't empty beforehand.
Now that #146490 removed the assertion in visitFreeze to assert that the
node was still isGuaranteedNotToBeUndefOrPoison we no longer need this
reduced depth hack (which had to account for the difference in depth of
freeze(op()) vs op(freeze())
Helps with some of the minor regressions in #150017
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
Similar to InstCombinerImpl::freezeOtherUses, attempt to ensure that we
merge multiple frozen/unfrozen uses of a SDValue. This fixes a number of
hasOneUse() problems when trying to push FREEZE nodes through the DAG.
Remove SimplifyMultipleUseDemandedBits handling of FREEZE nodes as we
now want to keep the common node, and not bypass for some nodes just
because of DemandedElts.
Fixes#149799
If using srl does not produce a legal constant for the RHS of the
final compare, try to use sra instead.
Because the AND constant is negative, the sign bits participate in the
compare. Using an arithmetic shift right duplicates that bit.
Add a new combine to replace
```
(store ch (vselect cond truevec (load ch ptr offset)) ptr offset)
```
to
```
(mstore ch truevec ptr offset cond)
```
This saves a blend operation on targets that support conditional stores.
The IRTranslator sets the flags now more consistently with
`SelectionDAGBuilder::visitGetElementPtr()`. This affects `nuw` and `nusw`, as
well as the recently introduced `inbounds` MIFlag (see PR #150900).
This PR also adds more tests to `AArch64/GlobalISel/irtranslator-gep-flags.ll`
to cover all points in `IRTranslator::translateGetElementPtr` that set flags.
For SWDEV-516125.
This slightly relaxes the invariant established in #149310, by also
allowing the lifetime argument to be poison. This is to support the
typical pattern of RAUWing with poison when removing an instruction.
It's worth noting that this does not require any conservative
assumptions, lifetimes with poison arguments can simply be skipped.
Fixes https://github.com/llvm/llvm-project/issues/151119.
The object file format specific derived classes are used in context
where the type is statically known. We don't use isa/dyn_cast and we
want to eliminate MCSymbol::Kind in the base class.
The object file format specific derived classes are used in context
where the type is statically known. We don't use isa/dyn_cast and we
want to eliminate MCSymbol::Kind in the base class.
The object file format specific derived classes are used in context
where the type is statically known. We don't use isa/dyn_cast and we
want to eliminate MCSymbol::Kind in the base class.
The object file format specific derived classes are used in context
where the type is statically known. We don't use isa/dyn_cast and we
want to eliminate MCSymbol::Kind in the base class.
The assertion previously did not work correctly because the operand was
being truncated to an `int` prior to comparison.
Change the assertion into a a reported error as suggested in
https://github.com/llvm/llvm-project/pull/101840#issuecomment-2304992425
by @arsenm
Finally, fix the lowering on 64-bit targets so that offsets larger than
32-bit are correctly addressed and add tests for various reported
issues.
The optimization introduced by #125637 tried to avoid using stacks to
promote bitcast with vector result type. However, it wouldn't be correct
if the input type is vector. This patch limits that optimizations to
only scalar to vector bitcasts.
This patch adds two DAG combines:
1. vector_interleave(splat, splat, ...) -> {splat,splat,...}
2. concat_vectors(splat, splat, ...) -> wide_splat
where all the input splats are identical. Both of these
together enable us to fold
concat_vectors(vector_interleave(splat, splat, ...))
into a wide splat. Post-legalisation we must only do the
concat_vector combine if the wider type and splat operation
is legal.
For fixed-width vectors the DAG combine only occurs for
interleave factors of 3 or more, however it's not currently
safe to test this for AArch64 since there isn't any lowering
support for fixed-width interleaves. I've only added
fixed-width tests for RISCV.
#128400 introduced a use-after-free bug in
`RegAllocBase::cleanupFailedVReg` when removing intervals from regunits.
The issue is from the `InterferenceCache` in `RAGreedy`, which holds
`LiveRange*`. The current `InterferenceCache` APIs make it difficult to
update it, and there isn't a straightforward way to do that.
Since #128400 already mentions it's not clear about the necessity of
removing intervals from regunits, this PR avoids the issue by simply
skipping that step.
Fixes SWDEV-527146.
Collect the necessary information for constructing the call graph
section, and emit to .callgraph section of the binary.
MD5 hash of the callee_type metadata string is used as the numerical
type id emitted.
Reviewers: ilovepi
Reviewed By: ilovepi
Pull Request: https://github.com/llvm/llvm-project/pull/87576
https://github.com/llvm/llvm-project/pull/114990 allowed more aggressive
tail duplication for computed-gotos in both pre- and post-regalloc tail
duplication.
In some cases, performing tail-duplication too early can lead to worse
results, especially if we duplicate blocks with a number of phi nodes.
This is causing a ~3% performance regression in some workloads using
Python 3.12.
This patch updates TailDup to delay aggressive tail-duplication for
computed gotos to after register allocation.
This means we can keep the non-duplicated version for a bit longer
throughout the backend, which should reduce compile-time as well as
allowing a number of optimizations and simplifications to trigger before
drastically expanding the CFG.
For the case in https://github.com/llvm/llvm-project/issues/106846, I
get the same performance with and without this patch on Skylake.
PR: https://github.com/llvm/llvm-project/pull/150911
This reduces the amount of boilerplate required when adding a new
field to MIMetadata and reduces the chance of bugs like the
one I fixed in TargetInstrInfo::reassociateOps.
Reviewers: arsenm, nikic
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/133535
Currently terminatorIsComputedGoto will return for blocks with a
indirect branch terminator and no successor. If there are no successor,
the terminator is likely not a computed goto, return false in that case.
Note that this is currently NFC, as the only use checks it only if there
are successors, but it will be needed in
https://github.com/llvm/llvm-project/pull/150911.
PR: https://github.com/llvm/llvm-project/pull/151342
This tries to reland #123632 (previously reverted by commit
6b1db79887df19bc8e8c946108966aa6021c8b87)
This PR aims to fix coalescing of SUBREG_TO_REG when sub-register
liveness tracking is enabled and this is now the so-manieth
reincarnation of this effort :)
This change is needed in order to enable subreg liveness tracking for
AArch64, because without the implicit-def, Machine Copy Propagation
would remove a 'redundant' copy because it doesn't realise that the
top 32-bits of the register are zeroed, which subsequent instructions
rely on.
Changes compared to previous PR:
* Rather than updating all instructions that define the source register
(SrcReg) of the SUBREG_TO_REG, this new approach only updates
instructions
that define SrcReg when they dominate the SUBREG_TO_REG. The live-ranges
are updated accordingly.
This flag applies to G_PTR_ADD instructions and indicates that the operation
implements an inbounds getelementptr operation, i.e., the pointer operand is in
bounds wrt. the allocated object it is based on, and the arithmetic does not
change that.
It is set when the IRTranslator lowers inbounds GEPs (currently only in some
cases, to be extended with a future PR), and in the
(build|materialize)ObjectPtrOffset functions.
Inbounds information is useful in ISel when we have instructions that perform
address computations whose intermediate steps must be in the same memory region
as the final result. A follow-up patch will start using it for AMDGPU's flat
memory instructions, where the immediate offset must not affect the memory
aperture of the address.
This is analogous to a concurrent effort in SDAG: #131862
(related: #140017, #141725).
For SWDEV-516125.
The "at construction" binop folds in SelectionDAG::getNode() has
different behaviour when compared to the equivalent LLVM IR. This PR
makes the behaviour consistent while also extending the coverage to
include signed/unsigned max/min operations.
Fold sequences where we extract a bunch of contiguous bits from a value,
merge them into the low bit and then check if the low bits are zero or
not.
Usually the and would be on the outside (the leaves) of the expression,
but the DAG canonicalizes it to a single `and` at the root of the
expression.
The reason I put this in DAGCombiner instead of the target combiner is
because this is a generic, valid transform that's also fairly niche, so
there isn't much risk of a combine loop I think.
See #136727