`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
This is needed for architectures that actually use strict pointer
arithmetic instead of integers such as AArch64 with FEAT_CPA (see
https://github.com/llvm/llvm-project/pull/105669) or CHERI. Using an
index as the first operand of pointer arithmetic may result in an
invalid output.
While there are quite a few codegen changes here, these only change the
order of registers in add instructions. One MIPS combine had to be
updated to handle the new node order.
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/125279
An optimization was added that tries to move the uses of the mflr
instruction away from the instruction itself. However, this doesn't work
when we are using the hashst instruction because that instruction needs
to be run before the stack frame is obtained.
This patch disables moving instructions away from the mflr in the case
where ROP protection is being used.
---------
Co-authored-by: Lei Huang <lei@ca.ibm.com>
There's a regression with one of the bootstrap builds for x86.
I'll revert this while I investigate.
This reverts commit 4df6d3df24ae9cff07c70c96a1663cbba6e1dca5.
This PR aims to reland work done by @arsenm which was previously
reverted due to some tangentially related scheduler issues as discussed
on #76416.
This PR cherry-picks the original commit (0e46b49de433), and adds
another patch on top with the following changes:
* The code in `updateRegDefsUses` now updates subranges when
subreg-liveness-tracking is enabled.
* When adding an implicit-def operand for the super-register,
the code in `reMaterializeTrivialDef` which tries to remove
undefined subranges should now take into account that the lanes
from the super-reg are no longer undefined.
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
I am planning to add some optimization remarks to the
`PartiallyInlineLibCalls` pass. However, since this pass does not emit any
optimization remarks yet, I have to add the "infrastructure" for that first, which
is what this PR is about.
For shuffle vector splats with undef lanes in the mask,
this was introducing real values. Filter out build_vector
results based on the undef elements in the mask.
This avoids AMDGPU test regressions in a future change.
test/CodeGen/X86/urem-seteq-illegal-types.ll looks worse
but I didn't investigate.
Unsigned subtraction wrap-around occurs in `emitGlobalConstantImpl` on
an AIX-specific code path from 8e4423eb0888 when a structure type has
zero elements.
With assertions enabled, this manifests as:
```
TypeSize llvm::StructLayout::getElementOffset(unsigned int) const: Assertion `Idx < NumElements && "Invalid element idx!"' failed.
```
https://reviews.llvm.org/D111433 introduced PPCISD::CALL_RM for
-frounding-math. -msecure-plt -frounding-math {-fpic,-fPIC} codegen for
PPC32 became incorrect when a function contains function calls but no
global variable references (GlobalBaseReg).
As reported by @q66 , musl/src/dirent/closedir.c implements such a
function, which is miscompiled.
PPCISD::CALL has custom logic to set up the base register
(https://reviews.llvm.org/D42112). Add an extra case for CALL_RM.
While here, improve the test to
* actually test `case PPCISD::CALL`: we need a non-leaf function that
doesn't access global variables (global variables lead to
GlobalBaseReg, which call `getGlobalBaseReg()` as well).
* test `ExternalSymbolSDNode` with a memset.
Supersedes: #72758
Pull Request: https://github.com/llvm/llvm-project/pull/121281
AArch64 and PowerPC look like a improvements.
RISC-V is neutral.
X86 trades a dependency breaking xor before a seta for a movsx after a
sbbb. Depending on how the result is used, this movsx might go away.
Limits #117900 to only fold when scalar_to_vector doesn't perform implicit truncation, as the scaled shift calculation doesn't currently account for this - this can be addressed in a future update.
Fixes#121306
llvm-mc --assemble prints an initial `.text` from `initSections`.
This is weird for quick assembly tasks that do not specify `.text`.
Omit the .text by moving section directive printing from `changeSection`
to `switchSection`. switchSectionNoPrint now correctly calls the
`changeSection` hook (needed by MachO).
The initial directives of clang -S are now reordered. On ELF targets, we
get `.file "a.c"; .text` instead of `.text; .file "a.c"`.
If there is no function, `.text` will be omitted.
This adds a new helper `canFoldStoreIntoLibCallOutputPointers()` to
check that it is safe to fold a store into a node that will expand to a
library call that takes output pointers. This requires checking for two
(independent) properties:
1. The store is not within a CALLSEQ_START..CALLSEQ_END pair
* If it is, the expansion would lead to nested call sequences (which is
invalid)
2. The node does not appear as a predecessor to the store
* If it does, attempting to merge the store into the call would result
in a cycle in the DAG
These two properties are checked as part of the same traversal in
`canFoldStoreIntoLibCallOutputPointers()`
Improve the non-fatal cases to use DiagnosticInfo, which will now
provide a location. The allocators attempt to report different errors
if it happens to see inline assembly is involved (this detection is
quite unreliable) using srcloc instead of dbgloc. For now, leave this
behavior unchanged. I think reporting the full location and context
function would be more useful.
When arguments are passed in memory instead of registers we currently
load the entire pointer size even though the argument may be smaller.
For exmaple if the pointer size if i32 then we use a load word even if
the argument is only an i8. This patch zeros / extends the bits that are
not required to ensure that we are getting the correct value even if the
load is larger.
This pull request modifies the behavior of the
`@llvm.experimental.stackmap` intrinsic to require that its two first
operands (`id` and `numShadowBytes`) be **immediate values**. This
change ensures that variables cannot be passed as two first arguments to
this intrinsic.
Related Issue: https://github.com/llvm/llvm-project/issues/115733
### Testing
- Added new test cases to ensure errors are emitted for non-immediate
operands.
- Ran the full LLVM test suite to verify no regressions were introduced.
The patch fix https://github.com/llvm/llvm-project/issues/116061
The root cause of the assertion is that the FMA mutation pass does not
update the subranges of the live interval for the defined register of
the modified instruction .
it recalculate the live interval of the defined register of xvmaddmdp in
the VSX FMA mutation pass.
When extracting a smaller integer from a scalar_to_vector source, we were limited to only folding/truncating the lowest bits of the scalar source.
This patch extends the fold to handle extraction of any other element, by right shifting the source before truncation.
Fixes a regression from #117884
When GlobalMerge creates a MergedGlobal of statics all initialized to
zero, emitGlobalConstantImpl sees a ConstantAggregateZero. This results
in just emitting zeros followed by labels for the aliases. We need to
handle it more like how emitGlobalConstantStruct does by emitting each
global inside the aggregate.
---------
Co-authored-by: Hubert Tong <hubert.reinterpretcast@gmail.com>
Nodes created with `getMemIntrinsicNode` have memory operands. In order
for operands to be propagated to machine instructions, the nodes should
have `SDNPMemOperand` property.
Similar to 3c8c385a.
Commit 93589057830b2c3c35500ee8cac25c717a1e98f9 was reverted because it
caused a failure with test `lld :: ELF/ppc64-local-exec-tls.s`. This
relands the commit with a fix for the test.
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `Builtins.td` - Add f16 support for libm atan2 builtin
- `CGBuiltin.cpp` - Emit constraint atan2 intrinsic for clang builtin
- `clang/test/CodeGenCXX/builtin-calling-conv.cpp` - Use erff instead of
atan2 for clang builtin to lib call calling convention check, now that
atan2 maps to an intrinsic.
- add atan2 cases to llvm.experimental.constrained tests for more
backends: ARM, PowerPC, RISCV, SystemZ.
- LangRef.rst: add llvm.experimental.constrained.atan2, revise
llvm.atan2 description.
Last part of Implement the atan2 HLSL Function. Fixes#70096.
When the chain is not the entry node there is a risk the stores are
within a (CALLSEQ_START, CALLSEQ_END), which when the node is expanded
will lead to nested call sequences.
It should be possible to check for this and allow more cases, but for
now, let's limit this to cases where it's definitely safe.
Fixes#115323
If an instruction doesn't support memory operands, but one is provided,
an error should be raised. And conversely, if an instruction requires a
memory operand, but none is given, an error should be raised.
This patch fixes the combines for vector_shuffles when either or both of
its left and right hand side inputs are scalar_to_vector nodes.
Previously, when both left and right side inputs are scalar_to_vector
nodes, the current combine could not handle this situation, as the shuffle
mask was updated incorrectly. To temporarily solve this solution, this combine
was simply disabled and not performed.
Now, not only does this patch aim to resolve the previous issue of the
incorrect shuffle mask adjustments respectively, but it also updates any test
cases that are affected by this change.
Patch migrated from https://reviews.llvm.org/D130487.
This reverts commit c31014322c0b5ae596da129cbb844fb2198b4ef4.
Based on the discussions in #112772, this pass is not needed after the
introduction of `llvm.threadlocal.address` intrinsic.
Fixes https://github.com/llvm/llvm-project/issues/112771.
This merges the logic for expanding both FFREXP and FSINCOS into one
method `DAG.expandMultipleResultFPLibCall()`. This reduces duplication
and also allows FFREXP to benefit from the stack slot elimination
implemented for FSINCOS. This method will also be used in future to
implement more multiple-result intrinsics (such as modf and sincospi).