I'm guessing based on tablegen definitions. I also don't
really understand how this could have been missing.
This defends against regressions in a future peephole-opt
patch.
Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.
We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.
Add a new AutoUpgrade function to convert some legacy nvvm.annotations
metadata to function level attributes. These attributes are quicker to
look-up so improve compile time and are more idiomatic than using
metadata which should not include required information that changes the
meaning of the program.
Currently supported annotations are:
- !"kernel" -> ptx_kernel calling convention
- !"align" -> alignstack parameter attributes (return not yet supported)
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
We track definitions of stack objects, the implementation is identical
to tracking of registers.
Also, added printing of all found reaching definitions for testing
purposes.
---------
Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.
When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of
instructions which takes the address of one of those instructions, and
uses that address to compute the return address signature. If this is
duplicated, there will be two different addresses used in calculating
the signature, so the epilogue will only be correct for (at most) one of
them.
This change also restricts code generation when using v8.3-A return
address signing, without PAuthLR. This isn't strictly needed, as
duplicating the prologue there would be valid. We could fix this by
having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable,
but I don't think it's worth adding the extra complexity to a security
feature for that.
https://github.com/llvm/llvm-project/pull/122183 adds a codegen pass to
infer machine jump table entry's hotness from the MBB hotness. This is a
follow-up PR to produce `.hot` and or `.unlikely` section prefix for
jump table's (read-only) data sections in the relocatable `.o` files.
When this patch is enabled, linker will see {`.rodata`, `.rodata.hot`,
`.rodata.unlikely`} in input sections. It can map `.rodata.hot` and
`.rodata` in the input sections to `.rodata.hot` in the executable, and
map `.rodata.unlikely` into `.rodata` with a pending extension to
`--keep-text-section-prefix` like
059e7cbb66,
or with a linker script.
1. To partition hot and jump tables, the AsmPrinter pass slices a function's jump table indices into two groups, one for hot and the other for cold jump tables. It then emits hot jump tables into a `.hot`-prefixed data section and cold ones into a `.unlikely`-prefixed data section, retaining the relative order of `LJT<N>` labels within each group.
2. [ELF only] To have data sections with _dynamic_ names (e.g., `.rodata.hot[.func]`), we implement
`TargetLoweringObjectFile::getSectionForJumpTable` method that accepts a `MachineJumpTableEntry` parameter, and update `selectELFSectionForGlobal` to generate `.hot` or `.unlikely` based on
MJTE's hotness.
- The dynamic JT section name doesn't depend on `-ffunction-section=true` or `-funique-section-names=true`, even though it leverages the similar underlying mechanism to have a MCSection with on-demand name as `-ffunction-section` does.
3. The new code path is off by default.
- Typically, `TargetOptions` conveys clang or LLVM tools' options to code generation passes. To follow the pattern, add option `EnableStaticDataPartitioning` bit in `TargetOptions` and make it
readable through `TargetMachine`.
- To enable the new code path in tools like `llc`, `partition-static-data-sections` option is introduced in
`CodeGen/CommandFlags.h/cpp`.
- A subsequent patch
([draft](8f36a13743)) will add a clang option to enable the new code path.
---------
Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>
This replaces the worklist by instead computing what VL is demanded by
each instruction's users first, which is done via checkUsers.
The demanded VLs are stored in a DenseMap, and then we can just do a
single forward pass of tryReduceVL where we check if a candidate's
demanded VL is less than its VLOp.
This means the pass should now be linear in complexity, and allows us to
relax the restriction on tied operands in more easily as in #124066.
Whilst adding a cross-block test, I encountered an assertion failure in
the second pass where we check the instruction popped off the worklist
is a candidate.
The leaf instruction %c in this case will be added to the worklist when
its VL is VLMAX, but during the first pass it will have its VL reduced
to 1.
Then in the second pass when its processed via the worklist, isCandidate
will no longer be true due to its VL == 1.
This fixes it by moving the VL == 1 check to tryReduceVL, keeping it
alongside the other VL check for bailing out early as an optimisation.
Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn
is generated by clang compiler. But 'may_goto 0' insn is actually a
no-op so it makes sense to remove that in llvm. The patch is also able
to handle the following code pattern
```
...
may_goto 2
may_goto 1
may_goto 0
...
```
where three may_goto insns can all be removed.
---------
Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
This patch contains a number of changes relating to the above flag;
primarily it updates comment references to the old flag names,
"-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names,
"-fextend-variable-liveness[={all,this}]". These changes are all NFC.
This patch also removes the explicit -fextend-this-ptr-liveness flag
alias, and shortens the help-text for the main flag; these are both
changes that were meant to be applied in the initial PR (#110000), but
due to some user-error on my part they were not included in the merged
commit.
The DwarfDebug.cpp implementation expects the epilogue instructions to
have source location of last non-debug instruction after which the epilogue
instructions are inserted. The epilogue_begin is set on location of the first
FrameDestroy instruction with source line information that has been seen in
the epilogue basic block.
In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has
actually begun i.e. after callee saved register reloads and the source line
information is not set on those reload instructions. This is leading to #120553
where, while debugging, breaking on or single stepping to the epilogue_begin
location will make accessing the variables from wrong place as the FP has been
restored to the parent frame's FP.
To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved
register spill/reload instructions which is actually correct. Then the
RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a
reload of the callee saved register in the epilogue and copies the source
line information from insert position instruction to that reload instruction.
Requires PR #120622Fixes#120553
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.
The change explicitly marks offset as a signed value.
In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are
only used by fake uses, as well as the fake use in question. There are
two existing errors with the pass however: it incorrectly examines every
operand of each FAKE_USE, when only the first is relevant (extra
operands will just be "killed" regs assigned by a previous pass), and it
ignores cases where the FAKE_USE register is not an exact match for the
loaded register, which is incorrect as regalloc may choose to load a
wider value than the FAKE_USE required pre-regalloc. This patch fixes
both of these cases.
This is a fix for:
https://github.com/llvm/llvm-project/issues/97290
Please let me know if that is the right way to address the issue. Thank
you!
---------
Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
If the operands to `INSERT_SUBVECTOR` can't be widened legally, just
replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`.
Closes#124255 (and possibly #102016)
Patch was reverted due to test case (added) exposing an infinite loop in
combiner, where (shl C1, C2) create by performSHLCombine isn't
constant-folded:
Combining: t14: i64 = shl t12, Constant:i64<1>
Creating new node: t36: i64 = shl
OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1>
Creating new node: t37: i64 = shl t6, Constant:i64<1>
Creating new node: t38: i64 = and t37, t36
... into: t38: i64 = and t37, t36
...
Combining: t38: i64 = and t37, t36
Creating new node: t39: i64 = and t6,
OpaqueConstant:i64<-2401053089408754003>
Creating new node: t40: i64 = shl t39, Constant:i64<1>
... into: t40: i64 = shl t39, Constant:i64<1>
and subsequently gets simplified by DAGCombiner::visitAND:
// Simplify: (and (op x...), (op y...)) -> (op (and x, y))
if (N0.getOpcode() == N1.getOpcode())
if (SDValue V = hoistLogicOpWithSameOpcodeHands(N))
return V;
before being folded by performSHLCombine once again and so on.
The combine in performSHLCombine should only be done if (shl C1, C2) can
be constant-folded, it may otherwise be unsafe and generally have a
worse end result. Thanks to Dave Sherwood for his insight on this one.
This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.
``` - add clang builtin to Builtins.td
- link builtin in hlsl_intrinsics
- add codegen for spirv intrinsic and two directx intrinsics to retain
signedness information of the operands in CGBuiltin.cpp
- add semantic analysis in SemaHLSL.cpp
- add lowering of spirv intrinsic to spirv backend in
SPIRVInstructionSelector.cpp
- add lowering of directx intrinsics to WaveActiveOp dxil op in
DXIL.td
- add test cases to illustrate passespendent pr merges.
```
Resolves#99170
We currently have an issue where bf16 patters can be used to match fp16
types, as GISel does not know about the difference between the two. This
patch explicitly disables them to make sure that they are never used.
The opposite can also happen too, where fp16 patterns are used for
operators that should be bf16. So this also changes any operations with
bf16 types to now cause a fallback to SDAG.
The pass setup for GISel has been slightly adjusted to make sure that a
verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass,
which otherwise can cause verifier issues when falling back.
Spotted the auipc case while looking at code for P550. I'm not sure this
is the right long term fix. We're still missing rematerialization
opportunities for these pairs so a pseudo might be better. That would
interfere with folding auipc+add into load/store addressing though.
Fixes#76779.
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.
Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.
We currently check for passthrus in two places, on the instruction to
reduce in isCandidate, and on the users in checkUsers.
We cannot reduce the VL if an instruction has a user that's a passthru,
because the user will read elements past VL in the tail.
However it's fine to reduce an instruction if it itself contains a
non-undef passthru. Since the VL can only be reduced, not increased, the
previous tail will always remain the same.
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.
On the CPUs listed below, we want to avoid LDAPUR for performance
reasons. Add a tuning feature to disable them when using:
-mcpu=neoverse-v2
-mcpu=neoverse-v3
-mcpu=cortex-x3
-mcpu=cortex-x4
-mcpu=cortex-x925
SVE2.2 introduces instructions with predicated forms with zeroing of
the inactive lanes. This allows in some cases to save a `movprfx` or
a `mov` instruction when emitting code for `_x` or `_z` variants of
intrinsics.
This patch adds support for emitting the zeroing forms of certain
`FLOGB` instructions.