We only want to treat globals as potentially far away, not other things
like constants in the constant pool.
This matches the object file emission that only puts the large section
flag on globals.
Remove FIXME since the remaining differences are accesses to 0 sized
globals which are intentional.
Some critical code paths we have depend on efficient byte extraction
from data loaded as integers. By default LLVM tries to extract bytes by
storing/loading from stack, which is very inefficient on GPU.
This patch separates PNR registers into their own register class instead
of sharing a register class with PPR registers. This primarily allows us
to return more accurate register classes when applying assembly
constraints, but also more protection from supplying an incorrect
predicate type to an invalid register operand.
Generate a few of the relevant tests with `update_llc_test_checks.py`
and pre-commit. Makes it easier to spot the differences in D152828.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D157116
These are marked to be "as cheap as a move".
According to publicly available Software Optimization Guides, they
have one cycle latency and maximum throughput only on some
microarchitectures, only for `LSL` and only for some shift amounts.
This patch uses the subtarget feature `FeatureALULSLFast` to determine
how cheap the instructions are.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D152827
Change-Id: I8f0d7e79bcf277ebf959719991c29a1bc7829486
- remove `FeatureCustomCheapAsMoveHandling`: when you have target
features affecting `isAsCheapAsAMove` that can be given on command
line or passed via attributes, then every sub-target effectively has
custom handling
- remove special handling of `FMOVD0`/etc: `FVMOV` with an immediate
zero operand is never[1] more expensive tha an `FMOV` with a
register operand.
- remove special handling of `COPY` - copy is trivially as cheap as
itself
- make the function default to the `MachineInstr` attribute
`isAsCheapAsAMove`
- remove special handling of `ANDWrr`/etc and of `ANDWri`/etc: the
fallback `MachineInstr` attribute is already non-zero.
- remove special handling of `ADDWri`/`SUBWri`/`ADDXri`/`SUBXri` -
there are always[1] one cycle latency with maximum (for the
micro-architecture) throughput
- check if `MOVi32Imm`/`MOVi64Imm` can be expanded into a "cheap"
sequence of instructions
There is a little twist with determining whether a
MOVi32Imm`/`MOVi64Imm` is "as-cheap-as-a-move". Even if one of these
pseudo-instructions needs to be expanded to more than one MOVZ,
MOVN, or MOVK instructions, materialisation may be preferrable to
allocating a register to hold the constant. For the moment a cutoff
at two instructions seems like a reasonable compromise.
[1] according to 19 software optimisation manuals
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D154722
While simplifying some vector operators in DAG combine, we may need to
create new instructions for simplified vectors. At that time, we need to
make sure that all the flags of the new instruction are copied/modified
from the old instruction.
If "contract" is dropped from an instruction like FMUL, it may not
generate FMA instruction which would impact performance.
Here's an example where "contract" flag is dropped when FMUL is created.
Replacing.2 t42: v2f32 = fmul contract t41, t38
With: t48: v2f32 = fmul t38, t38
Co-authored-by: Sirish Pande <sirish.pande@amd.com>
Similar to #65598, if we're using a vslideup to insert a fixed length
vector into another vector, then we can work out the minimum number of
registers it will need to slide up across given the minimum VLEN, and
shrink the type operated on to reduce LMUL accordingly.
This is somewhat dependent on #66211 , since it introduces a subregister
copy that triggers a crash with -early-live-intervals in one of the
tests.
Stacked upon #66211
Noticed on D159533 and I've finally deal with the x86 regressions - MatchingStackOffset wasn't peeking through AssertZext nodes while trying to find CopyFromReg/Load sources, it was only removing them if they were part of a (trunc (assertzext x)) pattern.
This transform has caused a few issues with operations that can naturally be
extended. This patch just adds a debug option for disabling the transform,
useful for testing cases where it might not be profitable.
One big issue with DirectXShaderCompiler was test coverage: DXIL and
SPIR-V backends had their own tests. When a bug was found in one, the
other wasn't always checked. This lead to unequal support of HLSL for
both backends. We'd like to avoid those issues here, hence the
test-sharing.
By default, all the tests in this folder are marked as requiring
DirectX. But as SPIR-V support grows, each test drop this requirement,
and check the SPIR-V behavior.
I would have preferred to mark new tests as XFAIL for SPIR-V by default,
so we could differentiate real unsupported tests (as SPIR-V has no
equivalent), from newly added tests. But the way LIT is built, I don't
think this is possible.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
Code Object V2 has been deprecated for more than a year now. We can
safely remove it from LLVM.
- [clang] Remove support for the `-mcode-object-version=2` option.
- [lld] Remove/refactor tests that were still using COV2
- [llvm] Update AMDGPUUsage.rst
- Code Object V2 docs are left for informational purposes because those
code objects may still be supported by the runtime/loaders for a while.
- [AMDGPU] Remove COV2 emission capabilities.
- [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2
- [AMDGPU] Update all tests that were still using COV2 - They are either
deleted or ported directly to code object v4 (as v3 is also planned to
be removed soon).
This patch moves the MBB Profile Dump to ./llvm/test/CodeGen/Generic
from ./llvm/test/CodeGen/MlRegAlloc as the profile dump doesn't have
anything to do with the ML guided register allocation heuristic.
This reverts commit 96ea48ff5dcba46af350f5300eafd7f7394ba606.
The change may cause Verifier.cpp error
"musttail call must precede a ret with an optional bitcast"
Do so by extending `matchUnaryPredicate` to also work for
`ConstantFPSDNode` types then encapsulate the constant checks in a
lambda and pass it to `matchUnaryPredicate`.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154868
Note: This is moving D154678 which previously implemented this in
InstCombine. Concerns where brought up that this was de-canonicalizing
and really targeting a codegen improvement, so placing in DAGCombiner.
This implements:
```
(fmul C, (uitofp Pow2))
-> (bitcast_to_FP (add (bitcast_to_INT C), Log2(Pow2) << mantissa))
(fdiv C, (uitofp Pow2))
-> (bitcast_to_FP (sub (bitcast_to_INT C), Log2(Pow2) << mantissa))
```
The motivation is mostly fdiv where 2^(-p) is a fairly common
expression.
The patch is intentionally conservative about the transform, only
doing so if we:
1) have IEEE floats
2) C is normal
3) add/sub of max(Log2(Pow2)) stays in the min/max exponent
bounds.
Alive2 can't realistically prove this, but did test float16/float32
cases (within the bounds of the above rules) exhaustively.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154805
Print OpConstant floats as formatted decimal floating points, with
special case exceptions to print infinity and NaN as hexfloats.
This change follows from the fixes in
https://github.com/llvm/llvm-project/pull/66686 to correct how
constant values are printed generally.
Differential Revision: https://reviews.llvm.org/D159376
This change makes callees with the __arm_preserves_za
type attribute comply with the dormant state requirements
when it's caller has the __arm_shared_za type attribute.
Several external SME functions also do not need to lazy
save.
5e67092434/aapcs64/aapcs64.rst (L1381)
Differential Revision: https://reviews.llvm.org/D159186
Previously, the SPIR-V instruction printer was always printing the first
operand of an `OpConstant`'s literal value as one of the fixed operands.
This is incorrect for 64-bit values, where the first operand is actually
the value's lower-order word and should be combined with the following
higher-order word before printing.
This change fixes that issue by waiting to print the last fixed operand
of `OpConstant` instructions until the variadic operands are ready to be
printed, then using `NumFixedOps - 1` as the starting operand index for
the literal value operands.
Depends on D156049
This adds tests for fixed and scalable vectors where we have a binary op
on two splats that could be scalarized. Normally this would be
scalarized in the middle-end by VectorCombine, but as noted in
https://reviews.llvm.org/D159190, this pattern can crop up during
CodeGen afterwards.
Note that a combine already exists for this, but on RISC-V currently it
only works on scalable vectors where the element type == XLEN. See
#65068 and #65072
If all the concatenated subvectors are targets shuffle nodes, then call combineX86ShufflesRecursively to attempt to combine them.
Unlike the existing shuffle concatenation in collectConcatOps, this isn't limited to splat cases and won't attempt to concat the source nodes prior to creating the larger shuffle node, so will usually only combine to create cross-lane shuffles.
This exposed a hidden issue in matchBinaryShuffle that wasn't limiting v64i8/v32i16 UNPACK nodes to AVX512BW targets.
Sample test case:
%3 = V_FMAC_F32_e32 killed %0, %1, %2, implicit $mode, implicit $exec
With LiveVariables this is converted to three-address form just because
there is no "killed" flag on %2. To make it do the same thing with
LiveIntervals I added a later use of %2:
%3 = V_FMAC_F32_e32 killed %0, %1, %2, implicit $mode, implicit $exec
S_ENDPGM 0, implicit %2
The pr does two things. One is to fix internal compiler error when we
need to spill callee saves but none of them is GPR, another is to fix
wrong register number for pushed registers are {ra, s0-s11}.