When ldp-aligned-only/stp-aligned-only is specified, modified to cancel
ldp/stp transformation if MachineMemOperand is not present or the access
size is unknown.
In the previous implementation, the test passed when there was no
MachineMemOperand. Also, if the size was unknown, an incorrect value was
used or an assertion failed. (But actually, if there is no
MachineMemOperand, it will be excluded from the target by
isCandidateToMergeOrPair() before reaching the part.)
A statistic NumFailedAlignmentCheck is added. NumPairCreated is modified
so that it only counts if it is not canceled.
These test cases show (op (ext a), (ext b)) patterns where the dest EEW is
more than 2 * source EEW. These could be lowered into widening ops where we
still have extend the operands, but at a smaller EEW.
This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306.
LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:
1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
control intrinsics.
2. Model token values as untyped virtual registers in MIR.
The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.
The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
There were a number of issues that needed to be addressed:
- i64 to bf16 did not correctly round
- strict rounding needed to yield a chain
- fastisel did not have logic to bail on bf16
When reference-types feature is enabled, forcing mem2reg unconditionally
even in `-O0` has some problems described in #81575. This uses
RefTypeMem2Local pass added in #81965 instead. This also removes
`IsForced` parameter added in
890146b192
given that we don't need it anymore.
This may still hurt debug info related to reference type variables a
little during the backend transformation given that they are not stored
in memory anymore, but reference type variables are presumably rare and
it would be still a lot less damage than forcing mem2reg on the whole
program. Also this fixes the EH problem described in #81575.
Fixes#81575.
We were setting based on whether the FP value is positive/negative, but
we really want to know whether the resulting integer will be treated as
a signed or unsigned value. Since we use SINT_TO_FP to convert the
integer to FP, we should always used signed here.
Without this we convert +2147483648.0 to an integer 0x80000000 and
convert it using sint_to_fp which produces -2147483648.0.
Clang sets the nonlazybind attribute for certain ObjC features. The
AArch64 SelectionDAG implementation for non-intrinsic calls
(46e36f0953aabb5e5cd00ed8d296d60f9f71b424) is behind a cl option.
GCC implements -fno-plt for a few ELF targets. In Clang, -fno-plt also
sets the nonlazybind attribute. For SelectionDAG, make the cl option not
affect ELF so that non-intrinsic calls to a dso_preemptable function use
GOT. Adjust AArch64TargetLowering::LowerCall to handle intrinsic calls.
For FastISel, change `fastLowerCall` to bail out when a call is due to
-fno-plt.
For GlobalISel, handle non-intrinsic calls in CallLowering::lowerCall
and intrinsic calls in AArch64CallLowering::lowerCall (where the
target-independent CallLowering::lowerCall is not called).
The GlobalISel test in `call-rv-marker.ll` is therefore updated.
Note: the current -fno-plt -fpic implementation does not use GOT for a
preemptable function.
Link: #78275
Pull Request: https://github.com/llvm/llvm-project/pull/78890
This change implements: #70072
- `hlsl_intrinsics.h` - add the `exp` api
- `DXIL.td` - add the llvm intrinsic to DXIL opcode lowering mapping.
- This change reuses llvm's existing intrinsic
`__builtin_elementwise_exp` \ `int_exp` & `__builtin_elementwise_exp2` \
`int_exp2`
- This PR is part 1 of 2.
- Part 2 requires an intrinsic to instructions lowering.
Part2 will expand `int_exp` to
```
A = Builder.CreateFMul(log2eConst, val);
int_exp2(A)
```
just like we do in
[TranslateExp](https://github.com/microsoft/DirectXShaderCompiler/blob/main/lib/HLSL/HLOperationLower.cpp#L2220C1-L2236C2)
This change implements #83736
The dot product lowering needs a tertiary multipy add operation. DXIL
has three mad opcodes for `fmad`(46), `imad`(48), and `umad`(49). Dot
product in DXIL only uses `imad`\ `umad`, but for completeness and
because the hlsl `mad` intrinsic requires it `fmad` was also included.
Two new intrinsics were needed to be created to complete this change.
the `fmad` case already supported by llvm via `fmuladd` intrinsic.
- `hlsl_intrinsics.h` - exposed mad api call.
- `Builtins.td` - exposed a `mad` builtin.
- `Sema.h` - make `tertiary` calls check for float types optional.
- `CGBuiltin.cpp` - pick the intrinsic for singed\unsigned & float also
reuse `int_fmuladd`.
- `SemaChecking.cpp` - type checks for `__builtin_hlsl_mad`.
- `IntrinsicsDirectX.td` create the two new intrinsics for
`imad`\`umad`/
- `DXIL.td` - create the llvm intrinsic to `DXIL` opcode mapping.
---------
Co-authored-by: Farzon Lotfi <farzon@farzon.com>
When ldp-aligned-only/stp-aligned-only is specified, modified to cancel
ldp/stp transformation if MachineMemOperand is not present or the access
size is unknown.
In the previous implementation, the test passed when there was no
MachineMemOperand. Also, if the size was unknown, an incorrect value was
used or an assertion failed. (But actually, if there is no
MachineMemOperand, it will be excluded from the target by
isCandidateToMergeOrPair() before reaching the part.)
A statistic NumFailedAlignmentCheck is added. NumPairCreated is modified
so that it only counts if it is not cancelled.
… AAPCS frame chain fix (#82801)"
This reverts commit 00e4a4197137410129d4725ffb82bae9ce44bdde. This patch
was found to cause miscompilations and compilation failures.
The promote alloca to vector transformation assumes that the
vector index is a constant value. If it is not a constant, then
either an assert occurs or the tranformation generates an
incorrect index.
Previously SelectionDAGBuilder used ABI alignment for
compressstore/expandload. This patch allows SelectionDAGBuilder to use
parameter alignment like vp intrinsics. This does not follow the
original code to default use vector type alignment, since it is possible
implemented to unaligned vector alignment.
An EXTRACT_VECTOR_ELT can extend the element to the width of its result
type, leaving the high bits undefined. Previously if we attempted to
query the bytes in these high bits we would recurse and hit an
assertion. This fixes it by bailing if the index is outside of the
vector element size.
I think the assertion Index < ByteWidth may still be incorrect, since
ByteWidth is calculated from Op.getValueSizeInBits(). I believe this
should be Op.getScalarValueSizeInBits() whenever VectorIndex is set
since we're querying the element now, not the vector. But I couldn't
think of a test case to trigger it. It can be addressed in a follow-up
patch.
Fixes#83920
We handle 3 register classes for x87 - rfp32, rfp64 and rfp80.
1. X87 registers are assigned a pseudo register class. We need a new
register bank for these classes.
2. We add bank information and enums for these.
3. Legalizer is updated to allow 32b and 64b even in absence of SSE.
This is required because with SSE enabled, x86 doesn't use fp stack and
instead uses SSE registers.
4. Functions in X86RegisterBankInfo need to decide whether to use the
pseudo classes or SSE-enabled classes. I add MachineInstr as an argument
to static helper function getPartialMappingIdx to this end.
Add/Update tests.
The SelectionDAG scheduling preference now becomes source order
scheduling (machine scheduler generates better code -- even without
there being a machine model defined for LoongArch yet).
Most of the test changes are trivial instruction reorderings and
differing register allocations, without any obvious performance impact.
This is similar to commit: 3d0fbafd0bce43bb9106230a45d1130f7a40e5ec
To prevent tests from breaking, another fix had to be made: Now, we
check if the instruction after a waiting instruction is a call, and if
so, we insert the wait.
Add SPIR-V backend support for the HLSL SV_DispatchThreadID semantic
attribute, which is lowered to a @llvm.dx.thread.id intrinsic in LLVM
IR. In the SPIR-V backend, this is now correctly translated to a
`GlobalInvocationId` builtin variable.
Fixes#82534
Changes in Recommit:
1) Fix non-determanism by using `SmallMapVector` instead of
`SmallPtrSet`.
2) Fix bug in dependency pruning where we discounted the actual
`and/or` combining the two conditions. This lead to over pruning.
Closes#81689
This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Reason: Broke the sanitizer buildbots. See the comments at
https://github.com/llvm/llvm-project/pull/71785
for more information.
These builtins are already there in Clang, however current codegen may
produce suboptimal results due to their complex behavior. Implement them
as intrinsics to ensure expected instructions are emitted.
When code for M class architecture was compiled with AAPCS and PAC
enabled, the frame pointer, r11, was not pushed to the stack adjacent to
the link register. Due to PAC being enabled, r12 was placed between r11
and lr. This patch fixes this by adding an extra case to the already
existing code that splits the GPR push in two when R11 is the frame
pointer and certain paremeters are met. The differential revision for
this previous change can be found here:
https://reviews.llvm.org/D125649. This now ensures that r11 and lr are
pushed in a separate push instruction to the other GPRs when PAC and
AAPCS are enabled, meaning the frame pointer and link register are now
pushed onto the stack adjacent to each other.
This PR is to add explicit support for SPV_KHR_float_controls
(https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_float_controls.asciidoc).
This extension is included into SPIR-V after version 1.4, but in case of
lower versions it is to be included explicitly and OpExtension must be
present in the module with `OpExtension "SPV_KHR_float_controls"`.
This PR fixes this issue and fixes the test case
test/CodeGen/SPIRV/exec_mode_float_control_khr.ll to account for a
version lower than 1.4.
This PR is to fix a way how SPIR-V Backend describes legality of
OpBitcast instruction and how it is validated on a step of instruction
selection. Instead of checking a size of virtual registers (that makes
no sense due to lack of guarantee of direct relations between size of
virtual register and bit width associated with the type size), this PR
allows to legalize OpBitcast without size check and postpones validation
to the instruction selection step.
As an example, let's consider the next example that was copied as is
from a bigger test suite:
```
%355:id(s16) = G_BITCAST %301:id(s32)
%303:id(s16) = ASSIGN_TYPE %355:id(s16), %349:type(s32)
%644:fid(s32) = G_FMUL %645:fid, %646:fid
%301:id(s32) = ASSIGN_TYPE %644:fid(s32), %40:type(s32)
```
Without the PR this leads to a crash with complains to an illegal
bitcast, because %355 is s16 and %301 is s32. However, we must check not
virtual registers in this case, but types of %355 and %301, i.e.,
%349:type(s32) and %40:type(s32), which are perfectly well compatible in
a sense of OpBitcast in this case.
In a test case that is a part of this PR OpBitcast is legal, being
applied for `OpTypeInt 16` and `OpTypeFloat 16`, but would not be
legalized without this PR due to virtual registers defined as having
size 16 and 32.
This PR is to add vector reduction instructions according to
https://llvm.org/docs/GlobalISel/GenericOpcode.html#vector-reduction-operations
and widen in such a way a range of successful supported conversions,
covering new cases of vector reduction instructions which IRTranslator
is unable to resolve.
By legalizing vector reduction instructions we introduce a new
instruction patterns that should be addressed, including patterns that
are delegated to pre-legalize step. To address this problem, a new pass
is added that is to bring newly generated instructions after
legalization to an aspect required by instruction selection.
Expected overheads for existing cases is minimal, because a new pass is
working only with newly introduced instructions, otherwise it's just a
additional code traverse without any actions.
Original commit 79889734b940356ab3381423c93ae06f22e772c9.
Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e.
LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:
1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
control intrinsics.
2. Model token values as untyped virtual registers in MIR.
The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.
The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
Previously we incorrectly removed the scalar load store pair here assuming it
was dead, when it actually aliased with the memset. This showed up as a
miscompile on SPEC CPU 2017 when compiling with -mrvv-vector-bits, and was only
triggered by the changes in #75531. This was fixed in #83017, but this patch
adds a test case for this specific miscompile.
For reference, the incorrect codegen was:
vsetvli a1, zero, e8, m4, ta, ma
vmv.v.i v8, 0
vs4r.v v8, (a0)
addi a1, a0, 80
vsetivli zero, 16, e8, m1, ta, ma
vmv.v.i v8, 0
vs1r.v v8, (a1)
addi a0, a0, 64
vs1r.v v8, (a0)