This looks like a rather weird change, so let me explain why this isn't
as unreasonable as it looks. Let's start with the problem it's solving.
```
define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb:
%i = icmp eq i32 %arg1, 1
br i1 %i, label %bb2, label %bb5
bb2: ; preds = %bb
%i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
%i4 = load i32, ptr %i3, align 4
br label %bb5
bb5: ; preds = %bb2, %bb
%i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
ret i32 %i6
}
```
Right now, we codegen this as:
```
li a3, 1
li a2, 13
bne a1, a3, .LBB0_2
lw a2, 4(a0)
.LBB0_2:
mv a0, a2
ret
```
In this example, we have two values which must be assigned to a0 per the
ABI (%arg, and the return value). SelectionDAG ensures that all values
used in a successor phi are defined before exit the predecessor block.
This creates an ADDI to materialize the immediate in the entry block.
Currently, this ADDI is not sunk into the tail block because we'd have
to split a critical edges to do so. Note that if our immediate was
anything large enough to require two instructions we *would* split this
critical edge.
Looking at other targets, we notice that they don't seem to have this
problem. They perform the sinking, and tail duplication that we don't.
Why? Well, it turns out for AArch64 that this is entirely an accident of
the existance of the gpr32all register class. The immediate is
materialized into the gpr32 class, and then copied into the gpr32all
register class. The existance of that copy puts us right back into the
two instruction case noted above.
This change essentially just bypasses this emergent behavior aspect of
the aarch64 behavior, and implements the same "always sink immediates"
behavior for RISCV as well.
We are using `PostMachineScheduler` instead of `PostRAScheduler`
since #68696.
The hook `getPostRAMutations` is only used in `PostRAScheduler` so
it is actually dead code for RISC-V now.
Add patterns to select 16b imulzu with -mapx-feature=zu, including
folding of zero-extends of the result. IsDesirableToPromoteOp is changed
to leave 16b multiplies by constant un-promoted, as imulzu will not
cause partial-write stalls.
A special case in type legalization wasn't accounting for different
operand numbering between FLDEXP and STRICT_FLDEXP.
AArch64 already asked STRICT_FLDEXP to be promoted, but had no test for
it.
Introduces `__builtin_hlsl_buffer_update_counter` clang buildin that is
used to implement the `IncrementCounter` and `DecrementCounter` methods
on `RWStructuredBuffer` and `RasterizerOrderedStructuredBuffer` (see
Note).
The builtin is translated to LLVM intrisic `llvm.dx.bufferUpdateCounter`
or `llvm.spv.bufferUpdateCounter`.
Introduces `BuiltinTypeMethodBuilder` helper in `HLSLExternalSemaSource`
that enables adding methods to builtin types using builder pattern like
this:
```
BuiltinTypeMethodBuilder(Sema, RecordBuilder, "MethodName", ReturnType)
.addParam("param_name", Type, InOutModifier)
.callBuiltin("buildin_name", { BuiltinParams })
.finalizeMethod();
```
Fixes#113513
[First version](llvm/llvm-project#114148) of this PR was reverted
because of build break.
For
add %reg1, name@GOTTPOFF(%rip), %reg2
add name@GOTTPOFF(%rip), %reg1, %reg2
{nf} add %reg1, name@GOTTPOFF(%rip), %reg2
{nf} add name@GOTTPOFF(%rip), %reg1, %reg2
{nf} add name@GOTTPOFF(%rip), %reg
add
`R_X86_64_CODE_6_GOTTPOFF` = 50
if the instruction starts at 6 bytes before the relocation offset. It's
similar to R_X86_64_GOTTPOFF.
Linker can treat `R_X86_64_CODE_6_GOTTPOFF` as `R_X86_64_GOTTPOFF` or
convert the instructions above to
add $name@tpoff, %reg1, %reg2
add $name@tpoff, %reg1, %reg2
{nf} add $name@tpoff, %reg1, %reg2
{nf} add $name@tpoff, %reg1, %reg2
{nf} add $name@tpoff, %reg
if the first byte of the instruction at the relocation `offset - 6` is
`0xd5` (namely, encoded w/REX2 prefix) when possible.
Binutils patch:
5bc71c2a6b
Binutils mailthread:
https://sourceware.org/pipermail/binutils/2024-February/132351.html
ABI discussion:
https://groups.google.com/g/x86-64-abi/c/FhEZjCtDLFw/m/VHDjN4orAgAJ
Blog: https://kanrobert.github.io/rfc/All-about-APX-relocation
Currently, ShaderFlagsAnalysis pass represents various module-level
properties as well as function-level properties of a DXIL Module using a
single mask. However, one mask per function is needed for accurate
computation of shader flags mask, such as for entry function metadata
creation.
This change introduces a structure that wraps a sorted vector of
function-shader flag mask pairs that represent function properties
instead of a single shader flag mask that represents module properties
and properties of all functions. The result type of ShaderFlagsAnalysis
pass is changed to newly-defined structure type instead of a single
shader flags mask.
This allows accurate computation of shader flags of an entry function
(and all functions in a library shader) for use during its metadata
generation (DXILTranslateMetadata pass) and its feature flags in DX
container globals construction (DXContainerGlobals pass) based on the
shader flags mask of functions. However, note that the change to
implement propagation of such callee-based shader flags mask computation
is planned in a follow-on PR. Consequently, this PR changes shader flag
mask computation in DXILTranslateMetadata and DXContainerGlobals passes
to simply be a union of module flags and shader flags of all functions,
thereby retaining the existing effect of using a single shader flag
mask.
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.
Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
We can move the logic from adjustStackForRVV into adjustReg, which
results in the remaining logic being trivially inlined to the two
callers and allows a duplicate copy of the same logic in
eliminateFrameIndex to be pruned.
In the DXIL CreateHandle and CreateHandleFromBinding ops, resource
bindings are
indexed from the beginning of the binding space, not from the binding
itself.
Translate from an index into the binding to one from the beginning of
the space
when lowering to these operations.
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.
OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte to write.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
The 2-pass XDL write VGPR, read by non-XDL SGEMM/DGEMM case
was 1 wait state overly conservative. Previously, for gfx940,
the XDL/non-XDL cases happened to have the same number of cycles
in all cases. Now the XDL consumer case has an additional state for
2 pass sources.
This is split off from #115274. There doesn't seem to be an easy way to
share this with getShuffleCost since that requires passing in a real
insert_element operand to get it to recognise it's a scalar splat.
For i1 vectors we can't currently lower them so it returns an invalid
cost.
---------
Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
This will avoid assertion failures once we disable implicit truncation
in getConstant().
Inside adjustSubwordCmp() I ended up suppressing the issue with an
explicit cast, because this code deals with a mix of unsigned and signed
immediates.
This patch moves the `areInlineCompatible` implementation from multiple
subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to
the base class `BasicTTIImpl`. The new implementation checks whether the
callee's target features are a subset of the caller's, enabling
consistent behavior across targets. Subclasses now simply delegate to
the base implementation, reducing code duplication and improving
maintainability.
Based on https://github.com/intel/perfmon/issues/149, the documentation
is incorrect and the pfm counter names are actually correct. This patch
adjusts the Alder Lake scheduling model to match the performance counter
naming/ correct naming that will soon be reflected in the optimization
manual.
This fixes part of #117360.
Based on intel/perfmon#149, the documentation is incorrect and the pfm
counter names are actually correct. This patch adjusts the
SapphireRapids scheduling model to match the performance counter naming/
correct naming that will soon be reflected in the optimization manual.
This fixes part of #117360.
Add complete IvyBridge schedule (which is included in the SandyBridge model, IvyBridge was the first to support F16C) - split rr/rm schedules as they usually have very different port usage.
Haswell/Broadwell use Port1 not Port0.
Confirmed with a mixture of Agner + uops.info comparisons.
The folded load variants almost never require Port5 for length changing conversions (just for SNB ymm cases), and don't typically use an extra uop for the load.
Confirmed with a mixture of Agner + uops.info comparisons.