From #185382
Lower `vduph_lane_f16` and `vduph_laneq_f16` to `cir::VecExtractOp`
Tests moved from `v8.2a-neon-instrinsics-generic.c` to a new CIR-enabled
test file.
I tried following from notes made in #185852 (BF16)
Previously, `computeProcResourceMasks()` would print resource masks on
debug mode from multiple call sites, creating noise in the debug output.
This patch aims to fix this and also print more info about the
resources.
It splits to 2 types of debug prints for resources:
1. No simulation - mask only
2. Simulation - mask + other info
For 2, it shares printing on a single place in `ResourceManager`
constructor, that should cover all the other simulation cases
indirectly:
1. `llvm/lib/MCA/HardwareUnits/ResourceManager` - covered
2. `llvm/lib/MCA/InstrBuilder.c` - should be covered indirectly - only
used by `llvm-mca` before simulation that constructs a `ResourceManager`
3. `llvm/tools/llvm-mca/Views/SummaryView.cpp` - after simulation that
constructs a `ResourceManager`
4. `llvm/tools/llvm-mca/Views/BottleneckAnalysis.cpp` - after simulation
that constructs a `ResourceManager`
It also adds `BufferSize` to the output, which should be useful to debug
scheduling model + MCA integration.
For 1, it inlines mask-only printing into 2 other callers:
1. `llvm/include/llvm/MCA/Stages/InstructionTables.h`
2. `llvm/tools/llvm-exegesis/lib/SchedClassResolution.cpp`
as they only use the masks there. I think this is a reasonable
duplication across distinguishably different users/tools.
Now every pair of callers, even across groups (1 and 2), effectively
print in a mutually exclusive way.
The patch adds debug tests for the 3 new callers, in the corresponding
root test directories, to drive further location of logically
target-independent tests that just require some target at the root. I
think this convention is more discoverable, and is pretty widely used in
the project.
So that it's preserved across all inline invocations rather than just
one inliner pass run.
This prevents cases where devirtualization in the simplification
pipeline uncovers inlining opportunities that should be discarded due to
inline history, but we dropped the inline history between inliner pass
runs, causing code size to blow up, sometimes exponentially.
For compile time reasons, we want to limit this to only call sites that
have the potential to inline through SCCs, potentially with the help of
devirtualization. This means that the callee is in a non-trivial
(Ref)SCC, or the call site was previously an indirect call, which can
potentially be devirtualized to call any function.
The CGSCCUpdater::InlinedInternalEdges logic still seems to be relevant
even with this change, as monster_scc.ll blows up if I remove that code.
http://llvm-compile-time-tracker.com/compare.php?from=e830d88e8ae5f44a97cc76136a0a4e83aa9157c0&to=ed535e732fc41b79ab8efda2417886cbd0812f7f&stat=instructions:uFixes#186926.
This is needed so that `allowsMemoryAccessForAlignment` checks for
unaligned vector memory
support instead of unaligned scalar memory support when called from
`RISCVTargetLowering::expandUnalignedVPStore`
While there remove incorrect setting of the truncating store flag
on the vector instruction. And restrict the transform to simple stores
since we don't have tests for volatile or atomic.
Fixes#189037
The RV32 macc*.h00 instructions take the lower half words from rs1 and
rs2, compute the full word product by extending the inputs, and
add to rd. The RV64 macc*.w00 is similar but operates on words
and produces a double word result.
I've restricted this to case where the multiply has a single use.
We don't have a general macc that multiplies the full xlen bits
of rs1 and rs2, so I'm allowing the input to be sext_inreg/and or
have sufficient sign/zero bits according to
ComputeNumSignBits/computeKnownBits.
We should also add mul*.h00/mul.*w00 patterns, but those we should
restrict to at least one input being sext_inreg/and and prefer
regular mul when there are no sext_inreg/and.
Add ThreeOp_v2i32_Pats pattern class to support v2i32 vector operations
for AND_OR_B32 and OR3_B32 instructions. The new patterns check the
v2i32 and-or or or-or instruction sequence, extract individual 32-bit
elements from v2i32 operands, and applies the and_or or or3 vop3
operations.
This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The MLIR LLVM dialect is missing support for several parameter
attributes that
exist in LLVM IR: `writable`, `dead_on_unwind`, `dead_on_return`, and
`nofpclass`. This adds them to the kind-to-name mapping in
`AttrKindDetail.h`
and the corresponding name accessors in `LLVMDialect.td`.
The existing generic conversion infrastructure in `ModuleTranslation`
and
`ModuleImport` picks them up automatically — `writable` and
`dead_on_unwind`
round-trip as `UnitAttr`, while `dead_on_return` and `nofpclass`
round-trip as
`IntegerAttr`.
CIR needs these to match classic codegen's ABI output (sret gets
`writable
dead_on_unwind`, indirect args get `dead_on_return`, fast-math FP args
get
`nofpclass`).
Add skip_tail_padding property to cir.copy to handle
potentially-overlapping
subobject copies directly, instead of falling back to cir.libc.memcpy.
When
set, the lowering uses the record's data size (excluding tail padding)
for
the memcpy length. This keeps typed semantics and promotability of
cir.copy.
Also fix CXXABILowering to preserve op properties when recreating
operations,
and expose RecordType::computeStructDataSize() for computing data size
of
padded record types.
Added vector intrinsics for
vshlq_n_s8
vshlq_n_s16
vshlq_n_s32
vshlq_n_s64
vshlq_n_u8
vshlq_n_u16
vshlq_n_u32
vshlq_n_u64
vshl_n_s8
vshl_n_s16
vshl_n_s32
vshl_n_s64
vshl_n_u8
vshl_n_u16
vshl_n_u32
vshl_n_u64
these cover all the vector intrinsics for constant shift
the method followed
1) the vectors for quad words are of the form `64x2`, `32x4`, `16x8`,
`8x16` and the shift is a constant value but for shift left we need both
of them to be vectors so we take the constant shift and convert it into
a vector of respective form, for `64x2` we convert the constant to
`64x2`, I have learnt that this process is also called **splat**
2) After splat we have that the lhs and rhs are of the same size hence
the shift left can be applied
3) There is one issue though, the ops[0] is not of the right size, for
quad words it falls back to the default int8*16 in the function, so I am
converting it to the required size using bit casting, `8x16` = `64x2` so
we can bitcast and get the vector array in the right form.
Wrote the test cases for all the intrinsics listed above
#185382
Some functions used `new`/`delete` to allocate/free arrays. To avoid
memory leaks, it would be better to avoid using raw pointers. This patch
replaces the use of them with `SmallVector`.
It seems the known bits handling added in
686987a540bc176bceaad43ffe530cb3e88796d5
is insufficient to perform many range based optimizations. For some
reason
computeConstantRange doesn't fall back on KnownBits, and has a separate,
less used form which tries to use computeKnownBits.
No real challenge to these, it is effectively a copy/paste of the
classic codegen as it just requires we properly emit the holding
variable. The rest falls out of the rest of our handling of variables.
If the pointer for a reference is constexpr-unknown, use the pointer
itself instead, instead of dereferencing it. Unfortunately, that means
constexpr-unknown pointers to reach a lot more places than before.
64-bit AIX requires DWARF64 format, which was only introduced in DWARF
v3. DWARF v2 only supports 32-bit DWARF format, making it incompatible
with 64-bit AIX (the compiler throws a fatal error). These changes split
DWARF v2 tests into separate files that exclude 64-bit AIX targets while
still running on 32-bit AIX and other 64-bit platforms where DWARF v2 is
supported.
Successors outside of any loop do not contribute to the innermost loop,
skip them to avoid incorrect results due to
getSmallestCommonLoop(nullptr, X) returning nullptr.
Consider a pattern like `icmp (shl nsw X, L), (add nsw (shl nsw Y, L),
K)`. When the constant K is a multiple of 2^L, this can be simplified to
`icmp X, (add nsw Y, K >> L)`.
This patch extends canEvaluateShifted to support `Instruction::Add` and
updates its signature to accept `Instruction::BinaryOps` instead of a
boolean. This change allows the function to distinguish between LShr and
AShr requirements, ensuring that information is preserved according to
the signedness and overflow flags (nsw/nuw) of the operands.
The logic is integrated into `foldICmpCommutative` to enable peeling off
matching shifts from both sides of a comparison even when an offset is
present.
Fixes: #163110
computeBestVF iterates over all VPlans and picks the VF of the most
profitable VPlan. This VPlan is later needed for execution and
additional checks. Instead of retrieving it multiple times later, just
directly return it from computeBestVF.
This removes some redundant lookups.
PR: https://github.com/llvm/llvm-project/pull/190385
This change adds the following NVVM Ops for new narrow FP conversions
introduced in PTX 9.1:
- `convert.{f32x2/bf16x2}.to.s2f6x2`
- `convert.s2f6x2.to.bf16x2`
- `convert.bf16x2.to.f8x2` (extended for `f8E4M3FN` and `f8E5M2` types)
- `convert.{f16x2/bf16x2}.to.f6x2`
- `convert.{f16x2/bf16x2}.to.f4x2`
PTX ISA Reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
perf2bolt generates empty fdata files for small binaries and right now
BOLT does this check while parsing by calling `((!hasBranchData() &&
!hasMemData()))`. Instead, early exit as soon as the buffer finishes
reading the data file and exit with error message.
This fixes `Host compiler does not support '-fuse-ld=lld'` error when
cross-build libclc for gpu target. Cmake configure command is:
-DRUNTIMES_amdgcn-amd-amdhsa-llvm_LLVM_ENABLE_RUNTIMES=libclc \
-DLLVM_RUNTIME_TARGETS="amdgcn-amd-amdhsa-llvm"
libclc targets only support offload target cross-build and can't link
host executable. The configuration error is false positive for offload.
This PR adds a baseline test to first check if the target can link
executable. If it fails (typical for gpu/offload), we skip the custom
linker validation.
Additional test coverage for loops not yet supported, with sinkable
find-iv expressions (github.com/llvm/llvm-project/pull/183911) and uses
of the IV.
PR: https://github.com/llvm/llvm-project/pull/190548
Match the select operands directly against PhiR using m_Specific,
binding only the non-phi IV expression. This replaces the generic
TrueVal/FalseVal matching followed by an assert and conditional
extraction.
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.
In the non-ARM case, the offset was left unset, so the symbol
synthesized for the entry point pointed to the start of the containing
section.
As a drive-by change, simplify offset adjustment in ARM case.
The LLVM Coding Standards [1] specify that:
> [T]o match error message styles commonly produced by other tools,
> start the first sentence with a lowercase letter, and finish the last
> sentence without a period, if it would end in one otherwise.
Historically, that hasn't been something we've enforced in LLDB, but in
the past year or so I've started to pay more attention to this in code
reviews. This PR brings more error messages in compliance, further
increasing consistency.
I also adopted `createStringErrorV` where it improved the code as a
drive-by for lines I was already touching.
[1] https://llvm.org/docs/CodingStandards.html#error-and-warning-messages
Assisted-by: Claude Code
Simplify the sentinel checking logic by using APSInt and checking for
both a signed and unsigned sentinel in a single call.
Removes the IsSigned argument
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.
…ns (NFC).
Use the more descriptive name FindLastSelect for the conditional select
that picks between the reduction phi and the IV value.
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.