Adds the fwidth intrinsic for HLSL.
The DXIL path only requires modification to the hlsl headers.
The SPIRV path implements the OpFwidth builtin in Clang and instruction
selection for the OpFwidth instruction in LLVM.
Also adds shader stage tests to the ddx_coarse and ddy_coarse
instructions used by fwidth.
Closes#99120
---------
Co-authored-by: Alexander Johnston <alexander.johnston@amd.com>
While SVE support for exception safe floating point code generation is
bare bones we try to ensure inactive lanes remiain inert. I mistakenly
broke this rule when adding support for SVE-B16B16 by lowering some
bfloat operations of unpacked vectors to unpredicated instructions.
Enables constexpr evaluation for the following AVX512 Instrinsics:
```
_mm_movepi8_mask _mm256_movepi8_mask _mm512_movepi8_mask
_mm_movepi16_mask _mm256_movepi16_mask _mm512_movepi16_mask
_mm_movepi32_mask _mm256_movepi32_mask _mm512_movepi32_mask
_mm_movepi64_mask _mm256_movepi64_mask _mm512_movepi64_mask
```
Part of #162072
This allows DemandedBits to see that the SVE CNTP intrinsic will only
ever produce small positive integers. The maximum value you could get
here is 256, which is CNTP on a nxv16i1 on a machine with a 2048bit
vector size (the maximum for SVE).
Using this various redundant operations (zexts, sexts, ands, ors, etc)
can be eliminated.
Avoid some repeated feature blocks - we should have a single place in
each file that we can find most builtins for a particular ISA level.
Also, avoid some of the 80col wrapping that just makes it harder to find
anything at all.
There's a lot more we can do - but I don't want to completely refactor
this while we still have so much work to do for #30794
A pattern of the form reduce.add(ext(mul)) is valid for a partial
reduction as long as the mul and its operands fulfill the requirements
of a normal partial reduction. The mul's extend operands will be
optimised to the wider extend, and we already have oneUse checks in
place to make sure the mul and operands can be modified safely.
1. -> https://github.com/llvm/llvm-project/pull/165536
2. https://github.com/llvm/llvm-project/pull/165543
In functions that have been seriously deformed during optimisation,
there can be call instructions with line-zero immediately after frame
setup (see C reproducer in the test added). Our previous algorithms for
prologue_end ignored these, meaning someone entering a function at
prologue_end would break-in after a function call had completed. Prefer
instead to place prologue_end and the function scope-line on the line
zero call: this isn't false (it's the first meaningful instruction of the
function) and is approximately true. Given a less than ideal function,
this is an OK solution.
Interfaces can be optional: whether an op implements an interface or not
can depend on the state of the operation.
```
// An optional code block for adding additional "classof" logic. This can
// be used to better enable "optional" interfaces, where an entity only
// implements the interface if some dynamic characteristic holds.
// `$_attr`/`$_op`/`$_type` may be used to refer to an instance of the
// interface instance being checked.
code extraClassOf = "";
```
The current `Pass::canScheduleOn(RegisteredOperationName)` is
insufficient. This commit adds an additional overload to inspect
`Operation *`.
This commit fixes a crash when scheduling an `InterfacePass` for an
optional interface on an operation that does not actually implement the
interface.
Consider a newly added "malloc_span" attribute in the allocation token
instrumentation to ensure that allocation functions with the
"malloc_span" attribute are processed similarly to other memory
allocation functions.
Update the tests to demonstrate applicability to __size_returning_new.
See if we can create a vector load from the src elements in reverse and
then shuffle these back into place.
SLP will (usually) catch this in the middle-end, but there are a few
BUILD_VECTOR scalarizations etc. that appear during DAG legalization.
I did start looking at a more general permute fold, but I haven't found
any good test examples for this yet - happy to take another look if
somebody has examples.
Treat it in the same manner of zero_extend_vector_inreg and generate an
extend_low_u if possible. This is to try an prevent expensive shuffles
from being generated instead. computeKnownBitsForTargetNode has also
been updated to specify known zeros on extend_low_u.
This prevents it from being optimized out in non-asserts builds.
Update X86 test to remove REQUIRES: asserts and check for LLVM ERROR.
Add FileCheck to RISC-V test and remove UNSUPPORTED.
This is the more complete fix for #168772 and #168525.
Reland https://github.com/llvm/llvm-project/pull/165725, fix the Failed
test by removing successor operands before delete operations. Following
the deletion of cond.branch, its successor operands will subsequently be
removed.
When we add the module map describing the compiled module to the command
line, add it to the file dependencies as well.
Discovered while working on reproducers where a command line input was
missing in the captured files as it wasn't considered a dependency.
In RISCVVLOptimizer we first compute all the demanded VLs, then we walk
backwards through the function and try to reduce any VLs.
We don't actually need to walk backwards anymore since after #124530 the
order in which we modify the instructions doesn't matter.
This patch changes it to just iterate over the instructions with a
demanded VL computed, which means we don't iterate over scalar
instructions etc.
This also fixes#168665, where we triggered an assert on instructions
with a dead $vxsat implicit-def:
dead %x:vr = PseudoVSADDU_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */,
0 /* tu, mu */, implicit-def dead $vxsat
Because $vxsat is a reserved register, DeadMachineInstructionElim won't
remove it and the instruction makes it to RISCVVLOptimizer.
And because the def of %x is dead, we don't reach this instruction in
the dataflow analysis. This instruction returns true for isCandidate, so
we would try to lookup its demanded VL which doesn't exist and assert.
But with this patch we don't try to reduce instructions that aren't in
DemandedVLs, which fixes the crash.
Applied `[[nodiscard]]` where relevant to smart pointers and related
functions.
- [x] - `std::unique_ptr`
- [x] - `std::shared_ptr`
- [x] - `std::weak_ptr`
See guidelines:
-
https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
- `[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue. For example a locking
constructor in unique_lock.
---------
Co-authored-by: Hristo Hristov <zingam@outlook.com>
… names
Fix non-RDC mode HIP compilation for the new driver on Windows due to
invalid temporary file names when offload arch is a target ID containing
':', which is invalid in file names on Windows.
Refactor the existing handling of ':' in file names on Windows from
clang driver into a shared function sanitizeTargetIDInFileName in
clang/Basic/TargetID.h. This function replaces ':' with '@' on Windows
only, preserving the original behavior.
Update both clang/lib/Driver/Driver.cpp and
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp to use this
shared function, ensuring consistent handling across both tools.
The problem with the many def-use chain problems in SLP vectorizer are
related to the fact that some nodes reuse the same instruction as
insertion point. Insertion point is not the instruction, but the place
between instructions. To set it correctly, better to generate pseudo
instruction immediately after the last instruction, and use it as
insertion point. It resolves the issues in most cases.
Fixes#168512#168576
The clang side of the calling convention code for arm64 vs. arm64ec is
close enough that this isn't really noticeable in most cases, but the
rule for choosing whether to pass a struct directly or indirectly is
significantly different.
(Adapted from my old patch https://reviews.llvm.org/D125419 .)
Fixes#89615.
Attempt to only define used subregisters when creating IMPLICIT_DEF fix
ups for live interval subranges. This avoids the appearance at the MIR
level of entire (wide) registers becoming live rather than relying only
on transient LiveIntervals dead definitions for unused subregisters.
PreRARematStage builds region live-outs if GCN trackers are enabled. If
rematerialization leads to empty regions, this can cause a crash because
of dereference of an invalid iterator in getLastMIForRegion. The fix is
to skip calling getLastMIForRegion for empty regions.
This patch fixes another bug in the same code region. getLastMIForRegion
calls skipDebugInstructionsBackward which may immediately return the
RegionEnd if it is not the begin instruction and it is a non-debug
instruction. That would imply considering an instruction that is outside
the relevant region. The fix is to always pass the previous of RegionEnd
to skipDebugInstructionsBackward.
This bug was found while using GCN trackers on the existing LIT test
machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure.
llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168:
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const
[with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr,
true, true, void, false, void>; bool IsReverse = false; bool IsConst =
false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference =
llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
This PR adds pattern for unrolling shape_cast given a targetShape. This
PR is a follow up of #164010 which was very general and was using
inserts and extracts on each element (which is also
LowerVectorShapeCast.cpp is doing).
After doing some more research on use cases, we (me and @Jianhui-Li )
realized that the previous version in #164010 is unnecessarily generic
and doesn't fit our performance needs.
Our use case requires that targetShape is contiguous in both source and
result vector.
This pattern only applies when contiguous slices can be extracted from
the source vector and inserted into the result vector such that each
slice remains in vector form with targetShape (and not decompose to
scalars). In these cases, the unrolling proceeds as:
vector.extract_strided_slice -> vector.shape_cast (on the slice
unrolled) -> vector.insert_strided_slice
This PR makes it possible to call `getBuffer()` on `DepScanFile` (a
`llvm::vfs::File`) repeatedly. Previously, this function would return a
moved-from `unique_ptr`. This doesn't fix any existing bugs, I
discovered this while experimenting with the VFSs in the scanner. Note
that the returned instances of `llvm::MemoryBuffer` are non-owning and
share the underlying buffer storage.