There is no need to set a big default stack size for PAL code object indirect
calls. The driver knows the max recursion depth, so it can compute a more
accurate value from the minimum scratch size.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D150609
Other such tests, of which there are many, are to be updated with
separate patches.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D152557
If we have a load/store with an illegal fixed length vector result type that
needs widened, e.g. `x:v6i32 = load p`
Instead of just widening it to: `x:v8i32 = load p`
We can widen it to the equivalent VP operation and set the EVL to the
exact number of elements needed: `x:v8i32 = vp_load a, b, mask=true, evl=6`
Provided that the target supports vp_load/vp_store on the widened type.
Scalable vectors are already widened this way where possible, so this
largely reuses the same logic.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D148713
This commit re-work the methods that dump traces with resource usage to take into account the StartAtCycle value added by https://reviews.llvm.org/D150310.
For each i, the values of the lists StartAtCycle and ReservedCycles is are printed with the interval [StartAtCycle[i], ReservedCycles[i])
```
... | StartAtCycle[i] | ... | ReservedCycles[i] - 1 | ReservedCycles[i] | ...
| xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | |
```
Reviewed By: andreadb
Differential Revision: https://reviews.llvm.org/D150311
Apply my post-commit comment on D81995. The negative name misguided commit
d8a8e5d6240a1db809cd95106910358e69bbf299 (`[clang][cli] Remove marshalling from
Opt{In,Out}FFlag`) to:
* accidentally flip the option to not emit the xray_fn_idx section.
* change -fno-xray-function-index (instead of -fxray-function-index) to emit xray_fn_idx
This patch renames XRayOmitFunctionIndex and makes -fxray-function-index emit
xray_fn_idx, but the default remains -fno-xray-function-index .
XRay instrumentation works for macOS running on Apple Silicon, but
codegen is untested there. I'm going to make changes affecting this
target, get the XRay tests running on AArch64.
Data sections are going to become slightly different on x86_64 soon.
I do want the tests to be specific about symbol names, so instead of
having test check the common step, bifurcate tests a bit and check
the full symbol names.
As for ARM, XRay is not really supported on iOS at the moment, though
ARM is also really used there with modern phones. Nevertheless, codegen
tests exist and the output is going to change a little, make it easier
to write the special case for iOS.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D145291
Something like:
`%r = mul %x, <33, 33, 33, ...>`
Is best lowered as:
`%tmp = %shl x, <5, 5, 5>; %r = add %tmp, %x`
As well, since vectors have non-destructive shifts, we can also do
cases where the multiply constant is `Pow2A +/- Pow2B` for arbitrary A
and B, unlike in the scalar case where the extra `mov` instructions
make it not worth it.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D150324
This is a quick fix for an assert when the source and dest have
different address spaces. The pointer compare needs to have matching
types, but we can't generically introduce addrspacecast and we don't
know if the address spaces alias.
Since ROLBRd needs an implicit R1 (on AVR) or an implicit R17 (on AVRTiny),
we split ROLBRd to ROLBRdR1 (on AVR) and ROLBRdR17 (on AVRTiny).
Reviewed By: aykevl, Patryk27
Differential Revision: https://reviews.llvm.org/D152248
The commit that added the run says it's to hoist uniform parts of
integer division expansion. That expansion is performed later, so this
didn't do anything in that case. Move this later so the original test
shows the improvement.
This also saves a run of "Canonicalize natural loops". Not sure why
this appears to be still getting a separate loop PM run. Also feels a
bit heavy to run this just for divide. Is there a way to specifically
hoist the divide sequence when it expands?
This should be true, but this is useless as is. The rematerialization
logic only permits rematerialize with constant physical register uses,
so non-constant physregs or virtual register uses (the case that
really matters) are not rematerialized. Add the tests which shows
nothing happens, but should in the future.
Also, all loads should really be rematerializable so in the future
this should apply to all the other kinds.
This was inserting an s_endpgm in the middle of the block when it has
to be a terminator. Split the block and insert a branch to a new block
with the trap if it's not in a terminator position.
Fixes verifier error on LDS in function with no trap support (and
other trap sources).
Expand large or unknown size memory intrinsics into loops in the
default lowering pipeline if the target doesn't have the corresponding
libfunc. Previously AMDGPU had a custom pass which existed to call the
expansion utilities.
With a default no-libcall option, we can remove the libfunc checks in
LoopIdiomRecognize for these, which never made any sense. This also
provides a path to lifting the immarg restriction on
llvm.memcpy.inline.
There seems to be a bug where TLI reports functions as available if
you use -march and not -mtriple.
hoisted constants.
The constant hoisting pass tries to hoist large constants into predecessors and also
generates remat instructions in terms of the hoisted constants. These aim to prevent
codegen from rematerializing expensive constants multiple times. So we can re-use
this optimization, we can preserve the no-op bitcasts that are used to anchor
constants to the predecessor blocks.
SelectionDAG achieves this by having the OpaqueConstant node, which is just a
normal constant with an opaque flag set. I've opted to avoid introducing a new
constant generic instruction here. Instead, we have a new G_CONSTANT_FOLD_BARRIER
operation that constitutes a folding barrier.
These are somewhat like the optimization hints, G_ASSERT_ZEXT in that they're
eliminated by the generic instruction selection code.
This change by itself has very minor improvements in -Os CTMark overall. What this
does allow is better optimizations when future combines are added that rely on having
expensive constants remain unfolded.
Differential Revision: https://reviews.llvm.org/D144336
Correctly handle single-element vectors to fix an assertion failure. Add tests
that were missing from the original commit.
Differential Revision: D151782
I don't think we can safely remove the second COPY as redundant in such cases.
The first COPY (which has undef src) may be lowered to a KILL instruction instead, resulting in no COPY being emitted at all.
Testcase is X86 so it's in the same place as other testcases for this function, but this was initially spotted on AMDGPU with the following:
```
renamable $vgpr24 = PRED_COPY undef renamable $vgpr25, implicit $exec
renamable $vgpr24 = PRED_COPY killed renamable $vgpr25, implicit $exec
```
The second COPY waas removed as redundant, and the first one was lowered to a KILL (= removed too), causing $vgpr24 to not have $vgpr25's value.
Fixes SWDEV-401507
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D152502
New metadata format contains full calculation of field contents for
ps_extra_lds_size (vs old format where the value in RSRC register is used by PAL
to calculate the value required).
Also stop updating float_mode and rely on front end settings for this field.
Differential Revision: https://reviews.llvm.org/D152247
This patch provides an alternative implementation to DPP for Scan Computations.
An alternative implementation iterates over all active lanes of Wavefront
using llvm.cttz and performs the following steps:
1. Read the value that needs to be atomically incremented using
llvm.amdgcn.readlane intrinsic
2. Accumulate the result.
3. Update the scan result using llvm.amdgcn.writelane intrinsic
if intermediate scan results are needed later in the kernel.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D147408
- (op (op X, C1), C2) -> (op X, (op C1, C2))
- (op (op X, C1), Y) -> (op (op X, Y), C1)
Some code duplication with the G_PTR_ADD reassociations unfortunately but no
easy way to avoid it that I can see.
Differential Revision: https://reviews.llvm.org/D150230
Without explicitly checking and erroring out, an invalid personality
function, which is not `__gxx_wasm_personality_v0`, caused
a segmentation fault down the line because `WasmEHFuncInfo` was not
created. This explicitly checks the validity of personality functions in
functions with EH pads and errors out explicitly with a helpful error
message. This also adds some more assertions to ensure `WasmEHFuncInfo`
is correctly created and non-null.
Invalid personality functions wouldn't be generated by our Clang, but
can be present in handwritten ll files, and more often, in files
transformed by passes like `metarenamer`, which is often used with
`bugpoint` to simplify names in `bugpoint`-reduced files.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D152203
This reverts commit 8392bf6000ad039bd0e55383d40a05ddf7b4af13.
The commit missed some edge cases that led to crashes. Reverting to resolve
downstream breakage while a fix is pending.
Undo the canonicalize done in
0cfc6510323fbb5a56a5de23cbc65f7cc30fd34c. Restores some regressed
matching of integer mad. The selection patterns fo the actual mads
don't seem to be properly commuting, so some of the commuted cases are
still missed.
Fixes: SWDEV-363009
This isn't quite using register units, but it's getting close. The phi
generation is driven by register units, but each phi still contains a
reference to a register, potentially with a mask that amounts to a unit.
In cases of explicit register aliasing this may still create phis with
references that are aliased, whereas separate phis would ideally contain
disjoint references (this is all within a single basic block).
Previously phis used maximal registers, now they use minimal. This is a
step towards both, using register units directly, and a simpler liveness
calculation algorithm. The idea is that a phi cannot reach a reference
to anything smaller than the phi itself represents. Before there could
be a phi for R1_R0, now there will be two for this case (assuming R0 and
R1 have one unit each).
If VecVT is an LMUL=8 VT, we can't concatenate the vectors as that
would create an illegal type. Instead we need to split the vectors
and emit two VECTOR_INTERLEAVE operations that can each be lowered.
Reviewed By: fakepaper56
Differential Revision: https://reviews.llvm.org/D152411