While attempting to enable Windows x64 unwind v2, compilation failed
with the following error:
```
fatal error: error in backend: Windows x64 Unwind v2 is required, but LLVM has generated incompatible code in function '<redacted>': Cannot pop registers before the stack allocation has been deallocated
```
I traced this down to an optimization in `X86FrameLowering`:
<6961139ce9/llvm/lib/Target/X86/X86FrameLowering.cpp (L324-L340)>
Technically, using `push`/`pop` to adjust the stack is permitted under
unwind v2: the requirement for a "canonical" epilog is that the stack is
fully adjusted before the registers listed as pushed in the unwind table
are popped. So, as long as the `.seh_unwindv2start` pseudo is after the
pops that adjust the stack, then everything will work correctly.
One other side effect of this change is that the stack is now allowed to
be adjusted across multiple instructions, which would be needed for
extremely large stack frames.
This patch improves the GPU benchmarking in this way:
* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.
---
## TODO (before merge)
* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.
## Follow-ups (future PRs)
* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
This fixes a few bugs, effectively through a fallback to `p` when `po` fails.
The motivating bug this fixes is when an error within the compiler causes `po` to fail.
Previously when that happened, only its value (typically an object's address) was
printed – and problematically, no compiler diagnostics were shown. With this change,
compiler diagnostics are shown, _and_ the object is fully printed (ie `p`).
Another bug this fixes is when `po` is used on a type that doesn't provide an object
description (such as a struct). Again, the normal `ValueObject` printing is used.
Additionally, this also improves how lldb handles an object description method that
fails in some way. Now an error will be shown (it wasn't before), and the value will be
printed normally.
For each function with the AMDGPU_CS_Chain calling convention, with
dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the
function symbol, plus an offset encoding one less than the number of
VGPR blocks used by the function (16 VGPRs per block, no more than 128)
in bits 5..3 of the symbol value. This is used by a front-end to have
functions that are chained rather than called, and a dispatcher that
dynamically resizes the VGPR count before dispatching to a function.
Having basic checks (like running buildifier) on the upstream bazel
files would be helpful for contributors maintaining the bazel build. Add
basic checks (currently just buildifier) to a workflow that runs
whenever the bazel build files change.
This updates the DIL code for handling array subscripting to more
closely match and handle all the cases from the original 'frame var'
implementation. Also updates the DIL array subscripting test. This
particularly fixes some issues with handling synthetic children, objc
pointers, and accessing specific bits within scalar data types.
Previously, we were trying to memset a pointer that wasn't being
initialized, and the test would randomly fail.
This PR replaces the pointers with actual objects.
Add a new AutomapToTargetData pass. This gathers the declare target
enter variables which have the AUTOMAP modifier. And adds
omp.declare_target_enter/exit mapping directives for fir.alloca and
fir.free oeprations on the AUTOMAP enabled variables.
Automap Ref: OpenMP 6.0 section 7.9.7.
Avoids strlen when constructing the returned StringRef. We were emitting
these in the libcall name lookup anyway, so split out the offsets for
general use.
Currently emitted as a separate table, not sure if it would be better
to change the string offset table to store pairs of offset and width
instead.
Currently, modifier is printed as address, so it is not readable and not
useful. This PR adds readable printing for it.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We've 2 ops:
1. nvvm.griddepcontrol.wait
2. nvvm.griddepcontrol.launch_dependents
They are related to Grid Dependent Launch (or programmatic dependent
launch in CUDA) and same concept. This PR unifies both ops into a single
one.
In preperation of the firstprivate implementation, this separates out
some functions to make it easier to read.
Additionally, it cleans up the VarDecl->alloca relationship, which will
prevent issues if we have to re-use the same vardecl for a future
generated recipe (and causes concerns in firstprivate later).
The patch adds patterns to select the EXT_ZZI_CONSTRUCTIVE pseudo
instead of the EXT_ZZI destructive instruction for vector_splice. This
only works when the two inputs to vector_splice are identical.
Given that registers aren't tied anymore, this gives the register
allocator more freedom and a lot of MOVs get replaced with MOVPRFX.
In some cases however, we could have just chosen the same input and
output register, but regalloc preferred not to. This means we end up
with some test cases now having more instructions: there is now a
MOVPRFX while no MOV was previously needed.
This reverts commit 16314eb7312dab38d721c70f247f2117e9800704 as the test cases
are failing under EXPENSIVE_CHECKS. Scalar vecreduce.fadd are not valid in
GISel.
In current DebugLoc coverage builds, the output for any reasonably large
build can become very large if any missing DebugLocs are present; this
happens because single errors in LLVM may result in many errors being
reported in the output report. The main cause of this is that the empty
locations attached to instructions may be propagated to other
instructions in later passes, which will each be reported as new errors.
This patch prevents this by adding an "unknown" annotation to
instructions after reporting them once, ensuring that any other
DebugLocs copied or derived from the original empty location will not be
marked as new errors.
As a separate but related change, this patch updates the report
generation script to deduplicate results using the recorded stacktrace
if they are available, instead of the pass+instruction combination. This
reduces the size of the reduction, but makes the reduction highly
reliable, as the stacktrace allows us to very precisely identify when
two bugs have originated from the same place.
* Add `requiresArgsAndResultsAttr` to `LLVM_OneResultIntrOp`
* Add `args_attrs` to `llvm.intr.masked.{expandload,compressstore}`
The LLVM intrinsics
[`llvm.intr.masked.expandload`](https://llvm.org/docs/LangRef.html#llvm-masked-expandload-intrinsics)
and
[`llvm.intr.masked.compressstore`](https://llvm.org/docs/LangRef.html#llvm-masked-compressstore-intrinsics)
both allow an optional align parameter attribute to be set which
defaults to one.
Inlining the documentation below for [`llvm.intr.masked.expandload` 's
](https://llvm.org/docs/LangRef.html#id1522) and
[`llvm.intr.masked.compressstore`'s](https://llvm.org/docs/LangRef.html#id1522)
arguments respectively
> The `align` parameter attribute can be provided for the first
argument. The pointer alignment defaults to 1.
> The `align` parameter attribute can be provided for the second
argument. The pointer alignment defaults to 1.