Now that "vibe coding" is a thing, ignore the documentation artifacts
that coding assistants, like Claude and Gemini, use to retain coding
workflows and other metadata.
This PR reapplies https://github.com/llvm/llvm-project/pull/149461
In the original `combineVectorSizedSetCCEquality`, the result of setcc
is being negated by returning setcc with the same cond code, leading to
wrong logic.
For example, with
```llvm
%cmp_16 = call i32 @memcmp(ptr %a, ptr %b, i32 16)
%res = icmp eq i32 %cmp_16, 0
```
the original PR producese all_true and then also compares the result
equal to 0 (using the same SETEQ in the returning setcc), meaning that
semantically, it effectively is calling icmp ne.
Instead, the PR should have use SETNE in the returning setcc, this way,
all true return 1, then it is compared again ne 0, which is equivalent
to icmp eq.
The purpose of this fence is to ensure that any `dataSubmit`s inserted
into a queue before a `dataFence` finish before finish before any
`dataSubmit`s
inserted after it begin.
This is a no-op for most queues, since they are in-order, and by design
any operations inserted into them occur in order.
But the interface is supposed to be functional for out-of-order queues.
The addition of the interface means that any operations that rely on
such ordering (like ATTACH map-type support in #149036) can invoke it,
without worrying about whether the underlying queue is in-order or
out-of-order.
Once a plugin supports out-of-order queues, the plugin can implement
this function, without requiring any change at the libomptarget level.
---------
Co-authored-by: Alex Duran <alejandro.duran@intel.com>
If the copyable schedule data is created and the user is used several
times in the user node, no need to count same data for the same user
several times, need to include it only ones.
Fixes#153754
The 'firstprivate' clause requires that we do a 'copy' operation, so
this patch creates some AST nodes from which we can generate the copy
operation, including a 'temporary' and array init. For the most part
this is pretty similar to what 'private' does other than the fact that
the source is copy (and not default init!), and that there is a
temporary from which to copy.
---------
Co-authored-by: Andy Kaylor <akaylor@nvidia.com>
GFX1250 SPG says: S_GETREG_B32 does not wait for idle before executing.
The user must S_WAIT_ALU 0 before S_GETREG_B32 on:
STATUS, STATE_PRIV, EXCP_FLAG_PRIV, or EXCP_FLAG_USER.
Fixes#153012
As we tolerate unfoldable constant expressions in `scalarizeOpOrCmp`, we
may fold
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) #0 {
entry:
%158 = insertelement <2 x i64> <i64 5, i64 ptrtoint (ptr @val to i64)>, i64 %idx, i32 0
%159 = or disjoint <2 x i64> splat (i64 2), %158
store <2 x i64> %159, ptr %ptr2
ret void
}
```
to
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) {
entry:
%.scalar = or disjoint i64 2, %idx
%0 = or <2 x i64> splat (i64 2), <i64 5, i64 ptrtoint (ptr @val to i64)>
%1 = insertelement <2 x i64> %0, i64 %.scalar, i64 0
store <2 x i64> %1, ptr %ptr2, align 16
ret void
}
```
And it would be folded back in `foldInsExtBinop`, resulting in an
infinite loop.
This patch forces scalarization iff InstSimplify can fold the constant
expression.
We've backported a lot more features from C to previous C standards than
we were documenting. I took a pass over the c_status page for Clang and
pulled more entries to add to our documentation.
Fixes#139023.
This PR essentially removes unused global variables:
- Restores the `GlobalDCE` Legacy pass and adds it to the DirectX
backend after the finalize linkage pass
- Converts external global variables with no usage to internal linkage
in the finalize linkage pass
- (so they can be removed by `GlobalDCE`)
- Makes the `dxil-finalize-linkage` pass usable using the new pass
manager flag syntax
- Adds tests to `finalize_linkage.ll` that make sure unused global
variables are removed
- Adds a use for variable `@CBV` in `opaque-value_as_metadata.ll` so it
isn't removed
- Changes the `scalar-data.ll` run command to avoid removing its global
variables
---------
Co-authored-by: Farzon Lotfi <farzonlotfi@microsoft.com>
We cannot actually retire an infinite number of uops per cycle. This
patch adds a RCU to the skylake scheduling model to fix this. I'm
purposefully using a loose upper bound here. We're unlikely to actually
get four fused uops per cycle, but this is better than not setting
anything. Most realistic code I've put through uiCA will retire up to ~6
uops per cycle.
Information taken from
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client).
This requires modification of the two zero idiom tests because we do not
currently model the CPU frontend which would likely be the actual
bottleneck in that case.
Related to #153747.
Similar to `IntegerRelation::addLocalFloorDiv`, this adds a utility
`IntegerRelation::addLocalModulo` that adds and returns a local variable
that is the modulus of an affine function of the variables modulo some
constant modulus. The function returns the absolute index of the new var
in the relation.
This is computed by first finding the floordiv of `exprs // modulus = q`
and then computing the remainder `result = exprs - q * modulus`.
Signed-off-by: Asra Ali <asraa@google.com>
When matching integers, `m_ConstantInt` is a convenient alternative to
`m_APInt` for matching unsigned 64-bit integers, allowing one to
simplify
```cpp
const APInt *IntC;
if (match(V, m_APInt(IntC))) {
if (IntC->ule(UINT64_MAX)) {
uint64_t Int = IntC->getZExtValue();
// ...
}
}
```
to
```cpp
uint64_t Int;
if (match(V, m_ConstantInt(Int))) {
// ...
}
```
However, this simplification is only true if `V` is a scalar type.
Specifically, `m_APInt` also matches integer splats, but `m_ConstantInt`
does not.
This patch ensures that the matching behaviour of `m_ConstantInt`
parallels that of `m_APInt`, and also incorporates it in some obvious
places.
This modifies InjectAnonymousStructOrUnionMembers to inject an
IndirectFieldDecl and mark it invalid even if its name conflicts with
another name in the scope.
This resolves a crash on a further diagnostic
diag::err_multiple_mem_union_initialization which via
findDefaultInitializer relies on these declarations being present.
Fixes#149985
This MR adds a `verifier` for the `emitc.get_field` op.
- The `verifier` checks that the `emitc.get_field` operation is nested
inside an `emitc.class` op.
- Additionally, appropriate tests for erroneous cases were added for
class-related operations in `invalid_ops.mlir`.
If the mask of a (fixed-vector) deinterleaved load is assembled by
`vector.interleaveN` intrinsic, any intrinsic arguments that are
all-zeros are regarded as gaps.
While attempting to enable Windows x64 unwind v2, compilation failed
with the following error:
```
fatal error: error in backend: Windows x64 Unwind v2 is required, but LLVM has generated incompatible code in function '<redacted>': Cannot pop registers before the stack allocation has been deallocated
```
I traced this down to an optimization in `X86FrameLowering`:
<6961139ce9/llvm/lib/Target/X86/X86FrameLowering.cpp (L324-L340)>
Technically, using `push`/`pop` to adjust the stack is permitted under
unwind v2: the requirement for a "canonical" epilog is that the stack is
fully adjusted before the registers listed as pushed in the unwind table
are popped. So, as long as the `.seh_unwindv2start` pseudo is after the
pops that adjust the stack, then everything will work correctly.
One other side effect of this change is that the stack is now allowed to
be adjusted across multiple instructions, which would be needed for
extremely large stack frames.
This patch improves the GPU benchmarking in this way:
* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.
---
## TODO (before merge)
* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.
## Follow-ups (future PRs)
* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
This fixes a few bugs, effectively through a fallback to `p` when `po` fails.
The motivating bug this fixes is when an error within the compiler causes `po` to fail.
Previously when that happened, only its value (typically an object's address) was
printed – and problematically, no compiler diagnostics were shown. With this change,
compiler diagnostics are shown, _and_ the object is fully printed (ie `p`).
Another bug this fixes is when `po` is used on a type that doesn't provide an object
description (such as a struct). Again, the normal `ValueObject` printing is used.
Additionally, this also improves how lldb handles an object description method that
fails in some way. Now an error will be shown (it wasn't before), and the value will be
printed normally.
For each function with the AMDGPU_CS_Chain calling convention, with
dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the
function symbol, plus an offset encoding one less than the number of
VGPR blocks used by the function (16 VGPRs per block, no more than 128)
in bits 5..3 of the symbol value. This is used by a front-end to have
functions that are chained rather than called, and a dispatcher that
dynamically resizes the VGPR count before dispatching to a function.
Having basic checks (like running buildifier) on the upstream bazel
files would be helpful for contributors maintaining the bazel build. Add
basic checks (currently just buildifier) to a workflow that runs
whenever the bazel build files change.
This updates the DIL code for handling array subscripting to more
closely match and handle all the cases from the original 'frame var'
implementation. Also updates the DIL array subscripting test. This
particularly fixes some issues with handling synthetic children, objc
pointers, and accessing specific bits within scalar data types.