Allow runtime source directories to live outside the top-level tree by
honoring LLVM_EXTERNAL_*_SOURCE_DIR and propagating the values via
RUNTIMES_CMAKE_ARGS.
Add prefixes to discriminate between -mattr=+mve and -mattr=+mve.fp to
add missing check coverage
Fixes update_llc_test_checks warnings and simplifies regeneration for an
upcoming patch
When applying BTI fixups to indirect branch targets, ignored functions
are
considered as a special case:
- these hold no instructions,
- have no CFG,
- and are not emitted in the new text section.
The solution is to patch the entry points in the original location.
If such a situation occurs in a binary, recompilation using the
-fpatchable-function-entry flag is required. This will place a nop at
all
function starts, which BOLT can use to patch the original section.
Without the extra nop, BOLT cannot safely patch the original .text
section.
An alternative solution could be to also ignore the function from which
the stub starts. This has not been tried as LongJmp pass - where most
stubs are inserted - is currently not equipped to ignore functions.
Testing: both the success and failure cases are covered with lit tests.
We have to be careful when attempting to decode OR() patterns as
shuffles - we can't forward demanded undef elements in both sources as
an undef result as it can lead to infinite loops during widening
(#49393).
But if we don't demand the element in the first place (based off
demanded elts masks during recursive shuffle combines), then it doesn't
matter what the elements contain and we can treat it as a
SM_SentinelUndef shuffle element.
Noticed while working on #137422
This patch removes all our manual adjustments to the access control
specifiers of Clang decls we create from DWARF.
This has led to occasional subtle bugs in the past (the latest being
https://github.com/llvm/llvm-project/issues/171913) and it's ultimately
redundant because Clang already has provisions for LLDB to bypass access
control for C++ and Objective-C. Access control doesn't affect name
lookup so really we're doing a lot of bookkeeping for not much benefit.
The only "feature" that relied on this was that `type lookup <foo>`
would print the access specifier in the output structure layout. I'm not
convinced that's worth keeping the infrastructure in place for (but
happy to be convinced otherwise).
I'd rather lean fully into the Clang access control bypass instead.
Note, i still kept the `AccessType` parameters to the various
`TypeSystemClang` APIs to reduce the size of the diff. A follow-up NFC
change will remove those parameters and adjust all the call-sites.
Commit 5efce7392f3f6cc added optimized AArch32 assembly versions of
mulsf3 and divsf3, with more thorough tests. The new tests included test
cases specific to Arm's particular NaN handling rules, which are
disabled on most platforms, but were intended to be enabled for Arm.
Unfortunately, they were not enabled under any circumstances, because I
made a mistake in `test/builtins/CMakeLists.txt`: the command-line `-D`
option that should have enabled them was added to the cflags list too
early, before the list was reinitialized from scratch. So it never ended
up on the command line.
Also, the test file mulsf3.S only even _tried_ to enable strict mode in
Thumb1, even though the Arm/Thumb2 implementation would also have met
its requirements.
Because the strict-mode tests weren't enabled, I didn't notice that they
would also have failed absolutely everything, because they checked the
results using the wrong sense of comparison! I used `==`, but that
comparison was supposed to be a drop-in replacement for
`compareResultF`, which returns zero for equality. Changed the tests to
use `!=`.
Finally, I've also added a macro to each test so that it records the
source line number of each failing test case. That way, when a test
fails, you can find it in the test source more easily, without having to
search for the hex numbers mentioned in the failure message.
When the memref element type (e.g., i8) is narrower than the SPIR-V
storage type (e.g., i32 on Vulkan), ori and andi can be lowered with a
single wide atomic instruction because OR-with-0 and AND-with-1 are
identity operations.
The revision follows `IntStoreOpPattern` to compute offsets/sizes via
`adjustAccessChainForBitwidth` method and `getOffsetForBitwidth` method.
Additionally, it handles the returned value (which is the old value by
definition), which is different from `IntStoreOpPattern`. E.g., the
check of `spirv::Capability::Kernel` is the same.
07ebb18e07/mlir/lib/Conversion/MemRefToSPIRV/MemRefToSPIRV.cpp (L847-L867)
There are refactoring opportunities and it is not performed within the
revision because the current implementation is already complicated. The
refactoring can be happenned in a follow-up with its own patch, so
reviewing this revision is easier.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Helps improve: https://github.com/llvm/llvm-project/issues/182625.
This does not fully solve the issues with using `ctpop` as the vector
type chosen for the reduction is not ideal in all cases. This results in
extra extends, which can be seen in a few test cases.
When an equated symbol (e.g. `x=0`) is followed by `.section x`,
getOrCreateSectionSymbol reports an "invalid symbol redefinition"
error but continues to reuse the equated symbol as a section symbol.
This causes an assertion failure in MCObjectStreamer::changeSection
when `setFragment` is called on the equated symbol.
Fix this by clearning `Sym`.
I was reading through ObjectContainerBSDArchive and came across some
dead method decls, a less-than-completely-clear `shared_ptr` typedef in
`ObjectContainerBSDArchive::Archive` for a shared_ptr<Archive> which was
a little unclear when reading a decl like `shared_ptr archive_sp;` for a
local variable.
This patch optimizes the creation of constant 64-bit vectors (e.g.,
v2i32, v4i16) by avoiding expensive loads from the constant pool. The
optimization works by packing the constant vector elements into a single
i64 immediate and bitcasting the result to the target vector type. This
replaces a memory access with more efficient immediate materialization.
To ensure this transformation is efficient, a check is performed to
verify that the immediate can be generated in two or fewer mov
instructions. If it requires more, the compiler falls back to using the
constant pool.
The optimization is disabled for bigendian targets for now.
Make the interfaces part of lldbPluginScriptInterpreterPython instead of
putting them into their own static library. This avoids the need for an
extra static archive and more importantly a bunch of code duplication
between the two CMakeLists.txt.
The assert added in
[0ab1d23fbfa2ae0ba14315cb11678d2289510f66](0ab1d23fbf)
is incorrect, NewUnit is legitimately null for compile units that are
skipped during garbage collection (e.g. dwarf5-macro.test). Revert to
the original null check.
The classic DWARF linker avoids `DIEEntry` for `DW_FORM_ref_addr`
references, using raw `DIEInteger` values with manual offset computation
instead. A stale FIXME explains this was because "the implementation
calls back to DwarfDebug to find the unit offset", but this is no longer
true. `DIEEntry` resolves offsets via
`DIEUnit::getDebugSectionOffset()`, which has no `DwarfDebug`
dependency.
And the real constraint is that forward references may point to
placeholder `DIEs` that never get adopted into a unit tree (due toODR
pruning), so `DIEEntry` cannot resolve them(a test failed during
refactoring this). However, backward references are safe, the target DIE
is already cloned and parented in a unit tree.
The support of f32 packed instructions in #126337 revealed performance
regressions on certain kernels. In one case, the cause comes from
loading a v4f32 from shared memory but then accessing them as {r0, r2}
and {r1, r3} from the full load of {r0, r1, r2, r3}.
This access pattern guarantees the registers requires a coalescing
operation which increases register pressure and degrades performance.
The fix here is to identify if we can prove that an v2f32 operand comes
from non-contiguous vector extracts and if so scalarizes the operation
so the coalescing operation is no longer needed.
I've found that ptxas can see through the extra unpacks/repacks of
contiguous registers this causes in MIR. However in the full test case
the packing of the final scalar->vector results does generate additional
costs especially since the only users unpack them. An additional MIR
pass is possible to catch the case
Assisted-by: Cursor / claude-4.6-opus-high
---------
Co-authored-by: Princeton Ferro <princetonferro@gmail.com>
CHR (Control Height Reduction) merges multiple biased branches into a
single speculative check, cloning the region into hot/cold paths. On
GPU targets, the merged branch may be divergent (evaluated per-thread),
splitting the wavefront: some threads take the hot path, others the
cold path.
A convergent call like ds_bpermute (a cross-lane operation on AMDGPU)
requires a specific set of threads to be active — when thread X reads
from thread Y, thread Y must be active and participating in the same
call. After CHR cloning, thread Y may have gone to the cold path while
thread X is on the hot path, so the hot-path ds_bpermute reads a stale
register value from thread Y instead of the intended value.
This caused a miscompilation in rocPRIM's lookback scan: CHR duplicated
a region containing ds_bpermute, and the hot-path copy executed with a
different set of active threads, reading incorrect cross-lane data and
causing a memory access fault.
The fix skips any region containing convergent or noduplicate calls,
following the same pattern as SimplifyCFG's block-duplication guard.
In this PR, we added basic support of type definitions in Python-defined
dialects, including:
- IRDL codegen for type definitions
- Type builders like `MyType.get(..)` and type parameter accessors (e.g.
`my_type.param1`)
- Use Python-defined types in Python-defined oeprations
```python
class TestType(Dialect, name="ext_type"):
pass
class Array(TestType.Type, name="array"):
elem_type: IntegerType[32] | IntegerType[64]
length: IntegerAttr
class MakeArrayOp(TestType.Operation, name="make_array"):
arr: Result[Array]
class MakeArray3Op(TestType.Operation, name="make_array3"):
arr: Result[Array[IntegerType[32], IntegerAttr[IntegerType[32], 3]]]
```
First, disallow R_X86_64_PC64 - generally only absolute relocations are
allowed in getDynRel. glibc and musl don't support R_X86_64_PC64 as
dynamic relocations.
Second, support R_X86_64_32 as dynamic relocation for the ILP32 ABI
(x32). GNU ld's behavior looks like:
- R_X86_64_32 => R_X86_64_RELATIVE
- R_X86_64_64 with addend 0 => R_X86_64_RELATIVE
- R_X86_64_64 with non-zero addend => R_X86_64_RELATIVE64 (unsupported
by musl; compilers do not generate such constructs to the best of my
knowledge)
For now we require R_X86_64_64 to be resolved at link-time for x32.
Fix#140465
This patch makes AsmPrinter work with the NewPM. We essentially create
three new passes that wrap different parts of AsmPrinter so that we can
separate out doIntialization/doFinalization without needing to
materialize all MachineFunctions at the same time. This has two main
drawbacks for now:
1. We do not transfer any state between the three new AsmPrinter passes.
This means that debuginfo/CFI currently does not work. This will be
fixed in future passes by moving this state to MachineModuleInfo.
2. We probably incur some overhead by needing to setup up analysis
callbacks for every MF rather than just per module. This should not
be large, and can be optimized in the future on top of this if
needed.
3. This solution is not really clean. However, a lot of cleanup is going
to be difficult to do while supporting two pass managers. Once we
remove LegacyPM support, we can make the code much cleaner and better
enforce invariants like a lack of state between
doInitialization/runOnMachineFunction/doFinalization.
Reviewers: arsenm, aeubanks, paperchalice
Pull Request: https://github.com/llvm/llvm-project/pull/182797
This allows for overriding these call backs when using the NewPM which
has different methods for obtaining analysis results.
Reviewers: RKSimon, arsenm, phoebewang, mingmingl-llvm, aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/182796
AsmPrinter needs to be split into three passes (begin, per MF, end) to
avoid the need to materialize all machine functions at the same time.
Update the CodeGenPassBuilder hooks for this.
Reviewers: aeubanks, paperchalice, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/182795
Otherwise we cannot create an MCStreamer without getting MMI, which we
cannot do until we have started running AsmPrinter without also plumbing
MMI through CodeGenPassBuilder.
Reviewers: arsenm, paperchalice, aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/182794
As part of making AsmPrinter work with the new pass manager, we need to
be able to override how we get analyses. This patch does that by
refactoring getting all analyses/other related functionality to
callbacks that are set by default but can be overriden later (like by a
NewPM wrapper pass).
Reviewers: aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/182793
Summary:
This patch matches CUDA, moving the HIP compilation jobs to the new
driver by default. The old behavior will return with
`--no-offload-new-driver`. The main difference is that objects compiled
with the old driver are no longer compatible and will need to be
recompiled or the old driver used.
Fixes#177712
The MatrixElt and VectorElt cases of `EmitLoadOfLValue` did not convert
the scalar value from its load/store type into its primary IR type like
the other cases do, which caused issues with HLSL in particular which
requires bools to be converted to and from i32 and i1 forms for its
load/store and primary IR types respectively.
This PR fixes the issue by applying `EmitFromMemory` to the loaded
scalar.