49 Commits

Author SHA1 Message Date
James Y Knight
c7f3437507
NFC: Clean up of IntrusiveRefCntPtr construction from raw pointers. (#151545)
Handles clang::DiagnosticsEngine and clang::DiagnosticIDs.

For DiagnosticIDs, this mostly migrates from `new DiagnosticIDs` to
convenience method `DiagnosticIDs::create()`.

Part of cleanup https://github.com/llvm/llvm-project/issues/151026
2025-07-31 15:07:35 -04:00
Kazu Hirata
096cad59df [Serialization] Fix a warning
This patch fixes:

  clang/unittests/Serialization/SourceLocationEncodingTest.cpp:39:34:
  error: unused variable 'Biggest' [-Werror,-Wunused-const-variable]
2025-06-25 07:53:24 -07:00
Haojian Wu
e90ab0e342
[Serialization] Remove delta encoding optimization (#145670)
See the discussion in https://github.com/llvm/llvm-project/pull/145529.

This will slightly increase the PCM size (~5%), some data (in-memory
preamble size in clangd):

- SemaExpr.cpp: 77MB  -> 80MB
- FindTarget.cpp: 71MB -> 75MB
2025-06-25 15:50:09 +02:00
Jan Svoboda
13e1a2cb22 Reapply "[clang] Remove intrusive reference count from DiagnosticOptions (#139584)"
This reverts commit e2a885537f11f8d9ced1c80c2c90069ab5adeb1d. Build failures were fixed right away and reverting the original commit without the fixes breaks the build again.
2025-05-22 12:52:03 -07:00
Kazu Hirata
e2a885537f Revert "[clang] Remove intrusive reference count from DiagnosticOptions (#139584)"
This reverts commit 9e306ad4600c4d3392c194a8be88919ee758425c.

Multiple builtbot failures have been reported:
https://github.com/llvm/llvm-project/pull/139584
2025-05-22 12:44:20 -07:00
Jan Svoboda
9e306ad460
[clang] Remove intrusive reference count from DiagnosticOptions (#139584)
The `DiagnosticOptions` class is currently intrusively
reference-counted, which makes reasoning about its lifetime very
difficult in some cases. For example, `CompilerInvocation` owns the
`DiagnosticOptions` instance (wrapped in `llvm::IntrusiveRefCntPtr`) and
only exposes an accessor returning `DiagnosticOptions &`. One would
think this gives `CompilerInvocation` exclusive ownership of the object,
but that's not the case:

```c++
void shareOwnership(CompilerInvocation &CI) {
  llvm::IntrusiveRefCntPtr<DiagnosticOptions> CoOwner = &CI.getDiagnosticOptions();
  // ...
}
```

This is a perfectly valid pattern that is being actually used in the
codebase.

I would like to ensure the ownership of `DiagnosticOptions` by
`CompilerInvocation` is guaranteed to be exclusive. This can be
leveraged for a copy-on-write optimization later on. This PR changes
usages of `DiagnosticOptions` across `clang`, `clang-tools-extra` and
`lldb` to not be intrusively reference-counted.
2025-05-22 12:33:52 -07:00
Jan Svoboda
b69dcb8734
[clang][frontend] Require invocation to construct CompilerInstance (#137668)
This PR makes it so that `CompilerInvocation` needs to be provided to
`CompilerInstance` on construction. There are a couple of benefits in my
view:
* Making it impossible to mis-use some `CompilerInstance` APIs. For
example there are cases, where `createDiagnostics()` was called before
`setInvocation()`, causing the `DiagnosticEngine` to use the
default-constructed `DiagnosticOptions` instead of the intended ones.
* This shrinks `CompilerInstance`'s state space.
* This makes it possible to access **the** invocation in
`CompilerInstance`'s constructor (to be used in a follow-up).
2025-05-01 07:31:30 -07:00
Reid Kleckner
e3c0565b74
Reapply "[cmake] Refactor clang unittest cmake" (#134195)
This reapplies 5ffd9bdb50b57 (#133545) with fixes.

The BUILD_SHARED_LIBS=ON build was fixed by adding missing LLVM
dependencies to the InterpTests binary in
unittests/AST/ByteCode/CMakeLists.txt .
2025-04-02 21:07:30 -07:00
dpalermo
03a791f703
Revert "[cmake] Refactor clang unittest cmake" (#134022)
Reverts llvm/llvm-project#133545

This change is breaking several buildbots as well as developer's builds.
Reverting to allow people to make progress.
2025-04-01 22:19:27 -05:00
Reid Kleckner
5ffd9bdb50
[cmake] Refactor clang unittest cmake (#133545)
Pass all the dependencies into add_clang_unittest. This is consistent
with how it is done for LLDB. I borrowed the same named argument list
structure from add_lldb_unittest. This is a necessary step towards
consolidating unit tests into fewer binaries, but seems like a good
refactoring in its own right.
2025-04-01 14:12:44 -07:00
Ilya Biryukov
7f43120152 [Serialization] Free memory in LoadSpecLazilyTest
Default Clang invocations set DisableFree = true, which causes ASAN to
complain. Override it in tests that are not supposed to leak.
2024-12-12 11:19:11 +01:00
Chuanqi Xu
20e9049509
[Serialization] Support loading template specializations lazily (#119333)
Reland https://github.com/llvm/llvm-project/pull/83237

---

(Original comments)

Currently all the specializations of a template (including
instantiation, specialization and partial specializations) will be
loaded at once if we want to instantiate another instance for the
template, or find instantiation for the template, or just want to
complete the redecl chain.

This means basically we need to load every specializations for the
template once the template declaration got loaded. This is bad since
when we load a specialization, we need to load all of its template
arguments. Then we have to deserialize a lot of unnecessary
declarations.

For example,

```
// M.cppm
export module M;
export template <class T>
class A {};

export class ShouldNotBeLoaded {};

export class Temp {
   A<ShouldNotBeLoaded> AS;
};

// use.cpp
import M;
A<int> a;
```

We have a specialization ` A<ShouldNotBeLoaded>` in `M.cppm` and we
instantiate the template `A` in `use.cpp`. Then we will deserialize
`ShouldNotBeLoaded` surprisingly when compiling `use.cpp`. And this
patch tries to avoid that.

Given that the templates are heavily used in C++, this is a pain point
for the performance.

This patch adds MultiOnDiskHashTable for specializations in the
ASTReader. Then we will only deserialize the specializations with the
same template arguments. We made that by using ODRHash for the template
arguments as the key of the hash table.

To review this patch, I think `ASTReaderDecl::AddLazySpecializations`
may be a good entry point.
2024-12-11 09:40:47 +08:00
Haowei Wu
12bdeba76e Revert "[Serialization] Support load lazy specialization lazily"
This reverts commit b5bd19211118c6d43bc525a4e3fb65d2c750d61e.
It brokes multiple llvm bots including clang-x64-windows-msvc
2024-12-06 10:33:57 -08:00
Chuanqi Xu
b5bd192111 [Serialization] Support load lazy specialization lazily
Currently all the specializations of a template (including
instantiation, specialization and partial specializations)  will be
loaded at once if we want to instantiate another instance for the
template, or find instantiation for the template, or just want to
complete the redecl chain.

This means basically we need to load every specializations for the
template once the template declaration got loaded. This is bad since
when we load a specialization, we need to load all of its template
arguments. Then we have to deserialize a lot of unnecessary
declarations.

For example,

```
// M.cppm
export module M;
export template <class T>
class A {};

export class ShouldNotBeLoaded {};

export class Temp {
   A<ShouldNotBeLoaded> AS;
};

// use.cpp
import M;
A<int> a;
```

We should a specialization ` A<ShouldNotBeLoaded>` in `M.cppm` and we
instantiate the template `A` in `use.cpp`. Then we will deserialize
`ShouldNotBeLoaded` surprisingly when compiling `use.cpp`. And this
patch tries to avoid that.

Given that the templates are heavily used in C++, this is a pain point
for the performance.

This patch adds MultiOnDiskHashTable for specializations in the
ASTReader. Then we will only deserialize the specializations with the
same template arguments. We made that by using ODRHash for the template
arguments as the key of the hash table.

To review this patch, I think `ASTReaderDecl::AddLazySpecializations`
may be a good entry point.

The patch was reviewed in
https://github.com/llvm/llvm-project/pull/83237 but that PR is a stacked
PR. But I feel the intention of the stacked PRs get lost during the
review process. So I feel it is better to merge the commits into a
single commit instead of merging them in the PR page. It is better for
us to cherry-pick and revert.
2024-12-06 10:52:35 +08:00
Kadir Cetinkaya
df9a14d7bb
Reapply "[NFC] Explicitly pass a VFS when creating DiagnosticsEngine (#115852)"
This reverts commit a1153cd6fedd4c906a9840987934ca4712e34cb2 with fixes
to lldb breakages.

Fixes https://github.com/llvm/llvm-project/issues/117145.
2024-11-21 14:55:30 +01:00
Sylvestre Ledru
a1153cd6fe Revert "[NFC] Explicitly pass a VFS when creating DiagnosticsEngine (#115852)"
Reverted for causing:
https://github.com/llvm/llvm-project/issues/117145

This reverts commit bdd10d9d249bd1c2a45e3de56a5accd97e953458.
2024-11-21 13:04:30 +01:00
kadir çetinkaya
bdd10d9d24
[NFC] Explicitly pass a VFS when creating DiagnosticsEngine (#115852)
Starting with 41e3919ded78d8870f7c95e9181c7f7e29aa3cc4 DiagnosticsEngine
creation might perform IO. It was implicitly defaulting to
getRealFileSystem. This patch makes it explicit by pushing the decision
making to callers.

It uses ambient VFS if one is available, and keeps using
`getRealFileSystem` if there aren't any VFS.
2024-11-21 12:11:41 +01:00
Chuanqi Xu
947b062823 Reland "[Modules] No transitive source location change (#86912)"
This relands 6c31104.

The patch was reverted due to incorrectly introduced alignment. And the
patch was re-commited after fixing the alignment issue.

Following off are the original message:

This is part of "no transitive change" patch series, "no transitive
source location change". I talked this with @Bigcheese in the tokyo's
WG21 meeting.

The idea comes from @jyknight posted on LLVM discourse. That for:

```
// A.cppm
export module A;
...

// B.cppm
export module B;
import A;
...

//--- C.cppm
export module C;
import C;
```

Almost every time A.cppm changes, we need to recompile `B`. Due to we
think the source location is significant to the semantics. But it may be
good if we can avoid recompiling `C` if the change from `A` wouldn't
change the BMI of B.

This patch only cares source locations. So let's focus on source
location's example. We can see the full example from the attached test.

```
//--- A.cppm
export module A;
export template <class T>
struct C {
    T func() {
        return T(43);
    }
};
export int funcA() {
    return 43;
}

//--- A.v1.cppm
export module A;

export template <class T>
struct C {
    T func() {
        return T(43);
    }
};
export int funcA() {
    return 43;
}

//--- B.cppm
export module B;
import A;

export int funcB() {
    return funcA();
}

//--- C.cppm
export module C;
import A;
export void testD() {
    C<int> c;
    c.func();
}
```

Here the only difference between `A.cppm` and `A.v1.cppm` is that
`A.v1.cppm` has an additional blank line. Then the test shows that two
BMI of `B.cppm`, one specified `-fmodule-file=A=A.pcm` and the other
specified `-fmodule-file=A=A.v1.pcm`, should have the bit-wise same
contents.

However, it is a different story for C, since C instantiates templates
from A, and the instantiation records the source information from module
A, which is different from `A` and `A.v1`, so it is expected that the
BMI `C.pcm` and `C.v1.pcm` can and should differ.

To fully understand the patch, we need to understand how we encodes
source locations and how we serialize and deserialize them.

For source locations, we encoded them as:

```
|
|
| _____ base offset of an imported module
|
|
|
|_____ base offset of another imported module
|
|
|
|
| ___ 0
```

As the diagram shows, we encode the local (unloaded) source location
from 0 to higher bits. And we allocate the space for source locations
from the loaded modules from high bits to 0. Then the source locations
from the loaded modules will be mapped to our source location space
according to the allocated offset.

For example, for,

```
// a.cppm
export module a;
...

// b.cppm
export module b;
import a;
...
```

Assuming the offset of a source location (let's name the location as
`S`) in a.cppm is 45 and we will record the value `45` into the BMI
`a.pcm`. Then in b.cppm, when we import a, the source manager will
allocate a space for module 'a' (according to the recorded number of
source locations) as the base offset of module 'a' in the current source
location spaces. Let's assume the allocated base offset as 90 in this
example. Then when we want to get the location in the current source
location space for `S`, we can get it simply by adding `45` to `90` to
`135`. Finally we can get the source location for `S` in module B as
`135`.

And when we want to write module `b`, we would also write the source
location of `S` as `135` directly in the BMI. And to clarify the
location `S` comes from module `a`, we also need to record the base
offset of module `a`, 90 in the BMI of `b`.

Then the problem comes. Since the base offset of module 'a' is computed
by the number source locations in module 'a'. In module 'b', the
recorded base offset of module 'a' will change every time the number of
source locations in module 'a' increase or decrease. In other words, the
contents of BMI of B will change every time the number of locations in
module 'a' changes. This is pretty sensitive. Almost every change will
change the number of locations. So this is the problem this patch want
to solve.

Let's continue with the existing design to understand what's going on.
Another interesting case is:

```
// c.cppm
export module c;
import whatever;
import a;
import b;
...
```

In `c.cppm`, when we import `a`, we still need to allocate a base
location offset for it, let's say the value becomes to `200` somehow.
Then when we reach the location `S` recorded in module `b`, we need to
translate it into the current source location space. The solution is
quite simple, we can get it by `135 + (200 - 90) = 245`. In another
word, the offset of a source location in current module can be computed
as `Recorded Offset + Base Offset of the its module file - Recorded Base
Offset`.

Then we're almost done about how we handle the offset of source
locations in serializers.

From the abstract level, what we want to do is to remove the hardcoded
base offset of imported modules and remain the ability to calculate the
source location in a new module unit. To achieve this, we need to be
able to find the module file owning a source location from the encoding
of the source location.

So in this patch, for each source location, we will store the local
offset of the location and the module file index. For the above example,
in `b.pcm`, the source location of `S` will be recorded as `135`
directly. And in the new design, the source location of `S` will be
recorded as `<1, 45>`. Here `1` stands for the module file index of `a`
in module `b`. And `45` means the offset of `S` to the base offset of
module `a`.

So the trade-off here is that, to make the BMI more independent, we need
to record more abstract information. And I feel it is worthy. The
recompilation problem of modules is really annoying and there are still
people complaining this. But if we can make this (including stopping
other changes transitively), I think this may be a killer feature for
modules. And from @Bigcheese , this should be helpful for clang explicit
modules too.

And the benchmarking side, I tested this patch against
https://github.com/alibaba/async_simple/tree/CXX20Modules. No
significant change on compilation time. The size of .pcm files becomes
to 204M from 200M. I think the trade-off is pretty fair.

I didn't use another slot to record the module file index. I tried to
use the higher 32 bits of the existing source location encodings to
store that information. This design may be safe. Since we use `unsigned`
to store source locations but we use uint64_t in serialization. And
generally `unsigned` is 32 bit width in most platforms. So it might not
be a safe problem. Since all the bits we used to store the module file
index is not used before. So the new encodings may be:

```
   |-----------------------|-----------------------|
   |           A           |         B         | C |

  * A: 32 bit. The index of the module file in the module manager + 1.
  * The +1
          here is necessary since we wish 0 stands for the current
module file.
  * B: 31 bit. The offset of the source location to the module file
  * containing it.
  * C: The macro bit. We rotate it to the lowest bit so that we can save
  * some
          space in case the index of the module file is 0.
```

(The B and C is the existing raw encoding for source locations)

Another reason to reuse the same slot of the source location is to
reduce the impact of the patch. Since there are a lot of places assuming
we can store and get a source location from a slot. And if I tried to
add another slot, a lot of codes breaks. I don't feel it is worhty.

Another impact of this decision is that, the existing small
optimizations for encoding source location may be invalided. The key of
the optimization is that we can turn large values into small values then
we can use VBR6 format to reduce the size. But if we decided to put the
module file index into the higher bits, then maybe it simply doesn't
work. An example may be the `SourceLocationSequence` optimization.

This will only affect the size of on-disk .pcm files. I don't expect
this impact the speed and memory use of compilations. And seeing my
small experiments above, I feel this trade off is worthy.

The mental model for handling source location offsets is not so complex
and I believe we can solve it by adding module file index to each stored
source location.

For the practical side, since the source location is pretty sensitive,
and the patch can pass all the in-tree tests and a small scale projects,
I feel it should be correct.

I'll continue to work on no transitive decl change and no transitive
identifier change (if matters) to achieve the goal to stop the
propagation of unnecessary changes. But all of this depends on this
patch. Since, clearly, the source locations are the most sensitive
thing.

---

The release nots and documentation will be added seperately.
2024-05-06 13:35:16 +08:00
Chuanqi Xu
d333a0de68 Revert "[Modules] No transitive source location change (#86912)"
This reverts commit 6c3110464bac3600685af9650269b0b2b8669d34.

Required by the post commit comments: https://github.com/llvm/llvm-project/pull/86912
2024-04-30 22:32:02 +08:00
Chuanqi Xu
6c3110464b
[Modules] No transitive source location change (#86912)
This is part of "no transitive change" patch series, "no transitive
source location change". I talked this with @Bigcheese in the tokyo's
WG21 meeting.

The idea comes from @jyknight posted on LLVM discourse. That for:

```
// A.cppm
export module A;
...

// B.cppm
export module B;
import A;
...

//--- C.cppm
export module C;
import C;
```

Almost every time A.cppm changes, we need to recompile `B`. Due to we
think the source location is significant to the semantics. But it may be
good if we can avoid recompiling `C` if the change from `A` wouldn't
change the BMI of B.

# Motivation Example

This patch only cares source locations. So let's focus on source
location's example. We can see the full example from the attached test.

```
//--- A.cppm
export module A;
export template <class T>
struct C {
    T func() {
        return T(43);
    }
};
export int funcA() {
    return 43;
}

//--- A.v1.cppm
export module A;

export template <class T>
struct C {
    T func() {
        return T(43);
    }
};
export int funcA() {
    return 43;
}

//--- B.cppm
export module B;
import A;

export int funcB() {
    return funcA();
}

//--- C.cppm
export module C;
import A;
export void testD() {
    C<int> c;
    c.func();
}
```

Here the only difference between `A.cppm` and `A.v1.cppm` is that
`A.v1.cppm` has an additional blank line. Then the test shows that two
BMI of `B.cppm`, one specified `-fmodule-file=A=A.pcm` and the other
specified `-fmodule-file=A=A.v1.pcm`, should have the bit-wise same
contents.

However, it is a different story for C, since C instantiates templates
from A, and the instantiation records the source information from module
A, which is different from `A` and `A.v1`, so it is expected that the
BMI `C.pcm` and `C.v1.pcm` can and should differ.

# Internal perspective of status quo

To fully understand the patch, we need to understand how we encodes
source locations and how we serialize and deserialize them.

For source locations, we encoded them as:

```
|
|
| _____ base offset of an imported module
|
|
|
|_____ base offset of another imported module
|
|
|
|
| ___ 0
```

As the diagram shows, we encode the local (unloaded) source location
from 0 to higher bits. And we allocate the space for source locations
from the loaded modules from high bits to 0. Then the source locations
from the loaded modules will be mapped to our source location space
according to the allocated offset.

For example, for,

```
// a.cppm
export module a;
...

// b.cppm
export module b;
import a;
...
```

Assuming the offset of a source location (let's name the location as
`S`) in a.cppm is 45 and we will record the value `45` into the BMI
`a.pcm`. Then in b.cppm, when we import a, the source manager will
allocate a space for module 'a' (according to the recorded number of
source locations) as the base offset of module 'a' in the current source
location spaces. Let's assume the allocated base offset as 90 in this
example. Then when we want to get the location in the current source
location space for `S`, we can get it simply by adding `45` to `90` to
`135`. Finally we can get the source location for `S` in module B as
`135`.

And when we want to write module `b`, we would also write the source
location of `S` as `135` directly in the BMI. And to clarify the
location `S` comes from module `a`, we also need to record the base
offset of module `a`, 90 in the BMI of `b`.

Then the problem comes. Since the base offset of module 'a' is computed
by the number source locations in module 'a'. In module 'b', the
recorded base offset of module 'a' will change every time the number of
source locations in module 'a' increase or decrease. In other words, the
contents of BMI of B will change every time the number of locations in
module 'a' changes. This is pretty sensitive. Almost every change will
change the number of locations. So this is the problem this patch want
to solve.

Let's continue with the existing design to understand what's going on.
Another interesting case is:

```
// c.cppm
export module c;
import whatever;
import a;
import b;
...
```

In `c.cppm`, when we import `a`, we still need to allocate a base
location offset for it, let's say the value becomes to `200` somehow.
Then when we reach the location `S` recorded in module `b`, we need to
translate it into the current source location space. The solution is
quite simple, we can get it by `135 + (200 - 90) = 245`. In another
word, the offset of a source location in current module can be computed
as `Recorded Offset + Base Offset of the its module file - Recorded Base
Offset`.

Then we're almost done about how we handle the offset of source
locations in serializers.

# The high level design of current patch

From the abstract level, what we want to do is to remove the hardcoded
base offset of imported modules and remain the ability to calculate the
source location in a new module unit. To achieve this, we need to be
able to find the module file owning a source location from the encoding
of the source location.

So in this patch, for each source location, we will store the local
offset of the location and the module file index. For the above example,
in `b.pcm`, the source location of `S` will be recorded as `135`
directly. And in the new design, the source location of `S` will be
recorded as `<1, 45>`. Here `1` stands for the module file index of `a`
in module `b`. And `45` means the offset of `S` to the base offset of
module `a`.

So the trade-off here is that, to make the BMI more independent, we need
to record more abstract information. And I feel it is worthy. The
recompilation problem of modules is really annoying and there are still
people complaining this. But if we can make this (including stopping
other changes transitively), I think this may be a killer feature for
modules. And from @Bigcheese , this should be helpful for clang explicit
modules too.

And the benchmarking side, I tested this patch against
https://github.com/alibaba/async_simple/tree/CXX20Modules. No
significant change on compilation time. The size of .pcm files becomes
to 204M from 200M. I think the trade-off is pretty fair.

# Some low level details

I didn't use another slot to record the module file index. I tried to
use the higher 32 bits of the existing source location encodings to
store that information. This design may be safe. Since we use `unsigned`
to store source locations but we use uint64_t in serialization. And
generally `unsigned` is 32 bit width in most platforms. So it might not
be a safe problem. Since all the bits we used to store the module file
index is not used before. So the new encodings may be:

```
   |-----------------------|-----------------------|
   |           A           |         B         | C |

  * A: 32 bit. The index of the module file in the module manager + 1. The +1
          here is necessary since we wish 0 stands for the current module file.
  * B: 31 bit. The offset of the source location to the module file containing it.
  * C: The macro bit. We rotate it to the lowest bit so that we can save some 
          space in case the index of the module file is 0.
```

(The B and C is the existing raw encoding for source locations)

Another reason to reuse the same slot of the source location is to
reduce the impact of the patch. Since there are a lot of places assuming
we can store and get a source location from a slot. And if I tried to
add another slot, a lot of codes breaks. I don't feel it is worhty.

Another impact of this decision is that, the existing small
optimizations for encoding source location may be invalided. The key of
the optimization is that we can turn large values into small values then
we can use VBR6 format to reduce the size. But if we decided to put the
module file index into the higher bits, then maybe it simply doesn't
work. An example may be the `SourceLocationSequence` optimization.

This will only affect the size of on-disk .pcm files. I don't expect
this impact the speed and memory use of compilations. And seeing my
small experiments above, I feel this trade off is worthy.

# Correctness

The mental model for handling source location offsets is not so complex
and I believe we can solve it by adding module file index to each stored
source location.

For the practical side, since the source location is pretty sensitive,
and the patch can pass all the in-tree tests and a small scale projects,
I feel it should be correct.

# Future Plans

I'll continue to work on no transitive decl change and no transitive
identifier change (if matters) to achieve the goal to stop the
propagation of unnecessary changes. But all of this depends on this
patch. Since, clearly, the source locations are the most sensitive
thing.

---

The release nots and documentation will be added seperately.
2024-04-30 15:57:58 +08:00
Chuanqi Xu
da00c60dae
[C++20] [Modules] Introduce reduced BMI (#75894)
Close https://github.com/llvm/llvm-project/issues/71034

See

https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755

This patch introduces reduced BMI, which doesn't contain the definitions
of functions and variables if its definitions won't contribute to the
ABI.

Testing is a big part of the patch. We want to make sure the reduced BMI
contains the same behavior with the existing and relatively stable
fatBMI. This is pretty helpful for further reduction.

The user interfaces part it left to following patches to ease the
reviewing.
2024-03-08 10:12:51 +08:00
Krasimir Georgiev
b2ebd8b897
clang serialization unittests: fix some leaks (#82773)
No functional changes intended.

Fixes some leaks found by running under asan with `--gtest_repeat=2`.
2024-02-26 16:58:39 +01:00
Chuanqi Xu
49775b1dc0
[Serialization] Record whether the ODR is skipped (#82302)
Close https://github.com/llvm/llvm-project/issues/80570.

In

a0b6747804,
we skipped ODR checks for decls in GMF. Then it should be natural to
skip storing the ODR values in BMI.

Generally it should be fine as long as the writer and the reader keep
consistent.

However, the use of preamble in clangd shows the tricky part.

For,

```
// test.cpp
module;

// any one off these is enough to crash clangd
// #include <iostream>
// #include <string_view>
// #include <cmath>
// #include <system_error>
// #include <new>
// #include <bit>
// probably many more

// only ok with libc++, not the system provided libstdc++ 13.2.1

// these are ok

export module test;
```

clangd will store the headers as preamble to speedup the parsing and the
preamble reuses the serialization techniques. (Generally we'd call the
preamble as PCH. However it is not true strictly. I've tested the PCH
wouldn't be problematic.) However, the tricky part is that the preamble
is not modules. It literally serialiaze and deserialize things. So
before clangd parsing the above test module, clangd will serialize the
headers into the preamble. Note that there is no concept like GMF now.
So the ODR bits are stored. However, when clangd parse the file
actually, the decls from preamble are thought as in GMF literally, then
hte ODR bits are skipped. Then mismatch happens.

To solve the problem, this patch adds another bit for decls to record
whether or not the ODR bits are skipped.
2024-02-20 13:31:28 +08:00
Chuanqi Xu
d76b56fd28 [NFC] Eliminate warnings in SourceLocationEncodingTest.cpp
There are 2 dangling else warnings and a warning pararences around bit operations.
This patch tries to eliminate that.
2023-11-02 14:01:17 +08:00
Benjamin Kramer
263fc4c79a Turn off memory leaks in unit test 2023-09-14 13:16:22 +02:00
Chuanqi Xu
45c160510f [C++20] [Modules] [Serialization] Check input file contents with option ForceCheckCXX20ModulesInputFiles enabled
With overriden input files, e,g,. the compiler get the file from an
in-memroy buffer, the compiler can't get correct modified time information
to indicate whehter the input files are changed or not. Then, the
semantics of ForceCheckCXX20ModulesInputFiles are broken.

In this patch, if both ForceCheckCXX20ModulesInputFiles and
ValidateASTInputFilesContent and enabled, the compiler will still check the
hash value of the contents even if their modification time is the same.
2023-09-14 11:27:50 +08:00
Mikhail Goncharov
4faa6c6699 [clang][test] Use a physical copy of FS
(missing piece from https://reviews.llvm.org/D152265)
476e7c49ecb762df1d68273696b06c36feb0fd96
2023-06-06 17:36:35 +02:00
Kadir Cetinkaya
476e7c49ec
[clang][test] Use a physical copy of FS
Make use of a physical copy, rather than real FS in unittests that
change working-directory to get rid of the side effect of changing cwd for the
whole process. It's triggering crashes depending on the test order.

Differential Revision: https://reviews.llvm.org/D152265
2023-06-06 15:45:55 +02:00
Chuanqi Xu
c336c983bc [C++20] [Modules] [Serialization] Don't write comments to BMI for C++20 Named Modules
This patch forbids to write comment to BMIs for C++20 Named Modules.
Originally I thought this was helpful for language services like clangd.
But I found clangd don't want the BMI to contain comments actually. So
it is meaningless for C++20 Named Modules to keep such comments in
their BMI.

It is simple to enable this when someday we found we want this actually.
2023-06-06 13:05:17 +08:00
Michael Liao
2daf91dae3 Fix shared library build again from 1c9a800. NFC 2023-05-24 14:09:38 -04:00
Amy Kwan
61262f9ef4 Fix shared library build from 1c9a800.
Fix the shared library build failure on clang-ppc64le-rhel from 1c9a800 as seen
in: https://lab.llvm.org/buildbot/#/builders/57/builds/27080/steps/6/logs/stdio
2023-05-24 12:29:23 -05:00
Chuanqi Xu
1c9a8004ed Recommit [C++20] [Modules] Serialize the evaluated constant values for VarDecl
Close https://github.com/llvm/llvm-project/issues/62796.

Previously, we didn't serialize the evaluated result for VarDecl. This
caused the compilation of template metaprogramming become slower than
expect. This patch fixes the issue.

This is a recommit tested with asan built clang.
2023-05-24 15:45:16 +08:00
Chuanqi Xu
651b40e8ff Revert "[C++20] [Modules] Serialize the evaluated constant values for VarDecl"
This reverts commit c0d6f85e3ae8bcfdb7217d165314f01c1a4af9ae. The asan
bot detected a memory leak after this patch. Revert it for now.
2023-05-24 13:56:09 +08:00
Chuanqi Xu
597dd1f91d [NFC] Fix the warning for dangling pointer for c0d6f85e3ae8bc
The bot notes a warning-converted-error for the dangling pointer. And
the patch fixes that.
2023-05-24 11:42:55 +08:00
Chuanqi Xu
c0d6f85e3a [C++20] [Modules] Serialize the evaluated constant values for VarDecl
Close https://github.com/llvm/llvm-project/issues/62796.

Previously, we didn't serialize the evaluated result for VarDecl. This
caused the compilation of template metaprogramming become slower than
expect. This patch fixes the issue.
2023-05-24 10:17:33 +08:00
Kazu Hirata
6ad0788c33 [clang] Use std::optional instead of llvm::Optional (NFC)
This patch replaces (llvm::|)Optional< with std::optional<.  I'll post
a separate patch to remove #include "llvm/ADT/Optional.h".

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2023-01-14 12:31:01 -08:00
Kazu Hirata
a1580d7b59 [clang] Add #include <optional> (NFC)
This patch adds #include <optional> to those files containing
llvm::Optional<...> or Optional<...>.

I'll post a separate patch to actually replace llvm::Optional with
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2023-01-14 11:07:21 -08:00
Kazu Hirata
a41fbb1fc2 [clang/unittests] Use std::nullopt instead of None (NFC)
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated.  The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-03 12:14:19 -08:00
Sam McCall
481691572d [Serialization] Add missing includes for CHAR_BIT 2022-05-19 10:04:25 +02:00
Sam McCall
4df795bff7 [Serialization] Delta-encode consecutive SourceLocations in TypeLoc
Much of the size of PCH/PCM files comes from stored SourceLocations.
These are encoded using (almost) their raw value, VBR-encoded. Absolute
SourceLocations can be relatively large numbers, so this commonly takes
20-30 bits per location.

We can reduce this by exploiting redundancy: many "nearby" SourceLocations are
stored differing only slightly and can be delta-encoded.
Randam-access loading of AST nodes constrains how long these sequences
can be, but we can do it at least within a node that always gets
deserialized as an atomic unit.

TypeLoc is implemented in this patch as it's a relatively small change
that shows most of the API.
This saves ~3.5% of PCH size, I have local changes applying this technique
further that save another 3%, I think it's possible to get to 10% total.

Differential Revision: https://reviews.llvm.org/D125403
2022-05-19 09:40:44 +02:00
Sam McCall
499d0b96cb [clang] createInvocationFromCommandLine -> createInvocation, delete former. NFC
(Followup from 40c13720a4b977d4347bbde53c52a4d0703823c2)

Differential Revision: https://reviews.llvm.org/D125012
2022-05-06 16:21:48 +02:00
Ben Barham
766a08df12 [Frontend] Only compile modules if not already finalized
It was possible to re-add a module to a shared in-memory module cache
when search paths are changed. This can eventually cause a crash if the
original module is referenced after this occurs.
  1. Module A depends on B
  2. B exists in two paths C and D
  3. First run only has C on the search path, finds A and B and loads
     them
  4. Second run adds D to the front of the search path. A is loaded and
     contains a reference to the already compiled module from C. But
     searching finds the module from D instead, causing a mismatch
  5. B and the modules that depend on it are considered out of date and
     thus rebuilt
  6. The recompiled module A is added to the in-memory cache, freeing
     the previously inserted one

This can never occur from a regular clang process, but is very easy to
do through the API - whether through the use of a shared case or just
running multiple compilations from a single `CompilerInstance`. Update
the compilation to return early if a module is already finalized so that
the pre-condition in the in-memory module cache holds.

Resolves rdar://78180255

Differential Revision: https://reviews.llvm.org/D105328
2021-07-15 18:27:08 -07:00
Rumeet Dhindsa
57a2eaf3c1 Revert "[modules] Do not cache invalid state for modules that we attempted to load."
As per comment on https://reviews.llvm.org/D72860, it is suggested to
revert this change in the meantime, since it has introduced regression.

This reverts commit 83f4c3af021cd5322ea10fd1c4e839874c1dae49.
2020-03-10 10:59:26 -07:00
Volodymyr Sapsai
83f4c3af02 [modules] Do not cache invalid state for modules that we attempted to load.
Partially reverts 0a2be46cfdb698fefcc860a56b47dde0884d5335 as it turned
out to cause redundant module rebuilds in multi-process incremental builds.
When a module was getting out of date, all compilation processes started at the
same time were marking it as `ToBuild`. So each process was building the same
module instead of checking if it was built by someone else and using that
result. In addition to the work duplication, contention on the same .pcm file
wasn't making builds faster.

Note that for a single-process build this change would cause redundant module
reads and validations. But reading a module is faster than building it and
multi-process builds are more common than single-process. So I'm willing to
make such a trade-off.

rdar://problem/54395127

Reviewed By: dexonsmith

Differential Revision: https://reviews.llvm.org/D72860
2020-01-16 17:12:41 -08:00
Tom Stellard
2e97d2aa1b cmake: Add CLANG_LINK_CLANG_DYLIB option
Summary:
Setting CLANG_LINK_CLANG_DYLIB=ON causes clang tools to link against
libclang_shared.so instead of the individual component libraries.

Reviewers: mgorny, beanz, smeenai, phosek, sylvestre.ledru

Subscribers: arphaman, cfe-commits, llvm-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D63503

llvm-svn: 365092
2019-07-03 22:45:55 +00:00
Francis Visoiu Mistrih
e0308279cb [Bitcode] Move Bitstream to a separate library
This moves Bitcode/Bitstream*, Bitcode/BitCodes.h to Bitstream/.

This is needed to avoid a circular dependency when using the bitstream
code for parsing optimization remarks.

Since Bitcode uses Core for the IR part:

libLLVMRemarks -> Bitcode -> Core

and Core uses libLLVMRemarks to generate remarks (see
IR/RemarkStreamer.cpp):

Core -> libLLVMRemarks

we need to separate the Bitstream and Bitcode part.

For clang-doc, it seems that it doesn't need the whole bitcode layer, so
I updated the CMake to only use the bitstream part.

Differential Revision: https://reviews.llvm.org/D63899

llvm-svn: 365091
2019-07-03 22:40:07 +00:00
Duncan P. N. Exon Smith
b7db2e9f82 Stop relying on allocator behaviour in modules unit test
Another fixup for r355778 for Windows bots, this time to stop
accidentally relying on allocator behaviour for the test to pass.

llvm-svn: 355780
2019-03-09 20:15:01 +00:00
Duncan P. N. Exon Smith
0a2be46cfd Modules: Invalidate out-of-date PCMs as they're discovered
Leverage the InMemoryModuleCache to invalidate a module the first time
it fails to import (and to lock a module as soon as it's built or
imported successfully).  For implicit module builds, this optimizes
importing deep graphs where the leaf module is out-of-date; see example
near the end of the commit message.

Previously the cache finalized ("locked in") all modules imported so far
when starting a new module build.  This was sufficient to prevent
loading two versions of the same module, but was somewhat arbitrary and
hard to reason about.

Now the cache explicitly tracks module state, where each module must be
one of:

- Unknown: module not in the cache (yet).
- Tentative: module in the cache, but not yet fully imported.
- ToBuild: module found on disk could not be imported; need to build.
- Final: module in the cache has been successfully built or imported.

Preventing repeated failed imports avoids variation in builds based on
shifting filesystem state.  Now it's guaranteed that a module is loaded
from disk exactly once.  It now seems safe to remove
FileManager::invalidateCache, but I'm leaving that for a later commit.

The new, precise logic uncovered a pre-existing problem in the cache:
the map key is the module filename, and different contexts use different
filenames for the same PCM file.  (In particular, the test
Modules/relative-import-path.c does not build without this commit.
r223577 started using a relative path to describe a module's base
directory when importing it within another module.  As a result, the
module cache sees an absolute path when (a) building the module or
importing it at the top-level, and a relative path when (b) importing
the module underneath another one.)

The "obvious" fix is to resolve paths using FileManager::getVirtualFile
and change the map key for the cache to a FileEntry, but some contexts
(particularly related to ASTUnit) have a shorter lifetime for their
FileManager than the InMemoryModuleCache.  This is worth pursuing
further in a later commit; perhaps by tying together the FileManager and
InMemoryModuleCache lifetime, or moving the in-memory PCM storage into a
VFS layer.

For now, use the PCM's base directory as-written for constructing the
filename to check the ModuleCache.

Example
=======

To understand the build optimization, first consider the build of a
module graph TU -> A -> B -> C -> D with an empty cache:

    TU builds A'
       A' builds B'
          B' builds C'
             C' builds D'
                imports D'
          B' imports C'
             imports D'
       A' imports B'
          imports C'
          imports D'
    TU imports A'
       imports B'
       imports C'
       imports D'

If we build TU again, where A, B, C, and D are in the cache and D is
out-of-date, we would previously get this build:

    TU imports A
       imports B
       imports C
       imports D (out-of-date)
    TU builds A'
       A' imports B
          imports C
          imports D (out-of-date)
          builds B'
          B' imports C
             imports D (out-of-date)
             builds C'
             C' imports D (out-of-date)
                builds D'
                imports D'
          B' imports C'
             imports D'
       A' imports B'
          imports C'
          imports D'
     TU imports A'
        imports B'
        imports C'
        imports D'

After this commit, we'll immediateley invalidate A, B, C, and D when we
first observe that D is out-of-date, giving this build:

    TU imports A
       imports B
       imports C
       imports D (out-of-date)
    TU builds A' // The same graph as an empty cache.
       A' builds B'
          B' builds C'
             C' builds D'
                imports D'
          B' imports C'
             imports D'
       A' imports B'
          imports C'
          imports D'
    TU imports A'
       imports B'
       imports C'
       imports D'

The new build matches what we'd naively expect, pretty closely matching
the original build with the empty cache.

rdar://problem/48545366

llvm-svn: 355778
2019-03-09 17:44:01 +00:00
Duncan P. N. Exon Smith
8bef5cd49a Modules: Rename MemoryBufferCache to InMemoryModuleCache
Change MemoryBufferCache to InMemoryModuleCache, moving it from Basic to
Serialization.  Another patch will start using it to manage module build
more explicitly, but this is split out because it's mostly mechanical.

Because of the move to Serialization we can no longer abuse the
Preprocessor to forward it to the ASTReader.  Besides the rename and
file move, that means Preprocessor::Preprocessor has one fewer parameter
and ASTReader::ASTReader has one more.

llvm-svn: 355777
2019-03-09 17:33:56 +00:00