42 Commits

Author SHA1 Message Date
serge-sans-paille
984b800a03
Move from llvm::makeArrayRef to ArrayRef deduction guides - last part
This is a follow-up to https://reviews.llvm.org/D140896, split into
several parts as it touches a lot of files.

Differential Revision: https://reviews.llvm.org/D141298
2023-01-10 11:47:43 +01:00
Kazu Hirata
f71ffd3b73 [clang-tools-extra] Use std::optional instead of llvm::Optional (NFC)
This patch replaces (llvm::|)Optional< with std::optional<.  I'll post
a separate patch to clean up the "using" declarations, #include
"llvm/ADT/Optional.h", etc.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2023-01-07 20:19:42 -08:00
Kazu Hirata
71f557355d [clang-tools-extra] Add #include <optional> (NFC)
This patch adds #include <optional> to those files containing
llvm::Optional<...> or Optional<...>.

I'll post a separate patch to actually replace llvm::Optional with
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2023-01-07 20:02:20 -08:00
Haojian Wu
983cb53845 [pseudo] NFC, Remove an extral blank line. 2022-09-22 11:07:25 +02:00
Sam McCall
56c54cf66b [pseudo] Placeholder disambiguation strategy: always choose second
Mostly mechanics here. Interesting decisions:
 - apply disambiguation in-place instead of copying the forest
   debatable, but even the final tree size is significant
 - split decide/apply into different functions - this allows the hard part
   (decide) to be tested non-destructively and combined with HTML forest easily
 - add non-const accessors to forest to enable apply
 - unit tests but no lit tests: my plan is to test actual C++ disambiguation
   heuristics with lit, generic disambiguation mechanics without the C++ grammar

Differential Revision: https://reviews.llvm.org/D132487
2022-08-26 13:16:09 +02:00
Haojian Wu
edb8fb2659 [pseudo] Fix HeadsPartition is not initialized correctly.
The bug was that if we recover from the token 0, we will make the
Heads empty (Line646), which results no recovery being applied.

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D132388
2022-08-23 15:08:33 +02:00
Sam McCall
bd5cc6575b [pseudo] Start rules are _ := start-symbol EOF, improve recovery.
Previously we were calling glrRecover() ad-hoc at the end of input.
Two main problems with this:
 - glrRecover() on two separate code paths is inelegant
 - We may have to recover several times in succession (e.g. to exit from
   nested scopes), so we need a loop at end-of-file
Having an actual shift action for an EOF terminal allows us to handle
both concerns in the main shift/recover/reduce loop.

This revealed a recovery design bug where recovery could enter a loop by
repeatedly choosing the same parent to identically recover from.
Addressed this by allowing each node to be used as a recovery base once.

Differential Revision: https://reviews.llvm.org/D130550
2022-08-19 16:49:37 +02:00
Sam McCall
605035bf45 [pseudo] Changes omitted from previous commit 2022-08-19 15:15:37 +02:00
Sam McCall
2cc7463c85 [pseudo] Perform unconstrained reduction prior to recovery.
Our GLR uses lookahead: only perform reductions that might be consumed by the
shift immediately following. However when shift fails and so reduce is followed
by recovery instead, this restriction is incorrect and leads to missing heads.

In turn this means certain recovery strategies can't be made to work. e.g.
```
ns := NAMESPACE { namespace-body } [recover=Skip]
ns-body := namespace_opt
```
When `namespace { namespace {` is parsed, we can recover the inner `ns` (using
the `Skip` strategy to ignore the missing `}`). However this `namespace` will
not be reduced to a `namespace-body` as EOF is not in the follow-set, and so we
are unable to recover the outer `ns`.

This patch fixes this by tracking which heads were produced by constrained
reduce, and discarding and rebuilding them before performing recovery.

This is a prerequisite for the `Skip` strategy mentioned above, though there are
some other limitations we need to address too.

Reviewed By: hokein

Differential Revision: https://reviews.llvm.org/D130523
2022-08-19 15:07:36 +02:00
Sam McCall
07b7ff9838 [pseudo] Allow opaque nodes to represent terminals
This allows incomplete code such as `namespace foo {` to be modeled as a
normal sequence with the missing } represented by an empty opaque node.

Differential Revision: https://reviews.llvm.org/D130551
2022-07-26 13:56:26 +02:00
Haojian Wu
2a88fb2ecb [pseudo] Eliminate the dangling-else syntax ambiguity.
- the grammar ambiguity is eliminated by a guard;
- modify the guard function signatures, now all parameters are folded in
  to a single object, avoid a long parameter list (as we will add more
  parameters in the near future);

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D130160
2022-07-22 09:13:09 +02:00
Sam McCall
3132e9cd7c [pseudo] Key guards by RuleID, add guards to literals (and 0).
After this, NUMERIC_CONSTANT and strings should parse only one way.

There are 8 types of literals, and 24 valid (literal, TokenKind) pairs.
This means adding 8 new named guards (or 24, if we want to assert the token).

It seems fairly clear to me at this point that the guard names are unneccesary
indirection: the guards are in fact coupled to the rule signature.

(Also add the zero guard I forgot in the previous patch.)

Differential Revision: https://reviews.llvm.org/D130066
2022-07-21 22:42:31 +02:00
Kazu Hirata
53daa177f8 [clang, clang-tools-extra] Use has_value instead of hasValue (NFC) 2022-07-12 22:47:41 -07:00
Sam McCall
7d8e2742d9 [pseudo] Define recovery strategy as grammar extension.
Differential Revision: https://reviews.llvm.org/D129158
2022-07-06 15:03:38 +02:00
Sam McCall
3121167488 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:

- a parse failure is detected when all heads die when trying to shift the next token
- we can recover by choosing a nonterminal we're partway through parsing, and
  determining where it ends through nonlocal means (e.g. matching brackets)
- we can find candidates by walking up the stack from the (ex-)heads
- the token range is defined using heuristics attached to grammar rules
- the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:

- make recovery heuristics grammar extensions (depends on D127448)
- add recovery to our grammar for bracketed constructs and sequence nodes
- change the structure of our augmented `_ := start` rules to eliminate some
  special-cases in glrParse.
- (if I can work out how): avoid some spurious recovery cases described in comments

(Previously mistakenly committed as a0f4c10ae227a62c2a63611e64eba83f0ff0f577)

Differential Revision: https://reviews.llvm.org/D128486
2022-07-05 20:49:41 +02:00
Haojian Wu
9ab67cc8bf [pseudo] Implement guard extension.
- Extend the GLR parser to allow conditional reduction based on the
  guard functions;
- Implement two simple guards (contextual-override/final) for cxx.bnf;
- layering: clangPseudoCXX depends on clangPseudo (as the guard function need
  to access the TokenStream);

Differential Revision: https://reviews.llvm.org/D127448
2022-07-05 15:55:15 +02:00
Sam McCall
b37dafd5dc [pseudo] Store shift and goto actions in a compact structure with faster lookup.
The actions table is very compact but the binary search to find the
correct action is relatively expensive.
A hashtable is faster but pretty large (64 bits per value, plus empty
slots, and lookup is constant time but not trivial due to collisions).

The structure in this patch uses 1.25 bits per entry (whether present or absent)
plus the size of the values, and lookup is trivial.

The Shift table is 119KB = 27KB values + 92KB keys.
The Goto table is 86KB = 30KB values + 57KB keys.
(Goto has a smaller keyspace as #nonterminals < #terminals, and more entries).

This patch improves glrParse speed by 28%: 4.69 => 5.99 MB/s
Overall the table grows by 60%: 142 => 228KB.

By comparison, DenseMap<unsigned, StateID> is "only" 16% faster (5.43 MB/s),
and results in a 285% larger table (547 KB) vs the baseline.

Differential Revision: https://reviews.llvm.org/D128485
2022-07-04 19:40:04 +02:00
Sam McCall
743971faaf Revert "[pseudo] Add error-recovery framework & brace-based recovery"
This reverts commit a0f4c10ae227a62c2a63611e64eba83f0ff0f577.
This commit hadn't been reviewed yet, and was unintentionally included
on another branch.
2022-06-28 21:11:09 +02:00
Sam McCall
a0f4c10ae2 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:
 - a parse failure is detected when all heads die when trying to shift
   the next token
 - we can recover by choosing a nonterminal we're partway through parsing,
   and determining where it ends through nonlocal means (e.g. matching brackets)
 - we can find candidates by walking up the stack from the (ex-)heads
 - the token range is defined using heuristics attached to grammar rules
 - the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:
 - make recovery heuristics grammar extensions (depends on D127448)
 - add recover to our grammar for bracketed constructs and sequence nodes
 - change the structure of our augmented `_ := start` rules to eliminate
   some special-cases in glrParse.
 - (if I can work out how): avoid some spurious recovery cases described
   in comments
 - grammar changes to eliminate the hard distinction between init-list
   and designated-init-list shown in the recovery-init-list.cpp testcase

Differential Revision: https://reviews.llvm.org/D128486
2022-06-28 21:08:43 +02:00
Sam McCall
85eaecbe8e [pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens.
Previously, the action table stores a reduce action for each lookahead
token it should allow. These tokens are the followSet(action.rule.target).

In practice, the follow sets are large, so we spend a bunch of time binary
searching around all these essentially-duplicates to check whether our lookahead
token is there.
However the number of reduces for a given state is very small, so we're
much better off linear scanning over them and performing a fast check for each.

D128318 was an attempt at this, storing a bitmap for each reduce.
However it's even more compact just to use the follow sets directly, as
there are fewer nonterminals than (state, rule) pairs. It's also faster.

This specialized approach means unbundling Reduce from other actions in
LRTable, so it's no longer useful to support it in Action. I suspect
Action will soon go away, as we store each kind of action separately.

This improves glrParse speed by 42% (3.30 -> 4.69 MB/s).
It also reduces LR table size by 59% (343 -> 142kB).

Differential Revision: https://reviews.llvm.org/D128472
2022-06-28 00:36:16 +02:00
Kazu Hirata
94460f5136 Don't use Optional::hasValue (NFC)
This patch replaces x.hasValue() with x where x is contextually
convertible to bool.
2022-06-26 19:54:41 -07:00
Kazu Hirata
3b7c3a654c Revert "Don't use Optional::hasValue (NFC)"
This reverts commit aa8feeefd3ac6c78ee8f67bf033976fc7d68bc6d.
2022-06-25 11:56:50 -07:00
Kazu Hirata
aa8feeefd3 Don't use Optional::hasValue (NFC) 2022-06-25 11:55:57 -07:00
Sam McCall
768216cac0 [pseudo] Handle no-reductions-available on the fastpath. NFC
This is a ~2% speedup.
2022-06-23 20:34:11 +02:00
Sam McCall
466eae6aa3 [pseudo] Store last node popped in the queue, not its parent(s). NFC
We have to walk up to the last node to find the start token, but no need
to go even one node further.

This is one node fewer to store, but more importantly if the last node
happens to have multiple parents we avoid storing the sequence multiple times.

This saves ~5% on glrParse.
Based on a comment by hokein@ on https://reviews.llvm.org/D128307
2022-06-23 20:10:20 +02:00
Sam McCall
7aff663b2a [pseudo] Store reduction sequences by pointer in heaps, instead of by value.
Copying sequences around as the heap resized is significantly expensive.

This speeds up glrParse by ~35% (2.4 => 3.25 MB/s)

Differential Revision: https://reviews.llvm.org/D128307
2022-06-23 19:41:11 +02:00
Sam McCall
3e610f2cdc [pseudo] Turn glrReduce into a class, reuse storage across calls.
This is a ~5% speedup, we no longer have to allocate the priority queues and
other collections for each reduction step where we use them.

It's also IMO easier to understand the structure of a class with methods vs a
function with nested lambdas.

Differential Revision: https://reviews.llvm.org/D128301
2022-06-23 19:27:47 +02:00
Sam McCall
f9710d1908 [pseudo] Add a fast-path to GLR reduce when both pop and push are trivial
In general we split a reduce into pop/push, so concurrently-available reductions
can run in the correct order. The data structures for this are expensive.

When only one reduction is possible at a time, we need not do this: we can pop
and immediately push instead.
Strictly this is correct whenever we yield one concurrent PushSpec.

This patch recognizes a trivial but common subset of these cases:
 - there must be no pending pushes and only one head available to pop
 - the head must have only one reduction rule
 - the reduction path must be a straight line (no multiple parents)

On my machine this speeds up by 2.12 -> 2.30 MB/s = 8%

Differential Revision: https://reviews.llvm.org/D128299
2022-06-23 18:21:59 +02:00
Sam McCall
b70ee9d984 Reland "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit 2c80b5319870b57fbdbb6c9cef9c86c26c65371d.

Fixes LRTable::buildForTest to create states that are referenced but
have no actions.
2022-06-23 18:21:44 +02:00
Sam McCall
2c80b53198 Revert "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit e3ec054dfdf48f19cb6726cb3f4965b9ab320ed9.

Tests fail in asserts mode: https://lab.llvm.org/buildbot/#/builders/109/builds/41217
2022-06-23 18:16:38 +02:00
Sam McCall
e3ec054dfd [pseudo] Track heads as GSS nodes, rather than as "pending actions".
IMO this model is simpler to understand (borrowed from the LR0 patch D127357).
It also makes error recovery easier to implement, as we have a simple list of
head nodes lying around to recover from when needed.
(It's not quite as nice as LR0 in this respect though).

It's slightly slower (2.24 -> 2.12 MB/S on my machine = 5%) but nothing close
to as bad as LR0.
However
 - I think we'd have to eat a litle performance loss otherwise to implement
   error recovery.
 - this frees up some complexity budget for optimizations like fastpath push/pop
   (this + fastpath is already faster than head)
 - I haven't changed the data structure here and it's now pretty dumb, we can
   make it faster

Differential Revision: https://reviews.llvm.org/D128297
2022-06-23 17:26:42 +02:00
Haojian Wu
c70aeaad2b [pseudo] Move grammar-related headers to a separate dir, NFC.
We did that for .cpp, but forgot the headers.

Differential Revision: https://reviews.llvm.org/D127388
2022-06-09 14:58:05 +02:00
Haojian Wu
74e4d5f256 [pseudo] Simplify the glrReduce implementation.
glrReduce maintains two priority queues (one for bases, and the other
for Sequence), these queues are in parallel with each other, corresponding to a
single family. They can be folded into one.

This patch removes the bases priority queue, which improves the glrParse by
10%.

ASTReader.cpp: 2.03M/s (before) vs 2.26M/s (after)

Differential Revision: https://reviews.llvm.org/D127283
2022-06-09 11:28:52 +02:00
Haojian Wu
7a05942dd0 [pseudo] Remove the explicit Accept actions.
As pointed out in the previous review section, having a dedicated accept
action doesn't seem to be necessary. This patch implements the the same behavior
without accept acction, which will save some code complexity.

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D125677
2022-06-09 11:19:07 +02:00
Sam McCall
94b2ca18c1 [pseudo] GC GSS nodes, reuse them with a freelist
Most GSS nodes have short effective lifetimes, keeping them around until the
end of the parse is wasteful. Mark and sweep them every 20 tokens.

When parsing clangd/AST.cpp, this reduces the GSS memory from 1MB to 20kB.
We pay ~5% performance for this according to the glrParse benchmark.
(Parsing more tokens between GCs doesn't seem to improve this further).

Compared to the refcounting approach in https://reviews.llvm.org/D126337, this
is simpler (at least the complexity is better isolated) and has >2x less
overhead. It doesn't provide death handlers (for error-handling) but we have
an alternative solution in mind.

Differential Revision: https://reviews.llvm.org/D126723
2022-06-08 23:39:59 +02:00
Dmitri Gribenko
9c6a2f2966 Fix an unused variable warning in no-asserts build mode 2022-05-17 15:27:44 +02:00
Haojian Wu
1a65c491be [pseudo] Support parsing variant target symbols.
With this patch, we're able to parse smaller chunks of C++ code (statement,
declaration), rather than translation-unit.

The start symbol is listed in the grammar in a form of `_ :=
statement`, each start symbol has a dedicated state (`_ := • statement`).
We create and track all these separate states in the LRTable. When we
start parsing, we lookup the corresponding state to start the parser.

LR pasing table changes with this patch:
- number of states: 1467 -> 1471
- number of actions: 82891 -> 83578
- size of the table (bytes): 334248 -> 336996

Differential Revision: https://reviews.llvm.org/D125006
2022-05-16 10:38:16 +02:00
Weverything
0e86cddf98 [psuedo] Fix for unused warning by moving code into debug macro. 2022-05-03 16:07:59 -07:00
Haojian Wu
ed1b32791d [pseudo] Print the GSS::Node details when the unittest fails, NFC. 2022-05-03 22:06:10 +02:00
Haojian Wu
9f38da258e [pseudo] Implement the GLR parsing algorithm.
This patch implements a standard GLR parsing algorithm, the
core piece of the pseudoparser.

- it parses preprocessed C++ code, currently it supports correct code
  only and parse them as a translation-unit;
- it produces a forest which stores all possible trees in an efficient
  manner (only a single node being build for per (SymbolID, Token Range));
  no disambiguation yet;

Reland with a fix for g++'s -fpermissive error on previous declaration `GSS& GSS;`.

Differential Revision: https://reviews.llvm.org/D121150
2022-05-03 20:25:23 +02:00
Haojian Wu
860eabb395 Revert "[pseudo] Implement the GLR parsing algorithm."
This breaks some buildbots (on the declaration GSS& GSS), will fix it
later.

This reverts commit eac22d0754f70df10ea0eb6f59cbd1ef012ab5a4.
2022-05-03 15:54:10 +02:00
Sam McCall
eac22d0754 [pseudo] Implement the GLR parsing algorithm.
This patch implements a standard GLR parsing algorithm, the
core piece of the pseudoparser.

- it parses preprocessed C++ code, currently it supports correct code
  only and parse them as a translation-unit;
- it produces a forest which stores all possible trees in an efficient
  manner (only a single node being build for per (SymbolID, Token Range));
  no disambiguation yet;

Differential Revision: https://reviews.llvm.org/D121150
2022-05-03 15:42:07 +02:00