There is some code to make sure that C++ keywords that are identifiers
in the other languages are not treated as keywords. Right now, the kind
is set to identifier, and the identifier info is cleared. The latter is
probably so that the code for identifying C++ structures does not
recognize those structures by mistake when formatting a language that
does not have those structures. But we did not find an instance where
the language can have the sequence of tokens, the code tries to parse
the structure as if it is C++ using the identifier info instead of the
token kind, but without checking for the language setting. However,
there are places where the code checks whether the identifier info field
is null or not. They are places where an identifier and a keyword are
treated the same way. For example, the name of a function in
JavaScript. This patch removes the lines that clear the identifier
info. This way, a C++ keyword gets treated in the same way as an
identifier in those places.
JavaScript
New
```JavaScript
async function
union(
myparamnameiswaytooloooong) {
}
```
Old
```JavaScript
async function
union(
myparamnameiswaytooloooong) {
}
```
Java
New
```Java
enum union { ABC, CDE }
```
Old
```Java
enum
union { ABC, CDE }
```
A regular expression was used in the lexing process. It made the program
take more than linear time with regards to the length of the input. It
looked like the entire buffer could be scanned for every token lexed.
Now the regular expression is replaced with code. Previously it took 20
minutes for the program to format 125 000 lines of code on my computer.
Now it takes 315 milliseconds.
This reverts commit b92d6dd704d789240685a336ad8b25a9f381b4cc. See
github.com/llvm/llvm-project/commit/b92d6dd704d7#commitcomment-139992444
We should use a tool like Visual Studio to clean up the headers.
This implements the annotation of the values in TableGen.
The main changes are,
- parseTableGenValue(), the simplified parser method for the syntax of
values.
- modified consumeToken() to parseTableGenValue in 'if', 'assert' and
after '='.
- modified parseParens() to call parseTableGenValue inside.
- modified parseSquare() to to call parseTableGenValue inside, with
skipping separator tokens.
- modified parseAngle() to call parseTableGenValue inside, with skipping
separator tokens.
Adds the support for tokens that have forms like unary operators.
- bang operators: `!name`
- cond operator: `!cond`
- numeric literals: `+1`, `-1`
cond operator are one of bang operators but is distinguished because it has very specific syntax.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
The Verilog implication operator `->` is a binary operator meaning
either the left hand side is false or the right hand side is true.
Previously it was treated as the C++ struct member operator.
I didn't even know it existed when I added the operator formatting part.
And I didn't check all the tests for all the operators I added. That is
how the bad test got in.
If a non-keyword identifier is found in TypeNames, then a *, &, or && that
follows it is annotated as TT_PointerOrReference.
Differential Revision: https://reviews.llvm.org/D155273
The token annotator doesn't annotate the template opener and closer
as such if they enclose an overloaded operator. This causes the
space between the operator and the closer to be removed, resulting
in invalid C++ code.
Fixes#58602.
Differential Revision: https://reviews.llvm.org/D143755
The token annotator doesn't annotate the template opener and closer
as such if they enclose an overloaded operator. This causes the
space between the operator and the closer to be removed, resulting
in invalid C++ code.
Fixes#58602.
Differential Revision: https://reviews.llvm.org/D143755
Different loop termination conditions resulted in confusion of whether
*Offset was intended to be inside or outside the token.
This ultimately led to constructing an out-of-range SourceLocation.
Fix by making Offset consistently point *after* the token.
Differential Revision: https://reviews.llvm.org/D135356
This change fixes a clang-format unit test failure introduced by [D124748](https://reviews.llvm.org/D124748). The `countLeadingWhitespace` function was calling `isspace` with values that could fall outside the valid input range. The valid input range for `isspace` is unsigned 0-255. Values outside this range produce undefined behavior, which on Windows manifests as an assertion being raised in the debug runtime libraries. `countLeadingWhitespace` was calling `isspace` with a signed char that could produce a negative value if the underlying byte's value was 128 or above, which can happen for non-ASCII encodings. The fix is to use `StringRef`'s `bytes_begin` and `bytes_end` iterators to read the values as unsigned chars instead.
This bug can be reproduced by building the `check-clang-unit` target with a DEBUG configuration under Windows. This change is already covered by existing unit tests.
Reviewed By: MyDeveloperDay
Differential Revision: https://reviews.llvm.org/D128786
The setLength function checks for the token kind which could be
uninitialized in the previous version.
The problem was introduced in 2e32ff106e.
Reviewed By: MyDeveloperDay, owenpan
Differential Revision: https://reviews.llvm.org/D128607
Verilog uses the backtick instead of the hash. In this revision
backticks are lexed manually and then get labeled as hashes so the logic
for handling C preprocessor stuff don't have to change. Hashes get
labeled as identifiers for Verilog-specific stuff like delays.
Reviewed By: HazardyKnusperkeks
Differential Revision: https://reviews.llvm.org/D124749
The current way of counting whitespace would count backticks as
whitespace. For Verilog stuff we need backticks to be handled
correctly. For JavaScript the current way is to compare the entire
token text to see if it's a backtick. However, when the backtick is the
first token following an escaped newline, the escaped newline will be
part of the tok::unknown token. Verilog has macros and escaped newlines
unlike JavaScript. So we can't regard an entire tok::unknown token as
whitespace. Previously, the start of every token would be matched for
newlines. Now, it is all whitespace instead of just newlines.
The column counting problem has already been fixed for JavaScript in
e71b4cbdd140f059667f84464bd0ac0ebc348387 by counting columns elsewhere.
Reviewed By: HazardyKnusperkeks
Differential Revision: https://reviews.llvm.org/D124748