19 Commits

Author SHA1 Message Date
Corentin Jabot
31f4859c3e [Clang] Allow additional mathematical symbols in identifiers.
Implement the proposed UAX Profile
"Mathematical notation profile for default identifiers".

This implements a not-yet approved Unicode for a vetted
UAX31 identifier profile
https://www.unicode.org/L2/L2022/22230-math-profile.pdf

This change mitigates the reported disruption caused
by the implementation of UAX31 in C++ and C2x,
as these mathematical symbols are commonly used in the
scientific community.

Fixes #54732

Reviewed By: tahonermann, #clang-language-wg

Differential Revision: https://reviews.llvm.org/D137051
2022-12-16 10:20:49 +01:00
Corentin Jabot
dbfe446ef3 [Clang] Implement CWG2640 Allow more characters in an n-char sequence
Reviewed By: #clang-language-wg, aaron.ballman, tahonermann

Differential Revision: https://reviews.llvm.org/D138861
2022-12-13 09:02:52 +01:00
Corentin Jabot
9e2dc984ba [Clang] improve grammar in warn_utf8_symbol_homoglyph diagnostic 2022-12-09 10:52:13 +01:00
Corentin Jabot
c932cef32a Update Unicode to 15.0
Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters.
These additions include 2 new scripts along with 20 new emoji characters,
and 4,193 CJK ideographs.

This changes modify most existing tables including
 - XID_Start/XID_Continue in Clang
 - The character name database (used by \N{} in Clang)
 - The list of formattable/printable codepoints
 - The case folding algorithm (which we had not updated since Unicode 9)
 - The list of nonspacing/enclosing marks used by the column width
   computation algorithm. The rest of the column width algorithm
   is not updated.

Reviewed By: tahonermann

Differential Revision: https://reviews.llvm.org/D133807
2022-09-22 05:03:01 +02:00
Corentin Jabot
aee76cb59c [Clang] Add support for Unicode identifiers (UAX31) in C2x mode.
This implements
N2836 Identifier Syntax using Unicode Standard Annex 31.

The feature was already implemented for C++,
and the semantics are the same.

Unlike C++ there was, afaict, no decision to
backport the feature in older languages mode,
so C17 and earlier are not modified and the
code point tables for these language modes are conserved.

Reviewed By: aaron.ballman

Differential Revision: https://reviews.llvm.org/D130416
2022-07-23 14:08:08 +02:00
Corentin Jabot
c92056d038 [Clang][C++23] P2071 Named universal character escapes
Implements [[ https://wg21.link/p2071r1  | P2071 Named Universal Character Escapes ]] - as an extension in all language mode, the patch  not warn in c++23 mode will be done later once this paper is plenary approved (in July).

We add

 * A code generator that transforms `UnicodeData.txt` and `NameAliases.txt` to a space efficient data structure that can be queried in `O(NameLength)`
 * A set of functions in `Unicode.h` to query that data, including

   * A function to find an exact match of a given Unicode character name
   * A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching
   * A function returning the best matching codepoint for a given string per edit distance

 * Support of `\N{}` escape sequences in String and character Literals, with loose and typos diagnostics/fixits
 * Support of `\N{}` as UCN with loose matching diagnostics/fixits.

Loose matching is considered an error to match closely the semantics of P2071.

The generated data contributes to 280kB of data to the binaries.

`UnicodeData.txt` and `NameAliases.txt`  are not committed to the repository in this patch, and regenerating the data is a manual process.

Reviewed By: tahonermann

Differential Revision: https://reviews.llvm.org/D123064
2022-06-25 19:03:33 +02:00
Sam McCall
817550919e [Lex] Don't assert when decoding invalid UCNs.
Currently if a lexically-valid UCN encodes an invalid codepoint, then we
diagnose that, and then hit an assertion while trying to decode it.

Since there isn't anything preventing us reaching this state, remove the
assertion. expandUCNs("X\UAAAAAAAAY") will produce "XY".

Differential Revision: https://reviews.llvm.org/D125059
2022-05-06 08:51:42 +02:00
Aaron Ballman
7de7161304 Use functions with prototypes when appropriate; NFC
A significant number of our tests in C accidentally use functions
without prototypes. This patch converts the function signatures to have
a prototype for the situations where the test is not specific to K&R C
declarations. e.g.,

  void func();

becomes

  void func(void);

This is the sixth batch of tests being updated (there are a significant
number of other tests left to be updated).
2022-02-09 17:16:10 -05:00
Corentin Jabot
afb6223bc5 Support Unicode 14 identifiers
This update the UAX tables to support new Unicode 14 identifiers.
2021-09-16 13:21:27 -04:00
Aaron Ballman
9f27364377 Use a more general test here.
The interesting bit about that triple isn't the architecture, it's the
fact that ps4 implies C99 as the standard rather than a newer C mode.
Specify the language standard rather than the triple so the test is a
bit more general.
2021-08-18 09:32:05 -04:00
Corentin Jabot
2715c4da50 Do not emit diagnostics for invalid unicode characters in preprocessing mode
This amends 4e80636db71a1b6123d15ed1f9eda3979b4292de with a fix for
https://lab.llvm.org/buildbot/#/builders/139/builds/8943
2021-08-18 09:12:36 -04:00
Corentin Jabot
4e80636db7 Implement P1949
This adds the Unicode 13 data for XID_Start and XID_Continue.
The definition of valid identifier is changed in all C++ modes
as P1949 (https://wg21.link/p1949) was accepted by WG21 as a defect
report.
2021-08-18 07:33:14 -04:00
Richard Smith
4e966e8135 Don't emit "will be treated as an identifier character" warning for
UTF-8 characters that aren't identifier characters in the current
language mode.

llvm-svn: 343040
2018-09-25 22:34:45 +00:00
Richard Smith
8ed7776bc4 PR38870: Add warning for zero-width unicode characters appearing in
identifiers.

llvm-svn: 341700
2018-09-07 19:25:39 +00:00
Richard Smith
77091b167f Warn if we find a Unicode homoglyph for a symbol in an identifier.
Specifically, warn if:
 * we find a character that the language standard says we must treat as an
   identifier, and
 * that character is not reasonably an identifier character (it's a punctuation
   character or similar), and 
 * it renders identically to a valid non-identifier character in common
   fixed-width fonts.

Some tools "helpfully" substitute the surprising characters for the expected
characters, and replacing semicolons with Greek question marks is a common
"prank".

llvm-svn: 320697
2017-12-14 13:15:08 +00:00
Richard Smith
664798c034 Add test that we correctly allow some non-letter unicode characters in
identifiers, and extend existing test to also cover C++.

llvm-svn: 248079
2015-09-19 02:14:12 +00:00
Jordan Rose
cc538345be Lexer: Don't warn about Unicode in preprocessor directives.
This allows people to use Unicode in their #pragma mark and in macros
that exist only to be string-ized.

<rdar://problem/13107323&13121362>

llvm-svn: 174081
2013-01-31 19:48:48 +00:00
Jordan Rose
17441589c3 Don't warn about Unicode characters in -E mode.
People use the C preprocessor for things other than C files. Some of them
have Unicode characters. We shouldn't warn about Unicode characters
appearing outside of identifiers in this case.

There's not currently a way for the preprocessor to tell if it's in -E mode,
so I added a new flag, derived from the PreprocessorOutputOptions. This is
only used by the Unicode warnings for now, but could conceivably be used by
other warnings or even behavioral differences later.

<rdar://problem/13107323>

llvm-svn: 173881
2013-01-30 01:52:57 +00:00
Jordan Rose
4246ae0089 As an extension, treat Unicode whitespace characters as whitespace.
llvm-svn: 173370
2013-01-24 20:50:50 +00:00