llvm-project

Author	SHA1	Message	Date
Corentin Jabot	31f4859c3e	[Clang] Allow additional mathematical symbols in identifiers. Implement the proposed UAX Profile "Mathematical notation profile for default identifiers". This implements a not-yet approved Unicode for a vetted UAX31 identifier profile https://www.unicode.org/L2/L2022/22230-math-profile.pdf This change mitigates the reported disruption caused by the implementation of UAX31 in C++ and C2x, as these mathematical symbols are commonly used in the scientific community. Fixes #54732 Reviewed By: tahonermann, #clang-language-wg Differential Revision: https://reviews.llvm.org/D137051	2022-12-16 10:20:49 +01:00
Corentin Jabot	dbfe446ef3	[Clang] Implement CWG2640 Allow more characters in an n-char sequence Reviewed By: #clang-language-wg, aaron.ballman, tahonermann Differential Revision: https://reviews.llvm.org/D138861	2022-12-13 09:02:52 +01:00
Corentin Jabot	9e2dc984ba	[Clang] improve grammar in warn_utf8_symbol_homoglyph diagnostic	2022-12-09 10:52:13 +01:00
Corentin Jabot	c932cef32a	Update Unicode to 15.0 Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters. These additions include 2 new scripts along with 20 new emoji characters, and 4,193 CJK ideographs. This changes modify most existing tables including - XID_Start/XID_Continue in Clang - The character name database (used by \N{} in Clang) - The list of formattable/printable codepoints - The case folding algorithm (which we had not updated since Unicode 9) - The list of nonspacing/enclosing marks used by the column width computation algorithm. The rest of the column width algorithm is not updated. Reviewed By: tahonermann Differential Revision: https://reviews.llvm.org/D133807	2022-09-22 05:03:01 +02:00
Corentin Jabot	aee76cb59c	[Clang] Add support for Unicode identifiers (UAX31) in C2x mode. This implements N2836 Identifier Syntax using Unicode Standard Annex 31. The feature was already implemented for C++, and the semantics are the same. Unlike C++ there was, afaict, no decision to backport the feature in older languages mode, so C17 and earlier are not modified and the code point tables for these language modes are conserved. Reviewed By: aaron.ballman Differential Revision: https://reviews.llvm.org/D130416	2022-07-23 14:08:08 +02:00
Corentin Jabot	c92056d038	[Clang][C++23] P2071 Named universal character escapes Implements [[ https://wg21.link/p2071r1 \| P2071 Named Universal Character Escapes ]] - as an extension in all language mode, the patch not warn in c++23 mode will be done later once this paper is plenary approved (in July). We add * A code generator that transforms `UnicodeData.txt` and `NameAliases.txt` to a space efficient data structure that can be queried in `O(NameLength)` * A set of functions in `Unicode.h` to query that data, including * A function to find an exact match of a given Unicode character name * A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching * A function returning the best matching codepoint for a given string per edit distance * Support of `\N{}` escape sequences in String and character Literals, with loose and typos diagnostics/fixits * Support of `\N{}` as UCN with loose matching diagnostics/fixits. Loose matching is considered an error to match closely the semantics of P2071. The generated data contributes to 280kB of data to the binaries. `UnicodeData.txt` and `NameAliases.txt` are not committed to the repository in this patch, and regenerating the data is a manual process. Reviewed By: tahonermann Differential Revision: https://reviews.llvm.org/D123064	2022-06-25 19:03:33 +02:00
Sam McCall	817550919e	[Lex] Don't assert when decoding invalid UCNs. Currently if a lexically-valid UCN encodes an invalid codepoint, then we diagnose that, and then hit an assertion while trying to decode it. Since there isn't anything preventing us reaching this state, remove the assertion. expandUCNs("X\UAAAAAAAAY") will produce "XY". Differential Revision: https://reviews.llvm.org/D125059	2022-05-06 08:51:42 +02:00
Aaron Ballman	7de7161304	Use functions with prototypes when appropriate; NFC A significant number of our tests in C accidentally use functions without prototypes. This patch converts the function signatures to have a prototype for the situations where the test is not specific to K&R C declarations. e.g., void func(); becomes void func(void); This is the sixth batch of tests being updated (there are a significant number of other tests left to be updated).	2022-02-09 17:16:10 -05:00
Corentin Jabot	afb6223bc5	Support Unicode 14 identifiers This update the UAX tables to support new Unicode 14 identifiers.	2021-09-16 13:21:27 -04:00
Aaron Ballman	9f27364377	Use a more general test here. The interesting bit about that triple isn't the architecture, it's the fact that ps4 implies C99 as the standard rather than a newer C mode. Specify the language standard rather than the triple so the test is a bit more general.	2021-08-18 09:32:05 -04:00
Corentin Jabot	2715c4da50	Do not emit diagnostics for invalid unicode characters in preprocessing mode This amends 4e80636db71a1b6123d15ed1f9eda3979b4292de with a fix for https://lab.llvm.org/buildbot/#/builders/139/builds/8943	2021-08-18 09:12:36 -04:00
Corentin Jabot	4e80636db7	Implement P1949 This adds the Unicode 13 data for XID_Start and XID_Continue. The definition of valid identifier is changed in all C++ modes as P1949 (https://wg21.link/p1949) was accepted by WG21 as a defect report.	2021-08-18 07:33:14 -04:00
Richard Smith	4e966e8135	Don't emit "will be treated as an identifier character" warning for UTF-8 characters that aren't identifier characters in the current language mode. llvm-svn: 343040	2018-09-25 22:34:45 +00:00
Richard Smith	8ed7776bc4	PR38870: Add warning for zero-width unicode characters appearing in identifiers. llvm-svn: 341700	2018-09-07 19:25:39 +00:00
Richard Smith	77091b167f	Warn if we find a Unicode homoglyph for a symbol in an identifier. Specifically, warn if: * we find a character that the language standard says we must treat as an identifier, and * that character is not reasonably an identifier character (it's a punctuation character or similar), and * it renders identically to a valid non-identifier character in common fixed-width fonts. Some tools "helpfully" substitute the surprising characters for the expected characters, and replacing semicolons with Greek question marks is a common "prank". llvm-svn: 320697	2017-12-14 13:15:08 +00:00
Richard Smith	664798c034	Add test that we correctly allow some non-letter unicode characters in identifiers, and extend existing test to also cover C++. llvm-svn: 248079	2015-09-19 02:14:12 +00:00
Jordan Rose	cc538345be	Lexer: Don't warn about Unicode in preprocessor directives. This allows people to use Unicode in their #pragma mark and in macros that exist only to be string-ized. <rdar://problem/13107323&13121362> llvm-svn: 174081	2013-01-31 19:48:48 +00:00
Jordan Rose	17441589c3	Don't warn about Unicode characters in -E mode. People use the C preprocessor for things other than C files. Some of them have Unicode characters. We shouldn't warn about Unicode characters appearing outside of identifiers in this case. There's not currently a way for the preprocessor to tell if it's in -E mode, so I added a new flag, derived from the PreprocessorOutputOptions. This is only used by the Unicode warnings for now, but could conceivably be used by other warnings or even behavioral differences later. <rdar://problem/13107323> llvm-svn: 173881	2013-01-30 01:52:57 +00:00
Jordan Rose	4246ae0089	As an extension, treat Unicode whitespace characters as whitespace. llvm-svn: 173370	2013-01-24 20:50:50 +00:00

19 Commits