Ver código fonte

Incorporate proposals #142 and #143 into the design

Add design text based on the contents of two proposals:

#142 Unicode source files
#143 Numeric literals
Richard Smith 5 anos atrás
pai
commit
ee7a108da4

+ 16 - 8
docs/design/README.md

@@ -105,7 +105,8 @@ cleaned up during evolution.
 
 ### Code and comments
 
-> References: [Lexical conventions](lexical_conventions.md)
+> References: [Source files](code_and_name_organization/source_files.md) and
+> [lexical conventions](lexical_conventions)
 >
 > **TODO:** References need to be evolved.
 
@@ -127,9 +128,16 @@ cleaned up during evolution.
       live code
     ```
 
+-   Decimal, hexadecimal, and binary integer literals and decimal and
+    hexadecimal floating-point literals are supported, with `_` as a digit
+    separator. For example, `42`, `0b1011_1101` and `0x1.EEFp+5`. Numeric
+    literals are case-sensitive: `0x`, `0b`, `e+`, and `p+` must be lowercase,
+    whereas hexadecimal digits must be uppercase. A digit is required on both
+    sides of a period.
+
 ### Packages, libraries, and namespaces
 
-> References: [Code and name organization](code_and_name_organization.md)
+> References: [Code and name organization](code_and_name_organization)
 
 -   **Files** are grouped into libraries, which are in turn grouped into
     packages.
@@ -161,16 +169,16 @@ fn Foo(var Geometry.Shapes.Flat.Circle: circle) { ... }
 
 ### Names and scopes
 
-> References: [Lexical conventions](lexical_conventions.md)
+> References: [Lexical conventions](lexical_conventions)
 >
 > **TODO:** References need to be evolved.
 
 Various constructs introduce a named entity in Carbon. These can be functions,
 types, variables, or other kinds of entities that we'll cover. A name in Carbon
-is always formed out of an "identifier", or a sequence of letters, numbers, and
-underscores which starts with a letter. As a regular expression, this would be
-`/[a-zA-Z][a-zA-Z0-9_]*/`. Eventually we may add support for more unicode
-characters as well.
+is formed from a word, which is a sequence of letters, numbers, and underscores,
+and which starts with a letter. We intend to follow Unicode's Annex 31 in
+selecting valid identifier characters, but a concrete set of valid characters
+has not been selected yet.
 
 #### Naming conventions
 
@@ -240,7 +248,7 @@ file, including `Int` and `Bool`. These will likely be defined in a special
 
 ### Expressions
 
-> References: [Lexical conventions](lexical_conventions.md) and
+> References: [Lexical conventions](lexical_conventions) and
 > [operators](operators.md)
 >
 > **TODO:** References need to be evolved.

+ 11 - 11
docs/design/code_and_name_organization.md → docs/design/code_and_name_organization/README.md

@@ -112,8 +112,8 @@ Important Carbon goals for code and name organization are:
 
 ## Overview
 
-Carbon files have a `.carbon` extension, such as `geometry.carbon`. These files
-are the basic unit of compilation.
+Carbon [source files](source_files.md) have a `.carbon` extension, such as
+`geometry.carbon`. These files are the basic unit of compilation.
 
 Each file begins with a declaration of which
 _package_<sup><small>[[define](/docs/guides/glossary.md#package)]</small></sup>
@@ -228,9 +228,9 @@ Every source file will consist of, in order:
 3. Source file body, with other code.
 
 Comments and blank lines may be intermingled with these sections.
-[Metaprogramming](metaprogramming.md) code may also be intermingled, so long as
-the outputted code is consistent with the enforced ordering. Other types of code
-must be in the source file body.
+[Metaprogramming](/docs/design/metaprogramming.md) code may also be
+intermingled, so long as the outputted code is consistent with the enforced
+ordering. Other types of code must be in the source file body.
 
 ### Name paths
 
@@ -241,7 +241,7 @@ separated by dots. This syntax may be loosely expressed as a regular expression:
 IDENTIFIER(\.IDENTIFIER)*
 ```
 
-Name conflicts are addressed by [name lookup](name_lookup.md).
+Name conflicts are addressed by [name lookup](/docs/design/name_lookup.md).
 
 #### `package` syntax
 
@@ -467,7 +467,7 @@ An import declares a package entity named after the imported package, and makes
 `api`-tagged entities from the imported library through it. The full name path
 is a concatenation of the names of the package entity, any namespace entities
 applied, and the final entity addressed. Child namespaces or entities may be
-[aliased](aliases.md) if desired.
+[aliased](/docs/design/aliases.md) if desired.
 
 For example, given a library:
 
@@ -574,8 +574,8 @@ struct Shapes.Square { ... };
 
 #### Aliasing
 
-Carbon's [alias keyword](aliases.md) will support aliasing namespaces. For
-example, this would be valid code:
+Carbon's [alias keyword](/docs/design/aliases.md) will support aliasing
+namespaces. For example, this would be valid code:
 
 ```carbon
 namespace Timezones.Internal;
@@ -606,7 +606,7 @@ import, and that the `api` is infeasible to rename due to existing callers.
 Alternately, the `api` entity may be using an idiomatic name that it would
 contradict naming conventions to rename. In either case, this conflict may exist
 in a single file without otherwise affecting users of the API. This will be
-addressed by [name lookup](name_lookup.md).
+addressed by [name lookup](/docs/design/name_lookup.md).
 
 ### Potential refactorings
 
@@ -904,7 +904,7 @@ Advantages:
 Disadvantages:
 
 -   We are likely to want a more fine-grained, file-level approach proposed by
-    [name lookup](name_lookup.md).
+    [name lookup](/docs/design/name_lookup.md).
 -   Allows package owners to name their packages things that they rarely type,
     but that importers end up typing frequently.
     -   The existence of a short `package` keyword shifts the balance for long

+ 244 - 0
docs/design/code_and_name_organization/source_files.md

@@ -0,0 +1,244 @@
+# Source files
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+## Table of contents
+
+<!-- toc -->
+
+-   [Overview](#overview)
+-   [Encoding](#encoding)
+-   [References](#references)
+-   [Alternatives](#alternatives)
+    -   [Character encoding](#character-encoding)
+    -   [Byte order marks](#byte-order-marks)
+    -   [Normalization forms](#normalization-forms)
+
+<!-- tocstop -->
+
+## Overview
+
+A Carbon _source file_ is a sequence of Unicode code points in Unicode
+Normalization Form C ("NFC"), and represents a portion of the complete text of a
+program.
+
+Program text can come from a variety of sources, such as an interactive
+programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a
+database, a memory buffer of an IDE, or a command-line argument.
+
+The canonical representation for Carbon programs is in files stored as a
+sequence of bytes in a file system on disk. Such files have a `.carbon`
+extension.
+
+## Encoding
+
+The on-disk representation of a Carbon source file is encoded in UTF-8. Such
+files may begin with an optional UTF-8 BOM, that is, the byte sequence
+EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if present, is
+ignored.
+
+No Unicode normalization is performed when reading an on-disk representation of
+a Carbon source file, so the byte representation is required to be normalized in
+Normalization Form C. The Carbon source formatting tool will convert source
+files to NFC as necessary.
+
+## References
+
+-   [Unicode](https://www.unicode.org/versions/latest/) is a universal character
+    encoding, maintained by the
+    [Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is
+    the canonical encoding used for textual information interchange across all
+    modern technology.
+
+    Carbon is based on Unicode 13.0, which is currently the latest version of
+    the Unicode standard. Newer versions will be considered for adoption as they
+    are released.
+
+-   [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)
+
+-   [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)
+
+## Alternatives
+
+The choice to require NFC is really four choices:
+
+1. Equivalence classes: we use a canonical normalization form rather than a
+   compatibility normalization form or no normalization form at all.
+
+    - If we use no normalization, invisibly-different ways of representing the
+      same glyph, such as with pre-combined diacritics versus with diacritics
+      expressed as separate combining characters, or with combining characters
+      in a different order, would be considered different characters.
+    - If we use a canonical normalization form, all ways of encoding diacritics
+      are considered to form the same character, but ligatures such as `ffi` are
+      considered distinct from the character sequence that they decompose into.
+    - If we use a compatibility normalization form, ligatures are considered
+      equivalent to the character sequence that they decompose into.
+
+    For a fixed-width font, a canonical normalization form is most likely to
+    consider characters to be the same if they look the same. Unicode annexes
+    [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
+    and
+    [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
+    both recommend the use of Normalization Form C for case-sensitive
+    identifiers in programming languages.
+
+2. Composition: we use a composed normalization form rather than a decomposed
+   normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
+   LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
+   O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
+   in smaller representations whenever the two differ, but the decomposed form
+   is a little easier for algorithmic processing (for example, typo correction
+   and homoglyph detection).
+
+3. We require source files to be in our chosen form, rather than converting to
+   that form as necessary.
+
+4. We require that the entire contents of the file be normalized, rather than
+   restricting our attention to only identifiers, or only identifiers and string
+   literals.
+
+### Character encoding
+
+**We could restrict programs to ASCII.**
+
+Advantages:
+
+-   Reduced implementation complexity.
+-   Avoids all problems relating to normalization, homoglyphs, text
+    directionality, and so on.
+-   We have no intention of using non-ASCII characters in the language syntax or
+    in any library name.
+-   Provides assurance that all names in libraries can reliably be typed by all
+    developers -- we already require that keywords, and thus all ASCII letters,
+    can be typed.
+
+Disadvantages:
+
+-   An overarching goal of the Carbon project is to provide a language that is
+    inclusive and welcoming. A language that does not permit names and comments
+    in programs to be expressed in the developer's native language will not meet
+    that goal for at least some of our developers.
+-   Quoted strings will be substantially less readable if non-ASCII printable
+    characters are required to be written as escape sequences.
+
+### Byte order marks
+
+**We could disallow byte order marks.**
+
+Advantages:
+
+-   Marginal implementation simplicity.
+
+Disadvantages:
+
+-   Several major editors, particularly on the Windows platform, insert UTF-8
+    BOMs and use them to identify file encoding.
+
+### Normalization forms
+
+**We could require a different normalization form.**
+
+Advantages:
+
+-   Some environments might more naturally produce a different normalization
+    form.
+-   Normalization Form D is more uniform, in that characters are always
+    maximally decomposed into combining characters; in NFC, characters may or
+    may not be decomposed depending on whether a composed form is available.
+    -   NFD may be more suitable for certain uses such as typo correction,
+        homoglyph detection, or code completion.
+
+Disadvantages:
+
+-   The C++ standard and community is moving towards using NFC:
+
+    -   WG21 is in the process of adopting a NFC requirement for C++
+        identifiers.
+    -   GCC warns on C++ identifiers that aren't in NFC.
+
+    As a consequence, we should expect that the tooling and development
+    environments that C++ developers are using will provide good support for
+    authoring NFC-encoded source files.
+
+-   The W3C recommends using NFC for all content, so code samples distributed on
+    webpages may be canonicalized into NFC by some web authoring tools.
+
+-   NFC produces smaller encodings than NFD in all cases where they differ.
+
+**We could require no normalization form and compare identifiers by code point
+sequence.**
+
+Advantages:
+
+-   This is the rule in use in C++20 and before.
+
+Disadvantages:
+
+-   This is not the rule planned for the near future of C++.
+-   Different representations of the same character may result in different
+    identifiers, in a way that is likely to be invisible in most programming
+    environments.
+
+**We could require no normalization form, and normalize the source code
+ourselves.**
+
+Advantages:
+
+-   We would treat source text identically regardless of the normalization form.
+-   Developers would not be responsible for ensuring that their editing
+    environment produces and preserves the proper normalization form.
+
+Disadvantages:
+
+-   There is substantially more implementation cost involved in normalizing
+    identifiers than in detecting whether they are in normal form. While this
+    proposal would require the implementation complexity of converting into NFC
+    in the formatting tool, it would not require the conversion cost to be paid
+    during compilation.
+
+    A high-quality implementation may choose to accept this cost anyway, in
+    order to better recover from errors. Moreover, it is possible to
+    [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
+    and do the conversion only when necessary. However, if non-canonical source
+    is formally valid, there are more stringent performance constraints on such
+    conversion than if it is only done for error recovery.
+
+-   Tools such as `grep` do not perform normalization themselves, and so would
+    be unreliable when applied to a codebase with inconsistent normalization.
+-   GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
+    process of adopting an
+    [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
+    development environments should be expected to increasingly accommodate
+    production of text in NFC.
+-   The byte representation of a source file may be unstable if different
+    editing environments make different normalization choices, creating problems
+    for revision control systems, patch files, and the like.
+-   Normalizing the contents of string literals, rather than using their
+    contents unaltered, will introduce a risk of user surprise.
+
+**We could require only identifiers, or only identifiers and comments, to be
+normalized, rather than the entire input file.**
+
+Advantages:
+
+-   This would provide more freedom in comments to use arbitrary text.
+-   String literals could contain intentionally non-normalized text in order to
+    represent non-normalized strings.
+
+Disadvantages:
+
+-   Within string literals, this would result in invisible semantic differences:
+    strings that render identically can have different meanings.
+-   The semantics of the program could vary if its sources are normalized, which
+    an editing environment might do invisibly and automatically.
+-   If an editing environment were to automatically normalize text, it would
+    introduce spurious diffs into changes.
+-   We would need to be careful to ensure that no string or comment delimiter
+    ends with a code point sequence that is a prefix of a decomposition of
+    another code point, otherwise different normalizations of the same source
+    file could tokenize differently.

+ 0 - 25
docs/design/lexical_conventions.md

@@ -1,25 +0,0 @@
-# Lexical conventions
-
-<!--
-Part of the Carbon Language project, under the Apache License v2.0 with LLVM
-Exceptions. See /LICENSE for license information.
-SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
--->
-
-## Table of contents
-
-<!-- toc -->
-
--   [TODO](#todo)
-
-<!-- tocstop -->
-
-## TODO
-
-This is a skeletal design, added to support [the overview](README.md). It should
-not be treated as accepted by the core team; rather, it is a placeholder until
-we have more time to examine this detail. Please feel welcome to rewrite and
-update as appropriate.
-
-See [PR 17](https://github.com/carbon-language/carbon-lang/pull/17) for context
--- that proposal may replace this.

+ 44 - 0
docs/design/lexical_conventions/README.md

@@ -0,0 +1,44 @@
+# Lexical conventions
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+## Table of contents
+
+<!-- toc -->
+
+-   [TODO](#todo)
+-   [Lexical elements](#lexical-elements)
+
+<!-- tocstop -->
+
+## TODO
+
+This is a skeletal design, added to support
+[the overview](/docs/design/README.md). It should not be treated as accepted by
+the core team; rather, it is a placeholder until we have more time to examine
+this detail. Please feel welcome to rewrite and update as appropriate.
+
+See [PR 17](https://github.com/carbon-language/carbon-lang/pull/17) for context
+-- that proposal may replace this.
+
+## Lexical elements
+
+The first stage of processing a
+[source file](/docs/design/code_and_name_organization/source_files.md) is the
+division of the source file into lexical elements.
+
+A _lexical element_ is one of the following:
+
+-   a maximal sequence of [whitespace](whitespace.md) characters
+-   a [word](words.md)
+-   a literal:
+    -   a [numeric literal](numeric_literals.md)
+    -   TODO: string literals
+-   TODO: operators, comments, ...
+
+The sequence of lexical elements is formed by repeatedly removing the longest
+initial sequence of characters that forms a valid lexical element.

+ 483 - 0
docs/design/lexical_conventions/numeric_literals.md

@@ -0,0 +1,483 @@
+# Numeric literals
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+## Table of contents
+
+<!-- toc -->
+
+-   [Overview](#overview)
+-   [Details](#details)
+    -   [Integer literals](#integer-literals)
+    -   [Real number literals](#real-number-literals)
+        -   [Ties](#ties)
+    -   [Digit separators](#digit-separators)
+-   [Alternatives](#alternatives)
+    -   [Integer bases](#integer-bases)
+        -   [Octal literals](#octal-literals)
+        -   [Decimal literals](#decimal-literals)
+        -   [Case sensitivity](#case-sensitivity)
+    -   [Real number syntax](#real-number-syntax)
+    -   [Digit separator syntax](#digit-separator-syntax)
+    -   [Digit separator positioning](#digit-separator-positioning)
+
+<!-- tocstop -->
+
+## Overview
+
+The following syntaxes are supported:
+
+-   Integer literals
+    -   `12345` (decimal)
+    -   `0x1FE` (hexadecimal)
+    -   `0b1010` (binary)
+-   Real number literals
+    -   `123.456` (digits on both sides of the `.`)
+    -   `123.456e789` (optional `+` or `-` after the `e`)
+    -   `0x1.2p123` (optional `+` or `-` after the `p`)
+-   Digit separators (`_`) may be used, but only in conventional locations
+
+Note that real number literals always contain a `.` with digits on both sides,
+and integer literals never contain a `.`.
+
+Literals are case-sensitive. Unlike in C++, literals do not have a suffix to
+indicate their type.
+
+## Details
+
+### Integer literals
+
+Decimal integers are written as a non-zero decimal digit followed by zero or
+more additional decimal digits, or as a single `0`.
+
+Integers in other bases are written as a `0` followed by a base specifier
+character, followed by a sequence of digits in the corresponding base. The
+available base specifiers and corresponding bases are:
+
+| Base specifier | Base | Digits                   |
+| -------------- | ---- | ------------------------ |
+| `b`            | 2    | `0` and `1`              |
+| `x`            | 16   | `0` ... `9`, `A` ... `F` |
+
+The above table is case-sensitive. For example, `0b1` and `0x1A` are valid, and
+`0B1`, `0X1A`, and `0x1a` are invalid.
+
+A zero at the start of a literal can never be followed by another digit: either
+the literal is `0`, the `0` begins a base specifier, or the next character is a
+decimal point (see below). No support is provided for octal literals, and any C
+or C++ octal literal (other than `0`) is invalid in Carbon.
+
+### Real number literals
+
+Real numbers are written as a decimal or hexadecimal integer followed by a
+period (`.`) followed by a sequence of one or more decimal or hexadecimal
+digits, respectively. A digit is required on each side of the period. `0.` and
+`.3` are both invalid.
+
+A real number can be followed by an exponent character, an optional `+` or `-`
+(defaulting to `+` if absent), and a character sequence matching the grammar of
+a decimal integer with some value _N_. For a decimal real number, the exponent
+character is `e`, and the effect is to multiply the given value by
+10<sup>&plusmn;_N_</sup>. For a hexadecimal real number, the exponent character
+is `p`, and the effect is to multiply the given value by
+2<sup>&plusmn;_N_</sup>. The exponent suffix is optional for both decimal and
+hexadecimal real numbers.
+
+Note that a decimal integer followed by `e` is not a real number literal. For
+example, `3e10` is not a valid literal.
+
+When a real number literal is interpreted as a value of a real number type, its
+value is the representable real number closest to the value of the literal. In
+the case of a [tie](#ties), the conversion to the real number type is invalid.
+
+The decimal real number syntax allows for any decimal fraction to be expressed
+-- that is, any number of the form _a_ x 10<sup>-_b_</sup>, where _a_ is an
+integer and _b_ is a non-negative integer. Because the decimal fractions are
+dense in the reals and the set of values of the real number type is assumed to
+be discrete, every value of the real number type can be expressed as a real
+number literal. However, for certain applications, directly expressing the
+intended real number representation may be more convenient than producing a
+decimal equivalent that is known to convert to the intended value. Hexadecimal
+real number literals are provided in order to permit values of binary floating
+or fixed point real number types to be expressed directly.
+
+#### Ties
+
+As described above, a real number literal that lies exactly between two
+representable values for its target type is invalid. Such ties are extremely
+unlikely to occur by accident: for example, when interpreting a literal as
+`Float64`, `1.` would need to be followed by exactly 53 decimal digits (followed
+by zero or more `0`s) to land exactly half-way between two representable values,
+and the probability of `1.` followed by a random 53-digit sequence resulting in
+such a tie is one in 5<sup>53</sup>, or about
+0.000000000000000000000000000000000009%. For `Float32`, it's about
+0.000000000000001%, and even for a typical `Float16` implementation with 10
+fractional bits, it's around 0.00001%.
+
+Ties are much easier to express as hexadecimal floating-point literals: for
+example, `0x1.0000_0000_0000_08p+0` is exactly half way between `1.0` and the
+smallest `Float64` value greater than `1.0`, which is `0x1.0000_0000_0000_1p+0`.
+
+Whether written in decimal or hexadecimal, a tie provides very strong evidence
+that the developer intended to express a precise floating-point value, and
+provided one bit too much precision (or one bit too little, depending on whether
+they expected some rounding to occur), so rejecting the literal is preferred
+over making an arbitrary choice between the two possible values.
+
+### Digit separators
+
+If digit separators (`_`) are included in literals, they must meet the
+respective condition:
+
+-   For decimal integers, the digit separators shall occur every three digits
+    starting from the right. For example, `2_147_483_648`.
+-   For hexadecimal integers, the digit separators shall occur every four digits
+    starting from the right. For example, `0x7FFF_FFFF`.
+-   For real number literals, digit separators can appear in the decimal and
+    hexadecimal integer portions (prior to the period and after the optional `e`
+    or mandatory `p`) as described in the previous bullets. For example,
+    `2_147.483648e12_345` or `0x1_00CA.FEF00Dp+24`
+-   For binary literals, digit separators can appear between any two digits. For
+    example, `0b1_000_101_11`.
+
+## Alternatives
+
+### Integer bases
+
+#### Octal literals
+
+No support is proposed for octal literals. In practice, their appearance in C
+and C++ code in a sample corpus consisted of (in decreasing order of commonality
+and excluding `0` literals):
+
+-   file permissions,
+-   cases where decimal was clearly intended (`CivilDay(2020, 04, 01)`), and
+-   (in _distant_ third place) anything else.
+
+The number of intentional uses of octal literals, other than in file
+permissions, was negligible. We considered the following alternatives:
+
+**Alternative 1:** Follow C and C++, and use `0` as the base prefix for octal.
+
+Advantages:
+
+-   More similar to C++ and other languages.
+
+Disadvantages:
+
+-   Subtle and error-prone rule: for example, left-padding with zeroes for
+    alignment changes the meaning of literals.
+
+**Alternative 2:** Use `0o` as the base prefix for octal.
+
+Advantages:
+
+-   Unlikely to be misinterpreted as decimal.
+-   Follows several other languages (for example, Python).
+
+Disadvantages:
+
+-   Additional language complexity.
+
+If we decide we want to introduce octal literals at a later date, use of
+alternative 2 is suggested.
+
+#### Decimal literals
+
+**We could permit leading `0`s in decimal integers (and in floating-point
+numbers).**
+
+Advantages:
+
+-   We would allow leading `0`s to be used to align columns of numbers.
+
+Disadvantages:
+
+-   The same literal could be valid but have a different value in C++ and
+    Carbon.
+
+**We could add an (optional) base specifier `0d` for decimal integers.**
+
+Advantages:
+
+-   Uniform treatment of all bases. Left-padding with `0` could be achieved by
+    using `0d000123`.
+
+Disadvantages:
+
+-   No evidence of need for this functionality.
+
+**We could permit an `e` in decimal literals to express large powers of 10.**
+
+Advantages:
+
+-   Many uses of (eg) `1e6` in our sample C++ corpus intend to form an integer
+    literal instead of a floating-point literal.
+
+Disadvantages:
+
+-   Would violate the expectations of many C++ programmers used to `e`
+    indicating a floating-point constant.
+
+#### Case sensitivity
+
+**We could make base specifiers case-insensitive.**
+
+Advantages:
+
+-   More similar to C++.
+
+Disadvantages:
+
+-   `0B1` is easily mistaken for `081`
+-   `0B1` can be confused with `0xB1`
+-   `0O17` is easily mistaken for `0017`
+-   Allowing more than one way to write literals will lead to style divergence.
+
+**We could make the digit sequence in hexadecimal integers case-insensitive.**
+
+Advantages:
+
+-   More similar to C++.
+-   Some developers will be more comfortable writing hexadecimal digits in
+    lowercase. Some tools, such as `md5`, will print lowercase.
+
+Disadvantages:
+
+-   Allowing more than one way to write literals will lead to style divergence.
+-   Lowercase hexadecimal digits are less visually distinct from the `x` base
+    specifier (for example, the digit sequence is more visually distinct in
+    `0xAC` than in `0xac`).
+
+**We could require the digit sequence in hexadecimal integers to be written
+using lowercase letters `a`..`f`.**
+
+Advantages:
+
+-   Some developers will be more comfortable writing hexadecimal digits in
+    lowercase. Some tools, such as `md5`, will print lowercase.
+-   `B` and `D` are more likely to be confused with `8` and `0` than `b` and `d`
+    are.
+
+Disadvantages:
+
+-   Some developers will be more comfortable writing hexadecimal digits in
+    uppercase. Some tools will print uppercase.
+-   Lowercase hexadecimal digits are less visually distinct from the `x` base
+    specifier (for example, the digit sequence is more visually distinct in
+    `0xAC` than in `0xac`).
+
+### Real number syntax
+
+**We could allow real numbers with no digits on one side of the period (`3.` or
+`.5`).**
+
+Advantages:
+
+-   More similar to C++.
+-   Allows numbers to be expressed more tersely.
+
+Disadvantages:
+
+-   Gives meaning to `tup.0` syntax that may be useful for indexing tuples.
+-   Gives meaning to `0.ToString()` syntax that may be useful for performing
+    member access on literals.
+-   May harm readability by making the difference between an integer literal and
+    a real number literal less significant.
+-   Allowing more than one way to write literals will lead to style divergence.
+
+See also the section on
+[floating-point literals](https://google.github.io/styleguide/cppguide.html#Floating_Literals)
+in the Google style guide, which argues for the same rule.
+
+**We could allow a real number with no `e` or `p` to omit a period (`1e100`).**
+
+Advantages:
+
+-   More similar to C++.
+-   Allows numbers to be expressed more tersely.
+
+Disadvantages:
+
+-   Assuming that such numbers are integers rather than real numbers is a common
+    error in C++.
+
+**We could allow the `e` or `p` to be written in uppercase.**
+
+Advantages:
+
+-   More similar to C++.
+-   Most calculators use `E`, to avoid confusion with the constant `e`.
+
+Disadvantages:
+
+-   Allowing more than one way to write literals will lead to style divergence.
+-   `E` may be confused with a hexadecimal digit.
+
+**We could require a `p` in a hexadecimal real number literal.**
+
+Advantages:
+
+-   More similar to C++.
+-   When explicitly writing a bit-pattern for a floating-point type, it's
+    reasonable to always include the exponent value.
+
+Disadvantages:
+
+-   Less consistent.
+-   Makes hexadecimal floating-point values even more expert-only.
+
+**We could arbitrarily pick one of the two values when a real number is exactly
+half-way between two representable values.**
+
+Advantages:
+
+-   More similar to C++.
+-   Would accept more cases, and it's likely that either of the two possible
+    values would be acceptable in practice.
+
+Disadvantages:
+
+-   Would either need to specify which option is chosen or, following C++,
+    accept that programs using such literals have non-portable semantics.
+-   Numbers specified to the exact level of precision required to form a tie are
+    a strong signal that the programmer intended to specify a particular value.
+
+### Digit separator syntax
+
+We considered the following characters as digit separators:
+
+**Status quo:** `_` as a digit separator.
+
+Advantages:
+
+-   Follows convention of C#, Java, JavaScript, Python, D, Ruby, Rust, Swift,
+    ...
+-   Culturally agnostic, because it doesn't match any common human writing
+    convention.
+
+Disadvantages:
+
+-   Underscore is not used as a digit grouping separator in any common human
+    writing convention.
+
+**Alternative 1:** `'` as a digit separator.
+
+Advantages:
+
+-   Follows C++ syntax.
+-   Used in several (mostly European) writing conventions.
+
+Disadvantages:
+
+-   `'` is also likely to be used to introduce character literals.
+
+**Alternative 2:** `,` as a digit separator.
+
+Advantages:
+
+-   More similar to how numbers are written in English text and many other
+    cultures.
+
+Disadvantages:
+
+-   Commas are expected to widely be used in Carbon programs for other purposes,
+    where there may be digits on both sides of the comma. For example, there
+    could be readability problems if `f(1, 234)` called `f` with two arguments
+    but `f(1,234)` called `f` with a single argument.
+-   Comma is interpreted as a decimal point in the conventions of many cultures.
+-   Unprecedented in common programming languages.
+
+**Alternative 3:** whitespace as a digit separator.
+
+Advantages:
+
+-   Used and understood by many cultures.
+-   Never interpreted as a decimal point instead of a grouping separator.
+-   Also usable to the right of a decimal point.
+
+Disadvantages:
+
+-   Omitted separators in lists of numbers may result in distinct numbers being
+    spliced together. For example, `f(1, 23, 4 567)` may be interpreted as three
+    separate numerical arguments instead of four arguments with a missing comma.
+-   Unprecedented in other programming languages.
+
+**Alternative 4:** `.` as digit separator, `,` as decimal point.
+
+Advantages:
+
+-   More familiar to cultures that write numbers this way.
+
+Disadvantages:
+
+-   As with `,` as a digit separator, `,` as a decimal point is problematic.
+-   This usage is unfamiliar and would be surprising to programmers; programmers
+    from cultures where `,` is the decimal point in regular writing are likely
+    already accustomed to using `.` as the decimal point in programming
+    environments, and the converse is not true.
+
+**Alternative 5:** No digit separator syntax.
+
+Advantages:
+
+-   Simpler language rules.
+-   More consistent source syntax, as there is no choice as to whether to use
+    digit separators or not.
+
+Disadvantages:
+
+-   Harms the readability of long literals.
+
+### Digit separator positioning
+
+**Alternative 1:** allow any digit groupings (for example, `123_4567_89`).
+
+Advantages:
+
+-   Simpler, more flexible rule, that may allow some groupings that are
+    conventional in a specific domain. For example, `var Date: d = 01_12_1983;`,
+    or `var Int64: time_in_microseconds = 123456_000000;`.
+-   Culturally agnostic. For example, the Indian convention for digit separators
+    would group the last three digits, and then every two digits before that
+    (1,23,45,678 could be written `1_23_45_678`).
+
+Disadvantages:
+
+-   Less self-checking that numeric literals are interpreted the way that the
+    author intends.
+
+**Alternative 2:** as above, but additionally require binary digits to be
+grouped in 4s.
+
+Advantages:
+
+-   More enforcement that digit grouping is conventional.
+
+Disadvantages:
+
+-   No clear, established rule for how to group binary digits. In some cases, 8
+    digit groups may be more conventional.
+-   When used to express literals involving bit-fields, arbitrary grouping may
+    be desirable. For example:
+
+    ```carbon
+    var Float32: flt_max =
+      BitCast(Float32, 0b0_11111110_11111111111111111111111);
+    ```
+
+**Alternative 3:** allow any regular grouping.
+
+Advantages:
+
+-   Can be applied uniformly to all bases.
+
+Disadvantages:
+
+-   Provides no assistance for decimal numbers with a single digit separator.
+-   Does not allow binary literals to express an intent to initialize irregular
+    bit-fields.

+ 42 - 0
docs/design/lexical_conventions/whitespace.md

@@ -0,0 +1,42 @@
+# Whitespace
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+## Table of contents
+
+<!-- toc -->
+
+-   [Overview](#overview)
+
+<!-- tocstop -->
+
+## Overview
+
+The exact lexical form of Carbon whitespace has not yet been settled. However,
+Carbon will follow lexical conventions for whitespace based on
+[Unicode Annex #31](https://unicode.org/reports/tr31/). TODO: Update this once
+the precise rules are decided; see the
+[Unicode source files](/proposals/p0142.md#characters-in-identifiers) proposal.
+
+Unicode Annex #31 suggests selecting whitespace characters based on the
+characters with Unicode property `Pattern_White_Space`, which is currently these
+11 characters:
+
+-   U+0009 CHARACTER TABULATION (horizontal tab)
+-   U+000A LINE FEED (traditional newline)
+-   U+000B LINE TABULATION (vertical tab)
+-   U+000C FORM FEED (page break)
+-   U+000D CARRIAGE RETURN
+-   U+0020 SPACE
+-   U+0085 NEXT LINE (Unicode newline)
+-   U+200E LEFT-TO-RIGHT MARK
+-   U+200F RIGHT-TO-LEFT MARK
+-   U+2028 LINE SEPARATOR
+-   U+2029 PARAGRAPH SEPARATOR
+
+The quantity and kind of whitespace separating tokens is ignored except where
+otherwise specified.

+ 49 - 0
docs/design/lexical_conventions/words.md

@@ -0,0 +1,49 @@
+# Words
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+## Table of contents
+
+<!-- toc -->
+
+-   [Overview](#overview)
+-   [Alternatives](#alternatives)
+
+<!-- tocstop -->
+
+## Overview
+
+A _word_ is a lexical element formed from a sequence of letters or letter-like
+characters, such as `fn` or `Foo` or `Int`.
+
+The exact lexical form of words has not yet been settled. However, Carbon will
+follow lexical conventions for identifiers based on
+[Unicode Annex #31](https://unicode.org/reports/tr31/). TODO: Update this once
+the precise rules are decided; see the
+[Unicode source files](/proposals/p0142.md#characters-in-identifiers) proposal.
+
+## Alternatives
+
+**We could restrict words to ASCII.**
+
+Advantages:
+
+-   Reduced implementation complexity.
+-   Avoids all problems relating to normalization, homoglyphs, text
+    directionality, and so on.
+-   We have no intention of using non-ASCII characters in the language syntax or
+    in any library name.
+-   Provides assurance that all names in libraries can reliably be typed by all
+    developers -- we already require that keywords, and thus all ASCII letters,
+    can be typed.
+
+Disadvantages:
+
+-   An overarching goal of the Carbon project is to provide a language that is
+    inclusive and welcoming. A language that does not permit names in programs
+    to be expressed in the developer's native language will not meet that goal
+    for at least some of our developers.