5 anos atrás · ee7a108da4
--- a/docs/design/README.md
+++ b/docs/design/README.md
@@ -105,7 +105,8 @@ cleaned up during evolution.
 
				 
			
 
				 ### Code and comments
			
 
				 
			
 
				-> References: [Lexical conventions](lexical_conventions.md)
			
 
				+> References: [Source files](code_and_name_organization/source_files.md) and
			
 
				+> [lexical conventions](lexical_conventions)
			
 
				 >
			
 
				 > **TODO:** References need to be evolved.
			
 
				 
			
@@ -127,9 +128,16 @@ cleaned up during evolution.
 
				       live code
			
 
				     ```
			
 
				 
			
 
				+-   Decimal, hexadecimal, and binary integer literals and decimal and
			
 
				+    hexadecimal floating-point literals are supported, with `_` as a digit
			
 
				+    separator. For example, `42`, `0b1011_1101` and `0x1.EEFp+5`. Numeric
			
 
				+    literals are case-sensitive: `0x`, `0b`, `e+`, and `p+` must be lowercase,
			
 
				+    whereas hexadecimal digits must be uppercase. A digit is required on both
			
 
				+    sides of a period.
			
 
				+
			
 
				 ### Packages, libraries, and namespaces
			
 
				 
			
 
				-> References: [Code and name organization](code_and_name_organization.md)
			
 
				+> References: [Code and name organization](code_and_name_organization)
			
 
				 
			
 
				 -   **Files** are grouped into libraries, which are in turn grouped into
			
 
				     packages.
			
@@ -161,16 +169,16 @@ fn Foo(var Geometry.Shapes.Flat.Circle: circle) { ... }
 
				 
			
 
				 ### Names and scopes
			
 
				 
			
 
				-> References: [Lexical conventions](lexical_conventions.md)
			
 
				+> References: [Lexical conventions](lexical_conventions)
			
 
				 >
			
 
				 > **TODO:** References need to be evolved.
			
 
				 
			
 
				 Various constructs introduce a named entity in Carbon. These can be functions,
			
 
				 types, variables, or other kinds of entities that we'll cover. A name in Carbon
			
 
				-is always formed out of an "identifier", or a sequence of letters, numbers, and
			
 
				-underscores which starts with a letter. As a regular expression, this would be
			
 
				-`/[a-zA-Z][a-zA-Z0-9_]*/`. Eventually we may add support for more unicode
			
 
				-characters as well.
			
 
				+is formed from a word, which is a sequence of letters, numbers, and underscores,
			
 
				+and which starts with a letter. We intend to follow Unicode's Annex 31 in
			
 
				+selecting valid identifier characters, but a concrete set of valid characters
			
 
				+has not been selected yet.
			
 
				 
			
 
				 #### Naming conventions
			
 
				 
			
@@ -240,7 +248,7 @@ file, including `Int` and `Bool`. These will likely be defined in a special
 
				 
			
 
				 ### Expressions
			
 
				 
			
 
				-> References: [Lexical conventions](lexical_conventions.md) and
			
 
				+> References: [Lexical conventions](lexical_conventions) and
			
 
				 > [operators](operators.md)
			
 
				 >
			
 
				 > **TODO:** References need to be evolved.
			
--- a/docs/design/code_and_name_organization/README.md
+++ b/docs/design/code_and_name_organization/README.md
@@ -112,8 +112,8 @@ Important Carbon goals for code and name organization are:
 
				 
			
 
				 ## Overview
			
 
				 
			
 
				-Carbon files have a `.carbon` extension, such as `geometry.carbon`. These files
			
 
				-are the basic unit of compilation.
			
 
				+Carbon [source files](source_files.md) have a `.carbon` extension, such as
			
 
				+`geometry.carbon`. These files are the basic unit of compilation.
			
 
				 
			
 
				 Each file begins with a declaration of which
			
 
				 _package_<sup><small>[[define](/docs/guides/glossary.md#package)]</small></sup>
			
@@ -228,9 +228,9 @@ Every source file will consist of, in order:
 
				 3. Source file body, with other code.
			
 
				 
			
 
				 Comments and blank lines may be intermingled with these sections.
			
 
				-[Metaprogramming](metaprogramming.md) code may also be intermingled, so long as
			
 
				-the outputted code is consistent with the enforced ordering. Other types of code
			
 
				-must be in the source file body.
			
 
				+[Metaprogramming](/docs/design/metaprogramming.md) code may also be
			
 
				+intermingled, so long as the outputted code is consistent with the enforced
			
 
				+ordering. Other types of code must be in the source file body.
			
 
				 
			
 
				 ### Name paths
			
 
				 
			
@@ -241,7 +241,7 @@ separated by dots. This syntax may be loosely expressed as a regular expression:
 
				 IDENTIFIER(\.IDENTIFIER)*
			
 
				 ```
			
 
				 
			
 
				-Name conflicts are addressed by [name lookup](name_lookup.md).
			
 
				+Name conflicts are addressed by [name lookup](/docs/design/name_lookup.md).
			
 
				 
			
 
				 #### `package` syntax
			
 
				 
			
@@ -467,7 +467,7 @@ An import declares a package entity named after the imported package, and makes
 
				 `api`-tagged entities from the imported library through it. The full name path
			
 
				 is a concatenation of the names of the package entity, any namespace entities
			
 
				 applied, and the final entity addressed. Child namespaces or entities may be
			
 
				-[aliased](aliases.md) if desired.
			
 
				+[aliased](/docs/design/aliases.md) if desired.
			
 
				 
			
 
				 For example, given a library:
			
 
				 
			
@@ -574,8 +574,8 @@ struct Shapes.Square { ... };
 
				 
			
 
				 #### Aliasing
			
 
				 
			
 
				-Carbon's [alias keyword](aliases.md) will support aliasing namespaces. For
			
 
				-example, this would be valid code:
			
 
				+Carbon's [alias keyword](/docs/design/aliases.md) will support aliasing
			
 
				+namespaces. For example, this would be valid code:
			
 
				 
			
 
				 ```carbon
			
 
				 namespace Timezones.Internal;
			
@@ -606,7 +606,7 @@ import, and that the `api` is infeasible to rename due to existing callers.
 
				 Alternately, the `api` entity may be using an idiomatic name that it would
			
 
				 contradict naming conventions to rename. In either case, this conflict may exist
			
 
				 in a single file without otherwise affecting users of the API. This will be
			
 
				-addressed by [name lookup](name_lookup.md).
			
 
				+addressed by [name lookup](/docs/design/name_lookup.md).
			
 
				 
			
 
				 ### Potential refactorings
			
 
				 
			
@@ -904,7 +904,7 @@ Advantages:
 
				 Disadvantages:
			
 
				 
			
 
				 -   We are likely to want a more fine-grained, file-level approach proposed by
			
 
				-    [name lookup](name_lookup.md).
			
 
				+    [name lookup](/docs/design/name_lookup.md).
			
 
				 -   Allows package owners to name their packages things that they rarely type,
			
 
				     but that importers end up typing frequently.
			
 
				     -   The existence of a short `package` keyword shifts the balance for long
			
--- a/docs/design/code_and_name_organization/source_files.md
+++ b/docs/design/code_and_name_organization/source_files.md
@@ -0,0 +1,244 @@
 
				+# Source files
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [Overview](#overview)
			
 
				+-   [Encoding](#encoding)
			
 
				+-   [References](#references)
			
 
				+-   [Alternatives](#alternatives)
			
 
				+    -   [Character encoding](#character-encoding)
			
 
				+    -   [Byte order marks](#byte-order-marks)
			
 
				+    -   [Normalization forms](#normalization-forms)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Overview
			
 
				+
			
 
				+A Carbon _source file_ is a sequence of Unicode code points in Unicode
			
 
				+Normalization Form C ("NFC"), and represents a portion of the complete text of a
			
 
				+program.
			
 
				+
			
 
				+Program text can come from a variety of sources, such as an interactive
			
 
				+programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a
			
 
				+database, a memory buffer of an IDE, or a command-line argument.
			
 
				+
			
 
				+The canonical representation for Carbon programs is in files stored as a
			
 
				+sequence of bytes in a file system on disk. Such files have a `.carbon`
			
 
				+extension.
			
 
				+
			
 
				+## Encoding
			
 
				+
			
 
				+The on-disk representation of a Carbon source file is encoded in UTF-8. Such
			
 
				+files may begin with an optional UTF-8 BOM, that is, the byte sequence
			
 
				+EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if present, is
			
 
				+ignored.
			
 
				+
			
 
				+No Unicode normalization is performed when reading an on-disk representation of
			
 
				+a Carbon source file, so the byte representation is required to be normalized in
			
 
				+Normalization Form C. The Carbon source formatting tool will convert source
			
 
				+files to NFC as necessary.
			
 
				+
			
 
				+## References
			
 
				+
			
 
				+-   [Unicode](https://www.unicode.org/versions/latest/) is a universal character
			
 
				+    encoding, maintained by the
			
 
				+    [Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is
			
 
				+    the canonical encoding used for textual information interchange across all
			
 
				+    modern technology.
			
 
				+
			
 
				+    Carbon is based on Unicode 13.0, which is currently the latest version of
			
 
				+    the Unicode standard. Newer versions will be considered for adoption as they
			
 
				+    are released.
			
 
				+
			
 
				+-   [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)
			
 
				+
			
 
				+-   [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)
			
 
				+
			
 
				+## Alternatives
			
 
				+
			
 
				+The choice to require NFC is really four choices:
			
 
				+
			
 
				+1. Equivalence classes: we use a canonical normalization form rather than a
			
 
				+   compatibility normalization form or no normalization form at all.
			
 
				+
			
 
				+    - If we use no normalization, invisibly-different ways of representing the
			
 
				+      same glyph, such as with pre-combined diacritics versus with diacritics
			
 
				+      expressed as separate combining characters, or with combining characters
			
 
				+      in a different order, would be considered different characters.
			
 
				+    - If we use a canonical normalization form, all ways of encoding diacritics
			
 
				+      are considered to form the same character, but ligatures such as `ﬃ` are
			
 
				+      considered distinct from the character sequence that they decompose into.
			
 
				+    - If we use a compatibility normalization form, ligatures are considered
			
 
				+      equivalent to the character sequence that they decompose into.
			
 
				+
			
 
				+    For a fixed-width font, a canonical normalization form is most likely to
			
 
				+    consider characters to be the same if they look the same. Unicode annexes
			
 
				+    [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
			
 
				+    and
			
 
				+    [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
			
 
				+    both recommend the use of Normalization Form C for case-sensitive
			
 
				+    identifiers in programming languages.
			
 
				+
			
 
				+2. Composition: we use a composed normalization form rather than a decomposed
			
 
				+   normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
			
 
				+   LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
			
 
				+   O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
			
 
				+   in smaller representations whenever the two differ, but the decomposed form
			
 
				+   is a little easier for algorithmic processing (for example, typo correction
			
 
				+   and homoglyph detection).
			
 
				+
			
 
				+3. We require source files to be in our chosen form, rather than converting to
			
 
				+   that form as necessary.
			
 
				+
			
 
				+4. We require that the entire contents of the file be normalized, rather than
			
 
				+   restricting our attention to only identifiers, or only identifiers and string
			
 
				+   literals.
			
 
				+
			
 
				+### Character encoding
			
 
				+
			
 
				+**We could restrict programs to ASCII.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Reduced implementation complexity.
			
 
				+-   Avoids all problems relating to normalization, homoglyphs, text
			
 
				+    directionality, and so on.
			
 
				+-   We have no intention of using non-ASCII characters in the language syntax or
			
 
				+    in any library name.
			
 
				+-   Provides assurance that all names in libraries can reliably be typed by all
			
 
				+    developers -- we already require that keywords, and thus all ASCII letters,
			
 
				+    can be typed.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   An overarching goal of the Carbon project is to provide a language that is
			
 
				+    inclusive and welcoming. A language that does not permit names and comments
			
 
				+    in programs to be expressed in the developer's native language will not meet
			
 
				+    that goal for at least some of our developers.
			
 
				+-   Quoted strings will be substantially less readable if non-ASCII printable
			
 
				+    characters are required to be written as escape sequences.
			
 
				+
			
 
				+### Byte order marks
			
 
				+
			
 
				+**We could disallow byte order marks.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Marginal implementation simplicity.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Several major editors, particularly on the Windows platform, insert UTF-8
			
 
				+    BOMs and use them to identify file encoding.
			
 
				+
			
 
				+### Normalization forms
			
 
				+
			
 
				+**We could require a different normalization form.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Some environments might more naturally produce a different normalization
			
 
				+    form.
			
 
				+-   Normalization Form D is more uniform, in that characters are always
			
 
				+    maximally decomposed into combining characters; in NFC, characters may or
			
 
				+    may not be decomposed depending on whether a composed form is available.
			
 
				+    -   NFD may be more suitable for certain uses such as typo correction,
			
 
				+        homoglyph detection, or code completion.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   The C++ standard and community is moving towards using NFC:
			
 
				+
			
 
				+    -   WG21 is in the process of adopting a NFC requirement for C++
			
 
				+        identifiers.
			
 
				+    -   GCC warns on C++ identifiers that aren't in NFC.
			
 
				+
			
 
				+    As a consequence, we should expect that the tooling and development
			
 
				+    environments that C++ developers are using will provide good support for
			
 
				+    authoring NFC-encoded source files.
			
 
				+
			
 
				+-   The W3C recommends using NFC for all content, so code samples distributed on
			
 
				+    webpages may be canonicalized into NFC by some web authoring tools.
			
 
				+
			
 
				+-   NFC produces smaller encodings than NFD in all cases where they differ.
			
 
				+
			
 
				+**We could require no normalization form and compare identifiers by code point
			
 
				+sequence.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   This is the rule in use in C++20 and before.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   This is not the rule planned for the near future of C++.
			
 
				+-   Different representations of the same character may result in different
			
 
				+    identifiers, in a way that is likely to be invisible in most programming
			
 
				+    environments.
			
 
				+
			
 
				+**We could require no normalization form, and normalize the source code
			
 
				+ourselves.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   We would treat source text identically regardless of the normalization form.
			
 
				+-   Developers would not be responsible for ensuring that their editing
			
 
				+    environment produces and preserves the proper normalization form.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   There is substantially more implementation cost involved in normalizing
			
 
				+    identifiers than in detecting whether they are in normal form. While this
			
 
				+    proposal would require the implementation complexity of converting into NFC
			
 
				+    in the formatting tool, it would not require the conversion cost to be paid
			
 
				+    during compilation.
			
 
				+
			
 
				+    A high-quality implementation may choose to accept this cost anyway, in
			
 
				+    order to better recover from errors. Moreover, it is possible to
			
 
				+    [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
			
 
				+    and do the conversion only when necessary. However, if non-canonical source
			
 
				+    is formally valid, there are more stringent performance constraints on such
			
 
				+    conversion than if it is only done for error recovery.
			
 
				+
			
 
				+-   Tools such as `grep` do not perform normalization themselves, and so would
			
 
				+    be unreliable when applied to a codebase with inconsistent normalization.
			
 
				+-   GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
			
 
				+    process of adopting an
			
 
				+    [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
			
 
				+    development environments should be expected to increasingly accommodate
			
 
				+    production of text in NFC.
			
 
				+-   The byte representation of a source file may be unstable if different
			
 
				+    editing environments make different normalization choices, creating problems
			
 
				+    for revision control systems, patch files, and the like.
			
 
				+-   Normalizing the contents of string literals, rather than using their
			
 
				+    contents unaltered, will introduce a risk of user surprise.
			
 
				+
			
 
				+**We could require only identifiers, or only identifiers and comments, to be
			
 
				+normalized, rather than the entire input file.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   This would provide more freedom in comments to use arbitrary text.
			
 
				+-   String literals could contain intentionally non-normalized text in order to
			
 
				+    represent non-normalized strings.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Within string literals, this would result in invisible semantic differences:
			
 
				+    strings that render identically can have different meanings.
			
 
				+-   The semantics of the program could vary if its sources are normalized, which
			
 
				+    an editing environment might do invisibly and automatically.
			
 
				+-   If an editing environment were to automatically normalize text, it would
			
 
				+    introduce spurious diffs into changes.
			
 
				+-   We would need to be careful to ensure that no string or comment delimiter
			
 
				+    ends with a code point sequence that is a prefix of a decomposition of
			
 
				+    another code point, otherwise different normalizations of the same source
			
 
				+    file could tokenize differently.
			
--- a/docs/design/lexical_conventions.md
+++ b/docs/design/lexical_conventions.md
@@ -1,25 +0,0 @@
 
				-# Lexical conventions
			
 
				-
			
 
				-<!--
			
 
				-Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				-Exceptions. See /LICENSE for license information.
			
 
				-SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				--->
			
 
				-
			
 
				-## Table of contents
			
 
				-
			
 
				-<!-- toc -->
			
 
				-
			
 
				--   [TODO](#todo)
			
 
				-
			
 
				-<!-- tocstop -->
			
 
				-
			
 
				-## TODO
			
 
				-
			
 
				-This is a skeletal design, added to support [the overview](README.md). It should
			
 
				-not be treated as accepted by the core team; rather, it is a placeholder until
			
 
				-we have more time to examine this detail. Please feel welcome to rewrite and
			
 
				-update as appropriate.
			
 
				-
			
 
				-See [PR 17](https://github.com/carbon-language/carbon-lang/pull/17) for context
			
 
				--- that proposal may replace this.
			
--- a/docs/design/lexical_conventions/README.md
+++ b/docs/design/lexical_conventions/README.md
@@ -0,0 +1,44 @@
 
				+# Lexical conventions
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [TODO](#todo)
			
 
				+-   [Lexical elements](#lexical-elements)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## TODO
			
 
				+
			
 
				+This is a skeletal design, added to support
			
 
				+[the overview](/docs/design/README.md). It should not be treated as accepted by
			
 
				+the core team; rather, it is a placeholder until we have more time to examine
			
 
				+this detail. Please feel welcome to rewrite and update as appropriate.
			
 
				+
			
 
				+See [PR 17](https://github.com/carbon-language/carbon-lang/pull/17) for context
			
 
				+-- that proposal may replace this.
			
 
				+
			
 
				+## Lexical elements
			
 
				+
			
 
				+The first stage of processing a
			
 
				+[source file](/docs/design/code_and_name_organization/source_files.md) is the
			
 
				+division of the source file into lexical elements.
			
 
				+
			
 
				+A _lexical element_ is one of the following:
			
 
				+
			
 
				+-   a maximal sequence of [whitespace](whitespace.md) characters
			
 
				+-   a [word](words.md)
			
 
				+-   a literal:
			
 
				+    -   a [numeric literal](numeric_literals.md)
			
 
				+    -   TODO: string literals
			
 
				+-   TODO: operators, comments, ...
			
 
				+
			
 
				+The sequence of lexical elements is formed by repeatedly removing the longest
			
 
				+initial sequence of characters that forms a valid lexical element.
			
--- a/docs/design/lexical_conventions/numeric_literals.md
+++ b/docs/design/lexical_conventions/numeric_literals.md
@@ -0,0 +1,483 @@
 
				+# Numeric literals
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [Overview](#overview)
			
 
				+-   [Details](#details)
			
 
				+    -   [Integer literals](#integer-literals)
			
 
				+    -   [Real number literals](#real-number-literals)
			
 
				+        -   [Ties](#ties)
			
 
				+    -   [Digit separators](#digit-separators)
			
 
				+-   [Alternatives](#alternatives)
			
 
				+    -   [Integer bases](#integer-bases)
			
 
				+        -   [Octal literals](#octal-literals)
			
 
				+        -   [Decimal literals](#decimal-literals)
			
 
				+        -   [Case sensitivity](#case-sensitivity)
			
 
				+    -   [Real number syntax](#real-number-syntax)
			
 
				+    -   [Digit separator syntax](#digit-separator-syntax)
			
 
				+    -   [Digit separator positioning](#digit-separator-positioning)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Overview
			
 
				+
			
 
				+The following syntaxes are supported:
			
 
				+
			
 
				+-   Integer literals
			
 
				+    -   `12345` (decimal)
			
 
				+    -   `0x1FE` (hexadecimal)
			
 
				+    -   `0b1010` (binary)
			
 
				+-   Real number literals
			
 
				+    -   `123.456` (digits on both sides of the `.`)
			
 
				+    -   `123.456e789` (optional `+` or `-` after the `e`)
			
 
				+    -   `0x1.2p123` (optional `+` or `-` after the `p`)
			
 
				+-   Digit separators (`_`) may be used, but only in conventional locations
			
 
				+
			
 
				+Note that real number literals always contain a `.` with digits on both sides,
			
 
				+and integer literals never contain a `.`.
			
 
				+
			
 
				+Literals are case-sensitive. Unlike in C++, literals do not have a suffix to
			
 
				+indicate their type.
			
 
				+
			
 
				+## Details
			
 
				+
			
 
				+### Integer literals
			
 
				+
			
 
				+Decimal integers are written as a non-zero decimal digit followed by zero or
			
 
				+more additional decimal digits, or as a single `0`.
			
 
				+
			
 
				+Integers in other bases are written as a `0` followed by a base specifier
			
 
				+character, followed by a sequence of digits in the corresponding base. The
			
 
				+available base specifiers and corresponding bases are:
			
 
				+
			
 
				+| Base specifier | Base | Digits                   |
			
 
				+| -------------- | ---- | ------------------------ |
			
 
				+| `b`            | 2    | `0` and `1`              |
			
 
				+| `x`            | 16   | `0` ... `9`, `A` ... `F` |
			
 
				+
			
 
				+The above table is case-sensitive. For example, `0b1` and `0x1A` are valid, and
			
 
				+`0B1`, `0X1A`, and `0x1a` are invalid.
			
 
				+
			
 
				+A zero at the start of a literal can never be followed by another digit: either
			
 
				+the literal is `0`, the `0` begins a base specifier, or the next character is a
			
 
				+decimal point (see below). No support is provided for octal literals, and any C
			
 
				+or C++ octal literal (other than `0`) is invalid in Carbon.
			
 
				+
			
 
				+### Real number literals
			
 
				+
			
 
				+Real numbers are written as a decimal or hexadecimal integer followed by a
			
 
				+period (`.`) followed by a sequence of one or more decimal or hexadecimal
			
 
				+digits, respectively. A digit is required on each side of the period. `0.` and
			
 
				+`.3` are both invalid.
			
 
				+
			
 
				+A real number can be followed by an exponent character, an optional `+` or `-`
			
 
				+(defaulting to `+` if absent), and a character sequence matching the grammar of
			
 
				+a decimal integer with some value _N_. For a decimal real number, the exponent
			
 
				+character is `e`, and the effect is to multiply the given value by
			
 
				+10<sup>&plusmn;_N_</sup>. For a hexadecimal real number, the exponent character
			
 
				+is `p`, and the effect is to multiply the given value by
			
 
				+2<sup>&plusmn;_N_</sup>. The exponent suffix is optional for both decimal and
			
 
				+hexadecimal real numbers.
			
 
				+
			
 
				+Note that a decimal integer followed by `e` is not a real number literal. For
			
 
				+example, `3e10` is not a valid literal.
			
 
				+
			
 
				+When a real number literal is interpreted as a value of a real number type, its
			
 
				+value is the representable real number closest to the value of the literal. In
			
 
				+the case of a [tie](#ties), the conversion to the real number type is invalid.
			
 
				+
			
 
				+The decimal real number syntax allows for any decimal fraction to be expressed
			
 
				+-- that is, any number of the form _a_ x 10<sup>-_b_</sup>, where _a_ is an
			
 
				+integer and _b_ is a non-negative integer. Because the decimal fractions are
			
 
				+dense in the reals and the set of values of the real number type is assumed to
			
 
				+be discrete, every value of the real number type can be expressed as a real
			
 
				+number literal. However, for certain applications, directly expressing the
			
 
				+intended real number representation may be more convenient than producing a
			
 
				+decimal equivalent that is known to convert to the intended value. Hexadecimal
			
 
				+real number literals are provided in order to permit values of binary floating
			
 
				+or fixed point real number types to be expressed directly.
			
 
				+
			
 
				+#### Ties
			
 
				+
			
 
				+As described above, a real number literal that lies exactly between two
			
 
				+representable values for its target type is invalid. Such ties are extremely
			
 
				+unlikely to occur by accident: for example, when interpreting a literal as
			
 
				+`Float64`, `1.` would need to be followed by exactly 53 decimal digits (followed
			
 
				+by zero or more `0`s) to land exactly half-way between two representable values,
			
 
				+and the probability of `1.` followed by a random 53-digit sequence resulting in
			
 
				+such a tie is one in 5<sup>53</sup>, or about
			
 
				+0.000000000000000000000000000000000009%. For `Float32`, it's about
			
 
				+0.000000000000001%, and even for a typical `Float16` implementation with 10
			
 
				+fractional bits, it's around 0.00001%.
			
 
				+
			
 
				+Ties are much easier to express as hexadecimal floating-point literals: for
			
 
				+example, `0x1.0000_0000_0000_08p+0` is exactly half way between `1.0` and the
			
 
				+smallest `Float64` value greater than `1.0`, which is `0x1.0000_0000_0000_1p+0`.
			
 
				+
			
 
				+Whether written in decimal or hexadecimal, a tie provides very strong evidence
			
 
				+that the developer intended to express a precise floating-point value, and
			
 
				+provided one bit too much precision (or one bit too little, depending on whether
			
 
				+they expected some rounding to occur), so rejecting the literal is preferred
			
 
				+over making an arbitrary choice between the two possible values.
			
 
				+
			
 
				+### Digit separators
			
 
				+
			
 
				+If digit separators (`_`) are included in literals, they must meet the
			
 
				+respective condition:
			
 
				+
			
 
				+-   For decimal integers, the digit separators shall occur every three digits
			
 
				+    starting from the right. For example, `2_147_483_648`.
			
 
				+-   For hexadecimal integers, the digit separators shall occur every four digits
			
 
				+    starting from the right. For example, `0x7FFF_FFFF`.
			
 
				+-   For real number literals, digit separators can appear in the decimal and
			
 
				+    hexadecimal integer portions (prior to the period and after the optional `e`
			
 
				+    or mandatory `p`) as described in the previous bullets. For example,
			
 
				+    `2_147.483648e12_345` or `0x1_00CA.FEF00Dp+24`
			
 
				+-   For binary literals, digit separators can appear between any two digits. For
			
 
				+    example, `0b1_000_101_11`.
			
 
				+
			
 
				+## Alternatives
			
 
				+
			
 
				+### Integer bases
			
 
				+
			
 
				+#### Octal literals
			
 
				+
			
 
				+No support is proposed for octal literals. In practice, their appearance in C
			
 
				+and C++ code in a sample corpus consisted of (in decreasing order of commonality
			
 
				+and excluding `0` literals):
			
 
				+
			
 
				+-   file permissions,
			
 
				+-   cases where decimal was clearly intended (`CivilDay(2020, 04, 01)`), and
			
 
				+-   (in _distant_ third place) anything else.
			
 
				+
			
 
				+The number of intentional uses of octal literals, other than in file
			
 
				+permissions, was negligible. We considered the following alternatives:
			
 
				+
			
 
				+**Alternative 1:** Follow C and C++, and use `0` as the base prefix for octal.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++ and other languages.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Subtle and error-prone rule: for example, left-padding with zeroes for
			
 
				+    alignment changes the meaning of literals.
			
 
				+
			
 
				+**Alternative 2:** Use `0o` as the base prefix for octal.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Unlikely to be misinterpreted as decimal.
			
 
				+-   Follows several other languages (for example, Python).
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Additional language complexity.
			
 
				+
			
 
				+If we decide we want to introduce octal literals at a later date, use of
			
 
				+alternative 2 is suggested.
			
 
				+
			
 
				+#### Decimal literals
			
 
				+
			
 
				+**We could permit leading `0`s in decimal integers (and in floating-point
			
 
				+numbers).**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   We would allow leading `0`s to be used to align columns of numbers.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   The same literal could be valid but have a different value in C++ and
			
 
				+    Carbon.
			
 
				+
			
 
				+**We could add an (optional) base specifier `0d` for decimal integers.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Uniform treatment of all bases. Left-padding with `0` could be achieved by
			
 
				+    using `0d000123`.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   No evidence of need for this functionality.
			
 
				+
			
 
				+**We could permit an `e` in decimal literals to express large powers of 10.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Many uses of (eg) `1e6` in our sample C++ corpus intend to form an integer
			
 
				+    literal instead of a floating-point literal.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Would violate the expectations of many C++ programmers used to `e`
			
 
				+    indicating a floating-point constant.
			
 
				+
			
 
				+#### Case sensitivity
			
 
				+
			
 
				+**We could make base specifiers case-insensitive.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   `0B1` is easily mistaken for `081`
			
 
				+-   `0B1` can be confused with `0xB1`
			
 
				+-   `0O17` is easily mistaken for `0017`
			
 
				+-   Allowing more than one way to write literals will lead to style divergence.
			
 
				+
			
 
				+**We could make the digit sequence in hexadecimal integers case-insensitive.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   Some developers will be more comfortable writing hexadecimal digits in
			
 
				+    lowercase. Some tools, such as `md5`, will print lowercase.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Allowing more than one way to write literals will lead to style divergence.
			
 
				+-   Lowercase hexadecimal digits are less visually distinct from the `x` base
			
 
				+    specifier (for example, the digit sequence is more visually distinct in
			
 
				+    `0xAC` than in `0xac`).
			
 
				+
			
 
				+**We could require the digit sequence in hexadecimal integers to be written
			
 
				+using lowercase letters `a`..`f`.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Some developers will be more comfortable writing hexadecimal digits in
			
 
				+    lowercase. Some tools, such as `md5`, will print lowercase.
			
 
				+-   `B` and `D` are more likely to be confused with `8` and `0` than `b` and `d`
			
 
				+    are.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Some developers will be more comfortable writing hexadecimal digits in
			
 
				+    uppercase. Some tools will print uppercase.
			
 
				+-   Lowercase hexadecimal digits are less visually distinct from the `x` base
			
 
				+    specifier (for example, the digit sequence is more visually distinct in
			
 
				+    `0xAC` than in `0xac`).
			
 
				+
			
 
				+### Real number syntax
			
 
				+
			
 
				+**We could allow real numbers with no digits on one side of the period (`3.` or
			
 
				+`.5`).**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   Allows numbers to be expressed more tersely.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Gives meaning to `tup.0` syntax that may be useful for indexing tuples.
			
 
				+-   Gives meaning to `0.ToString()` syntax that may be useful for performing
			
 
				+    member access on literals.
			
 
				+-   May harm readability by making the difference between an integer literal and
			
 
				+    a real number literal less significant.
			
 
				+-   Allowing more than one way to write literals will lead to style divergence.
			
 
				+
			
 
				+See also the section on
			
 
				+[floating-point literals](https://google.github.io/styleguide/cppguide.html#Floating_Literals)
			
 
				+in the Google style guide, which argues for the same rule.
			
 
				+
			
 
				+**We could allow a real number with no `e` or `p` to omit a period (`1e100`).**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   Allows numbers to be expressed more tersely.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Assuming that such numbers are integers rather than real numbers is a common
			
 
				+    error in C++.
			
 
				+
			
 
				+**We could allow the `e` or `p` to be written in uppercase.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   Most calculators use `E`, to avoid confusion with the constant `e`.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Allowing more than one way to write literals will lead to style divergence.
			
 
				+-   `E` may be confused with a hexadecimal digit.
			
 
				+
			
 
				+**We could require a `p` in a hexadecimal real number literal.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   When explicitly writing a bit-pattern for a floating-point type, it's
			
 
				+    reasonable to always include the exponent value.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Less consistent.
			
 
				+-   Makes hexadecimal floating-point values even more expert-only.
			
 
				+
			
 
				+**We could arbitrarily pick one of the two values when a real number is exactly
			
 
				+half-way between two representable values.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to C++.
			
 
				+-   Would accept more cases, and it's likely that either of the two possible
			
 
				+    values would be acceptable in practice.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Would either need to specify which option is chosen or, following C++,
			
 
				+    accept that programs using such literals have non-portable semantics.
			
 
				+-   Numbers specified to the exact level of precision required to form a tie are
			
 
				+    a strong signal that the programmer intended to specify a particular value.
			
 
				+
			
 
				+### Digit separator syntax
			
 
				+
			
 
				+We considered the following characters as digit separators:
			
 
				+
			
 
				+**Status quo:** `_` as a digit separator.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Follows convention of C#, Java, JavaScript, Python, D, Ruby, Rust, Swift,
			
 
				+    ...
			
 
				+-   Culturally agnostic, because it doesn't match any common human writing
			
 
				+    convention.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Underscore is not used as a digit grouping separator in any common human
			
 
				+    writing convention.
			
 
				+
			
 
				+**Alternative 1:** `'` as a digit separator.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Follows C++ syntax.
			
 
				+-   Used in several (mostly European) writing conventions.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   `'` is also likely to be used to introduce character literals.
			
 
				+
			
 
				+**Alternative 2:** `,` as a digit separator.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More similar to how numbers are written in English text and many other
			
 
				+    cultures.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Commas are expected to widely be used in Carbon programs for other purposes,
			
 
				+    where there may be digits on both sides of the comma. For example, there
			
 
				+    could be readability problems if `f(1, 234)` called `f` with two arguments
			
 
				+    but `f(1,234)` called `f` with a single argument.
			
 
				+-   Comma is interpreted as a decimal point in the conventions of many cultures.
			
 
				+-   Unprecedented in common programming languages.
			
 
				+
			
 
				+**Alternative 3:** whitespace as a digit separator.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Used and understood by many cultures.
			
 
				+-   Never interpreted as a decimal point instead of a grouping separator.
			
 
				+-   Also usable to the right of a decimal point.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Omitted separators in lists of numbers may result in distinct numbers being
			
 
				+    spliced together. For example, `f(1, 23, 4 567)` may be interpreted as three
			
 
				+    separate numerical arguments instead of four arguments with a missing comma.
			
 
				+-   Unprecedented in other programming languages.
			
 
				+
			
 
				+**Alternative 4:** `.` as digit separator, `,` as decimal point.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More familiar to cultures that write numbers this way.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   As with `,` as a digit separator, `,` as a decimal point is problematic.
			
 
				+-   This usage is unfamiliar and would be surprising to programmers; programmers
			
 
				+    from cultures where `,` is the decimal point in regular writing are likely
			
 
				+    already accustomed to using `.` as the decimal point in programming
			
 
				+    environments, and the converse is not true.
			
 
				+
			
 
				+**Alternative 5:** No digit separator syntax.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Simpler language rules.
			
 
				+-   More consistent source syntax, as there is no choice as to whether to use
			
 
				+    digit separators or not.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Harms the readability of long literals.
			
 
				+
			
 
				+### Digit separator positioning
			
 
				+
			
 
				+**Alternative 1:** allow any digit groupings (for example, `123_4567_89`).
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Simpler, more flexible rule, that may allow some groupings that are
			
 
				+    conventional in a specific domain. For example, `var Date: d = 01_12_1983;`,
			
 
				+    or `var Int64: time_in_microseconds = 123456_000000;`.
			
 
				+-   Culturally agnostic. For example, the Indian convention for digit separators
			
 
				+    would group the last three digits, and then every two digits before that
			
 
				+    (1,23,45,678 could be written `1_23_45_678`).
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Less self-checking that numeric literals are interpreted the way that the
			
 
				+    author intends.
			
 
				+
			
 
				+**Alternative 2:** as above, but additionally require binary digits to be
			
 
				+grouped in 4s.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   More enforcement that digit grouping is conventional.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   No clear, established rule for how to group binary digits. In some cases, 8
			
 
				+    digit groups may be more conventional.
			
 
				+-   When used to express literals involving bit-fields, arbitrary grouping may
			
 
				+    be desirable. For example:
			
 
				+
			
 
				+    ```carbon
			
 
				+    var Float32: flt_max =
			
 
				+      BitCast(Float32, 0b0_11111110_11111111111111111111111);
			
 
				+    ```
			
 
				+
			
 
				+**Alternative 3:** allow any regular grouping.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Can be applied uniformly to all bases.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Provides no assistance for decimal numbers with a single digit separator.
			
 
				+-   Does not allow binary literals to express an intent to initialize irregular
			
 
				+    bit-fields.
			
--- a/docs/design/lexical_conventions/whitespace.md
+++ b/docs/design/lexical_conventions/whitespace.md
@@ -0,0 +1,42 @@
 
				+# Whitespace
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [Overview](#overview)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Overview
			
 
				+
			
 
				+The exact lexical form of Carbon whitespace has not yet been settled. However,
			
 
				+Carbon will follow lexical conventions for whitespace based on
			
 
				+[Unicode Annex #31](https://unicode.org/reports/tr31/). TODO: Update this once
			
 
				+the precise rules are decided; see the
			
 
				+[Unicode source files](/proposals/p0142.md#characters-in-identifiers) proposal.
			
 
				+
			
 
				+Unicode Annex #31 suggests selecting whitespace characters based on the
			
 
				+characters with Unicode property `Pattern_White_Space`, which is currently these
			
 
				+11 characters:
			
 
				+
			
 
				+-   U+0009 CHARACTER TABULATION (horizontal tab)
			
 
				+-   U+000A LINE FEED (traditional newline)
			
 
				+-   U+000B LINE TABULATION (vertical tab)
			
 
				+-   U+000C FORM FEED (page break)
			
 
				+-   U+000D CARRIAGE RETURN
			
 
				+-   U+0020 SPACE
			
 
				+-   U+0085 NEXT LINE (Unicode newline)
			
 
				+-   U+200E LEFT-TO-RIGHT MARK
			
 
				+-   U+200F RIGHT-TO-LEFT MARK
			
 
				+-   U+2028 LINE SEPARATOR
			
 
				+-   U+2029 PARAGRAPH SEPARATOR
			
 
				+
			
 
				+The quantity and kind of whitespace separating tokens is ignored except where
			
 
				+otherwise specified.
			
--- a/docs/design/lexical_conventions/words.md
+++ b/docs/design/lexical_conventions/words.md
@@ -0,0 +1,49 @@
 
				+# Words
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [Overview](#overview)
			
 
				+-   [Alternatives](#alternatives)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Overview
			
 
				+
			
 
				+A _word_ is a lexical element formed from a sequence of letters or letter-like
			
 
				+characters, such as `fn` or `Foo` or `Int`.
			
 
				+
			
 
				+The exact lexical form of words has not yet been settled. However, Carbon will
			
 
				+follow lexical conventions for identifiers based on
			
 
				+[Unicode Annex #31](https://unicode.org/reports/tr31/). TODO: Update this once
			
 
				+the precise rules are decided; see the
			
 
				+[Unicode source files](/proposals/p0142.md#characters-in-identifiers) proposal.
			
 
				+
			
 
				+## Alternatives
			
 
				+
			
 
				+**We could restrict words to ASCII.**
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Reduced implementation complexity.
			
 
				+-   Avoids all problems relating to normalization, homoglyphs, text
			
 
				+    directionality, and so on.
			
 
				+-   We have no intention of using non-ASCII characters in the language syntax or
			
 
				+    in any library name.
			
 
				+-   Provides assurance that all names in libraries can reliably be typed by all
			
 
				+    developers -- we already require that keywords, and thus all ASCII letters,
			
 
				+    can be typed.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   An overarching goal of the Carbon project is to provide a language that is
			
 
				+    inclusive and welcoming. A language that does not permit names in programs
			
 
				+    to be expressed in the developer's native language will not meet that goal
			
 
				+    for at least some of our developers.