Sfoglia il codice sorgente

Do more precise table dispatch for symbols. (#3287)

When originally switching to the table dispatch approach we discussed
that it'd be nice to disentangle the monolithic symbol lexing routine
with this as we'll typically have fairly precise dispatch. This is
especially true for grouping symbols, which in Carbon are all
constructively one-character (at this point).

I think this provides a substantial improvement to the clarity of the
code by disentangling the different paths. It also allowed a bunch of
simplifications / clarifications to exactly what the behavior with
closing invalid groups actually involves currently.

This was initially motivated by code organization improvements, and any
performance wins were speculative. However, when benchmarking it
surfaced a problem that hadn't been clear -- we're generating too many
distinct functions here, and the table-based dispatch slows down in the
face of that.

So this PR also includes a fix for that, removing the template-generated
fan-out of dispatch functions for distinct symbols. Instead, we have
a dedicated table to translate one character into the token kinds. This
seems to work quite well, avoiding the huge branch-y structure and just
do fairly cheap table translation & dispatch for all one-character
symbols. Building the table requires the token kinds to be default
constructable, so this also enables that and arranges for the zero-value
kind to be the error kind.

Combined, this is a modest speedup for *non* grouping symbols (3-4%,
a bit noisy). And in some cases it is a huge speedup for grouping
symbols (>10%).

Raw benchmark data with 20 runs before/after -- despite the # of runs,
the grouping symbols benchmarks were frustratingly noisy in non-uniform
ways that couldn't fully be accounted for here. Still, this seems like
an overall improvement.

```
BM_RandomSource               7.98ms ± 2%  7.73ms ± 3%   -3.12%  (p=0.000 n=18+19)
BM_GroupingSymbols/1/0/0      5.90ms ± 2%  5.82ms ± 4%   -1.38%  (p=0.001 n=20+20)
BM_GroupingSymbols/2/0/0      5.21ms ± 2%  5.15ms ± 2%   -1.14%  (p=0.002 n=20+18)
BM_GroupingSymbols/3/0/0      4.42ms ± 2%  4.34ms ± 2%   -1.87%  (p=0.000 n=19+18)
BM_GroupingSymbols/4/0/0      4.29ms ± 2%  4.38ms ± 5%     ~     (p=0.297 n=17+20)
BM_GroupingSymbols/8/0/0      5.09ms ±10%  5.10ms ± 7%     ~     (p=0.919 n=18+20)
BM_GroupingSymbols/16/0/0     6.35ms ± 8%  6.29ms ± 6%     ~     (p=0.201 n=20+20)
BM_GroupingSymbols/32/0/0     9.88ms ± 2%  9.83ms ± 1%     ~     (p=0.167 n=18+20)
BM_GroupingSymbols/0/1/0      5.12ms ± 2%  5.01ms ± 2%   -2.14%  (p=0.000 n=20+19)
BM_GroupingSymbols/0/2/0      4.01ms ± 2%  3.93ms ± 4%   -2.03%  (p=0.000 n=20+19)
BM_GroupingSymbols/0/3/0      2.92ms ± 3%  2.81ms ± 2%   -3.87%  (p=0.000 n=20+19)
BM_GroupingSymbols/0/4/0      2.61ms ± 3%  2.47ms ± 2%   -5.30%  (p=0.000 n=20+18)
BM_GroupingSymbols/0/8/0      1.77ms ± 3%  1.61ms ± 2%   -8.91%  (p=0.000 n=18+19)
BM_GroupingSymbols/0/16/0     1.41ms ± 3%  1.16ms ± 4%  -17.66%  (p=0.000 n=20+20)
BM_GroupingSymbols/0/32/0     1.10ms ± 2%  0.92ms ± 3%  -16.36%  (p=0.000 n=20+17)
BM_GroupingSymbols/0/0/1      5.09ms ± 2%  5.03ms ± 3%   -1.11%  (p=0.001 n=20+18)
BM_GroupingSymbols/0/0/2      4.01ms ± 2%  3.91ms ± 2%   -2.67%  (p=0.000 n=20+18)
BM_GroupingSymbols/0/0/3      2.93ms ± 3%  2.81ms ± 2%   -4.23%  (p=0.000 n=20+19)
BM_GroupingSymbols/0/0/4      2.59ms ± 2%  2.48ms ± 3%   -4.48%  (p=0.000 n=20+19)
BM_GroupingSymbols/0/0/8      1.75ms ± 1%  1.62ms ± 3%   -7.65%  (p=0.000 n=17+19)
BM_GroupingSymbols/0/0/16     1.40ms ± 2%  1.15ms ± 3%  -17.67%  (p=0.000 n=19+20)
BM_GroupingSymbols/0/0/32     1.10ms ± 2%  0.92ms ± 3%  -15.91%  (p=0.000 n=20+19)
BM_GroupingSymbols/32/1/0     9.62ms ± 2%  9.65ms ± 2%     ~     (p=0.654 n=18+20)
BM_GroupingSymbols/32/2/0     9.41ms ± 2%  9.37ms ± 2%     ~     (p=0.095 n=20+19)
BM_GroupingSymbols/32/3/0     9.13ms ± 2%  9.13ms ± 3%     ~     (p=0.687 n=19+20)
BM_GroupingSymbols/32/4/0     8.93ms ± 1%  8.87ms ± 2%   -0.69%  (p=0.010 n=20+18)
BM_GroupingSymbols/32/8/0     8.15ms ± 2%  8.14ms ± 3%     ~     (p=0.729 n=19+19)
BM_GroupingSymbols/32/16/0    7.04ms ± 3%  6.92ms ± 1%   -1.71%  (p=0.000 n=20+18)
BM_GroupingSymbols/32/32/0    5.48ms ± 2%  5.38ms ± 3%   -1.81%  (p=0.000 n=20+20)
BM_GroupingSymbols/32/32/1    5.39ms ± 2%  5.29ms ± 2%   -1.87%  (p=0.000 n=19+19)
BM_GroupingSymbols/32/32/2    5.34ms ± 2%  5.21ms ± 1%   -2.45%  (p=0.000 n=20+18)
BM_GroupingSymbols/32/32/3    5.27ms ± 3%  5.16ms ± 2%   -2.18%  (p=0.000 n=20+19)
BM_GroupingSymbols/32/32/4    5.21ms ± 2%  5.10ms ± 3%   -2.11%  (p=0.000 n=19+20)
BM_GroupingSymbols/32/32/8    4.98ms ± 2%  4.83ms ± 2%   -2.85%  (p=0.000 n=19+19)
BM_GroupingSymbols/32/32/16   4.55ms ± 2%  4.45ms ± 2%   -2.25%  (p=0.000 n=18+20)
BM_GroupingSymbols/32/32/32   3.95ms ± 2%  3.84ms ± 2%   -2.98%  (p=0.000 n=19+20)
```
Chandler Carruth 2 anni fa
parent
commit
a79ea4b28d
3 ha cambiato i file con 125 aggiunte e 86 eliminazioni
  1. 4 1
      toolchain/lex/token_kind.def
  2. 2 0
      toolchain/lex/token_kind.h
  3. 119 85
      toolchain/lex/tokenized_buffer.cpp

+ 4 - 1
toolchain/lex/token_kind.def

@@ -32,6 +32,10 @@
 #define CARBON_TOKEN(Name)
 #endif
 
+// The error token comes first because we want it to get the zero value, which
+// will also be used in default initialization.
+CARBON_TOKEN(Error)
+
 #ifndef CARBON_SYMBOL_TOKEN
 #define CARBON_SYMBOL_TOKEN(Name, Spelling) CARBON_TOKEN(Name)
 #endif
@@ -209,7 +213,6 @@ CARBON_TOKEN(StringLiteral)
 CARBON_TOKEN(IntegerTypeLiteral)
 CARBON_TOKEN(UnsignedIntegerTypeLiteral)
 CARBON_TOKEN(FloatingPointTypeLiteral)
-CARBON_TOKEN(Error)
 CARBON_TOKEN(StartOfFile)
 CARBON_TOKEN(EndOfFile)
 

+ 2 - 0
toolchain/lex/token_kind.h

@@ -28,6 +28,8 @@ class TokenKind : public CARBON_ENUM_BASE(TokenKind) {
   // An array of all the keyword tokens.
   static const llvm::ArrayRef<TokenKind> KeywordTokens;
 
+  using EnumBase::EnumBase;
+
   // Test whether this kind of token is a simple symbol sequence (punctuation,
   // not letters) that appears directly in the source text and can be
   // unambiguously lexed with `starts_with` logic. While these may appear

+ 119 - 85
toolchain/lex/tokenized_buffer.cpp

@@ -464,77 +464,106 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
     }
   }
 
-  auto LexSymbolToken(llvm::StringRef& source_text,
-                      TokenKind kind = TokenKind::Error) -> LexResult {
-    auto compute_symbol_kind = [](llvm::StringRef source_text) {
-      return llvm::StringSwitch<TokenKind>(source_text)
-#define CARBON_SYMBOL_TOKEN(Name, Spelling) \
-  .StartsWith(Spelling, TokenKind::Name)
-#include "toolchain/lex/token_kind.def"
-          .Default(TokenKind::Error);
-    };
-
-    // We use the `error` token as a place-holder for cases where one character
-    // isn't enough to pick a definitive symbol token. Recompute the kind using
-    // the full symbol set.
-    if (LLVM_UNLIKELY(kind == TokenKind::Error)) {
-      kind = compute_symbol_kind(source_text);
-      if (kind == TokenKind::Error) {
-        return LexError(source_text);
-      }
-    } else {
-      // Verify in a debug build that the incoming token kind is correct.
-      CARBON_DCHECK(kind == compute_symbol_kind(source_text))
-          << "Incoming token kind '" << kind
-          << "' does not match computed kind '"
-          << compute_symbol_kind(source_text) << "'!";
-    }
+  auto LexOneCharSymbolToken(llvm::StringRef& source_text, TokenKind kind)
+      -> Token {
+    // Verify in a debug build that the incoming token kind is correct.
+    CARBON_DCHECK(kind != TokenKind::Error);
+    CARBON_DCHECK(kind.fixed_spelling().size() == 1);
+    CARBON_DCHECK(source_text.front() == kind.fixed_spelling().front())
+        << "Source text starts with '" << source_text.front()
+        << "' instead of the spelling '" << kind.fixed_spelling()
+        << "' of the incoming token kind '" << kind << "'";
 
     if (!set_indent_) {
       current_line_info_->indent = current_column_;
       set_indent_ = true;
     }
 
-    CloseInvalidOpenGroups(kind);
-
-    const char* location = source_text.begin();
     Token token = buffer_->AddToken(
         {.kind = kind, .token_line = current_line_, .column = current_column_});
-    current_column_ += kind.fixed_spelling().size();
-    source_text = source_text.drop_front(kind.fixed_spelling().size());
+    ++current_column_;
+    source_text = source_text.drop_front();
+    return token;
+  }
+
+  auto LexOpeningSymbolToken(llvm::StringRef& source_text, TokenKind kind)
+      -> LexResult {
+    Token token = LexOneCharSymbolToken(source_text, kind);
+    open_groups_.push_back(token);
+    return token;
+  }
 
-    // Opening symbols just need to be pushed onto our queue of opening groups.
-    if (kind.is_opening_symbol()) {
-      open_groups_.push_back(token);
+  auto LexClosingSymbolToken(llvm::StringRef& source_text, TokenKind kind)
+      -> LexResult {
+    auto unmatched_error = [&] {
+      CARBON_DIAGNOSTIC(
+          UnmatchedClosing, Error,
+          "Closing symbol without a corresponding opening symbol.");
+      emitter_.Emit(source_text.begin(), UnmatchedClosing);
+      Token token = buffer_->AddToken({.kind = TokenKind::Error,
+                                       .token_line = current_line_,
+                                       .column = current_column_,
+                                       .error_length = 1});
+      ++current_column_;
+      source_text = source_text.drop_front();
       return token;
+    };
+
+    // If we have no open groups, this is an error.
+    if (LLVM_UNLIKELY(open_groups_.empty())) {
+      return unmatched_error();
     }
 
-    // Only closing symbols need further special handling.
-    if (!kind.is_closing_symbol()) {
-      return token;
+    Token opening_token = open_groups_.back();
+    // Close any invalid open groups first.
+    if (LLVM_UNLIKELY(buffer_->GetTokenInfo(opening_token).kind !=
+                      kind.opening_symbol())) {
+      CloseInvalidOpenGroups(kind);
+      // This may exhaust the open groups so re-check and re-error if needed.
+      if (open_groups_.empty()) {
+        return unmatched_error();
+      }
+      opening_token = open_groups_.back();
+      CARBON_DCHECK(buffer_->GetTokenInfo(opening_token).kind ==
+                    kind.opening_symbol());
     }
+    open_groups_.pop_back();
 
-    TokenInfo& closing_token_info = buffer_->GetTokenInfo(token);
+    // Now that the groups are all matched up, lex the actual token.
+    Token token = LexOneCharSymbolToken(source_text, kind);
 
-    // Check that there is a matching opening symbol before we consume this as
-    // a closing symbol.
-    if (open_groups_.empty()) {
-      closing_token_info.kind = TokenKind::Error;
-      closing_token_info.error_length = kind.fixed_spelling().size();
+    // Note that it is important to get fresh token infos here as lexing the
+    // open token would invalidate any pointers.
+    buffer_->GetTokenInfo(opening_token).closing_token = token;
+    buffer_->GetTokenInfo(token).opening_token = opening_token;
 
-      CARBON_DIAGNOSTIC(
-          UnmatchedClosing, Error,
-          "Closing symbol without a corresponding opening symbol.");
-      emitter_.Emit(location, UnmatchedClosing);
-      // Note that this still returns true as we do consume a symbol.
-      return token;
+    return token;
+  }
+
+  auto LexSymbolToken(llvm::StringRef& source_text) -> LexResult {
+    // One character symbols and grouping symbols are handled with dedicated
+    // dispatch. We only lex the multi-character tokens here.
+    TokenKind kind = llvm::StringSwitch<TokenKind>(source_text)
+#define CARBON_SYMBOL_TOKEN(Name, Spelling) \
+  .StartsWith(Spelling, TokenKind::Name)
+#define CARBON_ONE_CHAR_SYMBOL_TOKEN(TokenName, Spelling)
+#define CARBON_OPENING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, ClosingName)
+#define CARBON_CLOSING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, OpeningName)
+#include "toolchain/lex/token_kind.def"
+                         .Default(TokenKind::Error);
+    if (kind == TokenKind::Error) {
+      return LexError(source_text);
     }
 
-    // Finally can handle a normal closing symbol.
-    Token opening_token = open_groups_.pop_back_val();
-    TokenInfo& opening_token_info = buffer_->GetTokenInfo(opening_token);
-    opening_token_info.closing_token = token;
-    closing_token_info.opening_token = opening_token;
+    if (!set_indent_) {
+      current_line_info_->indent = current_column_;
+      set_indent_ = true;
+    }
+
+    Token token = buffer_->AddToken(
+        {.kind = kind, .token_line = current_line_, .column = current_column_});
+    current_column_ += kind.fixed_spelling().size();
+    source_text = source_text.drop_front(kind.fixed_spelling().size());
     return token;
   }
 
@@ -587,30 +616,9 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
     return token;
   }
 
-  // Closes all open groups that cannot remain open across the symbol `K`.
+  // Closes all open groups that cannot remain open across a closing symbol.
   // Users may pass `Error` to close all open groups.
-  auto CloseInvalidOpenGroups(TokenKind kind) -> void {
-    // There are two common cases that result in nothing to close. Short circuit
-    // those here.
-    if ((!kind.is_closing_symbol() && kind != TokenKind::Error) ||
-        open_groups_.empty()) {
-      return;
-    }
-
-    // Also check the first open group token to see if it matches this closing
-    // token, in which case there is nothing to do. This is redundant with the
-    // work inside the main loop, but we peel it out to allow inlining.
-    Token opening_token = open_groups_.back();
-    TokenKind opening_kind = buffer_->GetTokenInfo(opening_token).kind;
-    if (kind == opening_kind.closing_symbol()) {
-      return;
-    }
-
-    // Otherwise, delegate to a separate function to help with inlining.
-    CloseInvalidOpenGroupsSlow(kind);
-  }
-
-  [[gnu::noinline]] auto CloseInvalidOpenGroupsSlow(TokenKind kind) -> void {
+  [[gnu::noinline]] auto CloseInvalidOpenGroups(TokenKind kind) -> void {
     CARBON_CHECK(kind.is_closing_symbol() || kind == TokenKind::Error);
     CARBON_CHECK(!open_groups_.empty());
 
@@ -770,7 +778,9 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
 
     // Close any open groups. We do this after marking whitespace, it will
     // preserve that.
-    CloseInvalidOpenGroups(TokenKind::Error);
+    if (!open_groups_.empty()) {
+      CloseInvalidOpenGroups(TokenKind::Error);
+    }
 
     buffer_->AddToken({.kind = TokenKind::EndOfFile,
                        .token_line = current_line_,
@@ -818,14 +828,19 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
   CARBON_DISPATCH_LEX_TOKEN(LexNumericLiteral)
   CARBON_DISPATCH_LEX_TOKEN(LexStringLiteral)
 
-  // A custom dispatch function that pre-selects a symbol token to lex.
-  template <const TokenKind& Token>
-  static auto DispatchLexOneCharSymbol(Lexer& lexer,
-                                       llvm::StringRef& source_text) -> void {
-    LexResult result = lexer.LexSymbolToken(source_text, Token);
-    CARBON_CHECK(result) << "Failed to form a token!";
-    [[clang::musttail]] return DispatchNext(lexer, source_text);
+  // A custom dispatch functions that pre-select the symbol token to lex.
+#define CARBON_DISPATCH_LEX_SYMBOL_TOKEN(LexMethod)                          \
+  static auto Dispatch##LexMethod##SymbolToken(Lexer& lexer,                 \
+                                               llvm::StringRef& source_text) \
+      ->void {                                                               \
+    LexResult result = lexer.LexMethod##SymbolToken(                         \
+        source_text, OneCharTokenKindTable[source_text.front()]);            \
+    CARBON_CHECK(result) << "Failed to form a token!";                       \
+    [[clang::musttail]] return DispatchNext(lexer, source_text);             \
   }
+  CARBON_DISPATCH_LEX_SYMBOL_TOKEN(LexOneChar)
+  CARBON_DISPATCH_LEX_SYMBOL_TOKEN(LexOpening)
+  CARBON_DISPATCH_LEX_SYMBOL_TOKEN(LexClosing)
 
   // Define a set of non-token dispatch functions that handle things like
   // whitespace and comments.
@@ -915,7 +930,11 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
     // needs to override some of the generic handling above, and provide a
     // custom token.
 #define CARBON_ONE_CHAR_SYMBOL_TOKEN(TokenName, Spelling) \
-  table[(Spelling)[0]] = &DispatchLexOneCharSymbol<TokenKind::TokenName>;
+  table[(Spelling)[0]] = &DispatchLexOneCharSymbolToken;
+#define CARBON_OPENING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, ClosingName) \
+  table[(Spelling)[0]] = &DispatchLexOpeningSymbolToken;
+#define CARBON_CLOSING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, OpeningName) \
+  table[(Spelling)[0]] = &DispatchLexClosingSymbolToken;
 #include "toolchain/lex/token_kind.def"
 
     // Override the handling for `/` to consider comments as well as a `/`
@@ -956,6 +975,8 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
 
   static const DispatchTableT DispatchTable;
 
+  static const std::array<TokenKind, 256> OneCharTokenKindTable;
+
   TokenizedBuffer* buffer_;
 
   SourceBufferLocationTranslator translator_;
@@ -976,6 +997,19 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
 constexpr TokenizedBuffer::Lexer::DispatchTableT
     TokenizedBuffer::Lexer::DispatchTable = MakeDispatchTable();
 
+constexpr std::array<TokenKind, 256>
+    TokenizedBuffer::Lexer::OneCharTokenKindTable = [] {
+      std::array<TokenKind, 256> table = {};
+#define CARBON_ONE_CHAR_SYMBOL_TOKEN(TokenName, Spelling) \
+  table[(Spelling)[0]] = TokenKind::TokenName;
+#define CARBON_OPENING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, ClosingName) \
+  table[(Spelling)[0]] = TokenKind::TokenName;
+#define CARBON_CLOSING_GROUP_SYMBOL_TOKEN(TokenName, Spelling, OpeningName) \
+  table[(Spelling)[0]] = TokenKind::TokenName;
+#include "toolchain/lex/token_kind.def"
+      return table;
+    }();
+
 auto TokenizedBuffer::Lex(SourceBuffer& source, DiagnosticConsumer& consumer)
     -> TokenizedBuffer {
   TokenizedBuffer buffer(source);