Преглед на файлове

Predetermine all the line splits in the lexer. (#3278)

## Summary ##

Restructures the lexer to first scan the entire source text for newlines
and create all the line structures needed. Doing this up-front makes it
easy to produce an optimized version with minimal complexity. Currently,
it leverages the system `memchr`, but even when expanded to handle more
complex cases like CR+LF line endings, being isolated in this way will
result in a significantly simpler implementation. This change improves
the lexing of comment lines significantly by skipping their contents
immediately. The overhead of the pre-scan is unmeasurable in all
realistic benchmarks, and 10-30% in benchmarks consisting almost
entirely of blank lines or comments. The improvement of comment lexing
with average length comment lines mixed with code starts at 20% and goes
up. Regressing blank line handling for non-empty comments seems like the
right tradeoff (by far).

## Background and details ##

One weak point in the lexer implementation were large runs of comments.
While those aren't terribly common, they shouldn't present a hazard to
the lexer performance.

A bit more common is a pattern of comments like the following:

```carbon
  // Some method comment here.
  fn SomeMethodName(...) -> ...;

  // Some other method comment here.
  fn SomeOtherMethodName(...) -> ...;
```

Here, the lexer spends an inordinate amount of time getting from the
`\n` after the first semicolon to the `fn` token. It has to skip a blank
line, scan a line, find the `//` comment start, then scan to find the
next `\n`, and then scan horizontal whitespace, etc.

It is tempting to build a scanner *exactly* for this. In fact, I built
one, and I can publish it in a PR if folks are interested in what it
looks like. For x86-64, the PSHUFB trick used for scanning identifiers
technically works. But it is *complicated*. Amazingly so. 150 lines of
very subtle code with subtle performance pitfalls at ever turn. I felt
very uncomfortable submitting it, but we can always go back to it.
Nothing I've come up with quite matches it for sheer speed.

However, most of the complexity and time is spent walking from a `//` to
the end of the line. And *that* is something we can do very simply. In
fact, there is a tuned function for that in libc: `memchr`. Using this
we can build a very fast and much simpler scanner to split lines
up-front. This PR uses that and a carefully crafted fast loop to first
build up all the line info we need. Getting this to be as fast as
possible required some other subtle changes, for example always creating
a line structure that goes from the last `\n` and the end of the file.
We then back up the EOF token to avoid surfacing this to users. The nice
thing is the EOF token isn't part of any hot loop, and so this removes
branches everywhere else at modest complexity.

Once we have that, the rest of the lexer just needs to keep track of its
current line in order to record column offsets. I've taken some care to
try and optimize the lexer's usage of the line structures but there are
more opportunities here I suspect.

Combined, this gets much but not all of the performance of a huge SIMD
scanner for newline-through-to-next-token. For extreme cases (100s of
blank lines or empty comment lines between tokens) the holistic scanner
is of course still much faster, but those don't seem nearly worth the
cost.

I was initially worried about the overhead of taking two passes over the
source text, but in practice I've not been able to measure any
appreciable cost to this with realistic source files. In some cases
benchmarks with no newlines get *faster* because we use a much more
efficient approach to fetch the source text into cache as
a happenstance. And that in turn makes the byte-wise dispatched loop run
faster as it stalls less.

I'm particularly happy with this approach because it seems very clear
how to extend this to support CR+LF, bare CR, and even complex mixtures
without any significant speed cost. That wasn't at all true for the
other approaches explored.

I may try some further PRs to smooth out the last bits of slowness here,
but already this is working excellent for me in practice. My 10mloc test
case is down to 2.3s to lex.

## Raw benchmark data

Using a tool that runs benchmarks before and after and analyzes the
results, the following summarizes the CPU-time impact, each of these for
lexing 100k tokens:

```
BM_ValidKeywords                               2.57ms ± 1%  2.58ms ± 0%     ~     (p=0.190 n=5+4)
BM_ValidIdentifiers<1, 64, false>              9.24ms ± 4%  9.31ms ± 4%     ~     (p=0.421 n=5+5)
BM_ValidIdentifiers<1, 1, true>                3.05ms ± 4%  3.11ms ± 4%     ~     (p=0.222 n=5+5)
BM_ValidIdentifiers<3, 5, true>                10.9ms ± 0%  11.1ms ± 1%   +1.76%  (p=0.016 n=4+5)
BM_ValidIdentifiers<3, 16, true>               11.1ms ± 7%  11.0ms ± 1%     ~     (p=0.310 n=5+5)
BM_ValidIdentifiers<12, 64, true>              12.2ms ± 1%  12.3ms ± 2%     ~     (p=0.111 n=4+5)
BM_HorizontalWhitespace/1                      11.2ms ± 6%  11.1ms ± 2%     ~     (p=0.841 n=5+5)
BM_HorizontalWhitespace/4                      12.0ms ± 3%  12.0ms ± 2%     ~     (p=0.548 n=5+5)
BM_HorizontalWhitespace/16                     16.2ms ± 6%  15.9ms ± 8%     ~     (p=0.690 n=5+5)
BM_HorizontalWhitespace/64                     27.7ms ± 3%  28.4ms ± 3%     ~     (p=0.151 n=5+5)
BM_HorizontalWhitespace/128                    44.3ms ± 1%  45.6ms ± 6%   +3.15%  (p=0.032 n=5+5)
BM_RandomSource                                7.75ms ± 2%  7.72ms ± 1%     ~     (p=1.000 n=5+5)
BM_BlankLines/1                                11.7ms ± 1%  12.1ms ± 1%   +3.46%  (p=0.008 n=5+5)
BM_BlankLines/4                                14.0ms ± 2%  15.2ms ± 3%   +8.12%  (p=0.008 n=5+5)
BM_BlankLines/16                               23.5ms ± 2%  31.1ms ± 4%  +32.26%  (p=0.008 n=5+5)
BM_BlankLines/64                               75.3ms ± 1%  81.2ms ± 3%   +7.83%  (p=0.008 n=5+5)
BM_BlankLines/128                               133ms ± 3%   150ms ± 2%  +12.74%  (p=0.008 n=5+5)
BM_CommentLines/1/0/0                          13.1ms ± 0%  13.7ms ± 1%   +5.11%  (p=0.008 n=5+5)
BM_CommentLines/4/0/0                          16.6ms ± 1%  18.2ms ± 4%   +9.56%  (p=0.008 n=5+5)
BM_CommentLines/128/0/0                         169ms ± 4%   182ms ± 1%   +7.24%  (p=0.008 n=5+5)
BM_CommentLines/1/30/0                         18.7ms ± 5%  14.1ms ± 0%  -24.84%  (p=0.008 n=5+5)
BM_CommentLines/4/30/0                         36.5ms ± 6%  20.6ms ± 3%  -43.59%  (p=0.008 n=5+5)
BM_CommentLines/128/30/0                        525ms ± 4%   198ms ± 1%  -62.38%  (p=0.008 n=5+5)
BM_CommentLines/1/70/0                         23.4ms ± 6%  14.7ms ± 2%  -37.15%  (p=0.008 n=5+5)
BM_CommentLines/4/70/0                         53.3ms ± 7%  22.4ms ± 4%  -57.99%  (p=0.008 n=5+5)
BM_CommentLines/128/70/0                        1.05s ± 4%   0.21s ± 2%  -80.31%  (p=0.008 n=5+5)
BM_CommentLines/1/0/2                          14.1ms ± 6%  14.3ms ± 1%     ~     (p=0.151 n=5+5)
BM_CommentLines/4/0/2                          19.4ms ± 5%  20.1ms ± 1%     ~     (p=0.151 n=5+5)
BM_CommentLines/128/0/2                         238ms ± 8%   229ms ± 0%     ~     (p=0.151 n=5+5)
BM_CommentLines/1/30/2                         19.2ms ± 7%  14.6ms ± 1%  -23.87%  (p=0.008 n=5+5)
BM_CommentLines/4/30/2                         40.3ms ±13%  22.3ms ± 4%  -44.63%  (p=0.008 n=5+5)
BM_CommentLines/128/30/2                        568ms ± 7%   254ms ± 3%  -55.28%  (p=0.008 n=5+5)
BM_CommentLines/1/70/2                         23.3ms ± 1%  15.0ms ± 3%  -35.61%  (p=0.016 n=4+5)
BM_CommentLines/4/70/2                         57.2ms ± 9%  24.1ms ± 2%  -57.81%  (p=0.008 n=5+5)
BM_CommentLines/128/70/2                        1.07s ± 0%   0.26s ± 2%  -75.51%  (p=0.016 n=4+5)
BM_CommentLines/1/0/8                          15.9ms ± 7%  16.0ms ± 1%     ~     (p=0.151 n=5+5)
BM_CommentLines/4/0/8                          24.2ms ± 6%  27.9ms ± 2%  +15.36%  (p=0.008 n=5+5)
BM_CommentLines/128/0/8                         386ms ± 4%   445ms ± 1%  +15.28%  (p=0.008 n=5+5)
BM_CommentLines/1/30/8                         20.6ms ± 5%  16.3ms ± 1%  -20.95%  (p=0.008 n=5+5)
BM_CommentLines/4/30/8                         45.3ms ± 6%  30.3ms ± 3%  -32.98%  (p=0.008 n=5+5)
BM_CommentLines/128/30/8                        699ms ± 3%   477ms ± 3%  -31.83%  (p=0.008 n=5+5)
BM_CommentLines/1/70/8                         25.7ms ± 5%  16.8ms ± 2%  -34.67%  (p=0.008 n=5+5)
BM_CommentLines/4/70/8                         62.0ms ± 4%  31.6ms ± 2%  -49.10%  (p=0.008 n=5+5)
BM_CommentLines/128/70/8                        1.20s ± 2%   0.48s ± 4%  -59.60%  (p=0.008 n=5+5)
```

The horizontal whitespace benchmark (and all of the non-line-oriented
ones) are noisier than they appear here but do show some improvements
(surprisingly). My guess is that it has a lot to do with system load, as
the advantage is that we're using a vectorized loop to scan the text
first and then doing the byte-dispatched loop. So when the cache is
a bit slower to populate, the vectorized version starts to be faster.

---------

Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Chandler Carruth преди 2 години
родител
ревизия
6ba8712fbd
променени са 2 файла, в които са добавени 90 реда и са изтрити 25 реда
  1. 77 25
      toolchain/lex/tokenized_buffer.cpp
  2. 13 0
      toolchain/lex/tokenized_buffer.h

+ 77 - 25
toolchain/lex/tokenized_buffer.cpp

@@ -247,18 +247,48 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
         translator_(&buffer),
         emitter_(translator_, consumer),
         token_translator_(&buffer),
-        token_emitter_(token_translator_, consumer),
-        current_line_(buffer.AddLine(LineInfo(0))),
-        current_line_info_(&buffer.GetLineInfo(current_line_)) {}
+        token_emitter_(token_translator_, consumer) {}
+
+  // Find all line endings and create the line data structures. Explicitly kept
+  // out-of-line because this is a significant loop that is useful to have in
+  // the profile and it doesn't simplify by inlining at all. But because it can,
+  // the compiler will flatten this otherwise.
+  [[gnu::noinline]] auto CreateLines(llvm::StringRef source_text) -> void {
+    // We currently use `memchr` here which typically is well optimized to use
+    // SIMD or other significantly faster than byte-wise scanning. We also use
+    // carefully selected variables and the `ssize_t` type for performance and
+    // code size of this hot loop.
+    //
+    // TODO: Eventually, we'll likely need to roll our own SIMD-optimized
+    // routine here in order to handle CR+LF line endings, as we'll want those
+    // to stay on the fast path. We'll also need to detect and diagnose Unicode
+    // vertical whitespace. Starting with `memchr` should give us a strong
+    // baseline performance target when adding those features.
+    const char* const text = source_text.data();
+    const ssize_t size = source_text.size();
+    ssize_t start = 0;
+    while (const char* nl = reinterpret_cast<const char*>(
+               memchr(&text[start], '\n', size - start))) {
+      ssize_t nl_index = nl - text;
+      buffer_->AddLine(LineInfo(start, nl_index - start));
+      start = nl_index + 1;
+    }
+    // The last line ends at the end of the file.
+    buffer_->AddLine(LineInfo(start, size - start));
+
+    // Now that all the infos are allocated, get a fresh pointer to the first
+    // info for use while lexing.
+    current_line_ = Line(0);
+    current_line_info_ = &buffer_->GetLineInfo(current_line_);
+  }
 
   // Perform the necessary bookkeeping to step past a newline at the current
   // line and column.
   auto HandleNewline() -> void {
-    current_line_info_->length = current_column_;
-
-    current_line_ = buffer_->AddLine(
-        LineInfo(current_line_info_->start + current_column_ + 1));
+    int next_start = current_line_info_->start + current_column_ + 1;
+    current_line_ = buffer_->GetNextLine(current_line_);
     current_line_info_ = &buffer_->GetLineInfo(current_line_);
+    CARBON_DCHECK(next_start == current_line_info_->start);
     current_column_ = 0;
     set_indent_ = false;
   }
@@ -278,15 +308,6 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
     CARBON_DCHECK(source_text.front() == '\n');
     NoteWhitespace();
     source_text = source_text.drop_front();
-
-    // If this is the last character in the source, directly return here
-    // to avoid creating an empty line.
-    if (LLVM_UNLIKELY(source_text.empty())) {
-      current_line_info_->length = current_column_;
-      return;
-    }
-
-    // Otherwise, add a line and set up to continue lexing.
     HandleNewline();
   }
 
@@ -325,14 +346,18 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
                     NoWhitespaceAfterCommentIntroducer);
     }
 
-    // Now just consume the text until a newline.
-    while (!source_text.empty() && source_text.front() != '\n') {
-      ++current_column_;
-      source_text = source_text.drop_front();
+    // Use the current line info to jump to the end of the line.
+    source_text =
+        source_text.drop_front(current_line_info_->length - current_column_);
+    // This may be the end of the file in which case we immediately return.
+    if (source_text.empty()) {
+      // Finished lexing.
+      return;
     }
 
-    // We don't handle the newline, just fall back to the lex loop to handle it
-    // generically.
+    // Otherwise, lex the newline.
+    current_column_ = current_line_info_->length;
+    LexVerticalWhitespace(source_text);
   }
 
   auto LexNumericLiteral(llvm::StringRef& source_text) -> LexResult {
@@ -725,10 +750,23 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
   auto LexEndOfFile(llvm::StringRef& source_text) -> void {
     CARBON_DCHECK(source_text.empty());
 
+    // Check if the last line is empty and not the first line (and only). If so,
+    // re-pin the last line to be the prior one so that diagnostics and editors
+    // can treat newlines as terminators even though we internally handle them
+    // as separators in case of a missing newline on the last line. We do this
+    // here instead of detecting this when we see the newline to avoid more
+    // conditions along that fast path.
+    if (current_column_ == 0 && buffer_->GetLineNumber(current_line_) != 1) {
+      current_line_ = buffer_->GetPrevLine(current_line_);
+      current_line_info_ = &buffer_->GetLineInfo(current_line_);
+      current_column_ = current_line_info_->length;
+    } else {
+      // Update the line length as this is also the end of a line.
+      current_line_info_->length = current_column_;
+    }
+
     // The end-of-file token is always considered to be whitespace.
     NoteWhitespace();
-    // Update the line length as this is also the end of a line.
-    current_line_info_->length = current_column_;
 
     // Close any open groups. We do this after marking whitespace, it will
     // preserve that.
@@ -804,6 +842,9 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
   // The main entry point for dispatching through the lexer's table. This method
   // should always fully consume the source text.
   auto Dispatch(llvm::StringRef& source_text) -> void {
+    // First build up our line data structures.
+    CreateLines(source_text);
+
     LexStartOfFile(source_text);
 
     // Manually enter the dispatch loop. This call will tail-recurse through the
@@ -923,7 +964,7 @@ class [[clang::internal_linkage]] TokenizedBuffer::Lexer {
   TokenLocationTranslator token_translator_;
   TokenDiagnosticEmitter token_emitter_;
 
-  Line current_line_;
+  Line current_line_ = Line::Invalid;
   LineInfo* current_line_info_;
 
   int current_column_ = 0;
@@ -1098,6 +1139,17 @@ auto TokenizedBuffer::GetLineNumber(Line line) const -> int {
   return line.index + 1;
 }
 
+auto TokenizedBuffer::GetNextLine(Line line) const -> Line {
+  Line next(line.index + 1);
+  CARBON_DCHECK(static_cast<size_t>(next.index) < line_infos_.size());
+  return next;
+}
+
+auto TokenizedBuffer::GetPrevLine(Line line) const -> Line {
+  CARBON_CHECK(line.index > 0);
+  return Line(line.index - 1);
+}
+
 auto TokenizedBuffer::GetIndentColumnNumber(Line line) const -> int {
   return GetLineInfo(line).indent + 1;
 }

+ 13 - 0
toolchain/lex/tokenized_buffer.h

@@ -55,8 +55,12 @@ struct Token : public ComparableIndexBase {
 // All other APIs to query a `Line` are on the `TokenizedBuffer`.
 struct Line : public ComparableIndexBase {
   using ComparableIndexBase::ComparableIndexBase;
+
+  static const Line Invalid;
 };
 
+constexpr Line Line::Invalid(Line::InvalidIndex);
+
 // A lightweight handle to a lexed identifier in a `TokenizedBuffer`.
 //
 // `Identifier` objects are designed to be passed by value, not reference or
@@ -229,6 +233,12 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
   // Returns the 1-based indentation column number.
   [[nodiscard]] auto GetIndentColumnNumber(Line line) const -> int;
 
+  // Returns the next line handle.
+  [[nodiscard]] auto GetNextLine(Line line) const -> Line;
+
+  // Returns the previous line handle.
+  [[nodiscard]] auto GetPrevLine(Line line) const -> Line;
+
   // Returns the text for an identifier.
   [[nodiscard]] auto GetIdentifierText(Identifier id) const -> llvm::StringRef;
 
@@ -347,6 +357,9 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
           length(static_cast<int32_t>(llvm::StringRef::npos)),
           indent(0) {}
 
+    explicit LineInfo(int64_t start, int32_t length)
+        : start(start), length(length), indent(0) {}
+
     // Zero-based byte offset of the start of the line within the source buffer
     // provided.
     int64_t start;