Explorar o código

Remove LLVM from the repo, and clean up history. (#1344)

This proposal establishes a plan for moving away from the embedded copy
of LLVM and instead downloading it with Bazel.

The goal is that after this lands, we will do a history-rewrite to
cleanup the repository. There are instructions on how folks can move any
in-flight work over to the newly tidied repo.

Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Chandler Carruth %!s(int64=3) %!d(string=hai) anos
pai
achega
df345b5ec7

+ 1 - 0
.pre-commit-config.yaml

@@ -168,6 +168,7 @@ repos:
               .bazelversion|
               compile_flags.txt|
               third_party/.*|
+              bazel/llvm-patches/.*\.patch|
               .*\.def|
               .*\.svg|
               .*/fuzzer_corpus/.*|

+ 10 - 2
WORKSPACE

@@ -84,10 +84,18 @@ http_archive(
 # LLVM libraries
 ###############################################################################
 
-new_local_repository(
+# We pin to specific upstream commits and try to track top-of-tree reasonably
+# closely rather than pinning to a specific release.
+llvm_version = "e2f627e5e3855309f3a7421f6786b401efb6b7c7"
+
+http_archive(
     name = "llvm-raw",
     build_file_content = "# empty",
-    path = "third_party/llvm-project",
+    patch_args = ["-p1"],
+    patches = ["@//:bazel/llvm-patches/0001-Patch-for-mallinfo2-when-using-Bazel-build-system.patch"],
+    sha256 = "228c37eecf8a8027ab32ac466b988712136191a0076d80750c646a3a9b1dc5d2",
+    strip_prefix = "llvm-project-%s" % llvm_version,
+    urls = ["https://github.com/llvm/llvm-project/archive/%s.tar.gz" % llvm_version],
 )
 
 load("@llvm-raw//utils/bazel:configure.bzl", "llvm_configure")

+ 32 - 0
bazel/llvm-patches/0001-Patch-for-mallinfo2-when-using-Bazel-build-system.patch

@@ -0,0 +1,32 @@
+From 80f5475adfe179739a45d42850f8c06a630bc3a0 Mon Sep 17 00:00:00 2001
+From: Chandler Carruth <chandlerc@gmail.com>
+Date: Fri, 17 Jun 2022 09:10:41 +0000
+Subject: [PATCH] Patch for mallinfo2 when using Bazel build system.
+
+This detects and defines the `HAVE_MALLINFO2` macro based on the glibc
+version to allow easy use of the Bazel build on systems with modern
+glibc installs.
+---
+ llvm/lib/Support/Unix/Process.inc | 7 +++++++
+ 1 file changed, 7 insertions(+)
+
+diff --git a/llvm/lib/Support/Unix/Process.inc b/llvm/lib/Support/Unix/Process.inc
+index d3d9fb7d7187..da3e721146f6 100644
+--- a/llvm/lib/Support/Unix/Process.inc
++++ b/llvm/lib/Support/Unix/Process.inc
+@@ -31,6 +31,13 @@
+ #if HAVE_SIGNAL_H
+ #include <signal.h>
+ #endif
++// When glibc is in use, detect mallinfo2 to address mallinfo deprecation
++// warnings.
++#if !defined(HAVE_MALLINFO2) && defined(__GLIBC_PREREQ)
++#if __GLIBC_PREREQ(2, 33)
++#define HAVE_MALLINFO2
++#endif  // __GLIBC_PREREQ(2, 33)
++#endif  // !defined(HAVE_MALLINFO2) && defined(__GLIBC_PREREQ)
+ #if defined(HAVE_MALLINFO) || defined(HAVE_MALLINFO2)
+ #include <malloc.h>
+ #endif
+--
+2.36.1

+ 3 - 4
docs/project/contribution_tools.md

@@ -127,10 +127,9 @@ brew install bazelisk
 ### Clang and LLVM
 
 [Clang](https://clang.llvm.org/) and [LLVM](https://llvm.org/) are used to
-compile and link Carbon as part of its build. Their source code are also
-provided in a [third_party subtree](/third_party/llvm-project) for incorporation
-into Carbon or Carbon tools as libraries. While the subtree tracks upstream
-LLVM, the project expects the LLVM 12 release (or newer) to be installed with
+compile and link Carbon as part of its build. Bazel will also download and build
+against a specific upstream LLVM commit. While the Bazel uses upstream LLVM
+sources, the project expects the LLVM 12 release (or newer) to be installed with
 Clang and other tools in your `PATH` for use in building Carbon itself.
 
 Our recommended way of installing is:

+ 370 - 0
proposals/p1344.md

@@ -0,0 +1,370 @@
+# Remove LLVM from the repository, and clean up history.
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/1344)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Problem](#problem)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Implementation](#implementation)
+    -   [Patching LLVM](#patching-llvm)
+    -   [Updating forks and clones](#updating-forks-and-clones)
+        -   [Archive your existing fork and clones](#archive-your-existing-fork-and-clones)
+        -   [Create a fresh fork and clone](#create-a-fresh-fork-and-clone)
+        -   [Porting in-progress branches from your archived clone](#porting-in-progress-branches-from-your-archived-clone)
+        -   [Updating pending PRs](#updating-pending-prs)
+    -   [Review comments may be disrupted](#review-comments-may-be-disrupted)
+-   [Rationale](#rationale)
+-   [Alternatives considered](#alternatives-considered)
+    -   [Do nothing](#do-nothing)
+    -   [Don't rewrite the repository history.](#dont-rewrite-the-repository-history)
+    -   [Go back to submodules](#go-back-to-submodules)
+    -   [Rename the repository, and create a new one](#rename-the-repository-and-create-a-new-one)
+    -   [Manually extract and archive some review comments](#manually-extract-and-archive-some-review-comments)
+
+<!-- tocstop -->
+
+## Problem
+
+We have tried both Git submodules and subtrees to manage accessing the LLVM
+source code as part of Carbon. We have had significant issues with both of these
+over time.
+
+Git submodules are well supported in general, but make the repository
+significantly less user-friendly. There are a number of common operations, not
+least the initial clone of the repository, that are made frustratingly more
+complex in the presence of a submodule. In some cases (switching branches), it
+can even cause hard to diagnose errors for users. It also doesn't directly help
+with carrying local patches. Because of all of these, we switched to subtrees.
+
+Git subtrees in some ways work much better once set up as they are simply a
+normal directory for most users. They also make it especially easy to make and
+carry local patches. However, setting up the subtree and updating the subtree
+are bug prone and interact very poorly with GitHub's pull request workflow. The
+result is that updating LLVM is currently both very difficult and error prone.
+
+Another fundamental problem subtrees expose is that the LLVM repository is
+_huge_. It includes large and complex projects that Carbon is unlikely to ever
+use, and it doesn't make sense for us to start our repository off with all of
+that history and space if we can avoid it. Including LLVM causes things like
+`git status` or cloning the repository to be dramatically more expensive without
+significant benefit.
+
+## Proposal
+
+We should switch to using Bazel to download a snapshot of LLVM during the build,
+and do a destructive history re-write to the Carbon repository to completely
+remove LLVM from it. All of the building against LLVM will continue to work
+seamlessly. However, the repository will become tiny and fast to work with, and
+the snapshot download is significantly cheaper than cloning or working with the
+full-history of LLVM.
+
+The only way to capture the benefit here is to do a destructive update to the
+repository's history. This is unfortunate, but essential to do as soon as
+possible and before we shift to be public. Despite the cost of making this
+change, the cost of _not_ making it will grow without bound for the life of the
+project.
+
+**Important:** This _will_ require a manual update of some kind for every fork
+and clone. The steps to update them are provided below.
+
+## Details
+
+### Implementation
+
+<!-- google-doc-style-ignore -->
+
+This will be accomplished using the
+[`git-filter-repo` tool](https://github.com/newren/git-filter-repo). This makes
+it very easy to prune any trees from the history, replaying the history left and
+resulting in a clean and clear result.
+
+<!-- google-doc-style-resume -->
+
+We will remove three other trees while here that are no longer used:
+
+-   `src/jekyll/`
+-   `src/firebase/`
+-   `website/`
+
+Once we're taking a history break, we should capture the value we can and have a
+minimal history of the repository we are actually using. The largest of these is
+`src/jekyll` (988kb even packed), but it seems cleanest to remove the three
+together.
+
+### Patching LLVM
+
+We still need to support patching LLVM. Initially, we will start with a simple
+approach of applying a collection of patch files from the Carbon repository to
+the LLVM snapshot. This is especially convenient while Carbon's repositories are
+not public. Eventually, we can switch to having a fork of LLVM that we snapshot
+and developing any needed patches there. Having a more robust long-term solution
+is expected to be important if larger scale development is needed on Clang when
+building interoperability support.
+
+<!-- google-doc-style-ignore -->
+
+Developing patches is also very easy when using Bazel. You can pass
+`--override_repository=llvm-project=<directory>`
+([docs](https://bazel.build/reference/command-line-reference#flag--override_repository))
+to have Bazel use a local checkout of LLVM (with any patches you are testing)
+rather than downloading it. If doing significant development, this can be added
+to your `user.bazelrc` file to consistently use your development area.
+
+<!-- google-doc-style-resume -->
+
+### Updating forks and clones
+
+If you have a fork and any clones of the repository with work that you want to
+save, we suggest the following process:
+
+#### Archive your existing fork and clones
+
+1.  Go to the archived repository:
+    https://github.com/carbon-language/archived-carbon-lang
+
+2.  Fork this archived repository just like you normally would (into your
+    _personal_ space). It is important to fork the `carbon-language` hosted
+    archived repository so you pick up the ACLs. You should **not** create your
+    own personal repository.
+
+3.  Update any clones to use these archived remotes to avoid accidentally trying
+    to interact with the new repository. Example commands assuming the remote
+    `origin` points to your fork and and `upstream` points to the main
+    repository. You may also need to adjust from the SSH-style URLs to HTTPS
+    ones to match:
+
+    ```
+    git remote set-url upstream git@github.com:carbon-language/archived-carbon-lang.git git@github.com:carbon-language/carbon-lang.git
+    git remote set-url origin git@github.com:$USER/archived-carbon-lang.git git@github.com:$USER/carbon-lang.git
+    ```
+
+4.  For each clone, mirror everything into the new `origin` (your fork):
+
+    ```
+    git push --mirror origin
+    ```
+
+    If you have multiple clones with different but overlapping branches, this
+    may require some extra steps to avoid collisions when doing the mirror push.
+
+At this point, your clones should all point to an archival fork and should be
+able to function "normally" cleanly. The goal here is just archival and making
+sure you don't lose anything.
+
+#### Create a fresh fork and clone
+
+Delete your existing fork of `carbon-lang` (_not_ `archived-carbon-lang`):
+
+1.  Go to your fork's settings page:
+    `https://github.com/$USER/carbon-lang/settings`. Be certain this is your
+    _fork_ and not the `carbon-language` organization.
+2.  Scroll to the bottom, and click `Delete this repository`.
+
+Once deleted, create a fresh fork from the now-filtered main repository:
+https://github.com/carbon-language/carbon-lang/fork
+
+Clone this fork and you're ready to go with the new repository structure.
+
+#### Porting in-progress branches from your archived clone
+
+For any branch in your archive clones that you want to import, there are two
+approaches.
+
+The simplest way is to extract the branch as a patch file and apply it in the
+new repository:
+
+```
+# If your old clone is in the `archived-carbon-lang` directory and your new
+# clone is in `carbon-lang`, both of them on the `trunk` branch:
+cd archived-carbon-lang
+git switch $BRANCH
+git diff upstream/trunk... > ../$BRANCH.patch
+cd ../carbon-lang
+git switch -c $BRANCH
+patch -p1 < ../$BRANCH.patch
+git commit
+```
+
+This will just flatten the branch into a single commit in the new clone.
+
+If you want to preserve the commit history, you can do that using
+`git format-patch`. Some points to remember here:
+
+-   Make sure you've rebased the branch into a clean series of commits on top of
+    the latest trunk in the archived repository.
+-   Use `git format-patch` to get a series of patches in a directory. See its
+    help for detailed usage.
+-   Use `git am` to apply the series of patches in your new clone under some
+    branch. Again, see its help for detailed usage.
+
+#### Updating pending PRs
+
+The pending PRs should end up closed automatically when you delete your fork
+above. If not, the simplest approach is to close them and just create a new PR.
+In-flight discussion comment threads will be awkward, but there isn't much else
+to do. Rename the proposal document if needed.
+
+We will explore whether we can re-open PRs with a fresh forced push from the
+fresh fork, but aren't confident in this working. It also isn't necessary, and
+we only have 12 PRs open at the moment.
+
+### Review comments may be disrupted
+
+When we edit the repository, some PR comments (especially pending PRs that are
+updated to follow a freshly created fork) may lose their association with the
+line of code they were made against, and it is possible we may be unable to find
+some comments. This will at most apply to inline comments within the code of
+PRs, but that is the common case for PRs. We expect most of these to still be
+somewhat visible in the conversation view of the pull request, but they may be
+hard to find due to no longer being attached to a line.
+
+## Rationale
+
+-   [Community and culture](/docs/project/goals.md#community-and-culture)
+    -   This will make it even easier and smoother for folks to clone, build,
+        and even contribute.
+    -   It will make clones significantly cheaper, especially those not building
+        code but just working on documentation.
+-   [Language tools and ecosystem](/docs/project/goals.md#language-tools-and-ecosystem)
+    -   Preserving the ability to build with LLVM and develop patches will be
+        important as we expand the set of language tools and ecosystem being
+        developed.
+
+## Alternatives considered
+
+### Do nothing
+
+We could simply continue using the moderately broken Git subtree approach.
+
+Advantages:
+
+-   No need to change our approach.
+-   No destructive operation on the repository requiring everyone to update.
+
+Disadvantages:
+
+-   Every update of LLVM remains difficult, manual, and error prone.
+    -   We're currently carrying an LLVM patch, and it's easy to lose (and has
+        been lost) in updates.
+-   Will continually hit bugs in Git subtree as it seems both brittle and not a
+    priority. We will have to struggle to understand and fix or work around
+    them.
+-   The repository remains massive, slow, and contains significant LLVM code
+    that we will never use.
+-   If we ever have users that wish to import the Carbon repository into some
+    other environment, they will have to pay the cost of LLVM or remove it
+    somehow. It may even cause them to have two copies of LLVM.
+
+We think this problem is worth solving.
+
+### Don't rewrite the repository history.
+
+We could fix this without rewriting history. If we choose not to rewrite history
+now, it should be noted that the cost of rewriting history only grows and so we
+should expect to _never_ rewrite history.
+
+Advantages:
+
+-   No need for manual steps to update forks, clones, in-flight patches, etc.
+-   Commit hashes remain stable.
+-   All code review comment information associated with commit hashes will be
+    retained.
+
+Disadvantages:
+
+-   We pay the cost of having imported LLVM forever. Even if this cost is
+    incrementally small (some amount of repository space when cloning with
+    history), it is a cost we will never stop paying.
+
+While the immediate costs are high, the unbounded time frame for which we will
+pay for leaving LLVM in the history means that will eventually be the dominant
+cost and we should just rewrite history, and the sooner the better.
+
+### Go back to submodules
+
+Rather than switching to use Bazel downloaded snapshots, we could go back to
+using Git submodules. The original motivation to move away from submodules was
+needing to carry a local patch. Submodules and the proposed direction have the
+same options there, and our approach is discussed above.
+
+Advantages:
+
+-   Somewhat more of a Git-native approach.
+-   Would avoid needing to invent another approach if we add a non-Bazel build
+    system.
+
+Disadvantages:
+
+-   Much of the cloning cost and other Git command costs we see with subtrees
+    would still be present.
+-   It makes working with the Git repository even more tricky to get right.
+-   While less esoteric than subtrees, it still exposes a less polished surface
+    of Git that we will have to cope with.
+-   Most notably, this will re-introduce the stumbling block of users first
+    encountering Carbon not having a seamless experience.
+
+We think we should try something simpler here, even if a bit less of a
+native-Git solution. It is also especially important for us to optimize the
+initial new contributor flow, and the Bazel approach is expected to be more
+seamless.
+
+### Rename the repository, and create a new one
+
+Rather than editing the repository in place, this would move it aside and create
+a new one with the edited history.
+
+Advantages:
+
+-   Some disruptive aspects of the in-place history edit would be avoided,
+    specifically parts of PRs that are associated with commit hashes would
+    likely continue working due to the commit hashes being preserved.
+
+Disadvantages:
+
+-   For most cases, this would be equally disruptive -- forks and clones would
+    have the same update needed as they reference the repository by name.
+-   We would either need to build a system to carefully re-create the exact
+    issue numbering and PR numbering, which may not even be possible, or we
+    would need to allow all issue numbers and PR numbers to churn.
+    -   The PR numbers churning is especially disruptive as the commit log
+        references them to connect commits to the code review that led to the
+        commit. They are also the basis of the proposal numbers.
+    -   There isn't likely a way to move the PRs back to the main repository, so
+        browsing the historical code reviews would become problematic.
+
+This seems to largely trade off preservation of some of the comment history
+within PRs for preserving the location, numbers, and links of all PRs and
+issues. That doesn't seem like the right tradeoff.
+
+### Manually extract and archive some review comments
+
+We could attempt to extract and archive review comments in case they are lost or
+made hard to find by the change.
+
+Advantages:
+
+-   Defense in depth against any information loss here.
+
+Disadvantages:
+
+-   Would be a decent amount of work to get this right.
+-   May not lose much information here. We think the comments will still be in
+    the conversation view, but maybe hard to find.
+-   Unclear whether any of these will actually have value, especially extracted
+    and out of context from the review where they were made.
+-   Serious concerns about arbitrarily scraping and moving comments other folks
+    authored outside of the system (GitHub) that they authored them within.
+
+Overall, while there is some risk here, we don't think it is too high and the
+cost of trying to mitigate this seems unreasonable given the relatively small
+total amount of data at issue here.