The Microsoft Study That Exposed Why Your Large PRs Fail (And GitHub's 2024 Fix)
TL;DR A Microsoft Research study of over 700,000 code reviews found reviewers comment on just 2–3% of methods in large changesets versus 7% in small ones, with PRs over 1,000 lines seeing sharply higher abandonment rates. GitHub’s native stacked PRs, currently in private preview, productize the workflow Meta, Google, and Uber have used internally for over a decade—breaking big changes into ordered, dependent layers with server-side stack tracking, one-click cascading rebases, and proper branch protection against the final base.
Code review quality collapses as PR size grows. That’s not opinion—it’s what happens when humans try to hold too much context at once. For 15 years, the largest engineering organizations sidestepped this problem by treating changes as ordered stacks rather than isolated commits or massive diffs. Now GitHub has taken that battle-tested internal practice and built it directly into the platform everyone uses. The implications go beyond convenience: this changes how teams can refactor, how CI behaves, and whether small, reviewable units become the default even when features depend on each other.
From Phabricator to ghstack: How Stacked Diffs Escaped Big Tech
The pattern started inside monorepos where true independence was often impossible. Google’s Mondrian and Facebook’s Phabricator treated dependent changes as first-class citizens as early as the mid-2000s [1]. By 2017 Facebook open-sourced ghstack, which rewrote diffs on temporary branches so engineers could maintain logical stacks on GitHub despite the platform having no native concept of them [2]. Tools like eBay’s spr and Graphite followed, each adding CLI-driven stack management and dashboards. The Microsoft study quantified what these teams already knew: review effectiveness falls dramatically with size, and merge probability drops non-linearly past roughly 400 lines [3]. GitHub’s version is different because the platform itself now understands the chain—previous tools had to fake it with branch naming tricks or external state.
What Changes When GitHub Tracks the Stack Server-Side
The technical leap is that branch protection, merge queues, and status checks now apply against the logical bottom of the stack rather than just the immediate base branch. When you merge the lowest PR, GitHub automatically rebases the remaining layers so the next one targets the updated base—no manual retargeting [1]. The CLI (gh stack) handles local creation, cascading rebases with one command, and opening the entire chain as separate PRs. Reviewers see a stack map in the UI that lets them jump between layers without losing context, and each PR shows a focused diff for its specific change. Compared with the old “one giant PR” approach, this preserves the ability to land everything together while letting reviewers reason about one conceptual piece at a time. The tradeoff is clear: you inherit rebase cascade risk—if the bottom layer has conflicts, every PR above it may need manual resolution, though the one-click rebase button reduces the friction.
Who Should Use Stacks Versus Who Should Just Refactor Harder
The workflow shines in large monorepos with heavy refactoring layers—exactly why Meta and Google standardized on it. Graphite was already serving thousands of engineering teams before native support arrived [4]. Yet trunk-based development purists argue that needing a stack is often a symptom of insufficient modularity; they prefer feature flags and waiting for lower layers to merge. Real-world complaints from early Graphite and ghstack users include notification overload when a 12-PR stack floods the PR list and the very real pain of complex merge conflicts rippling upward. GitHub’s implementation removes the biggest previous friction—third-party tool drift from the native UI—but it’s still maturing. Teams evaluating it should ask whether their typical changes can be decoupled or if the stack primitive genuinely matches their dependency patterns.
The rules of the game are changing because the platform finally understands a workflow that used to require workarounds. The open question is whether stacked PRs will become the comfortable default for mid-sized teams, or whether the extra discipline of keeping every change truly independent will win out in the long run. Your next large feature is the perfect place to run the experiment.
References
[1] GitHub Stacked PRs - https://github.github.com/gh-stack/
[2] facebookincubator/ghstack - https://github.com/facebookincubator/ghstack
[3] Kononenko et al., Code Review Quality: How Developers See It, ICSE 2016
[4] Graphite.dev Technical Documentation - https://graphite.dev
[5] GitHub Docs: Working with stacked pull requests - https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/working-with-stacked-pull-requests