Originally published on the MacArthur Lab blog.
It is an absolute pleasure to announce the official release of the gnomAD manuscript package. In a set of seven papers, published in Nature, Nature Medicine, and Nature Communications, we describe a wide variety of different approaches to exploring and understanding the patterns of genetic variation revealed by exome and genome sequences from 141,456 humans.
Publication announcements always feel a little strange in this new era of open science. In our case, the underlying gnomAD data set has been publicly fully available for browsing and downloading since October 2016, and we’ve had the preprints available online since early 2019. However, it’s undeniable that there is something deeply gratifying about seeing these pieces of science revealed in their final, concrete form.
For me this package has a particular significance – it represents the culmination of seven and a half years of work with a phenomenal team at the Broad Institute, and marks my transition to a new role in Australia, and the handover of the gnomAD project to new leadership. So I wanted to spend some time in this post reflecting on the history of the project that became gnomAD, the people who’ve made it possible, and where things will go from here.
The idea of building a new database of genetic variation dates back to 2012, when I was a brand new faculty member in the newly established Analytic and Translational Genetics Unit at Massachusetts General Hospital and the Broad Institute. One of my first projects in that role was working with colleagues in Australia to analyze the exomes of a set of families affected by severe genetic muscle diseases. It rapidly became clear that one of the major limitations on analysis of these families was the quality and size of the available reference databases of normal genetic variation: existing resources like 1000 Genomes and ESP existed were still small, and had been generated using older data and analytic methods that made it hard to compare them to new high-quality patient exomes. That meant it was often hard to be sure that a novel genetic variant seen in a patient was actually novel, or had just been missed in these databases.
At the same time, it became clear that we had access to an astonishing, and rapidly growing, resource of high-quality genetic data at the Broad Institute. Many projects focused on understanding the genetic basis of common, complex, adult-onset disorders like type 2 diabetes, heart disease, and schizophrenia had generated tens of thousands of exomes from cases and controls. That meant if we could find a way to perform variant-calling across all of these samples simultaneously, we could create an unprecedented resource of human protein-coding variation – and it was clear that this would be transformative for interpreting variation in rare disease patients.
In retrospect, deciding to take on a project of this scale as a junior faculty member was an act of sheer foolhardiness. But I got incredibly lucky on multiple levels. Firstly, I had steadfast support from senior investigators at the Broad who had generated the majority of the underlying exome data – particularly David Altshuler, Mark Daly, and Sekar Kathiresan – which made it feasible to create a consortium around this idea. Secondly, we had new methods being developed at the Broad that would ultimately enable high-quality variant-calling at enormous scale, and a team led initially by Mark DePristo, and then by Eric Banks and Tim Fennell, that was willing to continuously run and iterate on these methods until we got them right – a process that ended up taking 18 agonizing months. And finally, I was unbelievably fortunate to have Monkol Lek in my lab, who led the painstaking and repeated process of quality-checking and analyzing each new attempt until we finally had a call set that we trusted.
That call set became the Exome Aggregation Consortium (ExAC) release, which spanned just over 60,000 exomes. Following a down-to-the-wire push on the development of a new browser led by Konrad Karczewski, we were able to announce the public release of this dataset at ASHG in October 2014 – whereupon it was almost immediately adopted as the default reference data set for clinical genetics. And then, over the next 18 months, we worked on exploring the data set in a wide variety of ways, and thanks to the effort of a remarkable young team of analysts – particularly Kaitlin Samocha, Konrad Karczewski, and Eric Minikel – we were able to discover a bunch of cool stuff that we reported in a preprint that a year later became a published paper.
We all learned a tremendous amount from the ExAC experience. Firstly, it became clear that the data aggregation consortium model could work – investigators from all over the world had been willing to share their data from disease-specific studies to create a unified resource that benefited everyone. Secondly, we had learned many painful but important lessons about how to build and quality-control a data set of tens of thousands of sequenced humans. And finally, we had been justified beyond our wildest dreams in our view that rapid, open release of the underlying data was worthwhile: the ExAC browser had already garnered millions of page views and impacted the diagnosis of thousands of rare disease patients.
And of course we had also benefited from that open process. Every time someone looked at their favorite gene in ExAC, it was an opportunity for them to spot a mistake, and for us to try to fix it. There is no better approach to quality control than having thousands of very smart pairs of eyes crawl over your data looking for things that don’t make sense.
Even at the time that we released ExAC it was clear that we could go bigger – the methods that Eric Banks’ team had developed looked like they could scale to vastly larger numbers. It also seemed clear that going bigger would have enormous scientific value, because many of the analyses we were most interested in (like studying depletion of loss-of-function variants in individual genes) were underpowered even with the 60,000 exomes in ExAC. So the team almost immediately started planning for the next release.
This time around it was clear that in addition to substantially increasing the number of exomes in the call set, we’d have the opportunity to include some whole genomes as well – which would be great for fleshing out our understanding of variation in the non-coding sections of the genome, as well as for capturing structural variation. In early 2016 we set ourselves the ambitious target of creating this new call set in time for an announcement at ASHG in October that year.
Remarkably, we did it. Eric Banks personally oversaw the generation of massive new call sets of exomes and whole genomes, with contributions from an enormous number of players in the Broad’s Data Sciences Platform1. The initial quality control of the data was led by Konrad Karczewski, Laurent Francioli, and Kristen Larricchia, while the development of a new data browser was led by Matt Solomonson and Ben Weisburd. The intense work of assembling the corresponding metadata and securing appropriate permissions for all the samples, as well as a huge amount of overall coordination effort, fell on Jessica Alföldi. And finally, we benefited hugely by the creation of a new platform for rapid analysis and quality control of genomic data, Hail, built by Cotton Seed’s team at the Broad, with particularly helpful contributions to our work by Tim Poterba.
The entire process climaxed in a truly frantic week leading up to the ASHG announcement – a week that involved multiple computing cluster outages, last-minute paperwork challenges, sleepless nights, and unquestionably more code generated and deployed2 per hour than at any other time in my lab’s history. And at the end, we were able to successfully announce the release of a new call set that, while not perfect3, represented an enormous advance in terms of size and coverage over its ExAC precursor. Because the callset now contained whole genomes, we needed a new name: and thus the Genome Aggregation Database (gnomAD) v2 dataset4 was born.
This desperate sprint to the public release at ASHG was incredibly stressful, but I think I can fairly say that it was also the proudest period of my academic career to date. Watching an amazing team – many of whom had only recently started in the lab – come together and work like hell to build a massive and complicated resource and push it out into the world for others to use was intense, and inspiring. This is not a process I would care to repeat often, but I’m glad I got to experience it alongside such a fantastic team.
With the callset out in the world and already being heavily used, we were very keen to start doing science on it. However, before we could get there we needed to do a more comprehensive and thorough round of quality control over the data set, beyond the very basic QC we’d completed prior to the first public release. Because of the scale and complexity of the data set, this ended up taking a loooong time, in iterative cycles of quality control and analysis, led heroically by Laurent Francioli, Konrad Karczewski, and Grace Tiao, culminating in the release of gnomAD v2.1 in October 2018, two years (!) after that initial release. That was a marathon effort, which I’ll note would have been vastly longer without the expertise of the analysis team and the amazing computational speed-ups provided by the Hail platform.
During this time our group was also busily getting to know the challenges and power of the new data set, teasing at scientific threads that were slowly assembled into the gnomAD manuscripts released today. For instance, Konrad developed approaches for analyzing predicted loss-of-function (pLoF) variants, and built on Kaitlin Samocha’s prior work in ExAC to refine methods for determining which genes appear to be missing pLoF variants in the population and are thus likely to be particularly important – work that became the backbone of the gnomAD flagship paper, which describes the dissection of human genes into classes based on their intolerance towards genetic disruption.
Our team had already spent considerable time thinking about whether the pLoF variants Konrad had identified might be valuable for identifying and validating candidate targets for therapeutic drugs, leading ultimately to a paper led by Eric Minikel describing general principles for pLoF-guided drug target validation. We also investigated whether it might be possible to obtain clinical information for carriers of rare gene-disrupting variants in gnomAD to explore specific therapeutic hypotheses. Irina Armean from my team worked with Jessica Alföldi to pull together such information for carriers of pLoF variants in the LRRK2 gene across six gnomAD cohorts; and the baton was then passed to Nicky Whiffin, who drove this project home, bringing together data sets from 23andMe (thanks to Aaron Kleinman and Paul Cannon) and UK Biobank to show clearly that such variants are not associated with adverse clinical phenotypes – good news for multiple companies currently working on LRRK2 inhibitors for Parkinson’s disease.
In complementary work to our analysis of small variants (single nucleotide changes and small insertions/deletions), Mike Talkowski’s group at Broad/MGH, particularly Ryan Collins and Harrison Brand, had developed and applied a new set of tools for calling structural variants – large deletions, duplications, insertions and rearrangements of DNA. After a truly staggering amount of work validating the accuracy of their method, they were able to apply it to nearly 15,000 whole genomes overlapping with the gnomAD set to create what is easily the highest-resolution map to date of human structural variation.
The gnomAD data set proved to be fertile ground for testing new approaches to variant interpretation. Beryl Cummings had a long-standing interest in integrating variant and transcript-level expression data, which led toa new method for assessing whether individual variants in a gene are highly expressed across tissues using data from the GTEx project. We also explored particular subsets of genetic variation in the dataset: Qingbo Wang led a comprehensive analysis of the mutational origins and functional effects of multinucleotide variants (clusters of nearby variants that have been inherited together, and can greatly confuse annotation tools); and Nicky Whiffin from Imperial College London spent time in the lab using the gnomAD dataset to pursue her favorite class of variants, those that create or destroy upstream open reading frames in protein-coding transcripts, potentially leading to complete loss of function of the affected genes. (Nicky has a separate blog post about her thoughts on the gnomAD package here.)
This period also saw major improvements to the gnomAD browser under the leadership of Nick Watts. This included key additions like a separate view for structural variation, the addition of new tracks for base-level RNA expression and ClinVar pathogenic variants, and major overhauls of the back end and overall functionality of the browser.
I’m not going to say a huge amount more about the gnomAD manuscripts here – they’re all open access, so you can read them yourself, at the links above – but I did want to say a little about the process of assembling the package.
It’s fair to say that our original plan for these papers was substantially less ambitious. However, as we began to assemble the materials for a gnomAD scientific release it became clear that we had vastly more content than we could fit into one or two papers, and thus probably justified a coordinated package of manuscripts. Over time, the package idea began to accumulate unexpected bits of science: a great example was Eric Minikel’s paper on the use of LoF variants to explore drug targets, which began as a blog post in late 2018, which Eric then substantially expanded and refined with new analyses into a standalone manuscript. (Eric has more thoughts on the background of his paper in his own blog post.)
We benefited enormously from many conversations with Orli Bahcall at Nature, who agreed that a package seemed worthwhile, initially received and edited the papers, handled all of the initial peer review, and helped to shape the manuscripts into a coherent structure. The final sprint to publication, and the assembly of the final package, was then handled by Michelle Trenkman and Kate Gao. Peer review of the papers was tough, but generally fair – if you want to get a sense of how this went, you can read the reviews and responses yourself, since we agreed to serve as test cases for Nature’s first pilot of open peer review.
The overall publication process has been long – the first of the papers was initially submitted (to Nature, and simultaneously as a preprint to bioRxiv) on January 25th, 2019 – although given the scale and complexity of the papers it is difficult to imagine a much shorter time-frame. It is certainly a powerful reminder of the value of preprint servers like bioRxiv, which has allowed continuously updated versions of all of these papers to be publicly and freely available for the ~16 months between submission and publication.
I’m hugely proud of the final product, which represents the work of an incredibly talented team over many years, and is a vastly more successful outcome than a naïve young faculty member considering genomic data aggregation in 2012 could ever have reasonably hoped for. We were fortunate to be in the right place, with the right data, at the right time to make genomic data aggregation possible at scale, but more importantly I was ridiculously lucky to have been able to work with so many of the right people.
Without the principal investigators who contributed their hard-earned and high-quality data; without the software engineers and computational biologists from the Broad’s Data Sciences Platform and Hail team who built and ran the tools necessary to generate and quality-control data on this scale; without the team of sharp and impossibly hard-working postdocs and scientists who reviewed the data, and came up with clever scientific questions to ask and ways to answer them; without the project managers, and product owners, and associate directors who made sure that everything actually worked; without all these people, gnomAD wouldn’t have existed.
The team has not been idle since the release of the gnomAD v2 dataset. In 2019 we were able to release a new gnomAD whole-genome release, gnomAD v3, which included over 70,000 genomes remapped to the newest human genome reference, GRCh38, and was expertly guided to release by Laurent Francioli. There is much more data to come: the next planned gnomAD release will be a gargantuan GRCh38 release of exomes, hopefully spanning more than 400,000 samples. These two data sets in combination will be the foundation for a truly staggering amount of future research exploring the patterns of variation and constraint across both coding and non-coding regions.
As many of you know, 2019 was also a big year for me in other ways: I decided to move back home to my native Australia, to take up a role as director of the newly established Centre for Population Genomics, a joint initiative between the Garvan Institute of Medical Research in Sydney and Murdoch Children’s Research Institute in Melbourne. This is a chance to take the lessons we learned from seven years of ExAC and gnomAD in a variety of new directions, including developing more comprehensive maps of variation in a variety of understudied human populations, and exploring how to incorporate population-scale genomics into a functional healthcare system. (If you’re interested in working in sunny climes on large-scale population genomics – we’re hiring.)
My decision to leave Boston after nearly eight years wasn’t an easy one, but it was made much easier by my ability to hand the leadership of gnomAD over into very safe hands: new gnomAD consortium co-directors Heidi Rehm and Mark Daly. Heidi and Mark are both awesome, they bring highly complementary skills in clinical variant interpretation and large-scale statistical genetics, and they both share the fundamental gnomAD philosophy around the importance of open data sharing. With Heidi and Mark in charge, together with a council of scientific advisors spanning a wide range of scientific expertise, the project is well-placed to continue to generate new data, and exciting new science, for many years to come.
- Special DSP shout-outs to Kathleen Tibbetts, Charlotte Tolonen, Yossi Farjoun, Laura Gauthier, Dave Shiga, Jose Soto, George Grant, Sam Novod, Valentin Ruano-Rubio, and Bradley Taylor.
- A non-trivial amount of this code was written by Konrad in the back of taxis, and one particularly memorable coding session involved Laurent working while on a whale-watching expedition boat.
- Thanks to an inadvertent use of range(1,22), the call set was missing all variants on chromosome 22 for the first few hours of its public release. As far as we can tell, no-one noticed.
- The rationale behind naming this first gnomAD release “gnomAD v2” is now rather murky, but I recall it made sense at the time. ¯\(ツ)/¯