Posts by gnomAD Production Team

Using the gnomAD ancestry principal components analysis loadings and random forest classifier on your dataset

By popular request, we are now releasing the ancestry principal components analysis (PCA) variant loadings and accompanying random forest (RF) model used for global ancestry inference in gnomAD v2 and v3. This post discusses how those files were generated and how they can be used on another dataset. However, the use of these resources will not be appropriate for all datasets, and therefore we are including a discussion of the caveats associated with using these loadings and the RF model.

gnomAD v3.1

Today, the gnomAD Production Team is proud to announce the release of gnomAD v3.1, an update to our previous genome release. The v3.1 data set adds 4,454 genomes, bringing the total to 76,156 whole genomes mapped to the GRCh38 reference sequence. (Our most recent exome release is available in gnomAD v2.1.)

Despite the minor numbering of this release, we bring you an update filled with firsts.

For the first time, we:

  • Provide individual genotypes in addition to variant calls for a subset of gnomAD. This highly diverse subset includes new data from >60 distinct populations from Africa, Europe, the Middle East, South and Central Asia, East Asia, Oceania, and the Americas
  • Provide and display data from samples of Middle Eastern ancestry
  • Display read data visualizations for non-coding variants—an effort that required the generation of visualizations for over 2.5 billion genotypes observed in this release
  • Display manual curations for predicted loss-of-function variants on the gnomAD browser
  • Generated the dataset by incrementally adding new samples onto an already-existing callset, eliminating the time and cost typically required to re-call existing samples
  • Make all gnomAD data—for this release as well as previous releases—freely available for download or export on three cloud providers: Amazon Web Services, Microsoft Azure, and Google Cloud

And we’re currently polishing up the final touches on our first-ever mitochondrial variant release on v3.1, which will be coming very soon.

Requester-Pays Notice to Users

Last month the gnomAD project was billed thousands of dollars in cloud egress charges—above and beyond our normal expected costs—for users who were accessing Hail-formatted public gnomAD data. The vast majority of this excess cost was due to users spinning up machines in international regions and reading data from our US-region storage bucket.

As a result, we have decided to move gnomAD Hail tables and matrix tables to a requester-pays bucket, while keeping the VCFs and smaller public files free to download as usual. We decided to do this for the following reasons:

  • From our beginnings as a project, we have been committed to making gnomAD data as free and accessible to the world as humanly possible. We pay for each VCF download of our data, and we have resisted proposals to add gating mechanisms (such as click-through agreements) to our data. We want to reaffirm our commitment to our users by continuing to make VCFs free to download to our growing user base.

  • However, to maintain gnomAD, we must keep costs as low as possible and fund aspects of gnomAD that benefit the widest user base. Providing free access to the Hail-formatted versions of the data is very costly and benefits only a small proportion of our user base—those running cloud pipelines on the data. Therefore, we have decided to require users to supply Google Cloud billing information when they access Hail versions of gnomAD.