In this minor release of gnomAD v3.1, we include the following improvements to the previous release: a fix to the homozygous alternate allele depletion adjustment that was made for v3.1 and several updates to the gnomAD v3.1 Human Genome Diversity Project (HGDP) and 1000 Genomes (1KG) subset release. 

Fix to homozygous alternate allele depletion adjustment 

For the v3.1 release, we issued a genotype adjustment to a technical artifact found in the v3 release that resulted in a depletion of homozygous alternate alleles (see here for more details). In brief, this adjustment modified heterozygous genotypes with highly skewed allele balances (AB > 0.9) to homozygous alternate genotypes at common sites (AF > 0.01). We later discovered that this adjustment inadvertently caused some samples with heterozygous non-reference genotypes to have both heterozygous and homozygous alternate genotype calls at the same locus — implying that the sample was triploid at that site — because we did not initially confirm the heterozygous genotype contained only one alternate allele. For the 2,783,250 (0.37%) adjusted variants with a sample that falls into this category, we have corrected the frequencies by retaining the original, unadjusted heterozygous non-reference genotype. These corrections are now reflected in the browser, v3.1.2 sites Hail Tables and VCFs, as well as the frequencies and genotypes in the v3.1.2 HGDP + 1KG subset Hail MatrixTables and VCFs.

Improvements to the HGDP + 1KG subset release

After our initial release of the gnomAD v3.1 HGDP + 1KG subset we realized there were some changes we could make to improve the usability of this dataset. In the v3.1.2 release we have:

Restructuring of the HGDP + 1KG release files

We initially released the gnomAD v3.1 HGDP + 1KG subset as a dense Hail MatrixTable (MT), with corresponding VCFs. One of the major use cases for releasing the full genotype data for this subset is enabling our users to easily combine it with their own callsets for joint analysis. However, our previous release of the dense MT, in which multi-allelic positions were already split, prevents users from easily doing this. 

In order to enable users to combine this subset with another dataset using Hail’s VCF combiner on two sparse MTs, we have added the release of a raw Hail sparse MT without split multi-allelic variant calls. This sparse MT does not include sample QC and variant QC annotations; only annotations derived from the original gVCFs are present. There are also no adjustments to genotypes in the sparse MT. The two genotype adjustments that we make prior to frequency calculations, and for the dense MT release, are the fix for the homozygous alternate allele depletion described above as well as converting XY samples to haploid on non-PAR regions of chromosomes X/Y and XX samples to missing on chromosome Y. A final difference between the dense MT and the new sparse MT is that we did not filter any variants from the sparse MT (more discussion below on variants that we recommend filtering). Removal of variants from a sparse MT can lead to the removal of reference block END information that is needed for proper densification of the dataset. For users who do not want to combine another dataset with this subset and would prefer to use a dense MT, we are still releasing a dense MT with all sample and variant annotations included, and with the same updates to the dataset as described below.

Alongside the combinable sparse MT, we are also releasing a separate sample annotation Hail Table (HT) and variant annotation HT. The sample annotation HT includes the previously released gnomAD sample QC metrics, as well as additional sample metadata and refined sample QC metrics that account for subcontinental structure. Note that the variant annotation HT splits multi-allelic variants, so users who would like to annotate the sparse, unsplit MT with this HT will need to split the sparse MT first. There are also two variant annotations (described here) in the variant annotations HT that are not present on the dense MT, as we exclude these variants from the dense MT (these variants are also excluded from the full gnomAD sites HT). We remove variants within a telomere or centromere region and variants that fall below a low quality threshold (LowQual). We highly recommend that users also apply these filters (after densification if using a sparse MT). The LowQual annotation is similar to the GATK LowQual filter. The difference is that GATK computes this annotation at the site level, which uses the least stringent prior for mixed sites, but the LowQual annotation that we include is allele-specific.

Inclusion of all samples in the HGDP + 1KG subset release

The initial v3.1 dense MT release only included the HGDP + 1KG samples that passed gnomAD’s full callset sample QC filters. The sample QC methods used for all of gnomAD were designed to be used on a dataset with global diversity as well as feasible for use on a large dataset. However, we discovered that our approach for sample QC metric outlier filtering unintentionally excluded whole populations within the HGDP + 1KG subset that were the most genetically unique and had small sample sizes (more specifically: San, Mbuti Pygmy, Biaka Pygmy, Bougainville, and Papuan) compared to other populations within the gnomAD v3.1 callset. This occurred because the sample QC metric differences of these populations were not adequately captured by regressing out the first 8 global ancestry assignment PCs, causing these samples to appear as sample QC metric outliers.

In order to address this sample QC filtering issue and provide the most comprehensive dataset for our users, we have decided not to remove any samples from the HGDP + 1KG subset release files, allowing users to filter samples at their own discretion. We also include a new subcontinental-aware sample QC filter (‘high_quality’) that we recommend using prior to downstream analyses. This filter will still remove samples that fail gnomAD’s hard filters, but keep samples previously identified as sample QC metric outliers, therefore excluding samples that are clearly bad quality while ensuring the inclusion of these genetically unique populations. Additionally, the ‘high_quality’ filter excludes samples that are identified as outliers in subcontinental principal component analysis (PCA). The subcontinental PCA was performed by first splitting the full HGDP + 1KG subset by genetic region based on continental ancestry. Then for each region, the sample set was divided into related and unrelated samples, and PCA was run on the unrelated samples, followed by projecting related samples onto the predefined unrelated PCA space. We determined outliers from the first 8 subcontinental PCs by visually identifying specific individuals that defined entire PCs (e.g. a 1KG sample labeled as African American (ASW) with no African ancestry), excluding only ​​22 individuals across nine different HGDP/1KG population labels. In addition to the new recommended filter, we still provide sample QC information from the full gnomAD v3.1 callset in case users would like to filter the HGDP + 1KG subset to only samples included in the full gnomAD callset.

Along with all HGDP and 1KG samples, we also included a synthetic-diploid (syndip) sample (a mixture of DNA from two haploid CHM cell lines) in the subset. In gnomAD, we use this sample to benchmark variant QC by comparing the genotypes in our dataset to the truth data derived from PacBio assemblies of this sample, and we hope that inclusion in this subset will also be useful for our users.

Modification of the HGDP + 1KG subset allele frequency

The HGDP + 1KG v3.1 subset release allele frequency annotation (allele count, allele number, allele frequency, and homozygote count) included all samples released in the dense MT. Therefore, it consisted of only samples that passed the gnomAD-wide sample QC as well as related samples within the HGDP + 1KG subset. In the v3.1.2 release, the subset frequency calculations were performed on only samples that pass our current recommended sample QC filter (more details described above). Additionally, we now exclude samples that are inferred as closely related to other samples within the HGDP + 1KG subset to enhance the accuracy of population frequency estimates.

In order to exclude related individuals from the frequency calculations, we obtained kinship estimates for all pairs of samples within the subset. We used a conservative kinship estimate threshold of >0.05 to consider sample pairs as related. Related individuals are then pruned from the dataset using Hail’s maximal_independent_set module to maintain the maximal number of individuals in the dataset. 

Addition of a help page describing annotations on the release files

There is now a full description of all sample and variant annotations found on the HGDP + 1KG release HTs and dense MT. These descriptions are also added as global annotations on each HT/MT for easy lookup while working with them. Within the global annotations, we also provide specific cutoffs used for QC filters, each with an associated description.

The new release includes some additional sample metadata that is specific to the HGDP + 1KG subset. We now have annotations for the global population labels (those defined by each individual study as well as an annotation that is harmonized across both studies), approximate latitude and longitude of the geographical place of origin of the population, and technical considerations for HGDP detailed by Bergström et al. As discussed in changes to the frequency calculations, we have inferred relatedness within the subset, so we provide a sample annotation indicating all samples in the subset with inferred relatedness (kinship >0.05), their associated relationship metrics provided by Hail’s pc_relate module, and whether this sample is excluded from variant frequency calculations due to relatedness. HGDP + 1KG subset-specific global and subcontinental PCs were added to complement the gnomAD-wide global ancestry PCs.