Using the gnomAD genetic ancestry principal components analysis loadings and random forest classifier on your dataset

By popular request, we are now releasing the genetic ancestry principal components analysis (PCA) variant loadings and accompanying random forest (RF) model used for genetic ancestry group inference in gnomAD v2 and v3. This post discusses how those files were generated and how they can be used on another dataset. However, the use of these resources will not be appropriate for all datasets, and therefore we are including a discussion of the caveats associated with using these loadings and the RF model.

How the PCA loadings and RF model were generated

Details on genetic ancestry group assignment are available in the blog posts for gnomAD v2 and v3. We first run PCA on all unrelated (less than second-degree inferred relatedness) samples using the hwe_normalized_pca Hail function on a set of high-quality, autosomal, bi-allelic, and LD-pruned single nucleotide variants, followed by projection of the related samples onto these principal components (PCs). Then we train a RF classifier using genetic ancestry relevant PCs as features on samples with “known” (see caveats below) ancestry and infer a genetic ancestry group for all samples using this RF.

How to use the loadings and RF model on your own dataset

The loadings and RF model can be used for genetic ancestry group assignment on another dataset by first projecting the dataset’s genotypes onto pre-computed PCs, which can be done using Hail’s pc_project module with the gnomAD loadings. Following this projection, genetic ancestry group assignments can be made using the RF model and the projected PC scores — for example, by supplying the gnomAD Hail utilities assign_population_pcs function with the gnomAD RF model as the fit parameter. An example of this workflow is provided in our gnomAD QC GitHub repository.

Important caveats to consider

The variants used for genetic ancestry group inference in gnomAD were chosen based on our specific dataset (these were the variant choices for v2 and v3). This does not mean that these variants are always the ideal variants for determining genetic ancestry, and caution should be taken before deciding to use these loadings and genetic ancestry inference RF model on your own dataset.
The RF model is trained using gnomAD samples that have a “known” ancestry label associated with them within gnomAD’s available metadata. This “known” ancestry label is compiled by many different gnomAD Principal Investigators for various projects, so there is no guarantee that they were collected in the same way (e.g. some of these are self-reported labels and others are provider-reported labels). Therefore, if you have a dataset with a subset of reliable and consistent ancestry labels, it may be best for you to use those ancestry labels to build a RF model from your dataset as we did for gnomAD. It might also be useful to run the projection onto gnomAD PC space and compare how the two RF models perform on a withheld subset of samples with “known” ancestry labels.
If your dataset is missing large numbers of the variants with gnomAD PCA loadings, it will produce unreliable results. When projecting your dataset into the gnomAD genetic ancestry PC space, this can cause the projected samples to have scores that shrink toward 0. This shrinkage can result in problems when those projected PCs are then used in the RF model for genetic ancestry inference.
When using the gnomAD loadings and RF model on a dataset that includes genetic ancestries that are not well represented (or not represented at all) in gnomAD, the RF model will likely mis-classify samples from those ancestries that are underrepresented or missing from gnomAD. This is likely to particularly affect samples from regions that are very poorly sampled in gnomAD, such as Oceania, South-East Asia, the Middle East, and Africa, but will also affect other groups.
The workflow described above will always return genetic ancestry inference results, but it is difficult to determine if the gnomAD RF model is actually doing an adequate job on genetic ancestry inference in your specific dataset. When we run this RF model for the gnomAD genetic ancestry group assignment, we can explicitly measure performance by holding back a certain percentage of samples with known ancestry labels from the training set and use them to evaluate the RF model classification.

Use of the HGDP + 1KG subset to build a custom RF

As an alternate option to directly using the gnomAD loadings and RF model, a custom RF model can be designed by combining a dataset with the HGDP + 1KG subset. After combining the two datasets, either new variants can be chosen for a genetic ancestry PCA (limiting to high call rate within the reference panel and one’s own dataset, thereby avoiding the shrinkage problem), or the PC projection described above can be used, but with the same caveats associated with using the gnomAD loadings. An RF model can then be trained using the HGDP + 1KG known ancestry labels, and applied to the samples to obtain genetic ancestry group assignments using gnomAD Hail utilities assign_population_pcs with no fit parameter and the HGDP + 1KG known ancestry supplied as the known_col parameter. When training a new RF model on the known labels, you can measure performance by holding back labeled samples in the training and assessing accuracy on the set that is withheld (using the prop_train parameter in assign_population_pcs). Note that this method will still only work for classifying ancestries that are well represented by the HGDP + 1KG subset, and this subset is also only available for the GRCh38 reference genome.

Updated on August 1, 2023. Exported RF models in the ONNX format which provides more maintainability and security than pickle. Changed uses of ancestry to genetic ancestry and global ancestry assignment to genetic ancestry group assignment.