Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions

This challenge calls on the public to assess variant calling pipeline performance on a common frame of reference, with a focus on benchmarking in difficult-to-map regions, segmental duplications, and the Major Histocompatibility Complex (MHC).

  • Starts
    2020-05-01 21:00:00 UTC
  • Ends
    2020-06-16 03:00:59 UTC

PrecisionFDA partnered with The Genome in a Bottle (GIAB) consortium, led by the National Institute of Standards and Technology, to launch the Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions such as segmental duplications, and the Major Histocompatibility Complex (MHC).

Period: May 1 to June 15, 2020.

Total submissions: 64 from 20 teams using data from 3 short-read and/or long-read sequencing technologies.

Introductory Remarks

The Genome in a Bottle (GIAB) consortium recently used linked and long reads to develop an expanded benchmark for the reference sample HG002, the son of an Ashkenazi trio from the Personal Genome Project (described in the preprints here and here). Similar benchmarks for the parents of HG002 (HG003 and HG004) were also developed, presenting a unique opportunity to launch a challenge focused on variant calling in challenging regions in GRCh38 prior to the release of the v4 HG003 and HG004 truth data. By providing datasets for HG002, HG003, and HG004, and the GA4GH Benchmarking framework, the challenge provided a common frame of reference for measuring performance of participant pipelines on “difficult-to-map” regions such as segmental duplications, and the medically important, highly polymorphic regions called Major Histocompatibility Complex (MHC). For each of the three genomes, participants were provided three input datasets, whole genome sequencing from Illumina, PacBio, and Oxford Nanopore (ONT). Participants were asked to process one or more of the three datasets through their variant calling pipelines and create VCF files.

The GIAB benchmark sets for HG003 and HG004 (truth data) were kept from participants during the challenge until the submission period closed, while HG002 truth data were available during the challenge. Post challenge, the evaluation team ran comparisons between participants’ HG003 and HG004 VCF files and the GIAB benchmarks to evaluate performance in three categories: (1) the MHC, (2) “difficult-to-map” regions and segmental duplications, and (3) all benchmark regions (Table 1).

Table 1. Training and test inputs and outputs for the challenge

Datasets Input (FASTQ) Output (VCF)
Training HG002 ONT   HG002 PacBio   HG002 Illumina HG002 ONT   HG002 PacBio   HG002 Illumina
Testing HG003 ONT   HG003 PacBio   HG003 Illumina HG003 ONT   HG003 PacBio   HG003 Illumina
Testing HG004 ONT   HG004 PacBio   HG004 Illumina HG004 ONT   HG004 PacBio   HG004 Illumina

Overview of Results

The challenge ran 6 weeks from May 1 to June 15, 2020. A total of 64 submissions from 20 participants (or teams) were received. Participants primarily used Illumina (24 submissions) and PacBio data (17 submissions), with 20 additional submissions using multiple datasets. Submissions were benchmarked following best practices from the Global Alliance for Genomics and Health (GA4GH), new V4.2 HG003 and HG004 benchmark sets, and the V2.0 GIAB genome stratifications.


In addition to benchmarking submissions against all benchmark variants and regions, we stratified results for just the MHC region and just difficult-to-map and segmental duplication regions. Specifically, we benchmarked submissions against the entire MHC region defined by the Genome Reference Consortium (including both conserved “easy” regions and highly variable regions), defined by the bed file here and described in the README file here. For the “difficult- to-map” regions, we used the union of segmental duplications and “low mappability” regions where 100 bp read pairs have <=2 mismatches and <=1 indel difference from another region of the genome, defined by the bed file here and described in the README file here. We calculated the F1 score (harmonic mean of precision and recall) for SNVs and indels together, and averaged the results for HG003 and HG004 to provide the results below.


Performance Metrics

Figure 1 displays a summary of submissions based on overall performance (A) and submission rank (B) varied by technology and stratification. Generally, submissions that used multiple technologies (MULTI) outperformed single technology submissions for all three genomic context categories. Panel A shows a Histogram of F1 % (higher is better) for the three genomic stratifications evaluated. Submission counts across technologies are indicated by light grey bars and individual technologies by colored bars. Panel B shows individual submission performance. Data points represent submission performance for the three stratifications (difficult-to-map regions, all benchmark regions, MHC), and lines connect submissions. Category top performers are indicated by diamonds with “W”s.

Figure 1. Overall performance (A) and submission rank (B) varied by technology and stratification (log scale)

A public table with more detailed performance metrics for all submissions is available here. This table includes the combined SNV and indel F1 metrics used for the awards, as well as other metrics like precision and recall, and metrics stratified by SNV and indel. A few participants chose not to be identified, and have a unique 5-letter identifier. A description of the table values is available here.

Top Performers

For each technology (Illumina, PacBio HiFi, ONT, or Multi-technology), we have selected the top performers for all benchmark regions, difficult-to-map regions/segmental duplications, and MHC (Figure 2).

Figure 2. Challenge top performers in all categories


While GIAB has found the new benchmarks to reliably identify false positives and false negatives across a wide variety of input callsets, the benchmarking results have several limitations listed below. GIAB looks forward to ongoing work with the community to improve the benchmarks. Performance of these methods will likely be different for future benchmarks that cover more challenging regions. Limitations of V4.2 benchmarks include:

  1. These benchmarks include more challenging regions than those assessed in the first Truth Challenge , but they still exclude the most challenging regions of the genome, including highly similar segmental duplications, satellite DNA such as the centromeres, many mid-sized indels >15 bp, and structural variants and copy number variants.
  2. Although we tried to exclude duplications in HG002 relative to GRCh38, we found that some questionable benchmark variants and regions remain where duplications or other complex structural variants may exist around segmental duplications. For the best submissions, a significant fraction of false positives may lie in these challenging regions.
  3. ONT data was not used to form the V4.2 benchmarks, so these benchmarks do not assess performance in regions that are only accessible to the longer ONT reads.


  1. It is exciting to see new advances in variant calling methods and innovative method combinations.
  2. Combining the strengths of different technologies can give better results for both easy and difficult regions of the genome.
  3. In many cases, there were different top performers in difficult-to-map regions, all benchmark regions, and the MHC, demonstrating the utility of stratification for showing different bioinformatics methods with strengths in different regions.