CDRH Biothreat Challenge
Provide challenge data sets and reference standards for performance comparison of bioinformatics tools used in the biothreat and infectious disease NGS diagnostics community. The focus of this challenge is to enable tool developers to test their algorithms on blinded mock-clinical and in silico metagenomics samples using provided regulatory-grade reference genomes from the FDA-ARGOS database. This will enable the community to look at bioinformatics pipeline performance using a fixed reference genome data standard. The challenge will help familiarize precisionFDA users with the agency’s innovative FDA-ARGOS database resource (www.fda.gov/argos).
2018-08-04 00:00:00 UTC
2018-10-19 03:00:00 UTC
Many infectious diseases have similar signs and symptoms, making it challenging for healthcare providers to identify the disease-causing agent. Clinical samples are often tested by multiple test methods to help reveal the microbe that is causing the infectious disease. The results of these test methods can help healthcare professionals determine the best treatment for patients. Today, High-Throughput Sequencing (HTS) or Next Generation Sequencing (NGS) technology has the capability, as a single test, to accomplish what might have required several different tests in the past.
NGS technology may allow the diagnosis of infections without prior knowledge of disease(s) cause. NGS technology can potentially reveal the presence of all microorganisms in a patient sample. Using infectious disease NGS (ID-NGS) technology, each microbial pathogen may be identified by its unique genomic fingerprint. The vision of ID-NGS technology is to further improve patient care by delivering diagnostics which can help identify the microbial makeup in patient samples quickly and accurately.
A set of 15 mock clinical and 6 in silico metagenomics samples will be provided. Each challenge dataset is a mixture between a certain percentage of background (mock matrix) short reads and target (microbe) short reads. The test algorithm’s performance will be judged based on its estimation of the composition of the target short reads (microbial composition). The background reads ensure that the samples resemble clinically relevant samples.
21 metagenomics samples (15 mock, 6 in silico)
15 mock clinical samples
- Background at mock clinical relevant level
- Biothreat agent
- Near neighbor
- Lab contaminant
- No template control (NTC)
- Positive control
6 in silico samples
- Near neighbor
- Lab contaminant
*Mix blinded, may not contain all constituents
- Example submission link
Data part1 part2
- Hello World data set
- Reads from 21 blinded metagenomics sequence files (fasta/fastq) part1 part2
- 100,000 subsampled reads from each of the 21 metagenomics sequence files for the optional per read taxonomical origin challenge
FDA-ARGOS database link
- Blinded regulatory-grade microbial genomes
- Run on precisonFDA using a wrapper
- Download data and post results using template
- FDA-ARGOS genome species identification normalized confidence score [0,1]
- FDA-ARGOS genome species identification normalized quantity percentage [0,1]
- Per read FDA-ARGOS genome species identification normalized confidence score [0,1] (optional)
Participants are asked to submit the normalized confidence score (between 0 and 1) of identifying presence of each FDA-ARGOS Reference Genome within the 21 metagenomics samples and their method for confidence score calculation. A value of 1 indicates strong confidence of the presence of a FDA-ARGOS Reference Genome within a sample, while a value of 0.5 indicates neutral confidence, and a value of 0 indicates strong confidence of the absence of the genome in the sample.
|FDA-ARGOS Genome Species ID||Sample 1||Sample ...||Sample 21|
|ARGOS Reference Genome 1||1||0.8||0.8|
|ARGOS Reference Genome N||0||0.3||0.6|
Participants are asked to determine the quantity of genetic material originating from each reference genome within the provided FDA-ARGOS reference database and to submit the per genome normalized quantity percentage (between 0 and 1) within each sample. For each sample, total per genome quantity percentages should sum to at most 1. We want to ensure that the entire column sums to at most 1.
|FDA-ARGOS Genome Species ID||Sample 1||Sample 2||Sample 3|
|ARGOS Reference Genome 1||0.5||0.2||0|
|ARGOS Reference Genome N||0.3||0.2||0|
* It is possible that only a subset of short reads are being taxonomically classified, therefore, the final quantifications should be reported as the # of reads mapped to each genome divided by the total number of reads. The portion of unclassified reads can be assigned to the Unclassified row
[OPTIONAL] The participants are asked to submit the read based normalized confidence score (between 0 and 1) of the presence of each FDA-ARGOS species within designated subsamples. 100,000 reads were subsampled from each of the 21 metagenomics samples and provided to participants. If a score is not provided for a genome-read pair, the score is assumed to be 0.
|Read Name||FDA-ARGOS Genome Species ID||Score|
|Read 1||ARGOS Reference Genome 1||0.9|
|Read N||ARGOS Reference Genome N||0.6|
Due to the high volume of questions, the challenge deadline has been extended to October 18.
- PrecisionFDA: Elaine Johanson, Ruth Bandler
- PrecisionFDA CDRH: Adam Berger, Zivana Tezak
- Booz Allen: Zeke Maier
- DNAnexus: Singer Ma, John Didion
- CDRH: Heike Sichtig, Yi Yan
- USAMRIID: Timothy Minogue, Chris Stefan
Example Submission Walkthrough
An example submission was provided using a training dataset. Below we explain the files and data included in therein. The goal of this training dataset is to help users validate submission formatting.
Hello_World_R1.fq and Hello_World_R2.fq contain 3000 paired fastq sequences.
- Example.1-1000/1 are reads mapped to GCF_000833235.1 (Francisella tularensis).
- Example.1001-2000/1 are reads mapped to GCF_003073775.1 (Staphylococcus aureus).
- Example.2001-3000/1 are reads mapped to GCF_002366285.1 (Zika virus).
Example Output (see result table)
Hello_World_R1.fq was used to run blastn with the FDA-ARGOS database as the reference database.
- Reference Genome ID was determined by having at least 1 occurrence in the blast result.
- Reference Genome Quantity percentage was calculated by dividing the number of reads assigned to a Reference Genome ID by the total number of reads analyzed. For example, 1000 reads were assigned to CR_471 within Hello_World_R1.fq and there are 3000 reads total. Therefore, an estimated 33.3% reads were assigned to CR_471.
- The read based confidence score was calculated by normalizing the blast score for each read. Each read was assigned a reference genome ID corresponding with the highest score blast result.