PrecisionFDA
CDRH Biothreat Challenge


Provide challenge data sets and reference standards for performance comparison of bioinformatics tools used in the biothreat and infectious disease NGS diagnostics community. The focus of this challenge is to enable tool developers to test their algorithms on blinded mock-clinical and in silico metagenomics samples using provided regulatory-grade reference genomes from the FDA-ARGOS database. This will enable the community to look at bioinformatics pipeline performance using a fixed reference genome data standard. The challenge will help familiarize precisionFDA users with the agency’s innovative FDA-ARGOS database resource (www.fda.gov/argos).


  • Starts
    2018-08-04 00:00:00 UTC
  • Ends
    2018-10-05 03:00:00 UTC

about 2 months remaining


Background

Many infectious diseases have similar signs and symptoms, making it challenging for healthcare providers to identify the disease-causing agent. Clinical samples are often tested by multiple test methods to help reveal the microbe that is causing the infectious disease. The results of these test methods can help healthcare professionals determine the best treatment for patients. Today, High-Throughput Sequencing (HTS) or Next Generation Sequencing (NGS) technology has the capability, as a single test, to accomplish what might have required several different tests in the past.

NGS technology may allow the diagnosis of infections without prior knowledge of disease(s) cause. NGS technology can potentially reveal the presence of all microorganisms in a patient sample. Using infectious disease NGS (ID-NGS) technology, each microbial pathogen may be identified by its unique genomic fingerprint. The vision of ID-NGS technology is to further improve patient care by delivering diagnostics which can help identify the microbial makeup in patient samples quickly and accurately.

Challenge Data

A set of 15 mock clinical and 6 in silico metagenomics samples will be provided. Each challenge dataset is a mixture between a certain percentage of background (mock matrix) short reads and target (microbe) short reads. The test algorithm’s performance will be judged based on its estimation of the composition of the target short reads (microbial composition). The background reads ensure that the samples resemble clinically relevant samples.

21 metagenomics samples (15 mock, 6 in silico)

  • 15 mock clinical samples
    • Background at mock clinical relevant level
    • Biothreat agent
    • Near neighbor
    • Coinfection
    • Lab contaminant
    • No template control (NTC)
    • Positive control
  • 6 in silico samples
    • Background
    • Biothreat
    • Near neighbor
    • Coinfection
    • Lab contaminant

*Mix blinded, may not contain all constituents

Submission Format

Input

  • Example submission link
  • Data part1 part2
    • Hello World data set
    • Reads from 21 blinded metagenomics sequence files (fasta/fastq) part1 part2
    • 100,000 subsampled reads from each of the 21 metagenomics sequence files for the optional per read taxonomical origin challenge
  • FDA-ARGOS database link
    • Blinded regulatory-grade microbial genomes

Your pipeline

  • Run on precisonFDA using a wrapper
  • Download data and post results using template

Output

  1. FDA-ARGOS genome species identification normalized confidence score [0,1]
  2. FDA-ARGOS genome species identification normalized quantity percentage [0,1]
  3. Per read FDA-ARGOS genome species identification normalized confidence score [0,1] (optional)

Evaluation

Participants are asked to submit the normalized confidence score (between 0 and 1) of identifying presence of each FDA-ARGOS Reference Genome within the 21 metagenomics samples and their method for confidence score calculation. A value of 1 indicates strong confidence of the presence of a FDA-ARGOS Reference Genome within a sample, while a value of 0.5 indicates neutral confidence, and a value of 0 indicates strong confidence of the absence of the genome in the sample.

FDA-ARGOS Genome Species ID Sample 1 Sample ... Sample 21
ARGOS Reference Genome 1 1 0.8 0.8
ARGOS Reference Genome N 0 0.3 0.6

Participants are asked to determine the quantity of genetic material originating from each reference genome within the provided FDA-ARGOS reference database and to submit the per genome normalized quantity percentage (between 0 and 1) within each sample.

FDA-ARGOS Genome Species ID Sample 1 Sample 2 Sample 3
ARGOS Reference Genome 1 0.5 0.2 0
ARGOS Reference Genome N 0.5 0.8 0

* It is possible that only a subset of short reads is being taxonomically classified, therefore, the final quantifications are going to be evaluated through root mean square deviation (RMSD) to the known quantities.

[OPTIONAL] The participants are asked to submit the read based normalized confidence score (between 0 and 1) of the presence of each FDA-ARGOS species within designated subsamples. 100,000 reads were subsampled from each of the 21 metagenomics samples and provided to participants. If a score is not provided for a genome-read pair, the score is assumed to be 0.

Read Name FDA-ARGOS Genome Species ID Score
Read 1 ARGOS Reference Genome 1 0.9
Read N ARGOS Reference Genome N 0.6

Team

  • PrecisionFDA: Elaine Johanson, Ruth Bandler
  • PrecisionFDA CDRH: Adam Berger, Zivana Tezak
  • Booz Allen: Zeke Maier
  • DNAnexus: Singer Ma, John Didion
  • CDRH: Heike Sichtig, Yi Yan
  • USAMRIID: Timothy Minogue, Chris Stefan

Example Submission Walkthrough

An example submission was provided using a training dataset. Below we explain the files and data included in therein. The goal of this training dataset is to help users validate submission formatting.

Hello_World_R1.fq and Hello_World_R2.fq contain 3000 paired fastq sequences.

  • Example.1-1000/1 are reads mapped to GCF_000833235.1 (Francisella tularensis).
  • Example.1001-2000/1 are reads mapped to GCF_003073775.1 (Staphylococcus aureus).
  • Example.2001-3000/1 are reads mapped to GCF_002366285.1 (Zika virus).

Example Output (see result table)

Hello_World_R1.fq was used to run blastn with the FDA-ARGOS database as the reference database.

  • Reference Genome ID was determined by having at least 1 occurrence in the blast result.
  • Reference Genome Quantity percentage was calculated by dividing the number of reads assigned to a Reference Genome ID by the total number of reads analyzed. For example, 1000 reads were assigned to CR_471 within Hello_World_R1.fq and there are 3000 reads total. Therefore, an estimated 33.3% reads were assigned to CR_471.
  • The read based confidence score was calculated by normalizing the blast score for each read. Each read was assigned a reference genome ID corresponding with the highest score blast result.