PrecisionFDA
CDRH Biothreat Challenge


Provide challenge data sets and reference standards for performance comparison of bioinformatics tools used in the biothreat and infectious disease NGS diagnostics community. The focus of this challenge is to enable tool developers to test their algorithms on blinded mock-clinical and in silico metagenomics samples using provided regulatory-grade reference genomes from the FDA-ARGOS database. This will enable the community to look at bioinformatics pipeline performance using a fixed reference genome data standard. The challenge will help familiarize precisionFDA users with the agency’s innovative FDA-ARGOS database resource (www.fda.gov/argos).


  • Starts
    2018-08-04 00:00:00 UTC
  • Ends
    2018-10-19 03:00:00 UTC

Background

Many infectious diseases have similar signs and symptoms, making it challenging for healthcare providers to identify the disease-causing agent. Clinical samples are often tested by multiple test methods to help reveal the microbe that is causing the infectious disease. The results of these test methods can help healthcare professionals determine the best treatment for patients. Today, High-Throughput Sequencing (HTS) or Next Generation Sequencing (NGS) technology has the capability, as a single test, to accomplish what might have required several different tests in the past.

NGS technology may allow the diagnosis of infections without prior knowledge of disease(s) cause. NGS technology can potentially reveal the presence of all microorganisms in a patient sample. Using infectious disease NGS (ID-NGS) technology, each microbial pathogen may be identified by its unique genomic fingerprint. The vision of ID-NGS technology is to further improve patient care by delivering diagnostics which can help identify the microbial makeup in patient samples quickly and accurately.

Challenge Data

A set of 15 mock clinical and 6 in silico metagenomics samples will be provided. Each challenge dataset is a mixture between a certain percentage of background (mock matrix) short reads and target (microbe) short reads. The test algorithm’s performance will be judged based on its estimation of the composition of the target short reads (microbial composition). The background reads ensure that the samples resemble clinically relevant samples.

21 metagenomics samples (15 mock, 6 in silico)

  • 15 mock clinical samples
    • Background at mock clinical relevant level
    • Biothreat agent
    • Near neighbor
    • Coinfection
    • Lab contaminant
    • No template control (NTC)
    • Positive control
  • 6 in silico samples
    • Background
    • Biothreat
    • Near neighbor
    • Coinfection
    • Lab contaminant

*Mix blinded, may not contain all constituents

Submission Format

Input

  • Hello World data set
  • Reads from 21 blinded metagenomics sequence files (fasta/fastq)
  • 100,000 subsampled reads from each of the 21 metagenomics sequence files for the optional per read taxonomical origin challenge
  • FDA-ARGOS database
    • Blinded regulatory-grade microbial genomes

Your pipeline

  • Run on precisionFDA using a wrapper
  • Download data and post results using template

Output

  1. FDA-ARGOS genome species identification normalized confidence score [0,1]
  2. FDA-ARGOS genome species identification normalized quantity percentage [0,1]
  3. Per read FDA-ARGOS genome species identification normalized confidence score [0,1] (optional)

Evaluation

Participants are asked to submit the normalized confidence score (between 0 and 1) of identifying presence of each FDA-ARGOS species within the 21 metagenomics samples and their method for confidence score calculation.

FDA-ARGOS Genome Species ID Sample 1 Sample 2 ... Sample 21
Species 1 1 0.8 ... 0.8
Species N 0 0.3 ... 0.6

Participants are asked to determine the quantity of genetic material originating from each species within the provided FDA-ARGOS reference database and to submit the species normalized quantity percentage (between 0 and 1) within each sample.

FDA-ARGOS Genome Species ID Sample 1 Sample 2 ... Sample 21
Species 1 0.5 0.2 ... 0
Species N 0.5 0.8 ... 0

* It is possible that only a subset of short reads is being taxonomically classified, therefore, the final quantifications are going to be evaluated through root mean square deviation (RMSD) to the known quantities.

(OPTIONAL) The participants are asked to submit the read based normalized confidence score (between 0 and 1) of the presence of each FDA-ARGOS species within designated subsamples. 100,000 reads were subsampled from each of the 21 metagenomics samples and provided to participants.

Read Name FDA-ARGOs Genome Species ID Score
Read 1 Species 1 0.9
Read N Species N 0.6

Team

  • PrecisionFDA: Elaine Johanson, Ruth Bandler
  • PrecisionFDA CDRH: Adam Berger, Zivana Tezak
  • Booz Allen Hamilton: Zeke Maier
  • DNAnexus: Singer Ma, John Didion
  • FDA CDRH: Heike Sichtig, Yi Yan
  • USAMRIID: Timothy Minogue, Chris Stefan
  • NIST: Scott Jackson, Jason Kralj