CFSAN Pathogen Detection Challenge
Identify the types and distribution of Salmonella strains in metagenomics samples
2018-02-16 01:00:00 UTC
2018-04-26 23:59:59 UTC
The Food and Drug Administration (FDA) calls on the food safety and infectious diseases communities to help improve bioinformatics pipelines for detecting pathogens in samples sequenced using metagenomics by launching the precisionFDA CFSAN Pathogen Detection Challenge.
Challenge Time Period
February 15, 2018 through April 26, 2018
At a Glance
In the last few years, the application of next-generation sequencing (NGS) technology for whole genome sequencing (WGS) of foodborne pathogens has revolutionized food pathogen outbreak surveillance. WGS of foodborne pathogens enables high-resolution identification of pathogens isolated from food or environmental samples. These pathogens can then be compared to clinical isolates sequenced from patients. If the pathogens found in the food or food production environment match the pathogens from the sick patients, a reliable link between the two can be made with a clear source of the outbreak. This type of testing has traditionally been done using much lower resolution methods (e.g., pulsed-field gel electrophoresis) where a link between source exposure and patients was not always as clear. WGS has the power to differentiate between even closely related strains of the same species, reducing the detection of false clusters and requiring fewer clinical cases to identify an outbreak.
The U.S. Food and Drug Administration (FDA), Center for Food Safety and Applied Nutrition (CFSAN) has pioneered the use of WGS for outbreak detection via the GenomeTrakr network. The GenomeTrakr network is comprised of FDA and external laboratories that sequence foodborne pathogens and contribute the genomes to a data repository hosted by the National Center for Biotechnology Information (NCBI). The NCBI also offers a pathogen detection system for processing and delivering decision-ready outbreak detection information. A key function of this pipeline is detecting small genetic differences, called single nucleotide polymorphisms (SNP), between newly discovered pathogens and their closest known genetic relatives, such that outbreaks in several sites can be quickly traced back to their genetic ancestor at a single source. Already, the GenomeTrakr network has led to significant reductions in outbreak response times from 52 days to 12 days on average.
Although WGS has already greatly improved outbreak detection and traceback, current approaches rely on culturing a pathogen before sequencing. Metagenomics, defined here as the study of genetic material collected directly from environmental samples, is the next evolution of GenomeTrakr foodborne pathogen initiative because metagenomics is culture-independent.
As the food safety community moves to metagenomic sequencing, bioinformatics algorithms must be developed to detect pathogens amongst a mix of genetic material sequenced directly from a sample. Within bioinformatics and genomics, crowdsourcing has a history of improving expertly developed algorithms. Thus, we have designed a challenge as a step towards this goal. In this challenge, participants are asked to develop and use bioinformatics pipelines to identify the types and specific Salmonella strains in each of several metagenomics samples. This type of technology will expedite determining the source of foodborne illness.
The challenge begins with 24 precisionFDA-provided input datasets, corresponding to metagenomic sequencing of produce samples. These samples have been analyzed using classical microbiological laboratory techniques and a subset was found to be contaminated with Salmonella. Your mission is to identify Salmonella serotype, MLST type (7-gene), and strain(s) in the positive samples. A MLST schema and a list of publicly available strains are provided to aid you in this mission. In addition to the 24 metagenomic sequencing samples that comprise the challenge dataset, 8 metagenomic sequencing samples have been provided with the Salmonella serotype, MLST type (7-gene), and strain(s) in each sample as training data. For the challenge data, you can generate the MLST type and Salmonella strain identification results on your own environment and upload them to precisionFDA or you can reconstruct your pipeline on precisionFDA and run it on the cloud.
The identification of serotype, MLST types, and Salmonella strains in the Salmonella positive produce sequencing samples constitutes your submission to the challenge. Note that a fraction of the 24 samples may not contain Salmonella. Selected participants and top performers* will be recognized on the precisionFDA website. Therefore, we hope you are willing to share your experience with others to further enhance food safety through metagenomic pathogen detection.
The challenge runs until April 26, 2018.
Getting on the precisionFDA website
If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate that you are entering the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. Once approved, you will receive another email with your contributor account information.
With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). Everything you do on precisionFDA is initially private to you (not accessible to the FDA or the rest of the community) until you choose to publicize it. In other words, you can immediately start working on the challenge in private, and whenever you are ready you can officially publish your results as your challenge entry.
Locating and understanding the files
|Challenge Samples||CFSAN Challenge||This folder is the starting point for this challenge. It contains 24 food safety samples, corresponding to metagenomic sequencing of produce samples collected by the FDA. Forward and reverse *fastq.gz files are provided for each sample. Some of the samples, when cultured, did not have a strain of Salmonella detected. Other samples were created synthetically by adding Salmonella reads to culture-negative samples. Also included are culture-positive samples where a specific strain of Salmonella was detected.|
|MLST Type Database||
Multilocus sequence typing (MLST) is an unambiguous procedure for characterizing isolates of bacterial species using the DNA sequences of internal fragments of multiple genes. Approximately 450-500 bp internal fragments of each gene are used. This MLST type resource should be used to identify the MLST type(s) of the Salmonella in each sequencing sample.
sal7geneProfiles04022018.tab - Tab-delimited file containing MLST schema definitions. The "ST" column gives the sequence type and the gene columns (aroC, dnaN, etc.) provide the allelic profile. More information about the schema is available at https://enterobase.warwick.ac.uk/species/index/senterica.
sal7gene04022018.fasta - FASTA-formatted file of alleles used in the MLST schema. The FASTA header shows the genome name and allele variant (e.g. aroC_1). Alleles for the 7 genes used in this schema have been concatenated into a single file.
|Salmonella Strain Genomes||SalmonellaStrains.txt||Tab-delimited file containing a large list of publicly available Salmonella genomes at NIH/NCBI. These Salmonella strain genomes should be used to identify the Salmonella strains in each sequencing sample.|
|Training Samples||CFSAN Training||Metagenomic sequencing of 8 produce samples that have been spiked with either high or low levels of Salmonella Newport.|
|Challenge Submission Format||ChallengeSubmission.txt||Sample spreadsheet results file in tab separated text format. Challenge submissions should follow the format of this file. This file also contains the Salmonella serotype, MLST type (7-gene), and strain(s) in each training sample.|
|README||README.txt||Contains descriptions of all of the provided files|
Running your pipeline
After familiarizing yourself with the files, you will need to process the challenge samples through your Salmonella serotype, MLST type, and strain identification pipeline. Each invocation of your pipeline should take as input a sequencing sample and output the Salmonella serotype, MLST type, and strain(s) present, if any.
(Optional) Reconstructing your pipeline on precisionFDA
You have the option of reconstructing your pipeline on precisionFDA and running it there. To do that, you must create one or more apps on precisionFDA that encapsulate the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to be run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can also click the Fork button on an existing app (such as bwa_mem_bamsormadup) to use it as a starting point for developing your own.
Constructing your pipeline on precisionFDA has an important advantage: you can, at your discretion, share it with the community so that others can take a look at it and reproduce your results – and perhaps build upon it and further improve it.
Submitting your entry
Submissions should be formatted as a tab separated text file in the form of a matrix, with one row per sequencing sample and columns for the sample name, sample description, Salmonella serotype, MLST type, and strain. Please match the challenge submission example format for your challenge submission(s).
To begin your submission, click "Submit Challenge Entry" at the top of this page. The submission screen will ask for a title, a description, and a tab separated text file containing Salmonella serotype, MLST type (7-gene), and strain(s) for your entry.
Start by providing a short title for your submission entry, then fill in the description. In your description, please identify whether you are participating as an individual or as part of the team (and, if it is a team effort, please identify all members of your team). As an additional option, you can explain the pipeline used to obtain the results. The description entry field supports Markdown syntax. Don't worry if you don't get it perfect right away, you can always go back and edit this description later.
If you opted to construct your pipeline on precisionFDA and your Salmonella identifications were generated by running one or more precisionFDA apps, the system will automatically prompt you to share the details of these executions when submitting your entry (see below). In that case, your optional pipeline description only needs to provide a high-level summary, because the exact software invocations will be available to the community via the system’s sharing mechanism.
In the submission input data section, click "Select file..." and choose the tab separated text file you'd like to submit. (In the popup click "Files", then tick "Only mine", then select your file). If you ran your pipeline on your own environment, rather than precisionFDA, you must first upload your tab separated text file to precisionFDA in order to select it in the submission input data section.
Once you have entered a title, a description, and chosen a tab separated text file, the "Submit" button on the upper right corner will become active. Click the button to invoke the publishing wizard, which will prompt you to publish the tab separated text file so that others can see it (this is a requirement for participating in the challenge). If your tab separated text file was generated by running apps on precisionFDA (instead of being uploaded externally), the system will ask if you also want to publish the related job, app, and app assets.
After completing the publishing wizard, your tab separated text file will become public, and your entry will be officially submitted. The system will conduct an initial verification of your entry, by running software for validating and scoring the tab separated text file. During that time, your entry will appear as "pending verification" under the "my entries" tab, but will not yet show up under the public "challenge submissions" tab. This step takes several minutes, and you can monitor it by clicking "View Log".
If your entry fails this step, it will be marked as "failed". You can click "View Log" to see some diagnostic information related to the execution of the verification software. Failed entries will not show up under the "challenge submissions" tab and will not be counted towards the challenge. (However, your tab separated text will have been made public – but you are welcome to delete it).
If verification completes successfully, your entry will appear under "challenge submissions".
Successful entries cannot be revoked, but you can always submit new ones. You can also go back and edit the title and description of any entry.
(Optional) Methods Description
You are encouraged to submit a 1-2 page write-up describing the methods you used for the challenge. As described above, please submit your optional methods description in the description box when submitting your entry.
Determining Top Performers
PrecisionFDA will select a number of entries which will receive an acknowledgment for participation in the challenge. Each challenge entry will be assessed on the following criteria, and top performers will be recognized on the precisionFDA website.
- Number of culture-negative samples that were correctly identified as Salmonella negative
- Number of synthetic positive samples in which Salmonella was identified as being present
- Number of synthetic positive samples in which the Salmonella serotype was correctly identified
- Number of synthetic positive samples in which the MLST type was correctly identified
- Number of synthetic positive samples in which the Salmonella strain was correctly identified
- Number of culture-positive samples in which Salmonella was identified as being present
- Number of culture-positive samples in which the Salmonella serotype was correctly identified
- Number of culture-positive samples in which the MLST type was correctly identified
- Number of culture-positive samples in which the Salmonella strain was correctly identified
* Top performer recognition on a precisionFDA challenge is an acknowledgment by the precisionFDA community and does not imply FDA endorsement of any organization, tool, software, etc.