PrecisionFDA
NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge - Subchallenge 1


Sample mislabeling (accidental swapping of patient samples) or data mislabeling (accidental swapping of patient omics data) is known to be one of the obstacles in basic and translational research because this accidental swapping contributes to irreproducible results and invalid conclusions. The objective of this challenge is to encourage development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets.


  • Starts
    2018-09-24 19:00:00 UTC
  • Ends
    2018-11-05 04:59:59 UTC

The Food and Drug Administration (FDA) and National Cancer Institute (NCI) call on the scientific community to develop and evaluate computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets

Challenge Time Period
Subchallenge 1: September 24, 2018 through November 4, 2018
Subchallenge 2: November 5, 2018 through December 18, 2018

AT A GLANCE

In biomedical research, sample mislabeling (accidental swapping of patient samples) or data mislabeling (accidental swapping of patient omics data) has been a long-standing problem that contributes to irreproducible results and invalid conclusions. These problems are particularly prevalent in large scale multi-omics studies, in which multiple different omics experiments are carried out at different time periods and/or in different labs. Human errors could arise during sample transferring, sample tracking, large-scale data generation, and data sharing/management. Thus, there is a pressing need to identify and correct sample and data mislabeling events to ensure the right data for the right patient. Simultaneous use of multiple types of omics platforms to characterize a large set of biological samples, as utilized in The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) projects, has been demonstrated as a powerful approach to understanding the molecular basis of diseases and speeding the translation of new discoveries to patient care. Comprehensive multi-omics data obtained on the same patient sample can also add value in pinpointing and correcting mislabeling problems that can be encountered in the process. The FDA and NCI-CPTAC have joined forces to launch this challenge to encourage the development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets (Boja et al. 2018).

The challenge comprises two subchallenges to be conducted sequentially. In Subchallenge 1, participants will be presented with clinical and proteomics data for the same set of samples and asked to detect samples with unmatched clinical and proteomics data. In Subchallenge 2, participants will be further presented with RNA-Seq data from the same samples as in Subchallenge 1 and asked to identify mislabeled samples, specify the data type with mislabeling, and suggest the correct sample labels.

CHALLENGE DETAILS

Getting on the precisionFDA website

If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate that you are entering the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. Once approved, you will receive another email with your contributor account information.

With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). Everything you do on precisionFDA is initially private to you (not accessible to the FDA or the rest of the community) until you choose to publicize it. In other words, you can immediately start working on the challenge in private, and whenever you are ready you can officially publish your results as your challenge entry.

Locating and understanding the files

File type Files Description
Training Clinical Data train_cli.tsv Contains clinical information (gender and Microsatellite instability (MSI) status) for the 80 training samples.
Training Proteomic Data train_pro.tsv Proteomics data from the 80 training samples. Each row represents a protein and each column represents a training sample.
Training RNA-Seq Data Coming November 5 RNA-Seq data from the 80 training samples. Each row represents a gene and each column represents a training sample.
Test Clinical Data test_cli.tsv Contains clinical information (gender and MSI status) for the 80 test samples.
Test Proteomic Data test_pro.tsv Proteomics data from the 80 test samples. Each row represents a protein and each column represents a test sample.
Test RNA-Seq Data Coming November 5 RNA-Seq data from the 80 test samples. Each row represents a gene and each column represents a training sample.
Subchallenge 1 Training Key sum_tab_1.csv Mislabeling information for the training samples. A value of 1 indicates that the clinical and proteomics data are not from the same tumor sample (i.e., a mismatch). A value of 0 indicates that the clinical and proteomics data are from the same tumor sample.
Subchallenge 2 Training Key Coming November 5 Real sample labels from the 80 training samples. Each row represents the real labels of the three data sources for the corresponding training sample.
README README.txt Contains descriptions of all of the provided files.

Understanding the data

Original data description

Paired proteomics and RNA-Seq data were generated for each of the 162 tumor samples. Protein quantification was based on spectral counting and mRNA quantification was based on Fragments Per Kilobase of transcript per Million mapped reads (FPKM). For both proteomics and RNA-Seq data, genes with more than 50% missing values were removed, except for genes located in X or Y chromosomes, which were retained even if they were missed in more than 50% of the samples. The proteomics data was then normalized using quantile normalization followed by batch correction using ComBat, whereas the RNA-Seq data was normalized using the trimmed mean of M-values normalization method (TMM) followed by batch correction using ComBat.

Challenge dataset generation

From a total of 162 samples, we randomly selected 80 samples for training and another 80 for testing. Then we introduced labeling errors to proteomics, RNA-Seq and clinical information data matrices in both the training and test sets.

Specifically, based on the observed patterns and rates of sample labelling errors in various TCGA data sets, we introduced similar percentages of errors. Detailed rules are provided below:

  1. We introduced labelling errors to around 10% of the samples for the proteomics data and RNA-Seq data, respectively. In addition, we introduced labelling errors to around 5% of the samples in the clinical information table. Sample labelling errors were not shared across different types of data (i.e., for each sample, a mislabeling error only occurs in at most one type of data), so that all three data types can be used to identify the sources of the error.
  2. For proteomics and RNA-Seq data, we introduced three error types: sample duplication (B to A’, where A’ is a duplicate of A), sample swapping (A to B and B to A), and sample shifting (A to B, B to C, and C to D). Duplicated proteomic samples came from technical replicates (outputs from independent proteomics experiments of the same biological samples), whereas duplicated RNA-Seq samples were simulated by adding a perturbation. The swapped samples were required to have different gender or MSI status.
  3. For clinical data, we only introduced swapping (A to B and B to A) between gender inconsistent samples.

Developing and running your algorithm

Performing Subchallenge 1 analysis

In the first subchallenge, participants are presented with a training data set and a test data set, both consisting of clinical and protein profiling data. The participants will develop computational algorithms to model the relationship between clinical attributes and protein profiles using the training data set, where clinical and protein profiling data are perfectly matched for the majority of the samples. The model will be used to identify samples in the test data set that have unmatched clinical and protein profiling data, and the results will be submitted.

Performing Subchallenge 2 analysis

In the second subchallenge, participants will be presented with additional RNA profiling data for both the training and testing samples. Again, clinical, protein profiling, and mRNA profiling data are perfectly matched for the majority of the samples in the training cohort. For unmatched samples, however, at most one data type may be mislabeled. The test data set has the similar mislabeling patterns. As in the first subchallenge, participants will develop a computational algorithm to model the relationships among the three data types in the training data set, and then apply the model to identify and correct mislabeled data. Correction is possible because only one data type among the three is mislabeled.

(Optional) Reconstructing your pipeline on precisionFDA

You have the option of reconstructing your pipeline on precisionFDA and running it there. To do that, you must create one or more apps on precisionFDA that encapsulate the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to be run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can also click the Fork button on an existing app (such as bwa_mem_bamsormadup) to use it as a starting point for developing your own.

Constructing your pipeline on precisionFDA has an important advantage: you can, at your discretion, share it with the community so that others can take a look at it and reproduce your results – and perhaps build upon it and further improve it.

Submission format

For subchallenge 1, participants are required to submit a comma separated text file (CSV) named “subchallenge_1.csv”. A sample file is shown below:

sample,mismatch
Testing_1,0
Testing_2,1
Testing_3,0

Testing_78,1
Testing_79,0
Testing_80,0

Each row represents the mismatch prediction for a sample with first column indicating the sample name and second column being either 0 and 1. 1 indicates there is a mismatch between clinical and proteomics profile data.

For subchallenge 2, participants are asked to submit another comma separated text file named “subchallenge_2.csv”. A sample file is shown below:

sample,clinical,rnaseq,proteomics
Testing_1,1,1,1
Testing_2,1,2,2
Testing_3,3,3,79

Testing_78,-1,78,78
Testing_79,79,79,3
Testing_80,80,80,80

Each row contains the label prediction values with first column indicating the sample name and the rest of the columns being the predicted sample IDs. If you cannot identify the correct label after predicting that there is a mislabel, a ‘-1’ can be specified in the corresponding field.

Submitting your entry

To begin your submission, click "Submit Challenge Entry" at the top of this page. The submission screen will ask for a title, a description, and a comma separated text file containing your submission for either subchallenge 1 or subchallenge 2.

Start by providing a short title for your submission entry, then fill in the description. In your description, please identify whether you are participating as an individual or as part of a team (and, if it is a team effort, please identify all members of your team), as well as a description of the method you used. The description entry field supports Markdown syntax. Don't worry if you don't get it perfect right away, you can always go back and edit this description later.

In the submission input data section, click "Select file..." and choose the comma separated text file you'd like to submit. (In the popup click "Files", then check "Only mine", then select your file). If you ran your pipeline on your own environment, rather than precisionFDA, you must first upload your comma separated text file to precisionFDA in order to select it in the submission input data section.

Once you have entered a title, a description, and chosen a comma separated text file, the "Submit" button on the upper right corner will become active. Click the button to invoke the publishing wizard, which will prompt you to publish the comma separated text file so that others can see it (this is a requirement for participating in the challenge). If your comma separated text file was generated by running apps on precisionFDA (instead of being uploaded externally), the system will ask if you also want to publish the related job, app, and app assets.

After completing the publishing wizard, your comma separated text file will become public, and your entry will be officially submitted. The system will conduct an initial verification of your entry, by running software for validating and scoring the submission. During that time, your entry will appear as "pending verification" under the "my entries" tab, but will not yet show up under the public "challenge submissions" tab. This step takes several minutes, and you can monitor it by clicking "View Log".

If your entry fails this step, it will be marked as "failed". You can click "View Log" to see some diagnostic information related to the execution of the verification software. Failed entries will not show up under the "challenge submissions" tab and will not be counted towards the challenge. However, your comma separated text will have been made public, but you are welcome to delete it.

If verification completes successfully, your entry will appear under "challenge submissions."

Successful entries cannot be revoked, but you can always submit new ones. You can also go back and edit the title and description of any entry.

Methods Description

You are required to submit a short 1-2 page write-up describing the methods you used for the challenge. As described above, please submit your methods description in the description box when submitting your entry.

Determining Top Performers

We will evaluate the two subchallenges separately and pick the top 3 scoring teams as the winners of each subchallenge. For subchallenge 1, the winners are determined by the highest F1 scores of the submitted models. For subchallenge 2, we evaluate the model performance at 3 different levels. First, we assess how well the model predicts mislabeling at the sample level. If any of the predicted labels of the three data types do not match the original sample label, it is considered a mislabel identification at the sample level. We further evaluate the model at the sample-data level where label prediction of each individual data type of each sample is compared with the original labels. A prediction that correctly identifies a mislabeled data type of one sample, however not necessarily correctly rectifies it, will be considered as a true positive at this level. Finally, we assess performance at the correction level by checking how effectively the model corrects sample mislabeling. At this level, only when a corrected label matches the true sample label will it be considered a true positive. We will average the F1 scores at these three levels to obtain the final rank for subchallenge 2. Bootstrap based strategies will be used to declare ties when performances are close.

Opportunity to Publish the Challenge Results

We are pleased to announce that Nature Medicine supports the submission of an overview paper describing the precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge and broadly applicable insights that emerge from it. Publication in Nature Medicine will be contingent on a standard evaluation process including editorial assessment and peer review.

Challenge Discussion

Please use the NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge Discussion on the precisionFDA Discussions forum to discuss the challenge.

Frequently Asked Questions (FAQ)

  • Q: Do I need to participant in both subchallenges?
    • A: No. Participants can choose to participate in subchallenge 1, subchallenge 2, or both subchallenges.
  • Q: Is there a requirement on the modeling approach?
    • A: No. Participants can choose to use supervised, unsupervised, semi-supervised, or any other modeling approaches as appropriate.

Challenge Team

  • PrecisionFDA: Elaine Johanson, Ruth Bandler
  • FDA CDRH: Zivana Tezak, Adam Berger, Majda Haznadar
  • NCI-CPTAC: Henry Rodriguez, Emily Boja, Alexis Carter
  • Pacific Northwest National Laboratory: Samuel Payne
  • Baylor College of Medicine: Bing Zhang, Bo Wen, Zhiao Shi
  • Mount Sinai School of Medicine: Pei Wang, Seungyeul Yoo
  • IBM: Gustavo Stolovitzky
  • DNAnexus: Singer Ma, John Didion
  • Booz Allen: Zeke Maier, Sarah Prezek

References

Boja, E, et al. Right data for right patient-a precisionFDA NCI-CPTAC Multi-omics Mislabeling Challenge. Nat Med. 2018;24(9):1301-1302.