PrecisionFDA
NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge - Subchallenge 2


Sample mislabeling (accidental swapping of patient samples) or data mislabeling (accidental swapping of patient omics data) is known to be one of the obstacles in basic and translational research because this accidental swapping contributes to irreproducible results and invalid conclusions. The objective of this challenge is to encourage development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets.


  • Starts
    2018-11-06 00:00:00 UTC
  • Ends
    2018-12-19 07:59:59 UTC

The precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge – Subchallenge 2 ran from November 5, 2018 to December 18, 2018. This subchallenge asked participants to develop computational models for identifying and correcting samples that have mismatched clinical, protein profiling, and mRNA profiling data. There were 82 valid entries from 30 participants to the challenge.

This NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge – Subchallenge 2 results page displays the summarized results in the tables below. As with previous challenges, due to novelties related to the truth data and the comparison methodology, these results offer a first glance at our understanding. We welcome the community to further explore these results and provide insight for the future.

Introductory Remarks

At the start of this challenge, participants were provided with paired clinical, proteomics and mRNA profiling data for each of the 160 tumor samples. The 160 tumor samples contained labelling errors and were divided into training and test sets. Participants were asked to develop computational algorithms to model the relationship between clinical attributes, protein profiles, and mRNA profiles using the training data set, then apply the model to identify and correct mislabeled samples in the test data set that have one data type among the three mislabeled. Sample mislabeling patterns and rates were introduced based on observations in the TCGA data sets.

Overview of Results

For each sample in the test data set, participants submitted a prediction indicating whether there is a mismatch between clinical, proteomics profiling, and mRNA profiling data. Predictions were compared to known mislabeled samples. For this subchallenge, precision, recall, and F-score were computed by comparing the binary mismatch predictions, to the known mismatched samples. These three evaluation metrics are defined in the table below.

Metric Definition
Precision True Positives / (True Positives + False Positive)
Recall True Positives / (True Positives + False Negative)
F-score Harmonic mean of Precision and Recall

For subchallenge 2, we evaluated the model performance at three different levels:

  1. Sample level – measures model performance at the sample level. If any of the predicted labels of the three data types do not match the original sample label, it is considered an incorrect label at the sample level.
  2. Sample-data level - measures model performance at the level of each individual data type of each sample. A prediction that correctly identifies a mislabeled data type of a sample is considered a true positive at this level, even if the mislabeling is not corrected.
  3. Correction level – measures model performance at correcting sample mislabeling. At this level, only when a corrected label matches the true sample label will it be considered a true positive.

The final ranking was computed by averaging the F-scores at the three levels.

Finally, to determine significant performance differences between submissions, a bootstrapping approach was used to compute the confidence interval of the F-score of each submission. Rankings were generated based on: (1) method performance, by treating each submission as unique, and (2) submitter performance, by taking the median F-score of each participant’s submissions.

Method Performance Results

The table below shows the top 3 highest performing submissions/methods based on F-scores.

Name Submission Precision Recall F-score Mean(F-score) SD(F-score) 95%_CI_lower 95%_CI_upper
Anders Carlsson thep_bionamic_sub2_B 1 1 1 1 0 1 1
Renke Pan subchallenge_2_modelA_1 1 1 1 1 0 1 1
Soon Jye Kho subchallenge_2_sub1 1 1 1 1 0 1 1

The anonymized complete results table can be downloaded here. To protect the identity of participants, each participant’s performance will be emailed.

Submitter Performance Results

The table below shows the top 3 highest performing participants based on the median F-scores of their submissions.

Name F-score Mean(F-score) SD(F-score) 95%_CI_lower 95%_CI_upper
Anders Carlsson 0.967 0.965 0.009 0.963 0.967
Soon Jye Kho 0.967 0.965 0.009 0.963 0.967
Renke Pan 0.933 0.924 0.015 0.921 0.927

The performance of all submitters is shown in the figure below:

Figure: Median F-scores and the corresponding 95% confidence intervals. The confidence intervals were derived through a bootstrap procedure with 100 iterations.

The anonymized complete results table can be downloaded here. To protect the identity of participants, each participant’s performance will be emailed.

Scientific Manuscript

The NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge team plans to prepare a scientific manuscript that describes that challenge and challenge results. All challenge participants that submit a one-page description of their methods will be included as a challenge participant consortium author. In addition, the challenge team will select some participants, based on performance and/or unique methodology, to participate in the manuscript development.

Challenge Key

The subchallenge 2 key, including mislabeling information for the test samples, is available here.