NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge - Subchallenge 1

Sample mislabeling (accidental swapping of patient samples) or data mislabeling (accidental swapping of patient omics data) is known to be one of the obstacles in basic and translational research because this accidental swapping contributes to irreproducible results and invalid conclusions. The objective of this challenge is to encourage development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets.

  • Starts
    2018-09-24 19:00:00 UTC
  • Ends
    2018-11-05 04:59:59 UTC


The U.S. Food and Drug Administration (FDA) regulates a wide range of products that represent approximately 25% of the U.S. economy. These products include drugs, vaccines, medical devices, foods, cosmetics, dietary supplements, tobacco products, and animal drugs and devices. The FDA utilizes post-market surveillance to identify potential public health impacts associated with these products. Passive surveillance is conducted via collection of voluntary reports of adverse reactions associated with products submitted by the public, including patients, patient guardians, health care providers, and product manufacturers. Analysis of the adverse event reports enables detection of possible safety issues in order to protect public health.

The FDA utilizes several systems to collect adverse event reports. The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports, and product quality complaints submitted following exposure to drugs and biological products. In total, there are more than 18.5 million reports in FAERS as of September 30, 2019, and more than 1.5 million reports have been submitted annually to FAERS since 2015. The Vaccine Adverse Event Reporting System (VAERS), co-sponsored by the FDA and Centers for Disease Control and Prevention (CDC), collects adverse events associated with vaccines. The Manufacturer and User Facility Device Experience (MAUDE) database stores medical device reports (MDRs) of adverse events involving medical devices. Several hundred thousand MDRs are stored in MAUDE annually. Also available are adverse event repositories for foods, dietary supplements, and cosmetics in the CFSAN Adverse Event Reporting System (CAERS), and veterinary medicine-related adverse events in the Animal Drug Adverse Events (ADAE). This challenge will focus on detection of adverse event anomalies from FAERS, VAERS, and MAUDE.

FDA makes a significant amount of data, including adverse event data, publicly accessible via the openFDA platform. This resource provides application programming interfaces (APIs) that enable the public to access and explore FDA data, enabling the data to become more accessible, discoverable, and usable. Since its launch in 2014, openFDA has had over 205 million API calls. The objective of openFDA is to facilitate access and use of FDA public datasets by developers, researchers, and the public through harmonization of data across disparate datasets. Improved access to FDA data via openFDA also increases public awareness of the FDA, its mission, and the products it regulates (Kass-Hout et al., 2015). Ultimately, the availability of open data can increase interaction and engagement, and support creativity and innovation.

Based upon the recent Technology Modernization Action Plan (TMAP), which specified strategies for data, technology, and collaboration, the FDA announced its desire to develop an Agency-wide strategic approach for data. In support of this desire, the Modernizing FDA’s Data Strategy public meeting will focus on Agency level approaches to data quality, data stewardship, data exchange, and data analytics. The public FDA data, available via openFDA and other repositories, enables the public to explore the complexity of Agency data, including data quality and analytics challenges associated with adverse event reports.

FDA regulators use a variety of data mining methods and tools to analyze the volumes of adverse event reports and identify possible safety signals. Disproportionality methods, which identify unexpectedly high statistical associations between products and adverse events, serve as a primary method for identifying safety signals. Change-point analysis identifies changes in longitudinal adverse event patterns for products. Natural Language Processing (NLP) may be used to speed up the process of extracting and structuring key features, including symptoms, diagnosis, treatments, and dates, in the text narrative portion of adverse event reports. Finally, the biological plausibility of adverse event reports can also be assessed by leveraging knowledge of biological pathways (Duggirala et al., 2015).

While data mining and hands-on case reviews performed by medical officers have been the cornerstone of adverse event safety signal detection, other approaches, including machine learning and artificial intelligence, may provide novel insights into FDA adverse event data. Of particular interest is the automatic detection of anomalies in FDA adverse event data. Anomalies can take many forms, including:

  • Drug-specific adverse event patterns
    • E.g., common adverse reactions that were not identified in the clinical trials for a drug
    • E.g., common adverse reactions that are not identified on a drug label
  • Multi-drug adverse event patterns
    • E.g., common adverse reactions for a manufacturer
    • E.g., adverse reaction patterns shared by multiple drugs
    • E.g., adverse reaction patterns associated with drug interactions
  • Adverse event patterns associated with drug-patient interactions
    • E.g., drug adverse reaction patterns specific to sex
  • Time-dependent adverse event patterns

Automated adverse event anomaly detection would augment the traditional data mining and case review approach by enabling the unsupervised identification of novel potential safety signals. This FDA Open Data Challenge encourages the development of computational algorithms for automatic detection of drug adverse event anomalies. Participants will develop unsupervised algorithms due to the lack of known anomalies. Automated adverse event anomaly detection would augment the traditional data mining and case review approach by enabling the unsupervised identification of novel potential safety signals.


Getting on the precisionFDA website

If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate your intent to participate in the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. Once approved, you will receive another email with your contributor account information.

With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). Everything you do on precisionFDA is initially private to you (not accessible to the FDA or the rest of the community) until you choose to publicize it. In other words, you can immediately start working on the challenge in private, and when ready, you can officially publish your results as a challenge entry.

Locating and Understanding the Data

Site Source Data Type Product Type Link
openFDA Drug Product Labels Structured Product Labels Drug openFDA Drug Product Labels
openFDA FAERS Adverse Events Drug openFDA Drug Adverse Events
openFDA MAUDE Adverse Events Device openFDA Device Adverse Events VAERS Adverse Events Vaccines VAERS Data Download
Global Substance Registration System (GSRS) Substances in Regulated Products openFDA Substance Data Databases of Clinical Studies Data Download

The formats of the data files include XML, JSON, and CSV. Participants are expected to parse the data files and can use any programming language. Below is an example of how to parse each file format using Python with the packages pandas, json, and xmltodict.

Below is a diagram showing an example of how the datasets can be linked using fields like “active_ingredient” and “intervention_name”. There may be other ways to link the datasets besides the provided example. Participants are encouraged to explore the data to identify other ways to link the datasets to maximize the samples size.

Developing and running your algorithm

Participants will develop computational algorithms to identify adverse event anomalies in open FDA adverse event data available on OpenFDA. To augment this data, drug substance information is available in GSRS, and clinical trials information is available in Algorithms should detect anomalies automatically and without the use of known anomaly labeled training data.

(Optional) Reconstructing your pipeline on precisionFDA

You have the option of reconstructing your pipeline on precisionFDA and running it there. To do that, you must create an app on precisionFDA that encapsulate the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to be run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can also click the Fork button on an existing app (such as bwa_mem_bamsormadup) to use it as a starting point for developing your own.


A complete challenge submission consists of three items, described below:

  • Detected Anomalies: A text file (.txt extension) that lists all of the detected anomalies. Each detected anomaly should be described using one or more statements. Each distinct anomaly should be separated by a newline (‘\n’) in the text file. Please upload your detected anomalies text file when submitted your entry. An example detected anomalies submission file follows.
  • Methods Description: A short 1-2 page write-up describing the methods you used for the challenge. Please include a description of the data, any feature selection techniques, and the anomaly detection algorithm. Please paste your methods description into the description box when submitting your entry.
  • Anomaly Detection Code: The code developed to detect anomalies. Your code should reside on precisionFDA as an app or be accessible via a public GitHub repository. Please add a link to your code when submitting your entry. Please ensure your code is properly documented so that it can be properly evaluated, which includes being able to run and reproduce results.