Gaining New Insights by Detecting Adverse Event Anomalies Using FDA Open Data

During the life-cycle of FDA regulated products, FDA collects data from a diversity of sources including voluntary reports from healthcare providers and patients. While cause and effect are not always conclusive or relevant in these reports, valuable insights into the impact of regulated products on public health have been found in individual reports and in evaluation of reported data. This challenge engages data scientists to use evolving data science techniques to identify anomalies that may lead to valuable public health information.

  • Starts
    2020-01-17 19:20:00 UTC
  • Ends
    2020-02-29 04:59:00 UTC

6 days remaining

The Food and Drug Administration (FDA) calls on the public to develop computational algorithms for automatic detection of adverse event anomalies using publicly available data


The U.S. Food and Drug Administration (FDA) regulates a wide range of products that represent approximately 25% of the U.S. economy. These products include drugs, vaccines, medical devices, foods, cosmetics, dietary supplements, tobacco products, and animal drugs and devices. The FDA utilizes post-market surveillance to identify potential public health impacts associated with these products. Passive surveillance is conducted through collection of voluntary reports of adverse reactions associated with products submitted by the public, including patients, patient guardians, health care providers, and product manufacturers. Analysis of adverse event reports enables detection of possible safety issues in order to protect public health.

The FDA utilizes several systems to collect adverse event reports. The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports, and product quality complaints submitted following exposure to drugs and biological products. In total, there are more than 18.5 million reports in FAERS as of September 30, 2019, and more than 1.5 million reports have been submitted annually to FAERS since 2015. The Vaccine Adverse Event Reporting System (VAERS), co-sponsored by the FDA and Centers for Disease Control and Prevention (CDC), collects adverse events associated with vaccines. This challenge will focus on detection of adverse event anomalies from FAERS and VAERS.

FDA makes a significant amount of data, including adverse event data, publicly accessible via the openFDA platform. This resource provides application programming interfaces (APIs) that enable the public to access and explore FDA data, enabling the data to become more accessible, discoverable, and usable.

Based upon the recent Technology Modernization Action Plan (TMAP), which specified strategies for data, technology, and collaboration, the FDA announced its desire to develop an Agency-wide strategic approach for data. In support of this desire, the Modernizing FDA’s Data Strategy public meeting will focus on Agency level approaches to data quality, data stewardship, data exchange, and data analytics. The public FDA data, available via openFDA and other repositories, enables the public to explore the complexity of Agency data, including data quality and analytics challenges associated with adverse event reports.

FDA regulators use a variety of data mining methods and tools to analyze the volumes of adverse event reports and identify possible safety signals. Disproportionality methods, which identify unexpectedly high statistical associations between products and adverse events, serve as a primary method for identifying safety signals. Change-point analysis identifies changes in longitudinal adverse event patterns for products. Natural Language Processing (NLP) may be used to speed up the process of extracting and structuring key features, including symptoms, diagnosis, treatments, and dates, in the text narrative portion of adverse event reports. Finally, the biological plausibility of adverse event reports can also be assessed by leveraging knowledge of biological pathways (Duggirala et al., 2015).

While data mining and hands-on case reviews performed by medical officers have been the cornerstone of adverse event safety signal detection, other approaches, including machine learning and artificial intelligence, may provide novel insights into FDA adverse event data. Of particular interest is the automatic detection of anomalies in FDA adverse event data. Adverse event (AE) anomalies can take many forms, including:

1. Disproportionality anomalies

Example anomalies may be of the form:

  • Drug X is of Drug class Y
  • Other drugs in Drug Y have 30% of their AEs as AE term "Z"
  • Drug X has 70% of its AEs as AE term "Z"
  • Therefore, Drug X may have an anomalous number of AE "Z" reports

2. Demographic Disproportionality anomalies

Example anomalies may be of the form:

  • Drug X has 90% of its AEs from women (1023 from women, 101 from men)
  • Drug X is indicated for condition Y
  • Other drugs indicated for condition Y have ~1:1 male-to-female ratios for AEs
  • Therefore, Drug X may have an anomalous number of reports for women

3. Variability Between Data Source anomalies:

These may include:

  • Product label information being significantly inconsistent with observed AEs
  • Product label information being significantly inconsistent with reported clinical trial AEs
  • Therapeutically equivalent products showing significantly incompatible AE profiles in labelling, clinical trials, or AE reports

4. Logical anomalies:

These may include:

  • Numeric variables which are expected to be whole number counts having impossible values (e.g. “0.135 patients”, “-25 adverse events”, etc.)
  • Logical contradiction implied by provided information
  • Two typically incompatible co-occurring AEs appearing in the same report (e.g. hypertension and hypotension, etc.)

5. Chronological anomalies:

These may include:

  • A drug receives AE reports well before it was legally marketed or in clinical trials (also a logical anomaly)
  • A potent calendar correlation where none would be expected (e.g. all drugs of Drug Class A have received a greatly increased headache-to-nausea ration specifically in the month of March every year since 1990)

Automated adverse event anomaly detection would augment the traditional data mining and case review approach by enabling the unsupervised identification of novel potential safety signals. This FDA Open Data Challenge encourages the development of computational algorithms for automatic detection of drug adverse event anomalies. Participants will develop unsupervised algorithms due to the lack of known anomalies.

While reading over the background and challenge details you may be asking yourself if your background and skill-sets are a good fit for this challenge. We welcome any and all individuals with an AI, data science and/or pharmacovigilance background. In addition, we encourage teams that include diverse backgrounds that combine strengths. This is NOT a typical challenge that seeks answers, but rather is looking for methods to inspire interesting and relevant questions.


Getting on the precisionFDA website

If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate your intent to participate in the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. Once approved, you will receive another email with your contributor account information.

With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). Everything you do on precisionFDA is initially private to you (not accessible to the FDA or the rest of the community) until you choose to publicize it. In other words, you can immediately start working on the challenge privately, and when ready, you can officially publish your results as a challenge entry.

Locating and Understanding the Data

Site Source Data Type Product Type Link
openFDA Drug Product Labels Structured Product Labels Drug openFDA Drug Product Labels
openFDA FAERS Adverse Events Drug openFDA Drug Adverse Events VAERS Adverse Events Vaccines VAERS Data Download
Global Substance Registration System (GSRS) Substances in Regulated Products openFDA Substance Data Databases of Clinical Studies Data Download

The formats of the data files include XML, JSON, and CSV. Participants are expected to parse the data files and can use any programming language. Below is an example of how to parse each file format using Python with the packages pandas, json, and xmltodict.

Below is a diagram showing an example of how the datasets can be linked using fields like “active_ingredient” and “intervention_name”. There may be other ways to link the datasets besides the provided example. Participants are encouraged to explore the data to identify other ways to link the datasets to maximize the samples size.

While also serving to improve analytics for adverse event data, we would like to use this challenge as a platform for assessing user experience with the datasets. We encourage participants to explore the data and provide feedback regarding accessibility issues and walls they ran up against when connecting and analyzing data sources.

Developing and running your algorithm

Participants will develop computational algorithms to identify adverse event anomalies in open FDA adverse event data available on openFDA. To augment this data, drug substance information is available in GSRS, and clinical trials information is available in Algorithms should detect anomalies automatically and without the use of known anomaly labeled training data.

(Optional) Reconstructing your pipeline on precisionFDA

You have the option of reconstructing your pipeline on precisionFDA and running it there. To do that, you must create an app on precisionFDA that encapsulates the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to be run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can also click the Fork button on an existing app (such as bwa_mem_bamsormadup) to use it as a starting point for developing your own.


A complete challenge submission consists of two items, described below.

  • Detected Anomalies: A text file (.txt extension) that lists all of the detected anomalies. Each detected anomaly should be described using one or more statements. Each distinct anomaly should be separated by a newline (‘\n’) in the text file. Please upload your detected anomalies text file when submitting your entry. An example of a detected anomalies submission file follows.
  • Methods Description: A short 1-2 page write-up describing the methods you used for the challenge. Please include a description of the data, any feature selection techniques, and the anomaly detection algorithm. Please paste your methods description into the description box when submitting your entry.

UPDATE: Submitting code is no longer a required component of a valid challenge submission. We still welcome code submission as an optional addition to the submission of detected anomalies and a methods description.

  • OPTIONAL: Anomaly Detection Code: The code developed to detect anomalies. Your code should reside on precisionFDA as an app or be accessible via a private GitHub repository. Please add a link to your code when submitting your entry. Please ensure your code is properly documented so that it can be properly evaluated, which includes being able to run and reproduce results.

Note to participants about optional submission of code: Your code will NOT be publicly available and will not be shared without your expressed permission. Any and all code you include as part of your submission will be viewed only by a small challenge team at the FDA in order to evaluate your submission.


A panel consisting of FDA staff experts will review each submission. Submissions will be judged based on the following criteria:

1. Is the finding out of the ordinary: is the finding actually atypical?

  • It is not out of the ordinary that any NSAID has higher rates of ulcers
  • It is somewhat out of the ordinary if one specific NSAID has low rates of ulcers

2. Is the finding worth noticing: impactful to FDA processes or patient health?

  • It is out of the ordinary that the cisplatin drug contains a platinum atom, however, this is not necessarily worth noticing by itself and does not lead to follow-up questions
  • It might be worth noticing that among platinum-containing drugs, cisplatin has a comparatively low rate of a specific AE. That finding may lead to follow-up questions

3. Non-obviousness: does the finding require several data points/logical steps to explain, or is it simply reporting a readily apparent event/state?

  • It is out of the ordinary if a specific drug suddenly has a high rate of fatal AEs, but that case would also be *obvious* and likely detected by experts utilizing existing tools when reviewing the data
  • On the other hand, if rate of headache-to-fatality AEs, for example, is extremely consistent within all drugs that have a cleavable ester, but one particular cleavable ester drug has a much lower headache-to-fatality AE ratio it would not be *obvious* since it requires several connected elements to uncover

4. Is the finding novel: has this anomaly been discovered previously?

  • The finding may be rare, worth-noticing, and non-obvious but also published extensively in 2015. Such a case may still be useful but not novel

5. Strength of significance: does it have sufficient statistical strength to warrant further exploration?

  • The finding may be rare, meaningful, non-obvious, and novel but still has so little statistical power it is likely a product of noise
  • Lack of statistical strength does not equate to a non-useful finding but findings with statistical support receive priority

Note: These criteria are not strictly orthogonal and would typically correlate with one-another, but they also have distinct meanings that help guide evaluation. It'’s not the case that you must receive high scores on all 5 criteria to be a “good” anomaly. Reviewers may decide that especially strong showings in one or two criteria make up for lower scores in other criteria.


Through this challenge, participants will contribute insights for public health. Selected participants may be recognized, including involvement in future results manuscript(s), and could have the unique opportunity to sit side-by-side with FDA regulators on a panel at the Modernizing FDA’s Data Strategy public meeting.


Challenge Discussion

Please use the Gaining New Insights by Detecting Adverse Event Anomalies Using FDA Open Data Challenge Discussion on the precisionFDA Discussions forum to discuss the challenge.


Challenge Team

  • Office of the Principal Deputy Commissioner: Joseph Franklin, Allison Hoffman
  • Center for Drugs Evaluation and Research (CDER)
  • Center for Biologics Evaluation and Research (CBER)
  • PrecisionFDA: Elaine Johanson
  • OpenFDA: Lonnie Smith
  • GSRS: Tyler Peryea
  • DNAnexus: Omar Serang, John Didion, Sam Westreich
  • Conceptant: Josh Phipps
  • Booz Allen: Holly Stephens, Sean Watford, Zeke Maier


Kass-Hout, T. A., Xu, Z., Mohebbi, M., Nelsen, H., Baker, A., Levine, J., ... & Bright, R. A. (2015). OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data. Journal of the American Medical Informatics Association, 23(3), 596-600.

Duggirala, H. J., Tonning, J. M., Smith, E., Bright, R. A., Baker, J. D., Ball, R., ... & Boyer, M. (2015). Use of data mining at the Food and Drug Administration. Journal of the American Medical Informatics Association, 23(2), 428-434.