Gaining New Insights by Detecting Adverse Event Anomalies Using FDA Open Data

During the life-cycle of FDA regulated products, FDA collects data from a diversity of sources including voluntary reports from healthcare providers and patients. While cause and effect are not always conclusive or relevant in these reports, valuable insights into the impact of regulated products on public health have been found in individual reports and in evaluation of reported data. This challenge engages data scientists to use evolving data science techniques to identify anomalies that may lead to valuable public health information.

  • Starts
    2020-01-17 19:20:00 UTC
  • Ends
    2020-05-19 03:59:00 UTC

about 1 month remaining

Latest challenge news: the challenge submission period has been extended to May 18th, 2020!!  For those of you that submitted we thank you very much and are excited about your submissions. For those of you who have not yet submitted, you will have an additional 2 months to work on the challenge.


The Food and Drug Administration (FDA) calls on the public to develop computational algorithms for automatic detection of adverse event anomalies using publicly available data


The U.S. Food and Drug Administration (FDA) regulates a wide range of products that represent approximately 25% of the U.S. economy. These products include drugs, vaccines, medical devices, foods, cosmetics, dietary supplements, tobacco products, and animal drugs and devices. The FDA utilizes post-market surveillance to identify potential public health impacts associated with these products. Passive surveillance is conducted through collection of voluntary reports of adverse reactions associated with products submitted by the public, including patients, patient guardians, health care providers, and product manufacturers. Analysis of adverse event reports enables detection of possible safety issues in order to protect public health.

The FDA utilizes several systems to collect adverse event reports. The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports, and product quality complaints submitted following exposure to drugs and biological products. In total, there are more than 18.5 million reports in FAERS as of September 30, 2019, and more than 1.5 million reports have been submitted annually to FAERS since 2015. The Vaccine Adverse Event Reporting System (VAERS), co-sponsored by the FDA and Centers for Disease Control and Prevention (CDC), collects adverse events associated with vaccines. This challenge will focus on detection of adverse event anomalies from FAERS and VAERS.

FDA makes a significant amount of data, including adverse event data, publicly accessible via the openFDA platform. This resource provides application programming interfaces (APIs) that enable the public to access and explore FDA data, enabling the data to become more accessible, discoverable, and usable.

Based upon the recent Technology Modernization Action Plan (TMAP), which specified strategies for data, technology, and collaboration, the FDA announced its desire to develop an Agency-wide strategic approach for data. In support of this desire, the Modernizing FDA’s Data Strategy public meeting will focus on Agency level approaches to data quality, data stewardship, data exchange, and data analytics. The public FDA data, available via openFDA and other repositories, enables the public to explore the complexity of Agency data, including data quality and analytics challenges associated with adverse event reports.

FDA regulators use a variety of data mining methods and tools to analyze the volumes of adverse event reports and identify possible safety signals. Disproportionality methods, which identify unexpectedly high statistical associations between products and adverse events, serve as a primary method for identifying safety signals. Change-point analysis identifies changes in longitudinal adverse event patterns for products. Natural Language Processing (NLP) may be used to speed up the process of extracting and structuring key features, including symptoms, diagnosis, treatments, and dates, in the text narrative portion of adverse event reports. Finally, the biological plausibility of adverse event reports can also be assessed by leveraging knowledge of biological pathways (Duggirala et al., 2015).

While data mining and hands-on case reviews performed by medical officers have been the cornerstone of adverse event safety signal detection, other approaches, including machine learning and artificial intelligence, may provide novel insights into FDA adverse event data. Of particular interest is the automatic detection of anomalies in FDA adverse event data. Adverse event (AE) anomalies can take many forms, including:

1. Disproportionality anomalies

Example anomalies may be of the form:

  • Drug X is of Drug class Y
  • Other drugs in Drug Y have 30% of their AEs as AE term "Z"
  • Drug X has 70% of its AEs as AE term "Z"
  • Therefore, Drug X may have an anomalous number of AE "Z" reports

2. Demographic Disproportionality anomalies

Example anomalies may be of the form:

  • Drug X has 90% of its AEs from women (1023 from women, 101 from men)
  • Drug X is indicated for condition Y
  • Other drugs indicated for condition Y have ~1:1 male-to-female ratios for AEs
  • Therefore, Drug X may have an anomalous number of reports for women

3. Variability Between Data Source anomalies

These may include:

  • Product label information being significantly inconsistent with observed AEs
  • Product label information being significantly inconsistent with reported clinical trial AEs
  • Therapeutically equivalent products showing significantly incompatible AE profiles in labelling, clinical trials, or AE reports

4. Logical anomalies

These may include:

  • Numeric variables which are expected to be whole number counts having impossible values (e.g. “0.135 patients”, “-25 adverse events”, etc.)
  • Logical contradiction implied by provided information
  • Two typically incompatible co-occurring AEs appearing in the same report (e.g. hypertension and hypotension, etc.)

5. Chronological anomalies

These may include:

  • A drug receives AE reports well before it was legally marketed or in clinical trials (also a logical anomaly)
  • A potent calendar correlation where none would be expected (e.g. all drugs of Drug Class A have received a greatly increased headache-to-nausea ratio specifically in the month of March every year since 1990)

Automated adverse event anomaly detection would augment the traditional data mining and case review approach by enabling the unsupervised identification of novel potential safety signals. This FDA Open Data Challenge encourages the development of computational algorithms for automatic detection of drug adverse event anomalies. Participants will develop unsupervised algorithms due to the lack of known anomalies.

While reading over the background and challenge details you may be asking yourself if your background and skill-sets are a good fit for this challenge. We welcome any and all individuals with an AI, data science and/or pharmacovigilance background. In addition, we encourage teams that include diverse backgrounds that combine strengths. This is NOT a typical challenge that seeks answers, but rather is looking for methods to inspire interesting and relevant questions.


Getting on the precisionFDA website

If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate your intent to participate in the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. Once approved, you will receive another email with your contributor account information.

With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). Everything you do on precisionFDA is initially private to you (not accessible to the FDA or the rest of the community) until you choose to publicize it. In other words, you can immediately start working on the challenge privately, and when ready, you can officially publish your results as a challenge entry.

Locating and understanding the data

Site Source Data Type Product Type Link
openFDA Drug Product Labels Structured Product Labels Drug openFDA Drug Product Labels
openFDA FAERS Adverse Events Drug openFDA Drug Adverse Events
Global Substance Registration System (GSRS) Substances in Regulated Products openFDA Substance Data
openFDA - alternate link with JSON file containing links to downloadable files VAERS Adverse Events Vaccines VAERS Data Download Databases of Clinical Studies Data Download

The formats of the data files include XML, JSON, and CSV. Participants are expected to parse the data files and can use any programming language. Below is an example of how to parse each file format using Python with the packages pandas, json, and xmltodict.

Below is a diagram showing an example of how the datasets can be linked using fields like “active_ingredient” and “intervention_name”. There may be other ways to link the datasets besides the provided example. Participants are encouraged to explore the data to identify other ways to link the datasets to maximize the samples size.

While also serving to improve analytics for adverse event data, we would like to use this challenge as a platform for assessing user experience with the datasets. We encourage participants to explore the data and provide feedback regarding accessibility issues and walls they ran up against when connecting and analyzing data sources.

Developing and running your algorithm

Participants will develop computational algorithms to identify adverse event anomalies in open FDA adverse event data available on openFDA. To augment this data, drug substance information is available in GSRS, and clinical trials information is available in Algorithms should detect anomalies automatically and without the use of known anomaly labeled training data.

(Optional) Reconstructing your pipeline on precisionFDA

You have the option of reconstructing your pipeline on precisionFDA and running it there. To do that, you must create an app on precisionFDA that encapsulates the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to be run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can also click the Fork button on an existing app (such as bwa_mem_bamsormadup) to use it as a starting point for developing your own.


A complete challenge submission consists of two items, described below.

  • Detected Anomalies: A text file (.txt extension) that lists all of the detected anomalies. Each detected anomaly should be described using one or more statements. Each distinct anomaly should be separated by a newline (‘\n’) in the text file. Please upload your detected anomalies text file when submitting your entry. An example of a detected anomalies submission file follows. YOU will need to submit at least 5 anomalies, and must they be sorted such that your "best" 5 are on top since only the top 5 anomalies will be evaluated.
  • Methods Description: A short 1-2 page write-up (also a .txt file) describing the methods you used for the challenge. Please include a description of the data, any feature selection techniques, and the anomaly detection algorithm. Please paste your methods description into the description box when submitting your entry.

UPDATE: Submitting code is no longer a required component of a valid challenge submission. We still welcome code submission as an optional addition to the submission of detected anomalies and a methods description.

  • OPTIONAL: Anomaly Detection Code: The code developed to detect anomalies. Your code should reside on precisionFDA as an app or be accessible via a private GitHub repository. Please add a link to your code when submitting your entry. Please ensure your code is properly documented so that it can be properly evaluated, which includes being able to run and reproduce results.

Note to participants about optional submission of code: Your code will NOT be publicly available and will not be shared without your expressed permission. Any and all code you include as part of your submission will be viewed only by a small challenge team at the FDA in order to evaluate your submission.

You may submit rich text and graphics in PDF format to support your submission (e.g. describe how you assessed the statistical strength of your analysis).


A panel consisting of FDA staff experts will review each submission. Submissions will be judged based on the following criteria:

1. Is the finding out of the ordinary: is the finding actually atypical?

  • It is not out of the ordinary that any NSAID has higher rates of ulcers
  • It is somewhat out of the ordinary if one specific NSAID has low rates of ulcers

2. Is the finding worth noticing: impactful to FDA processes or patient health?

  • It is out of the ordinary that the cisplatin drug contains a platinum atom, however, this is not necessarily worth noticing by itself and does not lead to follow-up questions
  • It might be worth noticing that among platinum-containing drugs, cisplatin has a comparatively low rate of a specific AE. That finding may lead to follow-up questions

3. Non-obviousness: does the finding require several data points/logical steps to explain, or is it simply reporting a readily apparent event/state?

  • It is out of the ordinary if a specific drug suddenly has a high rate of fatal AEs, but that case would also be *obvious* and likely detected by experts utilizing existing tools when reviewing the data
  • On the other hand, if rate of headache-to-fatality AEs, for example, is extremely consistent within all drugs that have a cleavable ester, but one particular cleavable ester drug has a much lower headache-to-fatality AE ratio it would not be *obvious* since it requires several connected elements to uncover

4. Is the finding novel: has this anomaly been discovered previously?

  • The finding may be rare, worth-noticing, and non-obvious but also published extensively in 2015. Such a case may still be useful but not novel

5. Strength of significance: does it have sufficient statistical strength to warrant further exploration?

  • The finding may be rare, meaningful, non-obvious, and novel but still has so little statistical power it is likely a product of noise
  • Lack of statistical strength does not equate to a non-useful finding but findings with statistical support receive priority

Note: These criteria are not strictly orthogonal and would typically correlate with one-another, but they also have distinct meanings that help guide evaluation. It'’s not the case that you must receive high scores on all 5 criteria to be a “good” anomaly. Reviewers may decide that especially strong showings in one or two criteria make up for lower scores in other criteria.


Through this challenge, participants will contribute insights for public health. Selected participants may be recognized, including involvement in future results manuscript(s), and could have the unique opportunity to sit side-by-side with FDA regulators on a panel at the Modernizing FDA’s Data Strategy public meeting.


Challenge Discussion

Please use the Gaining New Insights by Detecting Adverse Event Anomalies Using FDA Open Data Challenge Discussion on the precisionFDA Discussions forum to discuss the challenge.

Frequently Asked Questions

Recently, we had a participant ask a question we feel would be valuable to share with all potential participants of this challenge.

Question: We have started running some ideas through the FAERS data. The next step is to augment the basic reports in FAERS with additional datasets as indicated in the competition page and we discussed. We have looked into joining the NDC data, Drug Labels data, and GSRS data. I do not see any obvious one-on-one match for an ID term. Could you point me to the best fields to use as keys for joining the above datasets?

Answer: Unfortunately these things are not 1-to-1 and deep field/key constraints aren’t always enforced across these data sets. Some of this is unavoidable, and some of this is due to how this data was produced / the audience for which it’s compiled. There is a lot to explore here, and a lot to explain, but I’ll give a few basics.

  1. Adverse Events Data (FAERS) [importantly links to (A)NDAs and to the same names as used in GSRS]
  2. Drug product/label data (SPL) [importantly links to (A)NDAs, NDCs, the same names as used in GSRS, and the UNII codes in GSRS]
  3. Drug ingredient data (GSRS) [importantly has LOTS of names, and the UNII codes for ingredients, as well as substance categories/definitions]
  4. Clinical Trial data (

The first 3 are controlled or produced by FDA directly, and made available in some fairly raw formats. is not controlled or produced directly by the FDA, but can be obtained in a fairly raw format. Almost all other drug resources we’ve talking about (many openFDA sources, NDC director, etc) are somehow originally sourced from these 3 raw sources and just simplified/enhanced/harmonized. OpenFDA is really an attempt to take these kinds of opaque raw resources and make them more manageable for computer consumption. So openFDA serves up GSRS/SPL/FAERS data with descriptions of fields, a standard for JSON, etc.

There is a basic overview of fields used for harmonization in openFDA here:

In general, resources in openFDA often have a little extra field called “openfda” which attempts to expose some useful harmonization/linking fields. Here’s a screenshot from an openFDA NDC directory JSON response:

In my experience, many of the AEs in openFDA are “enriched” by having the mappings to NDCs, NDAs and UNIIs already done. This isn’t as true with some of the other resources on openFDA, so mileage may vary. Here’s my simple list of hints/tricks:

  1. If you’ve got an ingredient, look for a UNII. If you’ve only got a name, you can still map to the GSRS and get a UNII in most cases. OpenFDA has probably done that for you a lot of the time, but sometimes it loses the meaning/granularity.
  2. If you’ve got a product reference, try to get an NDC or set id, which you can map the SPL records directly. You can map more broadly using NDA/ANDA as well. You can map very broadly based on active ingredient/UNII.

In an ideal world every AE report would come with the NDC that it was mapped to and things would get really interesting. Then we could ask really neat questions like “is there an enrichment for AEs based on the inactive ingredients found from SPL submissions”. Right now that’s a little hard to do. It’s partially hard because AE submissions don’t typically tell the FDA what the NDC was to begin with (though they can), and it’s partly hard because that kind of information isn’t completely preserved through the data pipeline.


Challenge Team

  • Office of the Principal Deputy Commissioner: Joseph Franklin, Allison Hoffman
  • Center for Drugs Evaluation and Research (CDER)
  • Center for Biologics Evaluation and Research (CBER)
  • PrecisionFDA: Elaine Johanson
  • OpenFDA: Lonnie Smith
  • GSRS: Tyler Peryea
  • DNAnexus: Omar Serang, John Didion, Sam Westreich
  • Conceptant: Josh Phipps
  • Booz Allen: Holly Stephens, Sean Watford, Zeke Maier


Kass-Hout, T. A., Xu, Z., Mohebbi, M., Nelsen, H., Baker, A., Levine, J., ... & Bright, R. A. (2015). OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data. Journal of the American Medical Informatics Association, 23(3), 596-600.

Duggirala, H. J., Tonning, J. M., Smith, E., Bright, R. A., Baker, J. D., Ball, R., ... & Boyer, M. (2015). Use of data mining at the Food and Drug Administration. Journal of the American Medical Informatics Association, 23(2), 428-434.