PrecisionFDA
VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge


The Veterans Health Administration (VHA) Innovation Ecosystem and Food and Drug Administration (FDA) call on the scientific and analytics community to develop and evaluate computational models to predict COVID-19 related health outcomes in Veterans.


  • Starts
    2020-06-02 13:53:44 UTC
  • Ends
    2020-07-03 13:53:53 UTC

News: VHA COVID-19 Challenge Participants,

Thank you so much for your interest and participation in our challenge! We are working hard to evaluate submissions and identify top performers. As with all our challenges, we draft a results site with results details, summary statistics, announcement of top performers, and conclusions. We plan to post the results site in mid-August. Please do not hesitate to reach out with questions you may have. Thanks again, the precisionFDA Challenge Team

The Veterans Health Administration (VHA) Innovation Ecosystem and Food and Drug Administration (FDA) call on the scientific and analytics community to develop and evaluate computational models to predict COVID-19 related health outcomes in Veterans.

Challenge Time Period

June 2, 2020 – July 3, 2020

AT A GLANCE

The novel coronavirus disease 2019 (COVID-19) is a respiratory disease caused by a new type of coronavirus, known as “severe acute respiratory syndrome coronavirus 2,” or SARS-CoV-2. On March 11, 2020, the World Health Organization (WHO) declared the outbreak a global pandemic. As of Monday, June 1, the Johns Hopkins University COVID-19 dashboard reports over 6.21 million total confirmed cases worldwide, including over 1.79 million cases in the United States. Although most people have mild to moderate symptoms, the disease can cause severe medical complications leading to death in some people.

The Centers for Disease Control and Prevention (CDC) have identified several groups at elevated risk for severe illness, including people 65 years and older, individuals living in nursing homes or long term care facilities, and those with serious underlying medical conditions, such as severe obesity, diabetes, chronic lung disease or moderate to severe asthma, chronic kidney or liver disease, and immunocompromised individuals. Identifying risk and protective factors for severe COVID-19 illness is crucial to better protect, triage, and treat at-risk individuals.

In this regard, the U.S. Department of Veterans Affairs (VA) has implemented several measures in response to the pandemic to protect and care for Veterans, including developing a COVID-19 response plan, administering over 165,000 COVID-19 tests, implementing outreach, screening, and protective procedures to prevent transmission, and supporting non-VA health care facilities. These steps are crucial to protect the Veteran population that has a higher prevalence of several of the known risk factors for severe COVID-19 illness, such as advanced age, heart disease, and diabetes.

To better understand the risk and protective factors in the Veteran population, the VHA Innovation Ecosystem and precisionFDA are calling upon the public to develop machine learning and artificial intelligence models to predict COVID-19 related health outcomes, including COVID-19 status, length of hospitalization, and mortality, using synthetic Veteran health records. Through this challenge, additional risk and protective factors will be investigated, including therapeutics prescribed for preexisting comorbidities, and treatment interactions.

CHALLENGE DETAILS

Getting on precisionFDA

If you do not yet have a contributor account on precisionFDA, file an access request with your complete information and indicate your intent to participate in the challenge. The FDA acts as a steward by providing the precisionFDA service to the community and ensuring proper use of the resources, so your request will be initially pending. In the meantime, you will receive an email with a link to access the precisionFDA website in browse (guest) mode. It is NOT necessary to email precisionFDA at this point as it is being reviewed. Once approved (approval typically takes 1-2 business days), you will receive another email with your contributor account information.

With your contributor account, you can use the features required to participate in the challenge (such as transfer files or run comparisons). All work performed on precisionFDA is private (not accessible to the FDA or the rest of the community) until you choose to publicize it. Once published, your work will be available to review by the FDA and precisionFDA community.

Locating and Understanding the Files

Training and test data are provided as zip archives described in the table below.

Archive Type Archive Description
Training Synthetic Health Records Training data Zip archive containing synthetic training patient records in 16 total comma separated value (CSV) files. Full longitudinal health information, including COVID-19 status and related health outcomes is included for each patient in this training data set.
Test Synthetic Health Records Test data Zip archive containing synthetic test patient records in 16 total comma separated value (CSV) files. Pre-2020 health information is included for each patient in this training data set

The training and test data archives contain the 16 files described in the table below. To protect patient identity, synthetic health record data was generated using the Synthea synthetic patient generator. A total of 147451 synthetic patients was generated and these patients were split 80% into the training data set and 20% into the test data set. Data dictionaries for each of the 16 files are provided on the Synthea GitHub Wiki and are linked below.

File Description
allergies.csv Patient allergy data.
careplans.csv Patient care plan data, including goals.
conditions.csv Patient conditions or diagnoses.
devices.csv Patient-affixed permanent and semi-permanent devices.
encounters.csv Patient encounter data.
imaging_studies.csv Patient imaging metadata.
immunizations.csv Patient immunization data.
medications.csv Patient medication data.
observations.csv Patient observations including vital signs and lab reports.
organizations.csv Provider organizations including hospitals.
patients.csv Patient demographic data.
payer_transitions.csv Payer Transition data (i.e. changes in health insurance).
payers.csv Payer organization data.
procedures.csv Patient procedure data including surgeries.
providers.csv Clinicians that provide patient care.
supplies.csv Supplies used in the provision of care.

Understanding the Patient Records

Example Python code for navigating the data set is provided below.

Health Outcomes

  • COVID-19 status is defined as the patient’s test result from the SARS-CoV-2 test. A negative SARS-CoV-2 test is identified by an observation with “CODE” 94531-1 and “VALUE” of “Not detected (qualifier value)”. The “VALUE” for a positive result is “Detected (qualifier value)”.
  • Alive or deceased status is defined as the “DEATHDATE” for a patient. A patient without a “DEATHDATE” is considered alive.
  • Hospitalizations from COVID-19 are identified from encounters with “REASONCODE” 840539006 and “CODE” 1505002. The number of days hospitalized can be obtained from the “START” and “STOP” dates of the encounter.
  • ICU admissions are identified from encounters with “CODE” 305351004. A COVID-19 ICU admission can be assumed if the patient also has been diagnosed with COVID-19.
  • Controlled ventilation of a patient is identified from procedures with “CODE” 26763009. Controlled ventilation from COVID-19 complications is assumed if the patient has also been diagnosed with COVID-19.

Developing and Running your Algorithm

In this challenge, we present participants with a training data set and a test data set consisting of synthetic Veteran patient health records. Participants will develop computational algorithms to model the risk of SARS-CoV-2 infection and severe outcomes of COVID-19 illness in the Veteran population. The model will be used to predict COVID-19 status, days hospitalized, days in the ICU, controlled ventilation status, and mortality for each synthetic Veteran in the test data set. We encourage participants to use demographic data and the presence of comorbidities when developing their model to help precisionFDA and the VHA Innovation Ecosystem better understand how race, ethnicity, age, and comorbidities can affect the progression of COVID-19.

(Optional) Reconstructing your Pipeline on precisionFDA

You have the option of reconstructing your pipeline on precisionFDA and running it there. To do so, you must create one or more apps on precisionFDA that encapsulate the actions performed in your pipeline. To create an app, you can provide Linux executables and an accompanying shell script to run inside an Ubuntu VM on the cloud. The precisionFDA website contains extensive documentation on how to create apps, and you can use existing apps (such as bwa_mem_bamsormadup) as a starting point for developing your own. Constructing your pipeline on precisionFDA has a significant advantage at your discretion, or you can share it with the community allowing others to review your work, reproduce your results, and collaborate to build on or refine your pipeline.

Submission Format

For this challenge, participants must submit five comma-separated text files (CSV), corresponding to predictions made for COVID-19 status, days hospitalized, days in the ICU, controlled ventilation status, and alive or deceased status for each synthetic Veteran in the test data set. These text files should be formatted in Unix format (e.g., the files should contain newline characters (\n), rather than carriage returns (\r)). The format of these files is shown below:

COVID-19 Status

Participants are asked to submit a normalized confidence score, between 0 and 1, for each synthetic Veteran’s COVID-19 status. A value of 1 indicates confidence that the synthetic Veteran had a positive SARS-CoV-2 infection test, and a value of 0 indicates confidence that the synthetic Veteran is negative for SARS-CoV-2 infection.

f3cezeef-6a9b-48dd-a668-5f76a9fzd098,0.8
8d6z843c-680c-42ac-9bf4-64c9zbd5c152,0.2
9z11767c-6de4-4ad1-a80b-cz1e1299fd45,0.7

27bz5e49-698d-4aac-9121-833cfzd19997,0.1

Days Hospitalized

Participants are asked to submit the number of days hospitalized, including fractional days, for each synthetic Veteran with COVID-19. Veterans without COVID-19 should be given a value of 0.

f3cezeef-6a9b-48dd-a668-5f76a9fzd098,12.6
8d6z843c-680c-42ac-9bf4-64c9zbd5c152,0.2
9z11767c-6de4-4ad1-a80b-cz1e1299fd45,8.1

27bz5e49-698d-4aac-9121-833cfzd19997,0

Days in ICU

Participants are asked to submit the number of days in the ICU, including fractional days, for each synthetic Veteran with COVID-19. Veterans without COVID-19 should be given a value of 0.

f3cezeef-6a9b-48dd-a668-5f76a9fzd098,8.2
8d6z843c-680c-42ac-9bf4-64c9zbd5c152,0.1
9z11767c-6de4-4ad1-a80b-cz1e1299fd45,4.9

27bz5e49-698d-4aac-9121-833cfzd19997,0

Controlled Ventilation Status

Participants are asked to submit a normalized confidence score, between 0 and 1, for each synthetic Veteran’s controlled ventilation status. A value of 1 indicates confidence that the synthetic Veteran was ventilated as a result of COVID-19 illness, while a value of 0 indicates confidence that the synthetic Veteran was not ventilated as a result of COVID-19 illness. Veterans without COVID-19 should be given a value of 0.

f3cezeef-6a9b-48dd-a668-5f76a9fzd098,0.3
8d6z843c-680c-42ac-9bf4-64c9zbd5c152,0.95
9z11767c-6de4-4ad1-a80b-cz1e1299fd45, 0.45

27bz5e49-698d-4aac-9121-833cfzd19997,1

Alive or Deceased Status

Participants are asked to submit a normalized confidence score, between 0 and 1, for each synthetic Veteran’s COVID-19 status. A value of 1 indicates confidence that the synthetic Veteran is alive, while a value of 0 indicates confidence that the synthetic Veteran died as a result of COVID-19 illness. Veterans without COVID-19 should be given a value of 1.

f3cezeef-6a9b-48dd-a668-5f76a9fzd098,0.3
8d6z843c-680c-42ac-9bf4-64c9zbd5c152,0.95
9z11767c-6de4-4ad1-a80b-cz1e1299fd45, 0.45

27bz5e49-698d-4aac-9121-833cfzd19997,1

Submitting your Entry

To begin your submission, click "Submit Challenge Entry" at the top of this page. The submission screen will ask for a title, a description, and five (5) CSV text files containing your predictions for COVID-19 status, days hospitalized, days in ICU, controlled ventilation status, and mortality for each synthetic Veteran in the test data set.

Start by providing a short title for your submission entry, then fill in the description. In your description, please identify whether you are participating as an individual or as part of a team (if it is a team effort, please identify all members of your team), as well as providing your methods. The description entry field supports Markdown syntax. If you don't get it perfect right away, you can always go back and edit this description later. Details on the description of the methods are provided below.

Upload each of your five CSV text prediction files by clicking "Select file..." and choosing the file you'd like to submit (in the popup click "Files", then check "Only mine", then select your file). If you ran your pipeline on your environment, rather than precisionFDA, you must first upload each of your CSV text files to precisionFDA to select it in the submission input data section.

Once you have entered a title, a description, and CSV text prediction files, the "Submit" button on the upper right corner will become active. Click the button to submit your entry. If your CSV text file was generated by running apps on precisionFDA (instead of being uploaded externally), the system will ask if you also want to publish the related job, app, and app assets.

After completing the submission wizard, your CSV text files will become available to the challenge organizers in a private space. The system will conduct an initial verification of your entry by running software to validate and score the submission. During that time, your entry will appear as "pending verification" under the "my entries" tab, but will not yet show up under the public "challenge submissions" tab. This step takes several minutes, and you can monitor it by clicking "View Log."

If your entry fails this step, it will be marked as "failed." You can click "View Log" to see some diagnostic information related to the verification software’s execution. Failed entries will not show up under the "challenge submissions" tab and will not be counted towards the challenge. Your CSV text will have been made public, but you are welcome to delete it.

If verification completes successfully, your entry will appear under "challenge submissions."

Participants are limited to a total of three (3) distinct submissions. Each submission will be evaluated separately. If submitting multiple entries, participants must include a description in their methods of how their models are substantially different.

Method Description

You are required to submit a short 1-2 page write-up describing the methods you used for the challenge. In the methods, please include a description of any known or novel risk and protective factors used as features by your model(s). As detailed above, please submit the description of your method(s) in the “description box” when submitting your entry. A template for method descriptions is provided below:

  • Provide a detailed description of your model including the following metrics: methods used, model settings and parameters, description(s) of datasets used for training, cross validation, and testing (number of samples, number of features, etc.)
  • Model features used
  • Risk Factors
  • Protective Factors
  • Demographic Data
  • Pre-Existing Conditions
  • Other Features
  • Link to complete code in a public GitHub repository or precisionFDA app(s)

Determining Top Performance

We will evaluate model performance separately for each of the five predicted outcomes. COVID-19 status and alive or deceased status predictions will be evaluated using area under the precision-recall curve (AUPRC) and area under the Receiver Operating Characteristic (AUROC) metrics. Predictions of days hospitalized and days in the ICU will be evaluated using root-mean-square error (RMSE) and median time-to-event . These metrics and defined in the table below.

Metric Meaning / Formula
Precision (true positives) / (true positives + false positives)
Recall (Sensitivity, or True Positive Rate (TPR)) (true positives) / (true positives + false negatives)
False Positive Rate (FPR) (false positives) / (false positives + true negatives)
Area Under the Precision-Recall Curve (AUPRC) Area under the precision-recall curve generated by plotting precision and recall at all classification thresholds
Area Under the Receiver Operating Characteristic (AUROC) Area under the ROC curve generated by plotting TPR and FPR at all classification thresholds
Root-Mean-Square Error (RMSE) Square root of the average of squared differences between the predictions and the observations
Median Time-To-Event 50th percentile of the Kaplan-Meier Curve starting from date of entry in hospital or in ICU
* Time origin – hospitalization or ICU admittance and the event will be the time of release from hospital or from ICU
* Censored observations – patients that died before being released from the hospital or ICU

An overall ranking of submissions will be generated by combining the ranks of model performance across all five predicted outcomes. The challenge team will work with selected teams to further validate top-performing models on de-identified Veteran health records.

Opportunities for Top Performers

Selected participants will be publicly recognized and invited to contribute to a scientific manuscript describing the challenge and results. Selected participants may also have opportunities to present at a conference and continue solution development with the VHA Innovation Ecosystem.

ADDITIONAL INFORMATION

Challenge Discussion

Please use the VHA Innovation Ecosystem COVID-19 Risk Factor Modeling Challenge on the precisionFDA Discussions forum to discuss the challenge.

Frequently Asked Questions

  • Question: How long does it take to convert my guest account to a contributor account?
    • Answer: Account approval typically takes 1-2 business days. PrecisionFDA administrators will provision your contributor account automatically upon review. Therefore, it is unnecessary to email precisionFDA after you receive an initial email about your guest account.
  • Question: Am I allowed to submit multiple entries to this challenge?
    • Answer: Yes, however, we allow up to three (3) entries per team. Each submission will be evaluated separately. If submitting multiple entries, participants must include within their methods a description of how their models are substantially different.
  • Do I include headers in my submission files?
    • Answer: You should not include headers in your submission files. Additionally, your submission files should only include 2 comma-separated columns where the first column is the patient identifier and the second is the prediction for a health outcome. Reference the “Submission Format” section for more information and examples for each health outcome.

Challenge Team

  • PrecisionFDA: Elaine Johanson, Emily Boja, Samir Lababidi
  • FDA OC: Christine Lee
  • FDA CBER: Ravi Goud
  • VHA Innovation Ecosystem: Amanda Purnell, Josh Patterson
  • Booz Allen Hamilton: Doug Deer, Anjali Kastorf, Zeke Maier, Holly Stephens, Sean Watford
  • DNAnexus: Sam Westreich