PrecisionFDA
Brain Cancer Predictive Modeling and Biomarker Discovery Challenge


An estimated 86,970 new cases of primary brain and other central nervous system tumors are expected to be diagnosed in the US in 2019. Brain tumors comprise a particularly deadly subset of all cancers due to limited treatment options and the high cost of care. Clinical investigators at Georgetown University are seeking to advance precision medicine techniques for the prognosis and treatment of brain tumors through the identification of novel multi-omics biomarkers. In support of this goal, precisionFDA and the Georgetown Lombardi Comprehensive Cancer Center and The Innovation Center for Biomedical Informatics at Georgetown University Medical Center (Georgetown-ICBI) are launching the Brain Cancer Predictive Modeling and Biomarker Discovery Challenge! This challenge will ask participants to develop machine learning and/or artificial intelligence models to identify biomarkers and predict patient outcomes using gene expression, DNA copy number, and clinical data.


  • Starts
    2019-11-01 16:00:00 UTC
  • Ends
    2020-02-02 04:59:59 UTC

about 2 months remaining


The Food and Drug Administration (FDA) and the Georgetown Lombardi Comprehensive Cancer Center and Innovation Center for Biomedical Informatics at Georgetown University Medical Center challenge the scientific community to develop and evaluate computational algorithms for brain tumor biomarker identification and patient outcome prediction using gene expression, DNA copy number, and clinical data

Challenge Time Period
November 1, 2019 through February 1, 2020

BACKGROUND

An estimated 86,970 new cases of primary brain and other central nervous system tumors are expected to be diagnosed in the US in 2019. Brain tumors comprise a particularly deadly subset of all cancers due to limited treatment options and the high cost of care. Only a few prognostic and predictive markers have been successfully implemented in the clinic so far for gliomas, the most common malignant brain tumor type. These markers include MGMT promoter methylation in high-grade astrocytomas, co-deletion of 1p/19q in oligodendrogliomas, and mutations in IDH1 or IDH2 genes (Staedtke et al. 2016). There remains significant potential for identifying new clinical biomarkers in gliomas.

Clinical investigators at Georgetown University are seeking to advance precision medicine techniques for the prognosis and treatment of brain tumors through the identification of novel multi-omics biomarkers. In support of this goal, precisionFDA and the Georgetown Lombardi Comprehensive Cancer Center and The Innovation Center for Biomedical Informatics at Georgetown University Medical Center are launching the Brain Cancer Predictive Modeling and Biomarker Discovery Challenge! This challenge asks participants to develop machine learning and/or artificial intelligence models to identify biomarkers and predict patient outcomes using gene expression, DNA copy number, and clinical data.

CHALLENGE OVERVIEW

The Brain Cancer Predictive Modeling and Biomarker Discovery Challenge will launch on November 1, 2019 and end on February 1, 2020. Participants are asked to take part in three sub-challenges, which will be scored individually and combined for an overall challenge score. Participants will be provided DNA copy number data, gene expression profiles, clinical phenotypes, and outcomes for a cohort of patients in two phases. The Phase 1 data is the “provided” data that will be used to develop the models. Phase 2 data is the test data that will be used to score model performance. Phase 2 data will be released a week before the final deadline.

Participants are encouraged to utilize the phenotype data in addition to DNA copy number data and gene expression profiles for each sub-challenge. Participants are also encouraged to employ feature selection to minimize the features used while maintaining performance. A complete challenge submission will include predictions from each sub-challenge on the Phase 2 data, comprehensive documentation of methods employed and datasets used, and code developed for the challenge.

CHALLENGE DATA

Descriptions

  • DNA copy number
    • Rows are samples
    • Columns are genomic cytobands
    • Values are reported as chromosome instability values (CIN Index) (Song et al. 2017) index values at the cytoband level. 0 indicates no instability
    • Source: SNP array
  • Gene expression profiles
    • Rows are samples
    • Columns are gene names
    • Values are log2 normalized gene expression values
    • Source: microarray
  • Survival status outcome
    • A value of 0 means the patient was alive or censoring at last follow up
    • A value of 1 means patient died before the last scheduled follow up
  • Additional clinical phenotype data
    • Sex
    • Race
    • Cancer type*
    • WHO grading**

*Brain cancer types include:

  • Astrocytoma - Originates in a particular kind of glial cells called astrocytes, star-shaped brain cells in the cerebrum. This type of tumor does not usually spread outside the brain and spinal cord; it usually does not affect other organs.
  • GBM (glioblastoma) - The most aggressive cancer that begins within the brain. Glioblastomas represent 15% of brain tumors. They can either start from normal glial cells or develop from an existing low-grade astrocytoma.
  • Oligodendroglioma - A tumor that can occur in the brain or spinal cord.  An oligodendroglioma tumor forms from oligodendrocytes — glial cells in the brain and spinal cord that produce myelin, a substance that protects nerve cells.
  • Mixed - A tumor that has a mixed etiology.

**World Health Organization (WHO) grading is a measure of the “progressiveness” of central nervous system tumors:

  • II - A group of abnormal slow growing cells that are found only in the place where they first formed in the body, are somewhat infiltrative, and may recur as a higher grade
  • III - Malignant and infiltrative cells that tend to recur as a higher grade
  • IV - The most malignant cells that experience rapid and aggressive growth and are widely infiltrative with rapid recurrence

Phase 1 Data Files (released November 1, 2019) -- files are linked below

Phase 2 Data Files (Released January 24, 2019)

The Phase 2 data will be released during the second stage of the challenge. Note that Outcome (survival status) will be withheld from Phase 2 files and participants will predict this value. The features selected and model obtained using the Phase 1 data in sub-challenges 1, 2 and 3 will be applied to the Phase 2 data to predict patient outcomes (survival status) for Phase 2. Participants are asked to run their models on the Phase 2 data only once as to avoid overfitting. These predicted values will be used to score the sub-challenges.

  • Sub-challenge 1
    • Clinical phenotypes = sc1_Phase2_GE_Phenotype.tsv
    • Gene expression = sc1_Phase2_GE_FeatureMatrix.tsv
  • Sub-challenge 2
    • Clinical phenotypes = sc2_Phase2_CN_Phenotype.tsv
    • DNA copy number = sc2_Phase2_CN_FeatureMatrix.tsv
  • Sub-challenge 3
    • Clinical phenotypes = sc3_Phase2_CN_GE_Phenotype.tsv
    • Gene expression and DNA copy number = sc3_Phase2_CN_GE_FeatureMatrix.tsv

SUBMISSION DETAILS

Challenge timeline

  • Phase 1
    • November 1, 2019: Phase 1 data released
    • January 22, 2020: Deadline for phase 1 submissions, which must include:
      • Phase 1 model summary
      • Phase 1 data used
  • Phase 2
    • January 24, 2020: Phase 2 data released
      • Participants are expected to apply the models they have built using Phase 1 data to the Phase 2 data using gene expression, copy number, and clinical phenotype data (the latter is optional). This is expected to be done only once to avoid overfitting.
    • February 1, 2020: Deadline for Phase 2 submissions, which must include:
      • Phase 2 model summary
      • Phase 2 sub-challenge 1 patient outcome predictions
      • Phase 2 sub-challenge 2 patient outcome predictions
      • Phase 2 sub-challenge 3 patient outcome predictions

Submission file format description

Phase 1 model summary
This document is meant to be a summary of the models used on the Phase 1 data. Provide the following details for EACH sub-challenge within “Summary-Phase1.txt”. Please use the provided template
a) Provide a description of model settings and parameters, and details of model building including dataset(s) description(s) used for training, cross validation and testing (number of samples, number of features, etc.) 
b) Short listed features selected by the model
c) Link to complete code in a public GitHub repository                                                                                                                                                                                                                                                                      d) Confusion matrix indicating number of predictions, true positives, false positives, true negatives, and false negatives                                                                                                                                              e) Overall accuracy
f) Specificity
g) Sensitivity 
h) Area under the curve (AUC)

Phase 1 data used
A compressed file that includes the data files from each sub-challenge used for analysis so that step (a) above can be reproduced. This would be the final files used for analysis after any data transformation or other processing has been applied

Phase 2 model summary
This document is meant to provide a description of the settings used on the Phase 2 data to confirm the same model was utilized for Phase 1 and Phase 2.  Provide the model description for EACH sub-challenge within “Summary-Phase2.txt”. Please use the provided template.

Phase 2 sub-challenge 1 patient outcome predictions
For sub-challenge 1, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 1 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_1.tsv”.  Files are expected to follow Unix format. In this file each row represents the prediction for survival status for each patient in the Phase 2 dataset. The first column is the patient name, and the second column is either 0 or 1 indicting the outcome status (0 means the patient is alive or censoring at last follow up and 1 means patient died before the last scheduled follow up). An example is shown below:
PATIENTID   SURVIVAL_STATUS
Patient_A   0
Patient_B   1
Patient_C   1

Phase 2 sub-challenge 2 patient outcome predictions
For sub-challenge 2, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 2 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_2.tsv”. The format of the file is the same as the Phase 2 sub-challenge 1 patient outcome predictions.

Phase 2 sub-challenge 3 patient outcome predictions
For sub-challenge 3, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 3 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_3.tsv”. The format of the file is the same as the Phase 2 sub-challenge 1 patient outcome predictions.

EVALUATION CRITERIA

  • The truth are the actual outcomes (survival status) for the patients in the Phase 2 data. These outcomes will be compared with the predicted values in Phase 2 for each sub-challenge submission. Models will be evaluated using metrics such as sensitivity, specificity, positive predictive value (precision),  F1-score, and diagnostic odds ratio.
  • The use of a small set of features while maintaining model performance. The use of the phenotype (clinical) data is encouraged but optional. Participants are asked to make use of and describe feature selection techniques used.

CHALLENGE DISCUSSION

Please use The Brain Cancer Predictive Modeling and Biomarker Discovery Challenge Discussion thread on the precisionFDA Discussions forum to discuss the challenge and ask questions.

FREQUENTLY ASKED QUESTIONS

  • Q: Can I participate in only one sub-challenge?
    • A: Yes. Participants can choose to participate in in only one sub-challenge, but are encouraged to participate in all three challenges as the total score is the sum of the predictions from all three sub-challenges.
  • Q: Can we participate and submit as a team?
    • A: Participants may submit as a team; however, they will need to appoint one member to request login access and to submit under that account.
  • Q: Are participants allowed to use external datasets at any step of the challenge (model training and/or validation)?
    • Model training and validation challenge phases are restricted only to provided datasets.

CHALLENGE TEAM

  • PrecisionFDA: Elaine Johanson, Ruth Bandler
  • Georgetown University: Subha Madhavan, Yuriy Gusev, Adil Alaoui, Krithika Bhuvaneshwar
  • Booz Allen: Holly Stephens, Sean Watford, Zeke Maier

REFERENCES

Staedtke, V., a Dzaye, O. D., & Holdhoff, M. (2016). Actionable molecular biomarkers in primary brain tumors. Trends in cancer, 2(7): 338-349.

Song, L., Bhuvaneshwar, K., Wang, Y., Feng, Y., Shih, I-M., Madhavan, S., Gusev, Y. (2017). CINdex: A Bioconductor Package for Analysis of Chromosome Instability in DNA Copy Number Data. Cancer Inform., 16:1176935117746637 (PMID: 29343938).