Brain Cancer Predictive Modeling and Biomarker Discovery Challenge
An estimated 86,970 new cases of primary brain and other central nervous system tumors are expected to be diagnosed in the US in 2019. Brain tumors comprise a particularly deadly subset of all cancers due to limited treatment options and the high cost of care. Clinical investigators at Georgetown University are seeking to advance precision medicine techniques for the prognosis and treatment of brain tumors through the identification of novel multi-omics biomarkers. In support of this goal, precisionFDA and the Georgetown Lombardi Comprehensive Cancer Center and The Innovation Center for Biomedical Informatics at Georgetown University Medical Center (Georgetown-ICBI) are launching the Brain Cancer Predictive Modeling and Biomarker Discovery Challenge! This challenge will ask participants to develop machine learning and/or artificial intelligence models to identify biomarkers and predict patient outcomes using gene expression, DNA copy number, and clinical data.
2019-11-01 16:00:00 UTC
2020-02-15 04:59:59 UTC
The Challenge Phase 1 Deadline Has Been Extended to February 5th!
If you have not yet submitted you have two additional weeks! For those who have submitted we encourage you to resubmit and improve upon your model if you wish.
Phase 2 data will now be released February 7th and the challenge will close February 14th.
In addition to the challenge extension, we’ve added an incentive for the the top three performing teams. These teams or individuals will be awarded a podium presentation and a poster at the 9th Annual Health Informatics and Data Science Symposium and will have their registration fees waived. This conference is a great opportunity to meet and network with thought leaders in the fields of precision and molecular medicine, health data analytics, and bioinformatics.
In addition , when applicable, publication authorship byline(s) will be awarded to the top-performing Challenge Team(s) in good standing (i.e., that submit and make publicly available their final Entry(ies) and any write-up, code or other requirement of a particular Challenge). With respect to the Top performing Team(s), only the names of the registered Team Members and any affiliated Lab Head or Principal Investigator will be listed as co-authors in the byline of the Challenge publication. Challenge organizers reserve the right to organize the publication authorship and contribution list.
Note: If you choose to use this use this extension to improve upon their model, please include a note in your re-submission.
Challenge Time Period
November 1, 2019 through February 14, 2020
An estimated 86,970 new cases of primary brain and other central nervous system tumors are expected to be diagnosed in the US in 2019. Brain tumors comprise a particularly deadly subset of all cancers due to limited treatment options and the high cost of care. Only a few prognostic and predictive markers have been successfully implemented in the clinic so far for gliomas, the most common malignant brain tumor type. These markers include MGMT promoter methylation in high-grade astrocytomas, co-deletion of 1p/19q in oligodendrogliomas, and mutations in IDH1 or IDH2 genes (Staedtke et al. 2016). There remains significant potential for identifying new clinical biomarkers in gliomas.
Clinical investigators at Georgetown University are seeking to advance precision medicine techniques for the prognosis and treatment of brain tumors through the identification of novel multi-omics biomarkers. In support of this goal, precisionFDA and the Georgetown Lombardi Comprehensive Cancer Center and The Innovation Center for Biomedical Informatics at Georgetown University Medical Center are launching the Brain Cancer Predictive Modeling and Biomarker Discovery Challenge! This challenge asks participants to develop machine learning and/or artificial intelligence models to identify biomarkers and predict patient outcomes using gene expression, DNA copy number, and clinical data.
The Brain Cancer Predictive Modeling and Biomarker Discovery Challenge will launch on November 1, 2019 and end on February 14, 2020. Participants are asked to take part in three sub-challenges, which will be scored individually and combined for an overall challenge score. Participants will be provided DNA copy number data, gene expression profiles, clinical phenotypes, and outcomes for a cohort of patients in two phases. The Phase 1 data is the “provided” data that will be used to develop the models. Phase 2 data is the test data that will be used to score model performance. Phase 2 data will be released a week before the final deadline.
Participants are encouraged to utilize the phenotype data in addition to DNA copy number data and gene expression profiles for each sub-challenge. Participants are also encouraged to employ feature selection to minimize the features used while maintaining performance. A complete challenge submission will include predictions from each sub-challenge on the Phase 2 data, comprehensive documentation of methods employed and datasets used, and code developed for the challenge.
DNA copy number
- Rows are samples
- Columns are genomic cytobands
- Values are reported as chromosome instability values (CIN Index) (Song et al. 2017) index values at the cytoband level. 0 indicates no instability
- Source: SNP array
Gene expression profiles
- Rows are samples
- Columns are gene names
- Values are log2 normalized gene expression values
- Source: microarray
Survival status outcome
- A value of 0 means the patient was alive or censoring at last follow up
- A value of 1 means patient died before the last scheduled follow up
Additional clinical phenotype data
- Cancer type*
- WHO grading**
*Brain cancer types include:
- Astrocytoma - Originates in a particular kind of glial cells called astrocytes, star-shaped brain cells in the cerebrum. This type of tumor does not usually spread outside the brain and spinal cord; it usually does not affect other organs.
- GBM (glioblastoma) - The most aggressive cancer that begins within the brain. Glioblastomas represent 15% of brain tumors. They can either start from normal glial cells or develop from an existing low-grade astrocytoma.
- Oligodendroglioma - A tumor that can occur in the brain or spinal cord. An oligodendroglioma tumor forms from oligodendrocytes — glial cells in the brain and spinal cord that produce myelin, a substance that protects nerve cells.
- Mixed - A tumor that has a mixed etiology.
**World Health Organization (WHO) grading is a measure of the “progressiveness” of central nervous system tumors:
- II - A group of abnormal slow growing cells that are found only in the place where they first formed in the body, are somewhat infiltrative, and may recur as a higher grade
- III - Malignant and infiltrative cells that tend to recur as a higher grade
- IV - The most malignant cells that experience rapid and aggressive growth and are widely infiltrative with rapid recurrence
Phase 1 Data Files (released November 1, 2019)
- Outcome (survival status)
- Clinical phenotypes
- Gene expression
- Outcome (Survival Status)
- Clinical phenotypes
- DNA copy number
- Outcome (survival status)
- Clinical phenotypes
- Gene expression and DNA copy number
Phase 2 Data Files (Released February 7, 2020)
The Phase 2 data will be released during the second stage of the challenge. Note that Outcome (survival status) will be withheld from Phase 2 files and participants will predict this value. The features selected and model obtained using the Phase 1 data in sub-challenges 1, 2 and 3 will be applied to the Phase 2 data to predict patient outcomes (survival status) for Phase 2. Participants are asked to run their models on the Phase 2 data only once as to avoid overfitting. These predicted values will be used to score the sub-challenges.
- Sub-challenge 1
- Sub-challenge 2
- Sub-challenge 3
- November 1, 2019: Phase 1 data released
February 5, 2020: Deadline for phase 1 submissions, which must include:
- Phase 1 model summary
- Phase 1 data used
February 7, 2020: Phase 2 data released
- Participants are expected to apply the models they have built using Phase 1 data to the Phase 2 data using gene expression, copy number, and clinical phenotype data (the latter is optional). This is expected to be done only once to avoid overfitting.
February 14, 2020: Deadline for Phase 2 submissions, which must include:
- Phase 2 model summary
- Phase 2 sub-challenge 1 patient outcome predictions
- Phase 2 sub-challenge 2 patient outcome predictions
- Phase 2 sub-challenge 3 patient outcome predictions
- February 7, 2020: Phase 2 data released
Submission file format description
Phase 1 model summary
This document is meant to be a summary of the models used on the Phase 1 data. Provide the following details for EACH sub-challenge within “Summary-Phase1.txt”. Please use the provided template.
a) Provide a description of model settings and parameters, and details of model building including dataset(s) description(s) used for training, cross validation and testing (number of samples, number of features, etc.)
b) Short listed features selected by the model
c) Link to complete code in a public GitHub repository d) Confusion matrix indicating number of predictions, true positives, false positives, true negatives, and false negatives e) Overall accuracy
h) Area under the curve (AUC)
Phase 1 data used
A compressed file that includes the data files from each sub-challenge used for analysis so that step (a) above can be reproduced. This would be the final files used for analysis after any data transformation or other processing has been applied
Phase 2 model summary
This document is meant to provide a description of the settings used on the Phase 2 data to confirm the same model was utilized for Phase 1 and Phase 2. Provide the model description for EACH sub-challenge within “Summary-Phase2.txt”. Please use the provided template.
Phase 2 sub-challenge 1 patient outcome predictions
For sub-challenge 1, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 1 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_1.tsv”. Files are expected to follow Unix format. In this file each row represents the prediction for survival status for each patient in the Phase 2 dataset. The first column is the patient name, and the second column is either 0 or 1 indicting the outcome status (0 means the patient is alive or censoring at last follow up and 1 means patient died before the last scheduled follow up). An example is shown below:
Phase 2 sub-challenge 2 patient outcome predictions
For sub-challenge 2, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 2 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_2.tsv”. The format of the file is the same as the Phase 2 sub-challenge 1 patient outcome predictions.
Phase 2 sub-challenge 3 patient outcome predictions
For sub-challenge 3, participants are expected to apply the models they have built using Phase 1 to the Phase 2 sub-challenge 3 test data. The survival status outcome is to be predicted. Participants are required to submit a tab separated text file (TSV) named “subchallenge_3.tsv”. The format of the file is the same as the Phase 2 sub-challenge 1 patient outcome predictions.
- The truth are the actual outcomes (survival status) for the patients in the Phase 2 data. These outcomes will be compared with the predicted values in Phase 2 for each sub-challenge submission. Models will be evaluated using metrics such as sensitivity, specificity, positive predictive value (precision), F1-score, and diagnostic odds ratio.
- The use of a small set of features while maintaining model performance. The use of the phenotype (clinical) data is encouraged but optional. Participants are asked to make use of and describe feature selection techniques used.
Please use The Brain Cancer Predictive Modeling and Biomarker Discovery Challenge Discussion thread on the precisionFDA Discussions forum to discuss the challenge and ask questions.
FREQUENTLY ASKED QUESTIONS
Q: Can I participate in only one sub-challenge?
- A: Yes. Participants can choose to participate in in only one sub-challenge, but are encouraged to participate in all three challenges as the total score is the sum of the predictions from all three sub-challenges.
Q: Can we participate and submit as a team?
- A: Participants may submit as a team; however, they will need to appoint one member to request login access and to submit under that account.
Q: Are participants allowed to use external datasets at any step of the challenge (model training and/or validation)?
- Model training and validation challenge phases are restricted only to provided datasets.
- PrecisionFDA: Elaine Johanson, Ruth Bandler
- Georgetown University: Subha Madhavan, Yuriy Gusev, Adil Alaoui, Krithika Bhuvaneshwar
- Booz Allen: Holly Stephens, Sean Watford, Zeke Maier
Staedtke, V., a Dzaye, O. D., & Holdhoff, M. (2016). Actionable molecular biomarkers in primary brain tumors. Trends in cancer, 2(7): 338-349.
Song, L., Bhuvaneshwar, K., Wang, Y., Feng, Y., Shih, I-M., Madhavan, S., Gusev, Y. (2017). CINdex: A Bioconductor Package for Analysis of Chromosome Instability in DNA Copy Number Data. Cancer Inform., 16:1176935117746637 (PMID: 29343938).