NCI Data Jamboree (Project Abstract Submission): Submission #4
Submission information
Submission Number: 4
Submission ID: 183381
Submission UUID: c7b95736-281c-4f60-a00f-4fa2fc8d8e21
Submission URI: /nci/datajamboree/abstractsubmission
Submission Update: /nci/datajamboree/abstractsubmission?token=BGhIGfQT0Mzd6_wfBqaKnJ41moVoCYuR6nKY7leFOsA
Created: Thu, 06/11/2026 - 19:11
Completed: Thu, 06/11/2026 - 19:11
Changed: Thu, 06/11/2026 - 19:11
Remote IP address: 10.208.28.22
Submitted by: Anonymous
Language: English
Is draft: No
Webform: NCI Data Jamboree (Abstracts)
Submitted to: NCI Data Jamboree (Project Abstract Submission)
Presenter Information
Sabira
{Empty}
Dabeer
M.B.B.S ; MD(Clinical Biochemistry); MS(Biological Data Science)
Clinical Biochemist and Data Scientist
ARIZONA STATE UNIVERSITY
Phoenix
Additional Authors
Abstract Information
Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data
{Empty}
Integrating Dietary, Clinical, and Molecular Data to Identify Risk Signatures for Early-Onset Colorectal Cancer Using Explainable Machine Learning
Early-onset colorectal cancer (EOCRC), defined as colorectal cancer diagnosed before age 50 years, has increased substantially over the past two decades despite declining incidence among older adults. Although lifestyle and dietary changes have been proposed as contributing factors, the biological mechanisms linking these exposures to colorectal cancer development remain incompletely understood. This project aims to investigate whether dietary patterns and lifestyle factors are associated with molecular signatures and biological pathways implicated in EOCRC and whether these features can be integrated into explainable machine learning models for risk prediction.
The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways.
Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development.
Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data.
The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways.
Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development.
Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data.