NCI Data Jamboree (Project Abstract Submission): Submission #4

Submission information

Submission Number: 4

Submission ID: 183381

Submission UUID: c7b95736-281c-4f60-a00f-4fa2fc8d8e21

Submission URI: /nci/datajamboree/abstractsubmission

Submission Update: /nci/datajamboree/abstractsubmission?token=BGhIGfQT0Mzd6_wfBqaKnJ41moVoCYuR6nKY7leFOsA

Created: Thu, 06/11/2026 - 19:11

Completed: Thu, 06/11/2026 - 19:11

Changed: Thu, 06/11/2026 - 19:11

Remote IP address: 10.208.28.22

Submitted by: Anonymous

Language: English

Is draft: No

Webform: NCI Data Jamboree (Abstracts)

Submitted to: NCI Data Jamboree (Project Abstract Submission)

Presenter Information

First Name Sabira

Middle Initial {Empty}

Last Name Dabeer

Degree(s) M.B.B.S ; MD(Clinical Biochemistry); MS(Biological Data Science)

Position/Title/Career Status Clinical Biochemist and Data Scientist

Organization ARIZONA STATE UNIVERSITY

Organization Address Phoenix

Email dabeersabira@gmail.com

Additional Authors

List of Additional Authors

First Name: Hatim
Last Name: Palitanawala
Affiliation: Mumbai University

Abstract Information

Abstract Category Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data

Abstract Keywords {Empty}

Abstract Title Integrating Dietary, Clinical, and Molecular Data to Identify Risk Signatures for Early-Onset Colorectal Cancer Using Explainable Machine Learning

Abstract Early-onset colorectal cancer (EOCRC), defined as colorectal cancer diagnosed before age 50 years, has increased substantially over the past two decades despite declining incidence among older adults. Although lifestyle and dietary changes have been proposed as contributing factors, the biological mechanisms linking these exposures to colorectal cancer development remain incompletely understood. This project aims to investigate whether dietary patterns and lifestyle factors are associated with molecular signatures and biological pathways implicated in EOCRC and whether these features can be integrated into explainable machine learning models for risk prediction.

The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways.

Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development.
Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data.