NCI Data Jamboree (Project Abstract Submission): Submission #4

Submission information
Submission Number: 4
Submission ID: 183381
Submission UUID: c7b95736-281c-4f60-a00f-4fa2fc8d8e21

Created: Thu, 06/11/2026 - 19:11
Completed: Thu, 06/11/2026 - 19:11
Changed: Thu, 06/11/2026 - 19:11

Remote IP address: 10.208.28.22
Submitted by: Anonymous
Language: English

Is draft: No
Presenter Information
---------------------
First Name: Sabira
Middle Initial: {Empty}
Last Name: Dabeer
Degree(s): M.B.B.S ; MD(Clinical Biochemistry); MS(Biological Data Science)
Position/Title/Career Status: Clinical Biochemist and Data Scientist
Organization: ARIZONA STATE UNIVERSITY
Organization Address:
Phoenix

Email: dabeersabira@gmail.com

Additional Authors
------------------
List of Additional Authors:
- First Name: Hatim
  Last Name: Palitanawala
  Affiliation: Mumbai University


Abstract Information
--------------------
Abstract Category: Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data
Abstract Keywords: {Empty}
Abstract Title: Integrating Dietary, Clinical, and Molecular Data to Identify Risk Signatures for Early-Onset Colorectal Cancer Using Explainable Machine Learning
Abstract:
Early-onset colorectal cancer (EOCRC), defined as colorectal cancer diagnosed before age 50 years, has increased substantially over the past two decades despite declining incidence among older adults. Although lifestyle and dietary changes have been proposed as contributing factors, the biological mechanisms linking these exposures to colorectal cancer development remain incompletely understood. This project aims to investigate whether dietary patterns and lifestyle factors are associated with molecular signatures and biological pathways implicated in EOCRC and whether these features can be integrated into explainable machine learning models for risk prediction.

The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways.

Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development.
Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data.