NCI Data Jamboree (Project Abstract Submission): Submission #4

Submission information
Submission Number: 4
Submission ID: 183381
Submission UUID: c7b95736-281c-4f60-a00f-4fa2fc8d8e21

Created: Thu, 06/11/2026 - 19:11
Completed: Thu, 06/11/2026 - 19:11
Changed: Thu, 06/11/2026 - 19:11

Remote IP address: 10.208.28.22
Submitted by: Anonymous
Language: English

Is draft: No
Presenter Information
Sabira
{Empty}
Dabeer
M.B.B.S ; MD(Clinical Biochemistry); MS(Biological Data Science)
Clinical Biochemist and Data Scientist
ARIZONA STATE UNIVERSITY
Phoenix
Additional Authors
  • First Name: Hatim
    Last Name: Palitanawala
    Affiliation: Mumbai University
Abstract Information
Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data
{Empty}
Integrating Dietary, Clinical, and Molecular Data to Identify Risk Signatures for Early-Onset Colorectal Cancer Using Explainable Machine Learning
Early-onset colorectal cancer (EOCRC), defined as colorectal cancer diagnosed before age 50 years, has increased substantially over the past two decades despite declining incidence among older adults. Although lifestyle and dietary changes have been proposed as contributing factors, the biological mechanisms linking these exposures to colorectal cancer development remain incompletely understood. This project aims to investigate whether dietary patterns and lifestyle factors are associated with molecular signatures and biological pathways implicated in EOCRC and whether these features can be integrated into explainable machine learning models for risk prediction.

The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways.

Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development.
Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data.