NCI Data Jamboree (Project Abstract Submission)
4 submissions
| # | Starred | Locked | Notes | Created | User | IP address | First Name | Middle Initial | Last Name | Degree(s) | Position/Title/Career Status | Organization | Organization Address | List of Additional Authors | Abstract Category | Abstract Keywords | Abstract Title | Abstract | Operations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Star/flag NCI Data Jamboree (Project Abstract Submission): Submission #4 | Lock NCI Data Jamboree (Project Abstract Submission): Submission #4 | Add notes to NCI Data Jamboree (Project Abstract Submission): Submission #4 | Thu, 06/11/2026 - 19:11 | Anonymous | 10.208.28.22 | Sabira | Dabeer | M.B.B.S ; MD(Clinical Biochemistry); MS(Biological Data Science) | Clinical Biochemist and Data Scientist | ARIZONA STATE UNIVERSITY | Phoenix | dabeersabira@gmail.com |
|
Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data | Integrating Dietary, Clinical, and Molecular Data to Identify Risk Signatures for Early-Onset Colorectal Cancer Using Explainable Machine Learning | Early-onset colorectal cancer (EOCRC), defined as colorectal cancer diagnosed before age 50 years, has increased substantially over the past two decades despite declining incidence among older adults. Although lifestyle and dietary changes have been proposed as contributing factors, the biological mechanisms linking these exposures to colorectal cancer development remain incompletely understood. This project aims to investigate whether dietary patterns and lifestyle factors are associated with molecular signatures and biological pathways implicated in EOCRC and whether these features can be integrated into explainable machine learning models for risk prediction. The proposed work will integrate epidemiologic, clinical, and molecular data from publicly available resources including the All of Us Research Program, Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, Surveillance, Epidemiology, and End Results (SEER) program, and molecular datasets available through the Genomic Data Commons (GDC), including The Cancer Genome Atlas (TCGA). Variables of interest may include dietary exposures, body mass index, physical activity, smoking status, alcohol use, demographic factors, colorectal cancer outcomes, and molecular features associated with key colorectal cancer pathways, including WNT signaling, TP53, KRAS, DNA mismatch repair, inflammatory, and metabolic pathways. Machine learning approaches, including logistic regression, random forest, and gradient boosting methods, will be evaluated to identify factors associated with EOCRC. Explainable AI techniques, including SHAP-based feature attribution, will be used to characterize the relative contribution of dietary, clinical, and molecular variables to model predictions. The project will also explore the feasibility of integrating heterogeneous data sources to generate interpretable risk signatures that may improve understanding of EOCRC development. Results may help identify candidate risk factors, generate new biological hypotheses, and establish a framework for future integrative analyses of cancer epidemiology and molecular data. |
|||
| 3 | Star/flag NCI Data Jamboree (Project Abstract Submission): Submission #3 | Lock NCI Data Jamboree (Project Abstract Submission): Submission #3 | Add notes to NCI Data Jamboree (Project Abstract Submission): Submission #3 | Wed, 06/10/2026 - 14:01 | Anonymous | 10.208.24.28 | Jaclyn | N | Taroni | Ph.D. | Director of the Childhood Cancer Data Lab | Alex's Lemonade Stand Foundation | Wynnewood, PA | j.taroni@alexslemonade.org | Evaluating data quality for reproducibility and AI-readiness | Project Seeker: Jaclyn N. Taroni | Experience and expertise: I am Director of the Childhood Cancer Data Lab at Alex’s Lemonade Stand Foundation (https://www.ccdatalab.org/), where I lead a multidisciplinary team of data scientists, software engineers, and UX professionals. At the Data Lab, we build tools to make pediatric cancer data accessible, such as the Single-cell Pediatric Cancer Atlas (scpca.alexslemonade.org), and organize open science projects, such as the Open Pediatric Brain Tumor Atlas. I am a computational biologist by training. Historically, my research focus has been on leveraging large collections of transcriptomic data and machine learning in rare disease settings. Why I want to participate: My work depends on the broader cancer data ecosystem being interoperable, well-documented, and reusable, and I want to engage directly with NCI repositories and a cross-disciplinary group on the practical barriers to finding, accessing, and integrating these resources. I am looking to contribute and learn from investigators in the broader cancer space in a hands-on collaboration setting. What I hope to achieve: I hope to build collaborations across the cancer data community that extend the reach of open-science tooling and to help produce publicly shareable artifacts that others can build on. I am especially interested in joining a team where reproducibility and data quality are central to the problem. |
|||
| 2 | Star/flag NCI Data Jamboree (Project Abstract Submission): Submission #2 | Lock NCI Data Jamboree (Project Abstract Submission): Submission #2 | Add notes to NCI Data Jamboree (Project Abstract Submission): Submission #2 | Mon, 06/08/2026 - 20:30 | Anonymous | 10.208.24.28 | Megha | B. | Srivastava | B.S./M.S. in Computer Science | PhD Student in Computer Science | Stanford University | Stanford | megha@cs.stanford.edu | Employing statistical, computational, and informatics tools, algorithms, and methods to integrate or analyze data | machine learning, AI-readiness, distribution shift, causal inference, confounding variables, language models, LLMs | Project Seeker | I am a PhD student in Computer Science, with significant experience in machine learning, large language modeling, and human-AI interaction. I have recently been transition my research towards applications of AI in medicine, healthcare, and drug discovery, and hope to understand what challenges exist on the dataset-level, and what are ideal datasets that can help push different problems forward. One research area I am particularly interested in is challenges of distribution shift -- e.g. mismatch between the training dataset and test time inference, and how to tackle that. I am particularly curious about methods for identifying potential confounding variables that are unmeasured in the current dataset. My hope is to join a project that can help improve the quality and availability of oncology datasets for machine learning research. | ||
| 1 | Star/flag NCI Data Jamboree (Project Abstract Submission): Submission #1 | Lock NCI Data Jamboree (Project Abstract Submission): Submission #1 | Add notes to NCI Data Jamboree (Project Abstract Submission): Submission #1 | Fri, 05/29/2026 - 09:31 | Anonymous | 10.208.28.199 | Ying | huang | MD | HSA | NCI | Rockville | ying.huang@nih.gov |
|
Developing tutorials, workbooks, infographics, or creative use of data for educational and engagement purposes | Keywords: Findability; Accessibility; Governance Interoperability; Data Reuse Workflow Observability; NCI Repositories | A Pilot Discovery Friction Framework for Quantifying Research Initiation Burden Across Federated Oncology Data Ecosystems | The rapid expansion of federated oncology ecosystems has increased controlled-access biomedical datasets, but translational investigators frequently encounter fragmented discovery pathways and complex governance workflows. While essential for participant privacy, the operational burden of these systems remains poorly characterized. This project will develop and pilot the Discovery Friction Framework, a human-centered observability framework designed to quantify "research initiation burden" across federated ecosystems. The framework employs structured workflow instrumentation—including screen recording, event logging, and rubric-based telemetry extraction—to capture objective operational metrics across multiple domains (discovery burden, authentication complexity, workflow instability, governance complexity, and temporal burden). Metrics include portal transitions, unresolved discovery paths, authentication redirection chains, manual workarounds, Data Access Request (DAR) revision cycles, and time-to-access intervals. To test the framework against realistic discovery pathways, the project will identify high-value datasets across multiple modalities from published secondary data analyses utilizing dbGaP (genomics), NCCR (clinical), and CRDC IDC (imaging). By characterizing translational workflow complexity through real-world investigator interactions, this pilot will generate foundational telemetry primitives, workflow observability methods, and evidence-based insights that may support future scalable observability strategies across biomedical data ecosystems. The project will provide actionable guidance for improving pathway transparency, governance intelligibility, and translational data access coordination across the NCI data ecosystem. Implementation Requirements: The project requires a standard web-based testing environment (no high-performance computing required). Software tools include open-source user-session logging, screen-capture instrumentation, and text-mining packages (Python/R) to structure qualitative rubrics into quantitative dataframes. The project relies on a cross-disciplinary team featuring: Human-Centered Design/UX Researchers to build telemetry rubrics; Data Governance Specialists and Data Access Committee (DAC) members familiar with dbGaP and NIH data access mechanisms to map workflow pathways; and Front-End/Data Engineers to develop the underlying schema for the friction telemetry database. |