Data and Tools
A variety of cancer research data types/datasets are available for access, integration, and analysis, including but not limited to: genomics, transcriptomics; imaging; spatial omics; proteomics; metabolomics; clinical; real world data (RWD); cancer registry; population and epidemiology data. Interested participants can use open access, managed access, and controlled access data for their projects.
Data Discovery:
Participants who have been funded to generate/collect datasets to be used for the jamboree projects will have direct access rights.
Secondary users can search and request access to data from several entry points:
- Directly querying datasets of interest from data catalogs, databases, or data repositories, e.g., the Index of NCI Studies, the NIH Database of Genotype and Phenotype (dbGaP), Gene Expression Omnibus (GEO), National Library of Medicine (NLM) Dataset Catalog, Cancer Research Data Commons (CRDC) domain-specific data commons (e.g., Genomic Data Commons (GDC), Proteomic Data Commons (PDC), Clinical and Translational Data Commons, Imaging Data Commons), or ImmPort.
- Browse through program-specific websites and data portals. Examples of datasets include, but are not limited to:
- The Cancer Genome Atlas (TCGA) and GDC: WXS, WGS, RNA-seq, mRNA-seq, miRNA-seq, bisulfate sequencing, copy number arrays, SNP arrays, RPPA, clinical, imaging, and specimen.
- Human Tumor Atlas Network (HTAN): Specimen, clinical, bulk DNA-seq and RNA-seq, sc-RNA-seq, sc-ATAQ-seq, Hi-C-seq, Visium 10X spatial transcriptomics, Nanostring CosMx and GeoMx, RPPA, mass spectrometry (MS)-based proteomics, metabolomics and lipidomics, electron microscopy, and others.
- Childhood Cancer Data Initiative (CCDI) and CCDI Hub: Molecular Characterization Initiative: CLIA-grade enhanced exome-seq, Archer Fusion, EPIC methyl arrays; clinical, enrollment statistics; research-grade WGS and total RNA-seq with proteomics, sc-RNA-seq, 10X Genomics Visium HD, and cell surface proteomics data coming. Check the CCDI data hub for new releases.
- Clinical Proteomic Tumor Analysis Consortium and PDC: WXS, WGS, RNA-seq, methylation arrays, proteomics, post-translational modifications (e.g., phosphoproteomes, acetylome, ubiquitinylome, glycoproteome), metabolomics, lipidomics, imaging, clinical, and specimen.
- Clinical Trials Reporting Program (CTRP): Clinical trial data from all national trials (e.g., NCTN, ETCTN, NCORP), those conducted at an NCI-Designated Cancer Center (with P30 center core grant), including all industrial trials, and all trials conducted under any contract, grant, or cooperative agreement supported by the NCI (e.g., R01, N01, SPORE, P01, U01, U10). The NCTN Archive clinical trial data access is transitioning to dbGaP. All clinical trial summary-level information is registered in https://clinicaltrials.gov.
- Immuno-Oncology for Translational Network (IOTN) and Acquired Resistance to Therapy Network (ARTNet): Clinical, sc-RNA-seq, bulk RNA-seq, sc-ATAQ-seq, ChIP-Seq, microarrays, WES, CRISPR-seq, CUT&TAG, CUT&RUN, HiChIP-Seq, specimen, TCR-seq, others.
- Visit https://cancercontrol.cancer.gov/publications-data/dccps-public-data-sets-analyses for Cancer Control, Epidemiology and Population Science Data and Resources:
- NCI Cohort Consortium, Cancer Epidemiology Descriptive Cohort Database (CEDCD): Descriptive data on cohort studies that follow groups of persons over time for cancer incidence, mortality, and other health outcomes, including general study information (e.g., eligibility criteria and size), the type of data collected at baseline, cancer sites, number of participants diagnosed with cancer, and biospecimen information. All data included in this database are aggregated for each cohort without individual-level data.
- Surveillance, Epidemiology, and End Results (SEER) and National Childhood Cancer Registry (NCCR): Cancer statistics from state registries, including survival, incidence, outcome, and mortality; registry data including demographics, cancer diagnosis, stage and prognostic factors, treatment, follow-up, recurrence, death, pathology data matched to linkage projects from various sources such as administrative claims and extracts from hospital clinical information systems (Electronic Health Records, Radiation Oncology Information System, and special study data.
- Genomic Summary Results (GSR): Summary genomic data generated from primary analyses of genomic research across many individuals.
- Health Information National Trends Survey (HINTS): National health survey data.
- Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial Data: PLCO is a randomized, controlled trial to determine whether certain screening exams reduce mortality from prostate, lung, colorectal and ovarian cancer. Approximately 155,000 participants were enrolled between November 1993 and July 2001. PLCO has the following five ClinicalTrials.gov registration numbers: NCT00002540 (Prostate), NCT01696968 (Lung), NCT01696981 (Colorectal), NCT01696994 (Ovarian), and NCT00339495 (EEMS).
This jamboree allows access to and use of datasets through different mechanisms, including open, registered, or controlled access, depending on the data types at the raw and derived (processed) levels.
For controlled access data requests (DARs) through dbGaP, refer to the dbGaP Authorized Access System instructions. NCI's Office of Data Sharing (ODS) created several Data Collections to streamline the access process and alleviate the burden on investigators. In this case, investigators can submit DARs for approval for the following Collections of datasets, each comprising many studies. Once approved to access the Collection(s), the investigators can access all datasets in the Collections below without having to submit additional DARs for individual datasets.
- NCI's Collection of Datasets for General Research Use (phs003014)
- NCI's Collection of Datasets for Health, Medical, and Biomedical (phs003044)
- NCI's Collection of Datasets for General Cancer Research (phs003967)
- NCI's Collection of Datasets for Pediatric Cancer Research (phs003964)
Computational Tools and Resources. Participants may leverage resources from NCI's programs, such as Cancer Research Data Commons (CRDC)'s Seven Bridges Cancer Genomics Cloud, tools from CCDI, HTAN, Informatics Technology for Cancer Research (ITCR), CPTAC, IDC, and other NCI program resources, or their own institutional computing environment and tools to download data for on-premises analysis or to bring data into a cloud environment before the event.