Git Product home page Git Product logo

pan-cancer-dataset-sources's Introduction

Pan-Cancer Datasets

Collection of pan-cancer datasets consisting of various modalities, including medical and clinical records, radiology (CT, MRIs, PET), pathology (H&E and IHC), and omics data (genomics and proteomics) have been compiled below. This is a non-exhaustive collection that is being updated periodically. The purpose of this compilation is to provide the cancer research community with a unified view of the resources available for studying various cancer sites, organs, and modalities. We aim to utilize these resources in our ongoing research and fight against the cancer disease.

Primarily, we have compiled the list of datasets from data portals under the flagship of NIH National Cancer Institute (NCI) that include, The Cancer Imaging Archive (TCIA), Genomic Data Commons (GDC) portal of The Cancer Genome Atlas (TCGA), and Proteomic Data Commons (PDC) portal of Clinical Proteomic Tumor Analysis Consortium (CPTAC). Below is the summary of the datasets available at these portals.

  • Study of molecular characterization of over 20,000 primary cancer and matched normal samples spanning 34 cancer types.
  • Joint effort between NCI and the National Human Genome Research Institute.
  • Over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data.
  • Publicly available for research use.
  • Genomics data available at the Genomics Data Commons portal, GDC, open access.
  • Imaging data available at The Cancer Imaging Archive (TCIA), open access. The radiology and histopathology data of TCIA can be accessed and downloaded through the following portals:
  • Proteomics data is available through the Proteomic Data Commons (PDC) under the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program.

Below we first present the National Cancer Institute (NCI) data modalities followed by the 32 cancer types and their corresponding datasets, primary publications, number of cases, and modalities. The list is organized by cancer type and then by data modality. The data modalities include clinical, copy number, DNA, imaging, and miRNA, mRNA, and protein expression. The second table below presents the non-NCI dataset resources available for public access. Lastly, we present the list of abbreviations for the cancer study name used in this compilation.

Data Modalities

  • Clinical
    • Clinical data
      • Available for all cancer types
      • May include demographic information, treatment information,survival data, etc.
      • XML (per patient), tab-delimited TXT.
      • Additional information in the Clinical Data Elements (CDE) Browser.
    • Biospecimen data
      • Available for all cancer types
      • Information on how samples were processed by the Biospecimen Core Resource Center
      • XML (per patient), tab-delimited TXT.
      • Additional information in the Clinical Data Elements (CDE) Browser.
    • Pathology Reports
      • Available for all cancer types
      • Pathology reports (for select cases)
      • PDF format
  • Copy Number
    • SNP microarray
    • Copy number microarray
      • Available for GBM, OV, LUSC
      • Tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT.
      • Probe information contained in array design files for each platform
    • DNA Sequencing
      • Available for Some tumor types
      • Low pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normal
      • tab-delimited TSV (normal vs. tumor cells)
  • DNA
    • Whole exome
      • Available for all cancer types
      • Whole exome sequencing of tumor and normal matched samples
      • VCF, MAF (mutation cells)
    • Whole genome
      • Available for all cancer types
      • Whole genome sequencing for tumor and normal matched samples (for select cases)
      • VCF, MAF (mutation cells).
    • SNP microarray
      • Available for all cancer types
      • tab-delimited TXT (genotypes per SNP)
  • Imaging
    • Diagnostic image
      • Available for all cancer types
      • Whole slide images of tissue used to diagnose participant
      • SVS
      • Available at the GDC, open access
    • Tissue image
      • Available for all cancer types
      • Whole slide images of tissue samples from each participant that were used for TCGA analyses
      • SVS
      • Available at the GDC, open access
    • Radiological image
      • Available for some cancer types
      • Pre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases)
      • DCM or DICOM format.
      • Available at The Cancer Imaging Archive, open access
  • miRNA, mRNA, and Protein Expression
    • miRNA Sequencing
      • Available for all cancer types except GBM
      • miRNA sequencing of tumor samples
      • tab-delimited TXT (normalized expression values per miRNA or isoform)
    • Array-based
      • Available for GBM, OV cancer types
      • TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
      • Probe information contained in array design files for each platform
    • mRNA Sequencing
      • Available for all cancer types
      • mRNA sequencing of tumor samples using a poly(A) enrichment RNA preparation
      • TXT (normalized expression values per gene, isoform, exon, or splice junction)
      • labeled as RNASeqV1 and RNASeqv2
    • Total RNA Sequencing
      • Available for some cancer types
      • mRNA sequencing of tumor samples ribosomal depletion RNA preparation
      • TXT (normalized expression values per gene, isoform, exon, or splice junction)
      • labeled as TotalRNASeqV2
    • Microarray
      • Available for BRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC cancer types
      • TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
      • Probe information contained in array design files for each platform
    • Reverse-Phase Protein Array
      • Available for all cancer types
      • High resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slide
      • TIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values

NIH / NCI-hosted Datasets

Ser Cancer Site #Cases Primary Publication (#Cases Studied) Clinical Genomics Proteomics Pathology Radiology
1 Acute Myeloid Leukemia (TCGA-LAML, CPTAC-AML) 200 NEJM 2013 (200) 135 Cases (TCGA-LAML) 135 Cases (TCGA-LAML) 41 Cases 120 svs
2 Adrenocortical Carcinoma (TCGA-ACC) 92 Cancer Cell 2016 (91) 92 Cases (TCGA-ACC) 92 Cases (TCGA-ACC) 323 svs
3 Bladder Urothelial Carcinoma 412 Nature 2014, Cell 2017 (123) 408 Case (TCGA-BLCA) 408 Cases (TCGA-BLCA) 926 svs TCGA-BLCA: 111,781 imgs (CT,CR,MR,PT,DX), 58GB size
4 Breast Ductal Carcinoma 778 Nature 2012 (430) 1036 Cases (TCGA-BRCA) 1036 Cases (TCGA-BRCA) 3,111 svs TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size
5 Breast Lobular Carcinoma 201 Cell 2015 (127) 1036 Cases (TCGA-BRCA) 1036 Cases (TCGA-BRCA) 3,111 svs TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size
6 Cervical Carcinoma 307 Nature 2017 (228) 305 Cases (TCGA-CESC) 305 Cases (TCGA-CESC) 604 svs TCGA-CESC: 19,135 imgs (MR), 9.5GB size
7 Cholangiocarcinoma (TCGA-CHOL) 51 Cell Reports 2017 (38) 355 Cases (TCGA-CHOL) 355 Cases (TCGA-CHOL) 110 svs
8 Colorectal Adenocarcinoma 633 Nature 2012 (276) 458 Cases (TCGA-COAD) 458 Cases (TCGA-COAD) 1,442 svs TCGA-COAD: 8,387 imgs (CT), 4.5GB size
9 Esophageal Carcinoma 185 Nature 2017 (164) 183 Cases (TCGA-ESCA) 183 Cases(TCGA-ESCA) 396 svs TCGA-ESCA: 20,593 imgs (CT), 11GB size
10 Stomach/ Gastric Adenocarcinoma 443 Nature 2014 (295) 437 Cases (TCGA-STAD) 437 Cases (TCGA-STAD) 1,197 svs TCGA-STAD: 43,908 imgs (CT), 23.3GB size
11 Glioblastoma Multiforme 617 Nature 2008, Cell 2013 (206) 523 Cases ( TCGA-GBM) 523 Cases ( TCGA-GBM) 100 Cases 2,053 svs TCGA-GBM: 481,158 imgs (CT,MR,DX), 73.5GB size
12 Head and Neck Squamous Cell Carcinoma 528 Nature 2015 (279) 523 Cases (TCGA-HNSC) 523 Cases (TCGA-HNSC) 1,263 svs TCGA-HNSC: 270,376 imgs (CT,MR,PET,RTDOSE,RTPLAN,RTSTRUCT), 130GB size
13 Liver Hepatocellular Carcinoma 377 Cell 2017 (363) 375 Cases (TCGA-LIHC) 375 Cases (TCGA-LIHC) 870 svs TCGA-LIHC: 125,397 imgs (CT,MR,PT), 52.5GB size
14 Kidney Chromophobe Carcinoma 113 Cancer Cell 2014 (66) 66 Cases (TCGA-KICH) 66 Cases (TCGA-KICH) 326 svs TCGA-KICH: 9,221 imgs (CT,MR), 4.2GB size
15 Kidney Clear Cell Carcinoma 537 Nature 2013 (446) 523 Cases ( TCGA-KIRC) 523 Cases ( TCGA-KIRC) 2,173 svs TCGA-KIRC: 192,581 imgs (CT,MR), 91.6GB size
16 Kidney Papillary Cell Carcinoma 291 NEJM 2016 (161) 289 Cases (TCGA-KIRP) 289 Cases (TCGA-KIRP) 773 svs TCGA-KIRP: 26,667 imgs (CT,MR,PT), 9.6GB size
17 Low Grade Glioma 516 NEJM 2015 (293) 509 Cases (TCGA-LGG) 509 Cases (TCGA-LGG) 1,572 svs TCGA-LGG: 241,183 imgs (CT,MR), 42.8GB size
18 Lung Adenocarcinoma 585 Nature 2014, Nature Genetics 2016 (230) 563 Cases (TCGA-LUAD) 563 Cases (TCGA-LUAD) 111 Cases 1,608 svs TCGA-LUAD: 48,931 imgs (CT,PT,NM), 18.3GB size
19 Lung Squamous Cell Carcinoma 504 Nature 2012, Nature Genetics 2016 (178) 501 Cases (TCGA-LUSC) 501 Cases (TCGA-LUSC) 118 Cases 1,612 svs TCGA-LUSC: 36,518 imgs (CT,PET,NM), 14GB size
20 Mesothelioma (TCGA-MESO) 74 Cancer Discovery 2018 (87) 85 Cases (TCGA-MESO) 85 Cases (TCGA-MESO) 175 svs
21 Ovarian Serous Adenocarcinoma 608 Nature 2011 (489) 570 Cases (TCGA-OV) 570 Cases (TCGA-OV) 1,481 svs TCGA-OV: 53,662 imgs (CT), 28.3GB size
22 Pancreatic Ductal Adenocarcinoma (TCGA-PAAD, CPTAC-PDA) 185 Cancer Cell 2017 (150) 173 Cases (TCGA-PAAD) 173 Cases (TCGA-PAAD) 166 Cases 466 svs, 557 svs From CPTAC-PDA:: 105,546 imgs (CR,CT,MR,PT,RF,US,XA), 50.8GB size
23 Paraganglioma & Pheochromocytoma (TCGA-PCPG) 179 Cancer Cell 2017 (173) 169 Cases (TCGA-PCPG) 169 Cases(TCGA-PCPG) 385 svs
24 Prostate Adenocarcinoma 500 Cell 2015 (333) 469 Cases (TCGA-PRAD) 469 Cases (TCGA-PRAD) 1,172 svs TCGA-PRAD: 16,790 imgs (CT,PT,MR), 3.74GB size
25 Sarcoma 261 Cell 2017 (206) 255 Cases (TCGA-SARC) 255 Cases (TCGA-SARC) 890 svs TCGA-SARC: 5,653 imgs (CT,MR), 2.8GB size
26 Skin Cutaneous Melanoma (TCGA-SKCM, CPTAC-CM) 470 Cell 2015 (331) 469 Cases (TCGA-SKCM 469 Cases (TCGA-SKCM 950 svs, 404 svs From CPTAC-CM: 32,103 imgs (CT,MR,CR,PT), 14GB size
27 Testicular Germ Cell Cancer (TCGA-TGCT) 150 Cell Reports 2018 (137) 150 Cases (TCGA-TGCT) 150 Cases (TCGA-TGCT) 413 svs
28 Thymoma (TCGA-THYM) 124 Cancer Cell 2018 (117) 97 Cases(TCGA-THYM) 97 Cases (TCGA-THYM) 318 svs
29 Thyroid Papillary Carcinoma 507 Cell 2014 (496) 473 Cases(TCGA-THCA) 473 Cases (TCGA-THCA) 1,158 svs TCGA-THCA: 2,780 imgs (CT,PET), 1.16GB size
30 Uterine Carcinosarcomaa (TCGA-UCS) 57 Cancer Cell 2017 (57) 57 Cases (TCGA-UCS) 57 Cases (TCGA-UCS) 154 svs
31 Uterine Corpus Endometrioid Carcinoma 560 Nature 2013 (373) 542 Cases (TCGA-UCEC) 542 Cases (TCGA-UCEC) 104 Cases 1,371 svs TCGA-UCEC: 75,829 imgs (CT,CR,MR,PT), 36.1GB size
32 Uveal Melanoma (TCGA-UVM) 80 Cancer Cell 2017 (80) 80 Cases (TCGA-UVM) 80 Cases (TCGA-UVM) 150 svs
33 Rectum adenocarcinoma 170 Cases (TCGA-READ) 170 Cases (TCGA-READ) 530 svs TCGA-READ: 1,796 imgs (CT,MR), 279MB size
34 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma ( DLBC) 103 svs, 246 svs

Other Sources of Data

Organ Disease Name Access Images Reference
Multiple Multi UKBiobank RC MRI, DXA https://www.ukbiobank.ac.uk/
Multiple Multi Grand-Challenges OA Multi-domain https://grand-challenge.org
Multiple Multi Kaggle OA Multi-domain https://www.kaggle.com
Multiple Multi VISCERAL: Visual Concept Extraction Challenge in Radiology RC Multi-domain http://www.visceral.eu/benchmarks
Multiple Multi Medical Segmentation Decathlon OA/RC CT, MRI http://medicaldecathlon.com
Brain Multi OpenNeuro OA/RC Multi-domain https://openneuro.org
Brain Multi Image and Data Archive (IDA) OA/RC s/f/dMRI, CT/PET/SPECT https://ida.loni.usc.edu
Brain Normal, dementia, Alzheimer’s OASIS Brains Dataset OA MRI https://www.oasis-brains.org
Brain Multi NITRC: NeuroImaging Tools and Resources Collaboratory OA s/fMRI https://nitrc.org
Brain TBI The Federal Interagency TBI Research (FITBIR) RC MRI, PET, Contrast https://fitbir.nih.gov
Brain TBI, Stroke CQ500 OA/RC CT http://headctstudy.qure.ai/dataset
Brain Multi NDA RC MRI https://nda.nih.gov
Brain Multi Connectome RC sMRI, fMRI https://www.humanconnectome.org
Breast Cancer screening MIAS mini-database OA MG, US http://peipa.essex.ac.uk/info/mias.html
Breast Cancer screening BCDR RC MG, US https://bcdr.eu
Breast Cancer DDSM OA MG http://www.eng.usf.edu/cvprg/Mammography/Database.html
Breast Cancer OMI-DB RC MG https://medphys.royalsurrey.nhs.uk/omidb
Breast Cancer INbreast OA/RC MG http://medicalresearch.inescporto.pt/breastresearch/index.php/Get_INbreast_Database
Cardiac Clinical routine care EchoNet-Dynamic OA/RC Echocardiogram videos https://echonet.github.io/dynamic
Cardiac Multi-abnormal CAMUS project OA/RC Echocardiogram https://www.creatis.insa-lyon.fr/Challenge/camus
Cardiac Multi EuCanShare RC MRI http://www.eucanshare.eu
Cardiac Multi Cardiac Atlas Project OA/RC MRI http://www.cardiacatlas.org
Full body Healthy, unknown Visible Human Project (VHP) OA CT, MRI https://www.nlm.nih.gov/research/visible
Lung Thorax NHS Chest X-ray NIHC OA X-ray https://nihcc.app.box.com/v/ChestXray-NIHCC
Lung Multi Cornell Engineering: Vision and Image Analysis lab OA CT http://www.via.cornell.edu/databases
Lung COVID19 MosMedData OA CT https://mosmed.ai/en
Lung COVID19 COVID-19 CT segmentation OA CT http://medicalsegmentation.com/covid19
Lung COVID19 BIMCV COVID-19 OA CT, CXR https://github.com/BIMCV-CSUSP/BIMCV-COVID-19
Lung COVID19 COVID-19 Image Data Collection OA CT, CXR https://github.com/ieee8023/covid-chestxray-dataset https://josephpcohen.com/w/public-covid19-dataset/
Lung COVID19 COVID-19 Chest X-ray Dataset Initiative OA CXR https://github.com/agchung/Figure1-COVID-chestxray-dataset
Retina Multi STARE:Structured Analysis of the Retina OA Retinal fundus http://cecas.clemson.edu/~ahoover/stare
Retina Diabetes CHASE_DB1 OA Retinal fundus https://blogs.kingston.ac.uk/retinal/chasedb1
Retina Diabetes High-Resolution Fundus (HRF) Image Database OA Retinal fundus https://www5.cs.fau.de/research/data/fundus-images
Skin Lesion International Skin Imaging Collaboration (ISIC) OA Digital images https://www.isic-archive.com

Abbreviations

Ser Abbreviation Long
1 NM Nuclear medicine
2 CT Computerized Tomography
3 CR Computed Radiography
4 PET, PT Positron Emission Tomography
5 MR Magnetic Resonance
6 MG Mammography
7 DX Digital Radiography
8 RF Radio Fluoroscopy
9 US Ultrasound
10 XA X-Ray Angiography
11 RTDOSE Radiotherapy Dose
12 RTSTRUCT Radiotherapy Structure Set
13 RTPLAN Radiotherapy Plan

pan-cancer-dataset-sources's People

Contributors

aakash-tripathi avatar waasem avatar

Watchers

Ghulam Rasool avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.