Git Product home page Git Product logo

openpbta-analysis's Introduction

OpenPBTA-analysis


OpenPBTA is now published:

Shapiro, J.A., Gaonkar, K.S., Spielman, S.J., Savonen, C.L., Bethell, C.J., Jin, R., Rathi, K.S., Zhu, Y., Egolf, L.E., Farrow, B.K., et al. (2023). OpenPBTA: The Open Pediatric Brain Tumor Atlas. Cell Genom., 100340. 10.1016/j.xgen.2023.100340.

We will no longer be accepting contributions to this repository.

Please check out d3b-center/OpenPedCan-analysis for updates to PBTA data as well as other data sources.


Pediatric brain tumors are the most common solid tumors and the leading cause of cancer-related death in children. Our ability to understand and successfully treat these diseases is hindered by small sample sizes due to the overall rarity of unique molecular subtypes and tainted grouped analyses resulting from misclassification. In September of 2018, the Children's Brain Tumor Network (CBTN) released the Pediatric Brain Tumor Atlas (PBTA), a genomic dataset (whole genome sequencing, whole exome sequencing, RNA sequencing, proteomic, and clinical data) for nearly 1,000 tumors, available from the Gabriella Miller Kids First Portal.

The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative to comprehensively define the molecular landscape of tumors of 943 patients from the CBTN and the PNOC003 DIPG clinical trial from the Pediatric Pacific Neuro-oncology Consortium through real-time, collaborative analyses and collaborative manuscript writing on GitHub.

The OpenPBTA operates on a pull request model to accept contributions from community participants. The maintainers have set up continuous integration software to confirm the reproducibility of analyses within the project’s Docker container. The collaborative manuscript is authored using Manubot software to provide an up-to-date public version of the manuscript. The project maintainers include scientists from Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab and the Center for Data-Driven Discovery in Biomedicine at the Children's Hospital of Philadelphia. We invite researchers to join OpenPBTA to help rigorously characterize the genomic landscape of these diseases to enable more rapid discovery of additional mechanisms contributing to the pathogenesis of pediatric brain and spinal cord tumors and overall accelerate clinical translation on behalf of patients.

New to the project? Please be sure to read the following documentation before contributing:

  1. Learn about the fundamental data used for this project in doc/data-formats.md and doc/data-files-description.md
  2. See what analyses are being performed in analyses/README.md
  3. Read the remainder of this README document in full.
  4. Read our contributing guidelines in CONTRIBUTING.md in full.

Table of Contents generated with DocToc

Data Description

The OpenPBTA dataset includes gene expression, fusion, as well as somatic mutation, copy number, structural and variant results in combined tsv or matrix format.

Below is a summary of biospecimens by sequencing strategy:

Experimental Strategy Normal Tumor
Targeted DNA Panel 1 1
RNA-Seq 0 1036
WGS 801 940
WXS 31 31

All sequencing was performed on nucleic acids extracted from fresh-frozen tissues using paired-end strategies. The manuscript methods section has additional details.

Below is a detailed table of broad histologies for the 1028 RNA-Seq biospecimens:

Broad Histology N
Benign tumor 34
Choroid plexus tumor 11
Diffuse astrocytic and oligodendroglial tumor 191
Embryonal tumor 184
Ependymal tumor 93
Germ cell tumor 13
Histiocytic tumor 6
Low-grade astrocytic tumor 303
Lymphoma 1
Melanocytic tumor 1
Meningioma 29
Mesenchymal non-meningothelial tumor 25
Metastatic secondary tumors 5
Neuronal and mixed neuronal-glial tumor 34
Non-CNS tumor 1
Non-tumor 3
Other astrocytic tumor 3
Other tumor 1
Pre-cancerous lesion 14
Tumor of cranial and paraspinal nerves 44
Tumor of pineal region 5
Tumors of sellar region 35

Below is a table of number of tumor biospecimens by phase of therapy (DNA and RNA):

Phase of Therapy N
Initial CNS Tumor 1520
Progressive 302
Progressive Disease Post-Mortem 13
Recurrence 136
Second Malignancy 35
Unavailable 2

How to Obtain OpenPBTA Data

We are releasing this dataset on both CAVATICA and AWS S3. Users performing analyses, should always refer to the symlinks in the data/ directory and not files within the release folder, as an updated release may be produced before a publication is prepared.

The data formats and caveats are described in more detail in doc/data-formats.md. For brief descriptions of the data files, see the data-files-description.md file included in the download.

Use the data issue template to file issues if you have questions about or identify issues with OpenPBTA data.

Data Access via Download Script

We have created a shell script that will download the latest release from AWS S3. macOS users must install md5sum before running the download script the first time. This can be installed with homebrew via the command brew install coreutils or conda/miniconda via the command conda install -c conda-forge coreutils. Note: the download-data.sh script now has the ability to skip downloads of unchanged files, but if you previously installed md5sum via brew you'll need to run brew unlink md5sha1sum && brew install coreutils first to take advantage of this new feature.

Once this has been done, run bash download-data.sh to acquire the latest release. This will create symlinks in data/ to the latest files. It's safe to re-run bash download-data.sh to check that you have the most recent release of the data. We will update the default release number whenever we produce a new release.

Data Access via CAVATICA

For any user registered on CAVATICA, the OpenPBTA data can be accessed from the CAVATICA public project below:

The release folder structure in CAVATICA mirrors that on AWS. Users downloading via CAVATICA should place the data files within the data/release* folder and then create symlinks to those files within /data.

How to Participate

Join the Cancer Data Science Slack

Have general questions or need help getting started using GitHub? You can join the Cancer Data Science Slack to connect with OpenPBTA organizers, other project participants, and the broader cancer data science community. Sign up and join the #open-pbta channel to get started!

Planned Analyses

There are certain analyses that we have planned or that others have proposed, but which nobody is currently in charge of completing. Check the existing issues to identify these. If you would like to take on a planned analysis, please comment on the issue noting your interest in tackling the issue in question. Ask clarifying questions to understand the current scope and goals. Then propose a potential solution. If the solution aligns with the goals, we will ask you to go ahead and start to implement the solution. You should provide updates to your progress in the issue. When you file a pull request with your solution, you should note that it closes the issue in question.

Proposing a New Analysis

In addition to the planned analyses, we welcome contributors who wish to propose their own analyses of this dataset as part of the OpenPBTA project. Check the existing issues before proposing an analysis to see if something similar is already planned. If there is not a similar planned analysis, create a new issue. The ideal issue will describe the scientific goals of the analysis, the planned methods to address the scientific goals, the input data that is required for the planned methods, and a proposed timeline for the analysis. Project maintainers will interact on the issue to clarify any questions or raise any potential concerns.

Implementing an Analysis

This section describes the general workflow for implementing analytical code, and more details are described below. The first step is to identify an existing analysis or propose a new analysis, engage with the project maintainers to clarify the goals of the analysis, and then get the go ahead to move forward with the analysis.

Analytical Code and Output

You can perform your analyses via a script (R or Python) or via a notebook (R Markdown or Jupyter). Your analyses should produce one or more artifacts. Artifacts include both vector or high-resolution figures sufficient for inclusion in a manuscript as well as new summarizations of the data (tables, etc) that are intended for either use in subsequent analyses or distribution with the manuscript.

Software Dependencies

Analyses should be performed within the project's Docker container. We use a single monolithic container in these analyses for ease of use. If you need software that is not included, please edit the Dockerfile to install the relevant software or file a new issue on this repository requesting assistance.

Pull Request Model

Analyses are added to this repository via Pull Requests. Please read the Pull Request section of the contribution guidelines carefully. We are using continuous integration software applied to the supplied test datasets to confirm that the analysis can be carried out successfully within the Docker container.

How to Add an Analysis

Users performing analyses, should always refer to the symlinks in the data/ directory and not files within the release folder, as an updated release may be produced before a publication is prepared.

Folder Structure

Our folder structure is designed to separate each analysis into its own set of notebooks that are independent of other analyses. Within the analyses directory, create a folder for your analysis. Choose a name that is unique from other analyses and somewhat detailed. For example, instead of gene-expression, choose gene-expression-clustering if you are clustering samples by their gene expression values. You should assume that any data files are in the ../../data directory and that their file names match what the download-data.sh script produces. These files should be read in at their relative path, so that we can re-run analyses if the underlying data change. Files that are primarily graphic should be placed in a plots subdirectory and should adhere to the color palette guide. Files that are primarily tabular results files should be placed in a results subdirectory. Intermediate files that are useful within the processing steps but that do not represent final results should be placed in ../../scratch/. It is safe to assume that files placed in ../../scratch will be available to all analyses within the same folder. It is not safe to assume that files placed in ../../scratch will be available from analyses in a different folder.

An example highlighting a new-analysis directory is shown below. The directory is placed alongside existing analyses within the analyses directory. In this case, the author of the analysis has run their workflows in R Markdown notebooks. This is denoted with the .Rmd suffix. However, the author could have used Jupyter notebooks, R scripts, or another scriptable solution. The author has produced their output figures as .pdf files. We have a preference for vector graphics as PDF files, though other forms of vector graphics are also appropriate. The results folder contains a tabular summary as a comma separated values file. We expect that the file suffix (.csv, .tsv) accurately denotes the format of the added files. The author has also included a README.md (see Documenting Your Analysis).

OpenPBTA-analysis
├── CONTRIBUTING.md
├── README.md
├── analyses
│   ├── existing-analysis-1
│   └── new-analysis
│       ├── 01-preprocess-data.Rmd
│       ├── 02-run-analyses.Rmd
│       ├── 03-make-figures.Rmd
│       ├── README.md
│       ├── plots
│       │   ├── figure1.pdf
│       │   └── figure2.pdf
│       ├── results
│       │   └── tabular_summary.csv
│       └── run-new-analysis.sh
├── data
└── scratch

Documenting Your Analysis

A goal of the OpenPBTA project is to create a collection of workflows that are commonly used for atlas papers. As such, documenting your analytical code via comments and including information summarizing the purpose of your analysis is important.

When you file the first pull request creating a new analysis module, add your module to the Modules At A Glance table. This table contains fields for the directory name, what input files are required, a short description, and any files that you expect other analyses will rely on. As your analysis develops and input or output files change, please check this table remains up to date. This step is included in the pull request reproducibility checklist.

When an analysis module contains multiple steps or is nearing completion, add a README.md file that summarizes the purpose of the module, any known limitations or required updates, and includes examples for how to run the analyses to the folder.

Analysis Script Numbering

As shown above, analysis scripts within a folder should be numbered from 01 and are intended be run in order. If the script produces any intermediate files, these files should be placed in ../../scratch, which is used as described above. A shell script that runs all analytical code in the intended order should be added to the analysis directory (e.g. run-new-analysis.sh above). See the continuous integration instructions for adding analyses with multiple steps for more information.

Output Expectations

The CI system that we use will generate, as artifacts, the contents of the analyses directory applied over a small test dataset. Our goal is to capture all of the outputs that will be used for the OpenPBTA-manuscript as artifacts. Files that are primarily graphic should be placed in a plots subdirectory of the analysis's folder. Plots should use the specified color palettes for this project. See more specific instructions on how to use the color palette here. Files that are primarily tabular results files should be placed in a results subdirectory of the analysis's folder. Files that are intermediate, which means that they are useful within an analysis but do not provide outputs intended for tables, figures, or supplementary tables or figures of the OpenPBTA-manuscript, should be placed in ../../scratch.

Docker Image

We build our project Docker image from a versioned tidyverse image from the Rocker Project (v3.6.0).

To add dependencies that are required for your analysis to the project Docker image, you must alter the project Dockerfile.

  • R packages installed on this image will be installed from an MRAN snapshot corresponding to the last day that R 3.6.0 was the most recent release (ref).
    • Installing most packages, from CRAN or Bioconductor, should be done with our install_bioc.R script, which will ensure that the proper MRAN snapshot is used. BiocManager::install() should not be used, as it will not install from MRAN.
    • R packages that are not available in the MRAN snapshot can be installed via GitHub using our install_github.R script, with the commit specified by the --ref argument.
      • To avoid rate limits by GitHub when installing these packages, we use an access token which is passed to the build environment via Docker secrets. To use this token, your installation step should start with RUN --mount=type=secret,id=gh_pat and you should pass the argument --pat_file /run/secrets/gh_pat to the install_github.R script
  • Python packages should be installed with pip3 install with version numbers for all packages and dependencies specified.
    • As a secondary check, we maintain a requirements.txt file to check versions of all python packages and dependencies.
    • When adding a new package, make sure that all dependencies are also added; every package should appear with a specified version both in the Dockerfile and requirements.txt.
  • Other software can be installed with apt-get, but this should never be used for R packages.

If you need assistance adding a dependency to the Dockerfile, file a new issue on this repository to request help.

Development in the Project Docker Container

The most recent version of the project Docker image, which is pushed to Docker Hub after a pull request gets merged into the master branch, can be obtained via the command line with:

docker pull ccdlopenpbta/open-pbta:latest

If you are a Mac or Windows user, the default limit for memory available to Docker is 2 GB. You will likely need to increase this limit for local development. [Mac documentation, Windows documentation]

RStudio

Using rocker/tidyverse:3.6.0 as our base image allows for development via RStudio in the project Docker container. If you'd like to develop in this manner, you may do so by running the following and changing <password> to a password of you choosing at the command line:

docker run -e PASSWORD=<password> -p 8787:8787 ccdlopenpbta/open-pbta:latest

You can change the volume that the Docker container points to either via the Kitematic GUI or the --volume flag to docker run.

Once you've set the volume, you can navigate to localhost:8787 in your browser if you are a Linux or Mac OS X user. The username will for login will be rstudio and the password will be whatever password you set with the docker run command above.

If you are a new user, you may find these instructions for a setting up a different Docker container or this guide from Andrew Heiss helpful.

Local Development

While we encourage development within the Docker container, it is also possible to conduct analysis without Docker if that is desired. In this case, it is important to ensure that local or personal settings such as file paths or installed packages and libraries are not assumed in the analysis.

RStudio

We have supplied an RStudio project (OpenPBTA-analysis.Rproj) file at the root of the project to aid in organization and encourage reproducible defaults for analysis. In particular, we do not source .Rprofile files in new sessions or save/restore workspaces.

Continuous Integration (CI)

We use continuous integration (CI) to ensure that the project Docker image will build if there are any changes introduced to the Dockerfile and that all analysis code will execute.

We have put together data files specifically for the purpose of CI that contain all of the features of the full data files for only a small subset of samples. You can see how this was done by viewing this module. We use the subset files to cut down on the computational resources and time required for testing.

Provided that your analytical code points to the symlinks in the data/ directory per the instructions above, adding the analysis to the CI (see below) will run your analysis on this subset of the data. Do not hardcode sample names in your analytical code: there is no guarantee that those samples will be present in the subset files.

Working with the subset files used in CI locally

If you would like to work with the files used in CI locally, e.g., for debugging, you can obtain them from AWS by running the following in the root directory of the project:

bash scripts/download-ci-files.sh

Running this will change the symlinks in data to point to the files in data/testing.

Adding Analyses to CI

For an analysis to be run in CI, it must be added to the Circle CI configuration file, .circleci/config.yml. A new analysis should be added as the last step of the run_analyses section.

Here is an example analysis that simply lists the contents of the data directory that contains the files for the test:

      - run:
          name: List Data Directory Contents
          command: ./scripts/run_in_ci.sh ls data/testing

Using ./scripts/run_in_ci.sh allows you to run your analyses in the project Docker container.

If you wanted to add running an Rscript called cluster-samples.R that was in an analysis folder called gene-expression-clustering, you would add this script to continuous integration with:

      - run:
          name: Cluster Samples
          command: ./scripts/run_in_ci.sh Rscript analyses/gene-expression-clustering/cluster-samples.R

This would run the cluster-samples.R on the subset files that are specifically designed to be used for CI.

Adding Analyses with Multiple Steps

There is a different procedure for adding an analysis comprised of multiple scripts or notebooks to CI. Per the contribution guidelines, each script or notebook should be added via a separate pull request. For each of these pull requests, the individual script or notebook should be added as its own run in the .circleci/config.yml file. This validates that the code being added can be executed at the time of review.

Once all code for an analysis has been reviewed and merged, a final pull request for the analysis that is comprised of the following changes should be filed:

  • A shell script that will run all script and/or notebooks in the analysis module.
  • The multiple runs from the module that are in the config.yml file are replaced with a single run that runs the shell script.

If the gene-expression-clustering analysis above instead required two scripts run sequentially (01-filter-samples.R and 02-cluster-heatmap.R), we would follow the procedure below.

1. File and merge a pull request for adding 01-filter-samples.R to the repository.

In this pull request, we would add the following change to .circleci/config.yml.

      - run:
          name: Filter Samples
          command: ./scripts/run_in_ci.sh Rscript analyses/gene-expression-clustering/01-filter-samples.R
2. File and merge a pull request for adding 02-cluster-heatmap.R to the repository.

In this pull request, we would add the following change to .circleci/config.yml. This would be added below the Filter Samples run.

      - run:
          name: Cluster Samples and Plot Heatmap
          command: ./scripts/run_in_ci.sh Rscript analyses/gene-expression-clustering/02-cluster-heatmap.R
3. File and merge a pull request for the shell script that runs the entirety of gene-expression-clustering.

In this pull request, we would add a shell script that runs 01-filter-samples.R and 02-cluster-heatmap.R. Let's call this shell script run-gene-expression-clustering.sh and place it in the analysis directory analyses/gene-expression-clustering.

The contents of analyses/gene-expression-clustering/run-gene-expression-clustering.sh may look like:

#!/bin/bash
# This script runs the gene-expression-clustering analysis
# Author's Name 2019

set -e
set -o pipefail

Rscript --vanilla analyses/gene-expression-clustering/01-filter-samples.R
Rscript --vanilla analyses/gene-expression-clustering/02-cluster-heatmap.R

We would remove the runs Filter Samples and Cluster Samples and Plot Heatmap from .circleci/config.yml and instead replace them with a single run:

      - run:
          name: Cluster Samples and Plot Heatmap
          command: ./scripts/run_in_ci.sh bash analyses/gene-expression-clustering/run-gene-expression-clustering.sh

Passing variables only in CI

The analyses run in CI use only a small portion of the data so that tests can be run quickly. For some analyses, there will not be enough samples to fully test code without altering certain parameters passed to methods. The preferred way to handle these is to run these analyses through a shell script that specifies default parameters using environment variables. The default parameters should be the ones that are most appropriate for the full set of data. In CI, these will be replaced.

We might decide that it makes the most sense to run an analysis using a more permissive statistical significance threshold in CI so that some "significant" pathways still appear and subsequent code that examines them can be tested. We'd first write code capable of taking command line parameters. In R, we could use optparse to specify these in a script - imagine it's called pathway_sig.R and it contains an option list:

option_list <- list(
  optparse::make_option(
    c("-a", "--alpha"),
    type = "double",
    help = "pathway significance threshold",
  )
)

Then we would create a shell script (perhaps run_pathway_sig.sh) that uses a default environment variable. If OPENPBTA_PATHSIG is defined, it will be used. Otherwise, a value of 0.05 is used. Note: the - before the 0.05 below is necessary notation for a default parameter and not designating a negative 0.05.

PATHSIG=${OPENPBTA_PATHSIG:-0.05}

Rscript analyses/my-path/pathway_sig.R --alpha $PATHSIG

We can override this by passing environment variables in .circleci/config.yml. For testing, we might want to use an alpha level of 0.75 so that at least some "significant" pathways appear, which allows testing subsequent code that depends on them. The run command in the .circleci/config.yml is used to specify these parameters.

- run:
    name: run pathway significance tests
    command: OPENPBTA_PATHSIG=0.75 ./scripts/run_in_ci.sh bash analyses/my-path/run_pathway_sig.sh

In this example OPENPBTA_PATHSIG=0.75 species an environment variable OPENPBTA_PATHSIG that is set to 0.75. Any environment variables prefixed with OPENPBTA_ are passed to the specified shell script. Environment variables without this prefix are not passed.

Data release preparation

Some scripts in this analysis repository are required for preparing a data release. To learn more, please see these docs.

openpbta-analysis's People

Contributors

alexslemonade-docs-bot avatar baileyckelly avatar bill-amadio avatar cansavvy avatar cbethell avatar cgreene avatar dmiller15 avatar e-t-k avatar fingerfen avatar hbeale avatar jaclyn-taroni avatar jashapiro avatar jharenza avatar kgaonkar6 avatar komalsrathi avatar lauraegolf avatar mkoptyra avatar nnoureen avatar pichairaman avatar runjin326 avatar sjspielman avatar tkoganti avatar yangyangclover avatar yuankunzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openpbta-analysis's Issues

Proposed Analysis: poly-A vs stranded

Hi @cbethell! I think this is a great start and one thing we have been struggling with is that some of the RNA-Seq was a poly-A prep while the majority was stranded. I am not sure anyone has yet to figure out how to normalize across these two batches (eg I think UCSC Treehouse folks are also very interested in this). In some of our discussions, it has been suggested to subset the dataset for only those genes which will be captured in a polyA prep, and see if that helps, but you will lose a lot of data in that way.

Originally posted by @jharenza in #83 (comment)

Proposed Analysis: poly-A vs stranded/PNOC vs. CBTTC

Scientific goals

As previously noted (e.g. in #83 (comment) ) the RNA-seq data was compiled from experiments that used two different techniques, poly-A prep vs. stranded. This is confounded with the source of the data: PNOC vs. CBTTC. If we are going to be able to compare these two sources, we will want to investigate if there is a way to identify comparable parts of the data set.

Proposed methods

Filtering of expression data to identify gene sets suitable for cross-method comparison. It may be that no such comparison can be made at this point, in which case all analysis between the two data sets should be expected to be performed in separately, in parallel.

Required input data

RNA-seq data (kallisto and RSEM outputs)

Proposed timeline

What is the timeline for the analysis?

Relevant literature

If there is relevant scientific literature, put links to those items here.

Proposed Additional PR checklist

Issue

I made a PR checklist inspired by the OpenPBTA project guidelines. This has been helpful for me personally, and I therefore believe that it can be helpful to other contributors. It would serve as an additional and optional resource.

The checklist is in markdown format and can be found below.

OpenPBTA-analysis

PR checklist

  • Fork the repo.
    You only need to do this step once, continue to the third step of this list if you have already done this.
  • Make the original repo, AlexsLemonade/OpenPBTA-analysis, an upstream repository.
  • Checkout a new branch to make changes on.
    WAIT Is your master branch up to date with the upstream repo?
    Try git status to check.
    If it's not, go here before creating your new branch.
    Remember to git push when you are done.
  • Did you make numerous changes, too many for one pull request?
    Then let's create stacked pull requests.
  • Now that you have committed changes, be sure to write a detailed summary of the changes by following this template.

Now check to ensure you have done the following before filing the PR (remember to push any additional changes):

  • Run a linter (using styler for example)
  • Set the seed (if applicable)
  • Structure of analysis directory meets the Project Structure guidelines
  • Comments and/or documentation up to date
  • Double checked paths
  • Spell check any Rmd file or md file (using mdspell for example)
  • Restart R and run all notebooks fresh and save (be sure to run in the Docker container)
  • Connect the pertinent issues on ZenHub
  • Updated Dockerfile and built Docker image successfully
  • Added analysis to continuous integration

If you have completed all of the applicable steps above, you can request a suitable reviewer and file the PR/draft PR.

Note: Upon approval of all Pull Requests for an analysis, create a final Pull Request for a shell script that runs the entirety of scripts in your analysis directory.

Question

Is it necessary to add this checklist to the repo?

Planned Analysis: Co-Occurence / Mutual Exclusivity

Determining genetic lesions (Mutation, CNV, Fusion) and/or pathways which co-occur or are mutually exclusive across the PBTA. This could help associate lesions with pathways or define potential synthetic lethality relationship.

Contributor friendliness: make project approachable for newcomers

In service of making the project more approachable for new contributors, I'm creating this issue to discuss the use of our issue labels.

From the Planned Analysis section of the README:

We have tagged a subset of these with the label "good first issue". These "good first issues" are ones that have a limited scope, that can be done with the software already present on the Docker container, and that we expect to have few dependencies with other issues.

We do not currently have a 'fleshed-out' Docker image. As a result, I've untagged all issues that were labeled as good first issue for the moment. When we have the container building and ready for use, we should focus on making limited-in-scope issues and tagging those.

I've also recently added the blocked label, which is meant to indicate that an issue is blocked by something that is external to or upstream of this project. We should apply this label to existing issues where appropriate.

Another consideration: should we use issues to indicate priority?

How we add completed analysis modules to CI

Context is #76 (comment), to quote that here

  • As individual notebooks or scripts are coming in for an analysis module, they should be added to CI as multiple runs. This validates that they can be executed at the time of review.
  • Once an analysis has been reviewed and validated, the last PR for an analysis module should be a shell script used to run everything in the module. That PR also replaces the multiple runs in the config file with a single run that runs the shell script.

I propose that we try this model out with @cbethell's addition of #5 (relevant PRs: #54, #55). If successful, we can update the instructions in the README.

LUMPY file missing data

File(s)

What data file(s) does this issue pertain to?
pbta-sv-lumpy.tsv.gz

Release

What release are you using?
V2/V3

Link to OpenPBTA-manuscript

Put a link to the relevant section of the OpenPBTA manuscript here.

Question/issue

Put your question or report your issue here.
LUMPY file only contains 171 participants and should include ~700 - we are remerging this file and will include with V4 data release. cc: @yuankunzhu @kgaonkar6 @jaclyn-taroni

symlinks instructions in README

File(s)

What data file(s) does this issue pertain to?

Release

What release are you using?

Link to OpenPBTA-manuscript

Put a link to the relevant section of the OpenPBTA manuscript here.

Question/issue

Put your question or report your issue here.
We had some confusions about how to use symlinks with R, and the R.utils package installed with the docker should have been able to read symlinks, but was unable to (with #92), so file.path had to be used instead. May be good to add some of the detail around use of symlinks for specific types of files with R to the README.

Proposed Analysis: RNA Expression-based Prediction of Sex

Scientific goals

In #73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.

Proposed methods

  • Construct an elastic-net logistic regression classifier using gene expression values as features and reported sex as the labels. I suggest elastic net because the signal is expected to be relatively linear, most genes are expected to play little role, but a few sets of genes are likely to be predictive and highly correlated, and it makes sense to spread weights across them to produce a robust predictor.
  • Evaluate using cross-validation with reported labels across the full set.
  • Evaluate using cross-validation with germline-based labels across the full set
  • Evaluation prediction accuracy using germline sequencing within each histology.

Required input data

For classifier construction:

  • The reported sex in the PBTA histologies file.
  • Gene expression estimates from kallisto, RSEM, or both.

For evaluation:

  • The germline-based sequencing calls.
  • Histologies, so that performance can be broken out by histology as well.

Proposed timeline

I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.

Relevant literature

There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups:

Data: Fusion TSV files are not tab-delimited

While working on #33, I read the Arriba file into R like so (removing the full path info for simplicity):

arriba <- readr::read_tsv("pbta-fusion-arriba.tsv.gz")

And the output in the console was:

Parsed with column specification:
cols(
  `gene1 gene2 strand1.gene.fusion. strand2.gene.fusion. breakpoint1 breakpoint2 site1 site2 type direction1 direction2 split_reads1 split_reads2 discordant_mates coverage1 coverage2 confidence closest_genomic_breakpoint1 closest_genomic_breakpoint2 filters fusion_transcript reading_frame peptide_sequence read_identifiers tumor_id` = col_character()
)

Running

arriba <- readr::read_delim("pbta-fusion-arriba.tsv.gz", delim = " ")

yields

Parsed with column specification:
cols(
  .default = col_character(),
  split_reads1 = col_double(),
  split_reads2 = col_double(),
  discordant_mates = col_double(),
  coverage1 = col_double(),
  coverage2 = col_double()
)
See spec(...) for full column specifications.
1 parsing failure.
  row col   expected     actual                                   file
37269  -- 25 columns 24 columns '../../data/pbta-fusion-arriba.tsv.gz'

and

> dim(arriba)
[1] 54279    25

Also checked by gunzipping the file and using less -U. I get a similar pattern with pbta-fusion-starfusion.tsv.gz.


I would expect files with a tsv.gz extension to be tab-delimited. Is it possible to change these files to be TSV files in the interest of smoothing the way for folks using a TSV parser on them?

EDIT: If this is a limitation of the upstream tools, I would suggest changing the extension to txt.gz and noting the format in https://github.com/AlexsLemonade/OpenPBTA-analysis#data-formats + possibly adding something to docs/format.

Someone noted in person that changing the files with something like sed /s/ /\t/g might be preferrable for downstream analysts over making the documentation change.

Planned Analysis: Analysis of recurrent fusions

This should probably be partitioned by cancer type and/or molecular subtype. This could be a stacked bar chart (one bar per fusion) for instance with colors representing different cancer types.

Planned Analysis: Druggable & Actionable targets

Summary of druggable/actionable genes via mutation, over-expression, fusion, synthetic lethality and actionable features i.e. TMB, mutational signatures, PD-L1 expression, etc..across the PBTA. Analysis of what's currently used and what may be some potential opportunities.

duplicate RNA-Seq BS_ID in clinical file

File(s)

What data file(s) does this issue pertain to?

Release

V2/V3

What release are you using?

Link to OpenPBTA-manuscript

Put a link to the relevant section of the OpenPBTA manuscript here.

Question/issue

Put your question or report your issue here.
BS_6DCSD5Y6 is duplicated for its RNA-Seq entry in the clinical file RNA-Seq and this is due to two different ages at diagnosis. Will look into which is correct - we may have a discrepancy between KF DRC and Pedcbio portal. Will also fix this with V4. cc: @yuankunzhu

Low number of matched participant identifiers

File(s)

All files

Release

release-v3-20190829

Question/issue

I was attempting to increase the number of matched participants in the subset file to 100 over on #91. I map the biospecimen IDs in each file to the participant IDs:

I find 91 participants in all files

participants_in_all <- base::Reduce(intersect, participant_id_list)

Here are the number of mapped participant ids for each file:

> lapply(participant_id_list, length)
$cnvkit
[1] 939

$controlfreec
[1] 941

$arriba
[1] 1029

$starfusion
[1] 720

$kallisto
[1] 1029

$rsem
[1] 944

$mutect2
[1] 973

$strelka2
[1] 973

$manta
[1] 940

$lumpy
[1] 147

I feel like the number of participants in all was higher in release v2 but I don't remember off the top of my head. Alternatively, this strategy for finding matched participants should be changed.

Missing README files in v4

File(s)

README.md
release-notes.md

Release

release-v4-20190909

Question/issue

This file does not seem to be present in the S3 bucket at the location specified in download-data.sh. Note that the script refers to release-notes.md, but the release notes suggest it may be named README.md. Neither seems to be present, however.

vardict and lancet variant calls on deck

File(s)

What data file(s) does this issue pertain to?

Release

What release are you using?

Link to OpenPBTA-manuscript

Put a link to the relevant section of the OpenPBTA manuscript here.

Question/issue

Put your question or report your issue here.
We are currently generating lancet and vardict calls and will release in V5

Process for CAVATICA access/analysis

We should flush out this process - giving people access to CAVATICA to analyze data, ability to use virtual notebooks to do analyses, then submission of their code/figures/notebooks to github.

Script to generate a subset of the data

It would be really handy if there was a script that could be run over the input data files to produce files containing data for only a small, specified number of matched participants. I'm imagining that we would use this to produce data files with information for 10 or so participants. We would then use this code via CircleCI to test the analytical code that people contribute without having to run over the full set of input data.

Planned Analysis: Molecularly subtype all tumors

Without methylation data, the goal of this analysis is to best molecularly subtype all brain tumors based on defining features.

  • Medulloblastomas (MBs) into SHH, WNT, Group 3, and Group 4 - @PichaiRaman has created a classifier here and classified the samples with RNA-Seq.
  • ATRTs into SHH, MYC, TYR. See: 1-s2.0-S1535610816300356-main.pdf
  • Ependymomas into ST-RELA, ST-YAP1, PF-A, and PF-B. Will require some genome-wide copy number examination. See: 0D556E30-D832-4D6F-9F2F-49554F68D56F.pdf
  • High-grade gliomas, according to brain region (if known, eg: midline vs hemispheric) and defining lesion (eg H3F3A K27/28M, G34/5R/V, HIST1H3B K27/28M, IDH-mutant, ACVR1-mutant, PDGFRA, and mesenchymal subtypes). See: WHO 2016 classifications
  • Non-MB embryonal tumors (old PNET classified tumors) into ETMRs and CNS embryonal NOS
  • Oligodendrogliomas as IDH-mutant and 1p/19q codeleted
  • Chordomas - subtype into poorly differentiated (INI1/SMARCB1 loss)

Data analysis/format questions for downstream analysts

CNV

  1. We have ControlFreeC and CNVkit. CNVkit output is VCF – do users want raw output for these or should we convert ControlFreeC output to VCF, then annotate all with AnnotSV (which is what we are using for SV annotation) and then provide annotated tab files?

SV
2. Do we want Novobreak as a 3rd algorithm? Have Manta and and will have Lumpy.

Annotated/Merge Files
3. We ran Strelka2 and Mutect2 for somatic calls. Do we want to use VEP for SNVs (MAFs are currently annotated with this) or SNPeff? We have both, but VEP was provided.
4. Do we want to provide a limited merged tab file in maf format that includes both Mutect2 and Strelka2 variant calls with a column for “Algorithm”? If so, we need to define the fields - Is this better for users?
5. Should we do the same for CNVs and SVs – provide one merge file of all data? Note: these will be large – (22GB +) and won’t be able to be processed locally
6. Should we set up a CAVATICA project for people to work on interactive analyses (R studio/Jupyter notebooks) or do we think they will do that on their own? We can probably provide some cloud credits for this if they want to do on own – maybe we need instructions?
7. Current plan is for CBTTC and PNOC to be separate merge files, but so we want these to remain separate or be merged? If separate, we need instructions for downstream users to merge all files they download, so I prefer merging all from the start.

SNV
8. Do we want only tumor/normal paired samples or should we also include tumor-only samples? If yes to tumor only, what should we use as PON - normal from cross-brain tumors, normal within disease type (if high enough N), non-cancer pediatric normals?

@gonzolgarcia @samesense @Yiran-Guo @gwaygenomics @jpfeil @kgaonkar6 @LauraEgolf @apexamodi @nathankendsersky @afarrel @jaclyn-taroni @PichaiRaman please weigh in. cc: @yuankunzhu

Planned data release: V5

Currently planned for 24-Sept-2019

Planned addition + changes from @jharenza :

  • Vardict - VEP annotated MAF (related: #103)
  • Lancet – VEP annotated MAF (related: #103)
  • Updated arriba fusion file (related: #92 (comment))
  • RSEM gene level count matrix
  • RSEM transcript level count matrix (related: #14 (comment))
  • Updated lumpy file - to fix T/N columns (related: #27 (comment))

.gitignore skips data file

somehow the .gitignore file keeps ignoring changes in data/ path, i saw you @cgreene only skips the DS_Store file, not sure what happened. I ended up using git add -f for force adding my commits.

Planned Analysis: Survival Analysis across PBTA

Comparison of genomic/transcriptomic features and derivates such as TMB, pathway activity, etc... to survival. Determine markers that could add prognostic values to different PBTA cancer types. KM Plots for significant and interesting features.

Planned Analysis: MultiPLIER LVs + Survival & Variant Analysis

We plan to produce a MultiPLIER LV matrix. We then plan to use that analysis to identify LVs associated with survival as well as those associated with samples that contain alterations in selected genes. This issue is for more fully fleshing out the analysis.

RSEM file mismatched to clinical

File(s)

What data file(s) does this issue pertain to?
pbta-gene-expression-rsem.fpkm.RDS

Release

What release are you using?
V2/V3

Link to OpenPBTA-manuscript

Put a link to the relevant section of the OpenPBTA manuscript here.

Question/issue

Put your question or report your issue here.
I noticed that there are 1123 BS_IDs in this file, but our data has 1028 unique RNA BS_IDs. Additionally, BS_ID BS_XM1AHBDJ is missing from the RSEM file, but present in the kallisto file. We will have to update this in V4.

Proposed Analysis: Sex prediction on PBTA cohort

Scientific goals

What are the scientific goals of the analysis?
Accurately predict sex of PBTA samples for downstream analyses

Proposed methods

What methods do you plan to use to accomplish the scientific goals?
Ratio of Y:X+Y chromosomes
XIST expression in RNA

Required input data

What input data will you use for this analysis?
normal DNA BAMS
RNA-Seq FPKM

Proposed timeline

What is the timeline for the analysis?
One week

Relevant literature

If there is relevant scientific literature, put links to those items here.

Planned Analysis: Filter and Annotate Fusions

Here, we will filter potential artifacts, filter fusions observed in normal tissue, retain high-confidence calls, and annotate with several databases to create a final list of putative driver fusions.

Planned Analysis: Identification of potential causal/driver VUS

Scientific goals

Identify potential variants of unknown significance that may play a role in tumorigenesis.

Proposed methods

Look at frequently recurrent VUS, determine if there is a functional effect on a gene (or it is in a regulatory region), look at VAF and focus on samples without other relevant driver mutations.

Required input data

VCF or MAF files

Proposed timeline

2 weeks

Relevant literature

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6359859/

Data Download Script

We should put together a script to download the data into a defined folder. We should require folks to not modify that folder. We should include dummy germline data in the download if we are not able to distribute it without restrictions so that at least the CI works.

How were "Panel" experimental strategy samples processed?

File(s)

pbta-histologies.tsv and to a lesser extent, pbta-snv-strelka2.vep.maf.gz

Release

release-v2-20190809

Link to OpenPBTA-manuscript

The methods section about data generation is where I would have expected to see information about the "Panel" samples.

Question/issue

What does it mean in pbta-histologies.tsv when an experimental strategy is "Panel"?

For the entire set of metadata, there appears to be only two samples that are called "Panel". Here is their information:

Kids_First_Biospecimen_ID Kids_First_Participant_ID experimental_strategy sample_type composition tumor_descriptor
BS_7KR13R3P PT_V1HNAC2Q Panel Tumor Solid Tissue Diagnosis
BS_WHZT48VG PT_V1HNAC2Q Panel Normal Peripheral Whole Blood NA

More specifically, there is only one tumor sample in the Strelka2 data that is noted as using this strategy: BS_7KR13R3P. The rest of the samples in the Strelka2 data are noted as "WGS" and "WES" which is what I would have expected for all the samples.

What does "Panel" mean regarding these samples? Is it incorrectly labeled? If the experimental strategy for these two samples is actually different, should it be discarded from the data in general?

Data: MAF files have samples that do not have metadata

Problem

13 samples that have data in the MAF files (pbta-snv-mutect2.vep.maf.gz and pbta-snv-strelka2.vep.maf.gz) are missing from the metadata. I am using data release-v2-20190809 and downloaded it via the bash script this morning.

Samples that are missing metadata but have data reported in the MAF files.

"BS_0BA5TZND" 
"BS_0HYD1VHH" 
"BS_0PQGSCJA" 
"BS_3GSGMV4T" 
"BS_3RJ3CDE6" 
"BS_AH51BZ8D" 
"BS_BF95AEXC" 
"BS_D0T6V861" 
"BS_KVPJVJR7" 
"BS_TC8R5HY4" 
"BS_TKF4H7CZ" 
"BS_V1DPR5TD"
"BS_X2G3JMM1"

Here are the steps I took to obtain this list of samples were missing from the metadata. Let me know if there is something I missing or misunderstanding.

# Read in the metadata
metadata <- readr::read_tsv(file.path("..", "..", "data", "pbta-histologies.tsv"))

# Read in Mutect2 data
mutect2 <- maftools::read.maf(file.path("..", "..", "data", "pbta-snv-mutect2.vep.maf.gz"))

# Read in Strelka2 data
strelka <- maftools::read.maf(file.path("..", "..", "data", "pbta-snv-strelka2.vep.maf.gz"))

# Find out what samples don't have data in the metadata
missing_mutect2_samples <- mutect2@data %>% 
  dplyr::filter(!(Tumor_Sample_Barcode %in% metadata$Kids_First_Biospecimen_ID)) %>%
  dplyr::distinct(Tumor_Sample_Barcode, .keep_all = TRUE) %>%
  dplyr::pull(Tumor_Sample_Barcode)

missing_mutect2_samples <- strelka2@data %>% 
  dplyr::filter(!(Tumor_Sample_Barcode %in% metadata$Kids_First_Biospecimen_ID)) %>%
  dplyr::distinct(Tumor_Sample_Barcode, .keep_all = TRUE) %>%
  dplyr::pull(Tumor_Sample_Barcode)

# These are the same 13 samples in both datasets
sort(intersect(missing_mutect2_samples, missing_strelka2_samples))

Next steps:

Were these samples supposed to be removed from the MAF files? (And if so, why?)
OR
Where is the metadata for these 13 samples and how do we get it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.