Git Product home page Git Product logo

genomicsnotebook's Introduction

Genomics Data Analysis with Jupyter Notebooks on Azure

text

Jupyter notebook is a great tool for data scientists who are working on genomics data analysis. In this repo, we demonstrate the use of Azure Notebooks for genomics data analysis via GATK, Picard, Bioconductor and Python libraries.

For more information about Codespaces please visit the product page

Here is the list of sample notebooks on this repo:

  1. genomics.ipynb: Analysis from 'uBAM' to 'structured data table' analysis.
  2. genomicsML.ipynb: Train Machine Learning models with Genomics + Clinical Data
  3. genomics-platinum-genomes.ipynb: Accessing Illumina Platinum Genomes data from Azure Open Datasets* and to make initial data analysis.
  4. genomics-reference-genomes.ipynb: Accessing reference genomes from Azure Open Datasets*
  5. genomics-clinvar.ipynb: Accessing ClinVar data from Azure Open Datasets*
  6. genomics-giab.ipynb: Accessing Genome in a Bottle data from Azure Open Datasets*
  7. SnpEff.ipynb: Accessing SnpEff databases from Azure Open Datasets*
  8. 1000 Genomes.ipynb: Accessing 1000 Genomes dataset from Azure Open Datasets*
  9. GATKResourceBundle.ipynb: Accessing GATK resource bundle from Azure Open Datasets*
  10. ENCODE.ipynb: Accessing ENCODE dataset from Azure Open Datasets*
  11. genomics-OpenCRAVAT.ipynb: Accessing OpenCRAVAT dataset from Azure Open Datasets and deploy built-in Azure Data Science VM for OpenCRAVAT*
  12. Bioconductor.ipynb: Pulling Bioconductor Docker image from Microsoft Container Registry
  13. simtotable.ipynb: Simulate NGS data, use Cromwell on Azure OR Microsoft Genomics service for secondary analysis and convert the gVCF data to a structured data table.
  14. igv_jupyter_extension_sample.ipynb: Download sample VCF file from Azure Open Datasets and use igv-jupyter extension on Jupyter Lab environment.
  15. radiogenomics.ipynb: Combine DICOM, VCF and gene expression data for patient segmentation analysis.
  16. fhir+PacBio.ipynb: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics
  17. fhir-vcf-clustering.ipynb: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics

*Technical note: Explore Azure Genomics Data Lake with Azure Storage Explorer

1. Prerequisites

Create and manage Azure Machine Learning workspaces in the Azure portal

text

For further details on creation of Azure ML workspace please visit this page.

Run the notebook in your workspace

This chapter uses the cloud notebook server in your workspace for an install-free and pre-configured experience. Use your own environment if you prefer to have control over your environment, packages and dependencies.

Follow along with this video or use the detailed steps below to clone and run the tutorial from your workspace.

Watch the video

2. Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

3. References

  1. Jupyter Notebook on Azure
  2. Introduction to Azure Notebooks
  3. GATK
  4. Picard
  5. Azure Machine Learning
  6. Azure Open Datasets
  7. Cromwell on Azure
  8. Bioconductor

genomicsnotebook's People

Contributors

clittlejohn366 avatar erdalcosgun avatar microsoftopensource avatar olesya13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genomicsnotebook's Issues

Description on simulated clinical and phenotypic datasets referred in `/sample-notebooks /genomicsML.ipynb`

Thank you very much for sharing an informative set of Jupyter Notebooks.

I've been reviewing the Train Machine Learning Models with Genomics + Clinical Data notebook that uses simulated clinical and phenotypic datasets. However, I couldn't find details on how this datasets are generated.

Could you provide insight into how this data is generated, or direct me to any resources or documentation on this matter?

Thank you very much in advance

Feedback on FHIR > 1_export_data.ipynb

  1. the mnt path is kind of confusing. Where it says USERNAME here is not my AD username, but rather than name of the compute env? I think. Either way, this path took some digging to figure out and could be clarified in the notebook.

import subprocess

subprocess.run(["./run_synthea",
"-s", "42",
"-cs", "99",
"-p", "10",
f'--exporter.baseDirectory=/mnt/batch/tasks/shared/LS_root/mounts/clusters/USERNAME/code'
]);

  1. Step 2.1 below, should somehow be after step 2.4, because the fhir server has not yet been created. First the user needs to go to Azure API for FHIR, create the FHIR server, then they can do the rest. Creating the server is not described in the instructions.

2.1) Create an "Azure API for FHIR"[3] instance, named <fhir_server>

Navigate to https://<fhir_server>.azurehealthcareapis.com/metadata and verify a "Capability Statement" is retrieved.
That means the FHIR server[3] is running.
Set fhir_server in Section 3.1
Use RBAC[6]: <fhir_server> left pane "Identity" -> "On" -> "Save"

  1. This line : for filename in glob(f"/home/azureuser/cloudfiles/data/datastore/synthea/fhir/*.json"): did not work for me, I had to use the mnt path from the top of the notebook.

  2. 4. Set up the FHIR->Synapse Sync Agent

This notebook section follows the "FHIR to Synapse Sync Agent" tutorial provided Microsoft's "FHIR Analytics Pipelines" Github repository[13].

4.1) Deploy the custom Azure template provided by the "FHIR to Synapse Sync Agent" tutorial[13].

  • Navigate to the Github repo by clicking this link.

The GitHub link is no longer valid. I went to that repo, but its not clear which doc I use to deploy.

  1. 5.3) Convert all PacBio VCFs to TSV

This step assumes you already have VCF files in a storage account container. You could download the vcfs directly into the VM and then copy to the container or leave in the VM. Either way, should not assume the user already has the data.

Username misleading in mnt path

In the first FHIR notebook, the mnt path is kind of confusing. Where it says USERNAME here is not my AD username, but rather than name of the compute env? I think. Either way, this path took some digging to figure out and could be clarified in the notebook.

import subprocess

subprocess.run(["./run_synthea",
"-s", "42",
"-cs", "99",
"-p", "10",
f'--exporter.baseDirectory=/mnt/batch/tasks/shared/LS_root/mounts/clusters//code'
]);

Creating FHIR Server

In this notebook 1-data-export, step 2.1 below, should somehow be after step 2.4, because the fhir server has not yet been created. First the user needs to go to Azure API for FHIR, create the FHIR server, then they can do the rest. Creating the server is not described in the instructions.

2.1) Create an "Azure API for FHIR"[3] instance, named <fhir_server>

  • Navigate to https://<fhir_server>.azurehealthcareapis.com/metadata and verify a "Capability Statement" is retrieved.
    That means the FHIR server[3] is running.
  • Set fhir_server in Section 3.1
  • Use RBAC[6]: <fhir_server> left pane "Identity" -> "On" -> "Save"

No link to Download PacBio VCF Files

In Notebook FHIR 1_export_data,

5.3) Convert all PacBio VCFs to TSV

This step assumes you already have VCF files in a storage account container. You could download the vcfs directly into the VM and then copy to the container or leave in the VM. Either way, should not assume the user already has the data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.