Introduction to bulk RNA-seq

Home Page: https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon-flipped/

SCSS 0.03% HTML 99.24% Shell 0.73%

intro-to-rnaseq-hpc-salmon-flipped's Introduction

Introduction to RNA-seq using high-performance computing (HPC)

Audience	Computational skills required	Duration
Biologists	None	3-session online workshop (~7.5 hours of trainer-led time)

Description

This repository has teaching materials for a 2-day Introduction to RNA-sequencing data analysis workshop. This workshop focuses on teaching basic computational skills to enable the effective use of an high-performance computing environment to implement an RNA-seq data analysis workflow. It includes an introduction to shell (bash) and shell scripting. In addition to running the RNA-seq workflow from FASTQ files to count data using Salmon, the workshop covers best practice guidelines for RNA-seq experimental design and data organization/management.

Note for Trainers: Please note that the schedule linked below assumes that learners will spend between 3-4 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion/Q & A.

These materials were developed for a trainer-led workshop, but are also amenable to self-guided learning.

Learning Objectives

Understand the necessity for, and use of, the command line interface (bash) and HPC for analyzing high-throughput sequencing data.
Understand best practices for designing an RNA-seq experiment and analyzing the resulting data.

Lessons

Installation Requirements

All:

FileZilla Client (make sure you get ‘FileZilla Client')

Mac users:

Plain text editor like Sublime text or similar

Windows users:

GitBash
Plain text editor like Notepad++ or similar

Citation

To cite material from this course in your publications, please use:

Mary E. Piper, Meeta Mistry, Jihe Liu, William J. Gammerdinger, & Radhika S. Khetani. (2022, January 10). hbctraining/Intro-to-rnaseq-hpc-salmon-flipped: Introduction to RNA-seq using Salmon Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5833880

A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Some materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

intro-to-rnaseq-hpc-salmon-flipped's People

Contributors

Stargazers

Watchers

intro-to-rnaseq-hpc-salmon-flipped's Issues

Ratio calculator for 5’-3’ bias: how do we calculate in bcbio?

redundancy with qualimap+multiQC and troubleshooting

We can combine qualimap and multiqc review and Meeta's slide deck - there were some redundancies

too much copy and paste in automation lesson

Make all the code in one chunk for them to copy and paste? Will will fix!

automation lesson slurm array link

Current link is broken. Should be https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/arrays_in_slurm.html

move details on data lifecycle to self-learning

Add some exercise questions including making the README

FileZilla with 2F-authentication

I was having lots of trouble connecting to O2 through FileZilla. Then I thought it was strange that it was bypassing my 2-factor authentication, so decided to check Duo. I had approvals waiting there and once I gave it approval, I got it to work. It seems like we should edit the course materials to discuss the use of 2-factor authentication when connecting with FileZilla.

add arrays instead of for loop for parallelizing?

Should we link this lesson Or discuss it? Or swap out content?

Add note about knockout genes continuing to be expressed

Experimental design considerations reminder

Less time going over experimental considerations that aren’t questions. B/c a lot of questions. Ask about if questions regarding the whole lesson prior to going through the lesson.

Update reservation

Provide information about updating reservation --reservation=HBC in FASTQC lesson.

update the shell review answer key

once the assignment is finalized update the answers

Add GC bias information

look up Mike Love’s rationale for including the GC bias information. Contact Rob or Mike about nuances in biases. Maybe look up alpine too.

zoom poll questions?

missing bias value for Mov10_oe_3

In the multiqc report there s missingvalue for 5'-3'bias in the table.

Troubleshoot this, especially since the value is computed by Qualimap

add an example paired-end script for automation

A question that comes up often is how to modify the script to work with PE data. For instructors that have handled this in a office hours, get a skeleton script together that we can link out to

Data to follow along

Hello all,

I try to follow along with the analysis in this workshop. However, when I click to download data, I am not sure I got the correct files because the file names are different from the tutorial. Would you confirm which files you used for this analysis? Is it the files in the folder raw_fastq? Thank you so much!

data link for self-learners broken

On this page: links-to-lessons.md

The link for Non-Harvard folks is broken (it also points to unix_lesson, which I don't think is what they need)

Missing link

In https://github.com/hbctraining/Intro-to-rnaseq-hpc-salmon-flipped/blob/main/lessons/04b_data_organization.md, section titled "Implementing data management best practices", there is a hyperlink "In a previous lesson" which is broken.

Add positional bias information (discuss that it’s experimental, and that the 5’-3’ bias can be deduced using Qualimap (referencing Salmon)

Reducing alignment slide deck

Radhika then Meeta

HPC lesson updates and clarifications from Kathleen

data storage & memory on O2 are in units using base 2, not units using base 10 (e.g. tebibytes not terabytes. TiB = 1024 GiB, TB = 1000 GB). HMS IT had been using the units that people colloquially use- such as terabyte, gigabyte, etc. - but have been technically incorrect with these units. The amount of storage that folks have been using/have access to has not changed with our change in terminology. The distinction is important for the billing aspect, as we charge for compute usage (with a RAM charge for GiB/hour, among other factors) and for storage usage (TiB/year). More details on billing rates here
for the sentence “There are several compute nodes on O2 available for performing your analysis/work”, do you mean several types of compute nodes? That is true, or you could also say “There are several hundred compute nodes…” which is also true. The sentence as is sounds like it is missing a word.
Memory request would be in gibibytes, not gigabytes for --mem 1G
This won’t be relevant for the workshop itself, but if folks are submitting jobs and are in multiple Slurm accounts (e.g. labs/groups), they’ll need to specify an account for an srun or sbatch job to count under with the -A parameter. You can check if you’re in multiple Slurm accounts by running sshare -Uu $USER. More details on -A and Slurm accounts/unix accounts here
The wiki link for -t is broken, missing a dash, use this: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Time-limits
Same thing for -c: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#How-many-cores?
And --mem: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Memory-requirements
And O2 wiki sbatch reference link: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#sbatch-options-quick-reference
sbatch job submission is using 400MiB
module load can modify additional environment variables than $PATH, specifics are probably not relevant to this workshop, though
We’re starting to move away from gcc/6.2.0 and are building new tools with gcc/9.2.0, but the majority of modules have been built with gcc/6.2.0
For the filesystems part, it’d be helpful to link to here, as it has links for requesting group directories (under the Active Compute section). Also, a caveat that off quad folks will have to pay for their group directories. Home, scratch directories are free for everyone. Also, /n/cluster/bin/scratch3_create.sh needs to be run from a login node. The script will give you an error message to this extent if you run it from a compute node, but sometimes folks don’t read :bloblul:

hbctraining / intro-to-rnaseq-hpc-salmon-flipped Goto Github PK