z0on / tag-based_rnaseq Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 9.0 4.33 MB

Cost-efficient genome-wide gene expression profiling

Perl 29.89% R 17.01% Python 53.10%

tag-based_rnaseq's Introduction

Genome-wide gene expression profiling with tag-based RNA-seq (TagSeq)

Mikhail Matz, [email protected]

NOW TESTED WITH NOVASEQ!

Current library preparation protocols (feel free to comment!):

"micro" version (1-10 ng total RNA): https://docs.google.com/document/d/1ySHaW-m53q13H7mIfBl8Rpye_Ji2zBZ0zlWq_PTYmEg/edit?usp=sharing

simplified version (100 ng of total RNA): https://docs.google.com/document/d/1wg-MGPTRjDu0KJ_6hJcT8FUZ3650nzfOpwBK8bzxL2w/edit?usp=sharing

Tag-based RNA-seq is a method of measuring expression of eukaryotic protein-coding genes on a whole-genome scale. Compared to standard RNA-seq it is very cost-efficient (on the order of $50/sample, all inclusive), allowing extensive experimental designs. The method, however, requires a reference (transcriptome or genome) to map reads to.

The method has been described in Meyer, Aglyamova and Matz, Mol Ecol 2011 ( http://bit.ly/1Zy8Ki7 ). Since then it has been adapted for Illumina sequencing, lab procedures have been further simplified, and removal of PCR duplicates has been implemented.

Lohman et al extensively benchmarked TagSeq against standard RNA-seq (NEBNext) and found that tag-seq quantifies transcript abndances more accurately, for about 10% of the cost: http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12529/abstract

This project provides the up-to-date wet lab protocol, scripts and walkthoughs for initial sequence data processing (from reads to gene counts), including:

concatenating raw sequence files according to the sampling design;
adaptor trimming, quality filtering and removal of PCR duplicates;
mapping against reference transcriptome;
deriving gene counts.

tag-based_rnaseq's People

Contributors

Stargazers

Watchers

Forkers

dayedepps juadiegaitan homereamk ckenkel mstudiva kmeaton seabove7 paigeduffin

tag-based_rnaseq's Issues

Read-counts-per-gene step error- likely stemming from seq2gene.tab file?

Good evening all,

I want to apologize in advance for the extremely long thread, but I am so new and uncertain with all of this that I wanted to make sure everything I've done is crystal clear in case I've made any errors along the way. My project seeks to understand differential gene expression in diseased Pisaster ochraceus sea stars. I am attempting to get a pipeline up and running for 90+ tagseq samples with a smaller subset of 3 samples. As I'm not sure where I've gone wrong, I am going to describe every step from the raw reads to where I've hit a roadblock that has completely stopped me in my tracks.

For reference, I am working on the Sapelo2 Linux cluster at the University of Georgia and mainly following along with your tutorial tagSeq_processing_README.txt.

Step 1. Prepare workspace and concatenate raw reads

Step 1A. Copy raw reads into working directory. Raw read pairs have been renamed to reflect the following format: sample_*_L1.fastq.gz and sample_*_L2.fastq.gz.
Step 1B. Unzip raw read files.
$ gunzip sample_*.fastq.gz
Step 1C. Make sure ngs_concat.pl script is in working directory.
Step 1D. Load perl module, run script to concatenate paired reads
$ ml Perl/5.32.1-GCCcore-10.3.0
$ perl ngs_concat.pl 'sample' 'sample_(.+)_L'

Step 2. Adaptor trimming

Note: At the time I did not understand "screen" + piping to the file named clean, so I wrote each adaptor trimming line manually. (I figured out how to do this for the maps step, so I will do it the "right way" when I scale up to the whole genome.

Step 2A. Make sure tagseq_clipper.pl script is in working directory.

Step 2B. Run code to trim reads. The three samples I'm working with are named 10.8.fq, 1.15.fq, and 11.4.fq at this stage.
$ ml cutadapt/3.4-GCCcore-8.3.0-Python-3.7.4
$ perl tagseq_clipper.pl 10.8.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 10.8.trim
$ perl tagseq_clipper.pl 1.15.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 1.15.trim
$ perl tagseq_clipper.pl 11.4.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 11.4.trim

Step 3. Create transcriptome from reference genome.

I am fortunate that I'm working with a species for which there is a published reference genome. After a lot of research on the internet, this is what I came up with to create my in silico transcriptome. I do not know for sure whether or not I've done this correctly- as you'll see, I ran into a few snags.

Step 3A. Download reference genome into working directory.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/010/994/315/GCA_010994315.2_ASM1099431v2/GCA_010994315.2_ASM1099431v2_genomic.fna.gz
$ gunzip GCA_010994315.2_ASM1099431v2_genomic.fna.gz
$ mv GCA_010994315.2_ASM1099431v2_genomic.fna.gz pisaster_ref_genome.fa # rename file

Step 3B. Get annotation file into working directory.
Download annotation file from colleague's Dryad: https://datadryad.org/stash/dataset/doi:10.6071/M3ND50
Upload annotation file to Sapelo2 in working directory, rename genome_annotation.gff3.

Step 3C. Troubleshoot and troubleshoot some more.
I was getting the same error when I tried many different variations of code that used gffread to create transcriptome from the genome and annotation files. Got involved in a Github issue in the repository for gffread that matched the error I was getting… and it led to me getting my answer: there are spaces in my ref genome sequence names and there should not be.

An example of what the first and last sequence names in my file look like:

>CM021546.1 Pisaster ochraceus isolate M0D055189C chromosome XXI, whole genome shotgun sequence
>JAAFGO010000999.1 Pisaster ochraceus isolate M0D055189C Sc28pcJ_999;HRSCAF=1079, whole genome shotgun sequence

Spent an extraordinarily long time trying to figure out some code that could efficiently fix this for me and failed. Shamefully, I instead landed on creating a shell script that renamed each contig individually (eg sed -i 's/CM021526.1/Sc28pcJ_649/g' pisaster_ref_genome_header.mod.fa for contig 1). But hey, it looks like it worked.

Then, I reran my transcriptome building step (gffread -F -w pisaster_transcriptome.fa -g pisaster_ref_genome_header.mod.fa ann_simple.gff) and got a new error: "GffObj::getSpliced() error: improper genomic coordinate 17535992 on Sc28pcJ_1844 for g8980.t1". I was frustrated so I just deleted that line, and 5 or 6 other lines that gave me similar issues. After that, it finally worked! As an example, the first two entries in my new pisaster_transcriptome.fa look like this:

>g1.t1 CDS=1-277 geneID=g1 atgctctctaccaataggaatagcgaaactgtctggggaGGAGGCAAGAAGTCCCCAGAGAACAATGTTT TGAACAAGGTGTTGGAGGCGCTGGCACCAGTAGCAAAGGATGGCATCTTTTCCCAAACTGATGGAGAATC ATACCCAATCCCACAAGCCGGGCTGAGTGATGATGTGGAAGTCATGGACTTGATAGCAAATGTAATAAGT GAAGGGGATAAGAGTGCAGTGGATGCCCAATCTGATTCTGCATCCATGGAGGACATTTCAGAGGCAG
>g2.t1 CDS=1-97 geneID=g2 ATGGCAGGCTGCGTTACAGGCGGGCTTATAGCATACAGAGTTGTGTCACTCACCGAAAGCATTTCTAGTA GCAAGTTATCATCATGGAGCAGCCTGG

Step 4. Back to the tag-based_RNAseq tutorial for mapping: preparing reference transcriptome

Create bowtie2 index for transcriptome.
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ bowtie2-build pisaster_transcriptome.fa pisaster_transcriptome.fa
Here is the text output I was given when I executed the above code:
bowtie2_build_output_june.18.2022.txt

Step 5. Map reads to transcriptome.

Move relevant perl script to working dir, create maps file as in tutorial (this is where I figured out how to use screen).
$ ml Perl/5.32.1-GCCcore-10.3.0
$ perl tagseq_bowtie2map.pl "trim$" pisaster_transcriptome.fa > maps
$ screen
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ . maps
Here is the text output I was given when I executed the above code:
maps_screenOutputFile.txt

Step 6. (attempt to) generate read counts per gene.

This is where my big "roadblock" is. At first, I thought I needed to make a true transcriptome_seq2gene.tab table, and was really confused by the instructions, because I didn't use Trinity. So I relied on another Github repo that is actually derived from this one to at least generate a populated seq2iso.tab file without any error messages. The version of that code that I attempted to modify to fit my needs was:
$ grep '^>' pisaster_transcriptome.fa |\ perl -pe 's/>(\S+)\s.*gene:(\S+).*/\1\t\2/g' >\ pisaster_transcriptome_seq2iso.tab

The first two lines of this .tab file are:

>g1.t1 CDS=1-277 geneID=g1
>g2.t1 CDS=1-97 geneID=g2

As I'm sure you can guess, this did not work. After running the following:
$ perl ./samcount_launch_bt2.pl '\.sam' pisaster_transcriptome_seq2iso.tab > sc
$ screen
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ ml Perl/5.32.1-GCCcore-10.3.0
$ . sc
...I received a massive chain of error messages (eg g19414.t1 has no isogroup designation) and the .sam.counts files were empty.

Finally, today, it dawned on me that this line in the instructions is (I believe) relevant to me:

if your transcriptome is made in silico from annotated genome data, just use a dummy table in the form:
# gene gene

I followed these instructions very literally and created a dummy .tab file named "dummy_seq2iso.tab" with one line of text "gene [tab] gene". However, when I reran the above code with this new .tab file, I received this error x3 (one for each sample, I presume):

disregarding reads mapping to multiple isogroups cannot open dummy_seq2iso.tab

I made sure restrictive file permissions were not the cause of this error and did a lot of sleuthing around on the internet, but I am at a loss on what to do next.

Any insight on this last step specifically, or if you believe the error is in one of these previous steps I've described, would be IMMENSELY appreciated.

Thank you so much in advance.

Paige Duffin

multiple template switching oligos?

Thanks for this - I am interested in trying it out! Could you clarify the reason for the multiple template switching oligos in this version of the protocol? It looks like the published versions have only one? Are all three now recommended?

Thanks

tagseq processing not recognizing fastq headers

I'm following your tagSeq_processing_README.txt protocol to trim and filter reads, generated from QuantSeq libraries run on an Illumina NovaSeq platform this month. The output indicates that a very large portion of my reads do not have headers:

Upon inspection, the fastq files don't appear to lack headers, but I'm wondering if the tagseq_clipper.pl script is looking for a different header format? My headers are in the following format:

Here are abbreviated versions of an untrimmed file, and the trimmed file showing reads that passed the tagseq_clipper.pl script:
example_files.zip

I admittedly am unfamiliar with perl scripts, so any help would be great.

TagSeq pipeline trouble - Generating read-counts-per-gene

Hi all,

I'm trying to assay differential gene expression from 22 RNA samples, two of them sequenced with RNAseq and 20 of them sequenced with TagSeq. I used the RNAseq sequences to generate a reference transcriptome de novo via Trinity to compare the TagSeq sequences to.

I've been following the pipeline listed here:
https://github.com/z0on/tag-based_RNAseq/blob/master/tagSeq_processing_README.txt
Everything went smoothly until the "# generating read-counts-per gene" section. Here's a rundown of what I did and where things appeared wrong (Skip to Step 4 if you want to go straight to that).

STEP 1
I used the command:
grep ">" Trinity.fasta | perl -pe 's/^>(\S+)|(\S+)(i\d+)\s.+/$1|$2$3\t$1$2/' >transcriptome_seq2iso.tab
in order to create a two-column tab-delimited table to give correspondence between entries in the transcriptome fasta file and genes (our transcriptome fasta file was named "Trinity.fasta", not "trinity.fasta" as per the readme file). The file "transcriptome_seq2iso.tab" was created.

STEP 2
The pipeline afterwards lists some steps to take if you don't have any assembler-derived genes (cd-hit-est). As we had plenty of assembler-derived-genes, I planned on skipping it, but isogroup_namer.pl (in the next step) seems to depend on having ch-hit-est. Thus, I ran cd-hit-est using the following code:
cdhit-est -i Trinity.fasta -o transcriptome_clust.fasta -c 0.99 -G 0 -aL 0.3 -aS 0.3
The following two files were created 1) transcriptome_clust.fasta and 2) transcriptome_clust.fasta.clustr

STEP 3
To try to add cluster designations to the fasta headers, I used the code:
/tag-based_RNAseq/isogroup_namer.pl transcriptome_clust.fasta transcriptome_clust.fasta.clstr >transcriptome_seq2gene.tab
Note: The "/tag-based_RNAseq" portion of the code was to tell Ubuntu where isogroup_namer.pl was.
The following three files were created: 1) transcriptome_clust_iso.fasta 2) transcriptome_clust_iso.seq2iso.tab 3) transcriptome_seq2gene.tab
However, when I examined these files with nano, all of the files were empty.

STEP 4 (THIS IS WHERE THINGS HAVE APPARENTLY GONE WRONG)
To try to count the hits per isogroup, I ran the following code:
/home/genomics/tag-based_RNAseq/samcount_launch_bt2.pl '.sam' /Data/devin/JA19193/transcriptome_seq2iso.tab > sc
Note: "/home/genomics/tag-based_RNAseq" pointed Ubuntu to samcount_launch_bt2.pl, and "/Data/devin?JA19193" pointed Ubuntu towards transcriptome_seq2.iso.tab.
After doing this, I used nano to edit sc to add the full path to samcount before each sample so Ubuntu could find samcount. For example:
samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts
became
~/tag-based_RNAseq/samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts
Note: "10_merged" is the name of a TagSeq sample.

I then executed sc using the following code:
bash sc

samcount ran, but in every line of output while it ran, the message, "[gene name] has no isogroup designation” appeared.
I tried creating the sc file using the file "transcriptome_seq2gene.tab" (the 3rd file in Step 3 above) and running samcount with this modified sc, but received the same result. Thus, it appears that samcount is treating the only meaningful product file from isogroup_namer as an empty file.
Can anyone give me some help? I'm running Ubuntu version 16.04.

Thanks,
-Devin

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.