hbctraining / dge_workshop_salmon Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://hbctraining.github.io/DGE_workshop_salmon/
Home Page: https://hbctraining.github.io/DGE_workshop_salmon/
For the code "samples <- list.files(path = "./data", full.names = T, pattern="\.salmon$")", it works when using pattern="salmon$". What is the purpose of using '\.' before it?
The problem we encountered when trying to change to AnnotationHub is the one-to-many mappings of Ensembl to Entrez and the fact that it is stored as a list. Here is some code that will work if we choose to change it. If we change it would be worth exploring the difference between these and Ensv86 using AnnotationDbi.
library(AnnotationHub)
library(ensembldb)
# Connect to AnnotationHub
ah <- AnnotationHub()
# Query AnnotationHub
human_ens <- query(ah, c("Homo sapiens", "EnsDb"))
# Extract annotations of interest
human_ens <- human_ens[["AH64923"]]
# Extract gene-level information
genes(human_ens, return.type = "data.frame") %>% View()
# Create a gene-level dataframe (FOR LESSON)
annotations_ahb <- genes(human_ens, return.type = "data.frame") %>%
dplyr::select(gene_id, symbol, entrezid, gene_biotype) %>%
dplyr::filter(gene_id %in% res_tableOE_tb$gene)
# Wait a second, we don't have one-to-one mappings!
class(annotations_ahb$entrezid)
which(map(annotations_ahb$entrezid, length) > 1)
# So which one is right? And why do we have this problem?
# Okay let's just keep the first entrezID in the case that there are two mappings
annotations_ahb$entrezid <- map(annotations_ahb$entrezid,1) %>% unlist()
# Determine the indices for the non-duplicated genes
non_duplicates_idx <- which(duplicated(annotations_ahb$symbol) == FALSE)
# Return only the non-duplicated genes using indices
annotations_ahb<- annotations_ahb[non_duplicates_idx, ]
use this as a template and update the link to lessons - https://github.com/hbctraining/Intro-to-ChIPseq/blob/master/README.md
It is okay if there is not an actual schedule, as long as each lesson has a time associated with it.
taking some guidance from what we have in training modules: https://github.com/hbctraining/Training-modules/blob/master/planning_successful_rnaseq/lessons/count_modeling.md
biocLite --> BiocManager
For the mean/variance count plot, will it be more clearer to set x and y axis of same range, so that red line is 45 degree?
WGCNA package link in functional analysis lesson.
Problems arise when using gene() function
See table in slide 4 here - https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/slides/functional_analysis_mp.pdf
Consider reducing GO content and making the intro to FA more a about databases.
Explain what will happen if answering yes/no to "AnnotationHub does not exist, create directory?"
Might be nice to have them save an object - e.g. save the ego
object.
Discuss statistical model ->
Discuss the output from the model being the log2 fold changes with standard error estimates ->
Wald test ->
P values / multiple test correction ->
Log2 shrinkage
In workflow summarization there is a broken link. We need to update with an actual RMarkdown report
code below to use log transformed values
`# Extract data from object
norm_counts <- sleuth_to_matrix(de, "obs_norm", "est_counts")
log_norm_counts <- de$transform_fun_counts(norm_counts)
pc <- prcomp(t(log_norm_counts))
plot_pca <- data.frame(pc$x, summarydata)
ggplot(plot_pca, aes(PC1, PC2)) +
theme_bw() +
geom_point(aes(color=genotype)) +
xlab('PC1') +
ylab('PC2') +
scale_x_continuous(expand = c(0.3, 0.3)) +
#geom_text_repel(aes(x=PC1, y=PC2), label=name) +
theme(plot.title = element_text(size = rel(1.5)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))`
The orgDb reduces genes to almost half ~25K; whereas EnsDb (albeit older release) retains 50K. Add an explanation and look into the differences
Code was modified to use the AnnotationHub dataframe from the annotations lesson. Need someone to run through it and make sure it works.
I have done so once and put figures here: Dropbox (Harvard University)/HBC Team Folder (1)/Teaching/Courses/DGE_salmon/
"Plot a boxplot of the mean expression of Myc for the KO and WT samples using theme_minimal() and give the plot new axes names and a centered title."
It's confusing because we don't need them to compute a mean
Currently an .RData object is linked for the annotation df. It contains many more objects than we need, so we should replace this with csv file of the annotations
for any computation using permutations or random sampling we should set.seed or demo set.seed so that we get the same results each time
Change instances of apply() to purrr::map() for consistency in the 01 lesson.
Good resource found by JIhe:
https://slideplayer.com/slide/12369265/73/images/25/Likelihood+Ratio+Test+vs.+Wald+Test.jpg
https://f1000research.com/articles/7-952/v1
Need to look into this!
so red line goes through the middle - thanks @jihe-liu
will modify logFC section remove equation remove notes about older DESeq2 remove NOTE about LRT. We are teaching it in the lessons.
Convert all uses of apply() to map()-
I have only noticed it here so far: https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/01_DGE_setup_and_overview.md#how-do-i-know-if-my-data-should-be-modeled-using-the-poisson-distribution-or-negative-binomial-distribution
Remove the top20 plot and clean up code(?).
question from the students, should we add this as a note?
https://support.bioconductor.org/p/98833/#98837
See this post about using one or the other.
Also see vignette for more info ---
In "Details regarding PCA are given below (based on materials from StatQuest, and if you would like a more thorough description, we encourage you to explore StatQuest's video and our longer lesson", the last link for "our longer lesson" is not working.
Seems a bit wordy, can trim where there is redundancy with the table
In the section about viewing data in the setup md - https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/01_DGE_setup_and_overview.md#viewing-data
it should be
# Write the counts to an object
data <- txi$counts %>%
....
instead of
# Write the counts to file
data <- txi$counts %>%
....
Change to ont = "ALL"
instead of just "BP" when running clusterprofiler and creating ego object.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.