lpantano / isomirs Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 3.0 29.22 MB

analyze isomiRs from seqbuster tool

Home Page: http://lpantano.github.io/isomiRs/

License: MIT License

R 96.10% TeX 3.90%

bioconductor isomirs analyze-isomirs r mirna

isomirs's Introduction

isomiRs

Analyze isomiRs from seqbuster tool or any BAM file after using seqbuster miraligner

Installation

This is an R package.

Bioconductor stable version

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
install.packages("BiocManager")
BiocManager::install("isomiRs")

Bioconductor latest version

devtools::install_git("https://[email protected]/packages/isomiRs")

devtools development version

install.packages("devtools")
devtools::install_github("lpantano/isomiRs")

isomirs's People

Contributors

Stargazers

Watchers

Forkers

mshadbolt drhogart matt-int

isomirs's Issues

isomiRs annotation

Hi,

Thank you a lot for such a nice package! However, I didn't understand how isomiRs are annotated. In the example of the vignette (which is below) there are two miRNAs which have the same 5' modification (**) but one is named ref. and another named iso. So why the molecule having modification is named as ref? And why they do not share the same name even though the modification is the same?

dds = isoDE(ids, formula=~condition, ref=TRUE, iso5=TRUE)
head(results(dds, tidy=TRUE))

   row                         baseMean   log2FoldChange lfcSE     stat        pvalue
1  hsa-let-7a-3p.iso.t5:0      18.174659  -0.0147846     0.5485291 -0.02695318 0.9784971
2  hsa-let-7a-3p.iso.t5:c      3.48228     0.7353251     0.8043144  0.91422589 0.3605982
3  **hsa-let-7a-3p.iso.t5:ct   4.148260    0.4996872     0.7783157  0.64201102 0.5208660
4  hsa-let-7a-3p.ref.t5:0      7.052740   -0.2545893     0.6761882 -0.37650660 0.7065403
5  hsa-let-7a-3p.ref.t5:c      4.231561    0.2260381     0.7555882  0.29915509 0.7648217
6  **hsa-let-7a-3p.ref.t5:ct   1.192198   -0.3630919     0.8902994 -0.40783118 0.6833976

input error with mirtop export file

Hi,

I wanted to analyze isomiR output using your isomiRs package. We already set up miRNA pipeline with sRNAbench, so I tried to convert from sRNAbench output to isomiRs input using mirtop.
The sRNAbench output was imported to mirtop, then the gff file export to isomir format. But isomiR package shows an error when I input the files in the R.
Here is the error message:

Error in [.data.frame(table, , c("seq", "freq", "mir", "mism", "add", :
undefined columns selected

I also tried to export as a seqbuster format in mirtop, but now mirtop shows error like below. In the last part "KeyError", miRNA id is different depending on samples.

Traceback (most recent call last):
File "/Users/kicheol/miniconda2/envs/mirna/bin/mirtop", line 10, in
sys.exit(main())
File "/Users/kicheol/miniconda2/envs/mirna/lib/python3.7/site-packages/mirtop/command_line.py", line 45, in main
export(kwargs["args"])
File "/Users/kicheol/miniconda2/envs/mirna/lib/python3.7/site-packages/mirtop/exporter/init.py", line 9, in export
seqbuster.convert(args)
File "/Users/kicheol/miniconda2/envs/mirna/lib/python3.7/site-packages/mirtop/exporter/seqbuster.py", line 29, in convert
_read_file(fn, precursors, matures, args.out)
File "/Users/kicheol/miniconda2/envs/mirna/lib/python3.7/site-packages/mirtop/exporter/seqbuster.py", line 49, in _read_file
matures[attr["Parent"]][attr["Name"]],
KeyError: 'hsa-miR-548av-5p'

Here is my files for your information.

gff file for single sample & multiple samples
converted rawData file for single sample & multiple samples

Thanks,
Kicheol

Add quantification modeling

Integrate in a R function paper about modeling absolute miRNA expression

feature: add polar coord plot with nt changes frequency

nt changes in seed, middle, end as three different category. Numbers show % in abundance and unique sequences (two plots next to each other). Each sample 1 line, color by group.

Disprepancy in miRNA counts

Hi,

I would like to ask about some miRNA counts discrepancies I noticed when switching between different isomiRs versions. In past, I used isomiRs 1.10 and now I switched to isomiRs 1.16.2. I am fully aware there were many versions between and certainly many improvements that would affect the final miRNA counts, but I am seeing huge differences in quantified miRNA counts which are bothering me. For example, here is first 10 miRNAs counts produced by isomiRs 1.10 (output of function IsomirDataSeqFromFiles()):

                        METSEQ-T04 METSEQ-T05 METSEQ-T06 METSEQ-T07 METSEQ-T08 METSEQ-T09
"hsa-let-7a-2-3p"	16         3	      22	 1	    18	       1
"hsa-let-7a-3p"		444        436        807	 474	    564	       835
"hsa-let-7a-5p"		250567     211944     536064	 342337	    309013     339820
"hsa-let-7b-3p"		94         164	      531	 272	    104	       217
"hsa-let-7b-5p"		59497      70568      198727	 161083	    47276      114909
"hsa-let-7c-3p"		17         39	      83	 47	    22	       14
"hsa-let-7c-5p"		8253       24841      91330	 30526	    7284       12840
"hsa-let-7d-3p"		276        279	      937	 452	    483	       805
"hsa-let-7d-5p"		8848       6314	      17324	 13667	    16358      21746

and here by isomiRs 1.16.2:

                        METSEQ-T04 METSEQ-T05 METSEQ-T06 METSEQ-T07 METSEQ-T08 METSEQ-T09
"hsa-let-7a-2-3p"	8	   2	      10	 0	    15	       1
"hsa-let-7a-3p"	        346	   353        601	 349	    476	       636
"hsa-let-7a-5p"	        223211	   186839     468075	 293262	    275729     287835
"hsa-let-7b-3p"	        59	   88	      345	 144	    67	       123
"hsa-let-7b-5p"	        44572	   50948      147743	 115957	    33111      82647
"hsa-let-7c-3p"	        14	   35	      64	 38	    18	       12
"hsa-let-7c-5p"	        6589	   19765      73938	 23852	    5792       10195
"hsa-let-7d-3p"	        224	   220	      749	 276	    366	       623
"hsa-let-7d-5p"	        7069	   4855	      13515	 10135	    12969      15318

The difference is really big, if I sum it for each sample it gives even up to 1M, so I would like to know what is causing the difference. In issue #17, @svattathil mentioned that there is some noise filtering happening in IsomiRs, but I did not manage to find any information about it.

The problem might be related to the underlying issue of non-functioning conda isomiRs packages (latest versions) that I already posted here. Specifically, the problem is probably in the changed syntax of dplyr package. I think so because when running isomiRs 1.16.2 IsomirDataSeqFromFiles() I got following warnings (for each input line, but the output table is produced):

1: Problem with `mutate()` input `prop`.
ℹ Chi-squared approximation may be incorrect
ℹ Input `prop` is `list(tidy(prop.test(value, total, pctco, alternative = "greater")))`.
ℹ The error occurred in row 6399.

I am using your great package a lot now and otherwise it does exactly what I need, so thank you for your work on this software so far :) I will very much appreciate it if you can look into this issue.

All the input/ouput files are here.

Thank you!
Karolina

IsomirDataSeqFromFiles rank error

I ran into an issue when trying to read ".mirna" files using IsomirDataSeqFromFiles.

I have 10 .mirna files isomir.zip from two groups, which I got from the SeqBuster tool
I followed the example for the IsomirDataSeqFromFiles in the documentation.
- Defining the path to my .mirna directory
- List of the .mirna files
- data frame for naming my samples

So that I have the following:

> fn_list
  [1] "...path...to...dir/SRR7819263_1.mirna"
  [2] "...path...to...dir/SRR7819264_1.mirna"
  [3] "...path...to...dir/SRR7819265_1.mirna"
  [4] "...path...to...dir/SRR7819266_1.mirna"
  [5] "...path...to...dir/SRR7819267_1.mirna"
  [6] "...path...to...dir/SRR7819268_1.mirna"
  [7] "...path...to...dir/SRR7819269_1.mirna"
  [8] "...path...to...dir/SRR7819270_1.mirna"
  [9] "...path...to...dir/SRR7819271_1.mirna"
 [10] "...path...to...dir/SRR7819272_1.mirna"
> de
   condition
f1    control
f2    control
f3    control
f4    control
f5    control
f6        pre
f7        pre
f8        pre
f9        pre
f10       pre

But I still get the following error in the last step.

> ids <- IsomirDataSeqFromFiles(fn_list, coldata=de)

Error in (function (cond)  :                                                                                       
  error in evaluating the argument 'x' in selecting a method for function 'unique': Problem with `mutate()` column `rank`.
ℹ `rank = 1:n()`.
ℹ `rank` must be size 0 or 1, not 2.

Session info:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=sl_SI.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=sl_SI.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=sl_SI.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=sl_SI.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] isomiRs_1.22.0              SummarizedExperiment_1.24.0
 [3] Biobase_2.54.0              GenomicRanges_1.46.1       
 [5] GenomeInfoDb_1.30.1         IRanges_2.28.0             
 [7] S4Vectors_0.32.3            BiocGenerics_0.40.0        
 [9] MatrixGenerics_1.6.0        matrixStats_0.61.0         
[11] DiscriMiner_0.1-29         

loaded via a namespace (and not attached):
  [1] assertive.base_0.0-9        colorspace_2.0-2           
  [3] rjson_0.2.21                ellipsis_0.3.2             
  [5] circlize_0.4.13             XVector_0.34.0             
  [7] GlobalOptions_0.1.2         ggdendro_0.1.22            
  [9] clue_0.3-60                 assertive.sets_0.0-3       
 [11] ggrepel_0.9.1               bit64_4.0.5                
 [13] AnnotationDbi_1.56.2        fansi_1.0.2                
 [15] codetools_0.2-18            splines_4.1.2              
 [17] logging_0.10-108            mnormt_2.0.2               
 [19] doParallel_1.0.16           cachem_1.0.6               
 [21] geneplotter_1.72.0          knitr_1.37                 
 [23] Nozzle.R1_1.1-1             broom_0.7.12               
 [25] annotate_1.72.0             cluster_2.1.2              
 [27] png_0.1-7                   readr_2.1.2                
 [29] compiler_4.1.2              httr_1.4.2                 
 [31] backports_1.4.1             Matrix_1.4-0               
 [33] fastmap_1.1.0               limma_3.50.0               
 [35] cli_3.1.1                   lasso2_1.2-22              
 [37] tools_4.1.2                 gtable_0.3.0               
 [39] glue_1.6.1                  GenomeInfoDbData_1.2.7     
 [41] dplyr_1.0.7                 Rcpp_1.0.8                 
 [43] vctrs_0.3.8                 Biostrings_2.62.0          
 [45] nlme_3.1-155                iterators_1.0.13           
 [47] psych_2.1.9                 xfun_0.29                  
 [49] stringr_1.4.0               DEGreport_1.30.0           
 [51] lifecycle_1.0.1             gtools_3.9.2               
 [53] XML_3.99-0.8                edgeR_3.36.0               
 [55] zlibbioc_1.40.0             MASS_7.3-55                
 [57] scales_1.1.1                vroom_1.5.7                
 [59] hms_1.1.1                   parallel_4.1.2             
 [61] RColorBrewer_1.1-2          ComplexHeatmap_2.10.0      
 [63] memoise_2.0.1               gridExtra_2.3              
 [65] ggplot2_3.3.5               reshape_0.8.8              
 [67] stringi_1.7.6               RSQLite_2.2.9              
 [69] genefilter_1.76.0           foreach_1.5.1              
 [71] caTools_1.18.2              BiocParallel_1.28.3        
 [73] shape_1.4.6                 rlang_1.0.0                
 [75] pkgconfig_2.0.3             bitops_1.0-7               
 [77] lattice_0.20-45             purrr_0.3.4                
 [79] cowplot_1.1.1               bit_4.0.4                  
 [81] tidyselect_1.1.1            GGally_2.1.2               
 [83] plyr_1.8.6                  magrittr_2.0.2             
 [85] DESeq2_1.34.0               R6_2.5.1                   
 [87] gplots_3.1.1                generics_0.1.2             
 [89] DelayedArray_0.20.0         DBI_1.1.2                  
 [91] withr_2.4.3                 pillar_1.7.0               
 [93] survival_3.2-13             KEGGREST_1.34.0            
 [95] RCurl_1.98-1.5              tibble_3.1.6               
 [97] crayon_1.4.2                KernSmooth_2.23-20         
 [99] utf8_1.2.2                  tmvnsim_1.0-2              
[101] tzdb_0.2.0                  GetoptLong_1.0.5           
[103] locfit_1.5-9.4              grid_4.1.2                 
[105] blob_1.2.2                  ConsensusClusterPlus_1.58.0
[107] digest_0.6.29               xtable_1.8-4               
[109] tidyr_1.2.0                 munsell_0.5.0

Any Idea for why I am getting this error?

Thank you,
Bine

3 Prime untemplated versus Mismatches

Hi, it's me again

I have a question about how the isomiRs/seqbuster pipeline is annotating isomiRs. For example I have these two isomiRs that have been categorised as having untemplated additions:

hsa-miR-22-3p.iso.t5:0.t3:tgt.ad:GGT.mm:0 
hsa-miR-25-3p.iso.t5:0.t3:tga.ad:GGA.mm:0

But I realised they could equally be categorised as having a mismatch at the 3rd base in from the three prime end. Is there a particular reason behind favouring one annotation over another?

Also if I had changed the argument canonicalAdd to the default TRUE when importing files with IsomirDataSeqFromFiles would it instead find a mismatch at that position or would it not be separated out? Or perhaps it would depend on the allele frequency of the mismatch? Or are mismatches effectively not called in the last three positions of the read.

Thanks!

IsomirDataSeqFromFiles error

> obj <- IsomirDataSeqFromFiles(files = files[rownames(design)], design = design , header = T)
Error in initialize(value, ...) : 
  cannot use object of class “SummarizedExperiment0” in new():  class “IsomirDataSeq” does not extend that class

Hi, Ipantano

Would you give me any suggestion about this error ?

check n files same that samples in colData

Function questions

Hi and thanks for the great isomiRs package, it is super handy.

I have a few questions/suggestions about various functions to ensure that I'm interpreting things correctly

IsomirDataSeqFromFiles()

Would you be able to explain more what you mean by the uniqueMism parameter? Does it mean to only keep mismatches if they are found in one isomiR type? or maybe you could provide an example?
It would be great to have an option to remove untemplated additions that contain 'N's

isoPlot()

When you set type to all you get a side-by-side plot, left side is labelled 'freq' right side is 'unique' I believe for this plot each line represents a sample and the position along the plane for each isomir type indicates the percentage of that type in that sample, i.e. if you added up the points on each plane for each sample it would add to 100%, let me know if this interpretation is correct.
I am not so clear on what the 'unique' graph is representing, does that mean when a particular isomiR is seen only in a single sample?
For the other isoPlot() types, the y-axis is labelled '# of unique sequences', but it is actually a decimal fraction or proportion whereas a # implies a count. For example when I set type = add I get my samples sitting around 0.5-0.6 for a 1-bp addition. Is the proportion of all isomiRs for a particular sample? Or is the proportion of all 'add' type isomiRs?
For a simplified example, say I have a sample with 10 total isomiRs of various types, a value of 0.6 at 1 bp means 6 of these have a 1 bp addition OR if I have 10 total isomiRs of various types, and 5 of these have an addition of between 1 and 3 bp, and I see a point around 0.6 for 1bp addition that means 0.6*5 so there are 3 isomiRs with a 1bp addition. And does the same logic then follow for the other single plots?
I'm also not overly sure what you mean by 'unique sequences'.
When you say the size of the points is proportional to the total counts, is that the count across all isomiRs?

I guess I am mostly confused by the way you use the term 'unique' and what it means in each context.

Thanks again for the cool package

isoNetwork depending on external packages

[ ] Add the best package to get miRNA / target information.
[ ] Use other package for GO enrichment like. https://bioconductor.org/packages/release/bioc/vignettes/EGSEA/inst/doc/EGSEA.pdf

isoAnnotate error

Thank you so much for this useful package.

When running isoAnnotate I encountered the following error:

Error in `[[<-.data.frame`(`*tmp*`, "edit_mature_position", value = c("NA",  : 
  replacement has 3762 rows, data has 3504

The input was generated using mirtop export, then loaded into isomiRs using IsomirDataSeqFromMirtop. Other isomiRs functions seemed to work correctly. Not sure if this error originated from mirtop or isomiRs.

isomiRs version 1.16.2 was installed using Bioconductor on R 4.0.2.

Thanks in advance for looking into this.

error with IsomirDataSeqFromFiles

Hi,

I am coming up with an error when I try to import the data from files. I was wondering if you had any information on why this particular error might be coming up?

obj <- IsomirDataSeqFromFiles(filelist, design=design, header=T, quiet=FALSE)

Error in if (tab.fil$ratio[tab.fil$subs == row["subs"] & tab.fil$mir ==  : 
  argument is of length zero

Thanks!

Gabriel

About isomiRs counts

Hi Lorena,
Thanks once again for this useful package. I have a question which is more biological I guess. So, after getting the counts file if I look at the mature miRNA (ref) and isomiRs separately (not as merged) , I notice that almost all isomiRs (like 95%) are more abundant than mature miRNAs (ref). They have more raw read counts than mature ones.Biologically, it seems like it should not be the case. My team and I are little concerned about it . Can you share your idea on this?

input only 2~3 isomiRs / missing mismatch for 3'trim variation

Hi,
I found 2 possible error/bug while using the package. I think first one is isomir package issue, but the second one may be related to miRTop.

input only 2~3 isomiRs

I was able to input my data successfully. But I realized the count was not matched with original input. For example in one miRNA, original input (rawData from miRTop) has over 100 different types of isomiRs. But I found only 3 of them in the input data. It's not cutoff since I see very low counts, but it seems that only the top 2 or 3 expressed isomirs were loaded into R dataset. Could you check this error? Let me know if you need any information or my data.
Data I tested for input is here, and I used this for input dataset:

de <- data.frame(row.names=c("S400001603","S400001616","S400001617"), condition = c("cc","bb","aa"))
mirData <- IsomirDataSeqFromRawData(read_tsv("mirtop_rawData.tsv"), de)

I saw many warnings during input, is it possible that is related to this warning?

Warning messages:
1: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect
2: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect

missing mismatch for 3'trim variation

I also found an error in the rawData file exported from miRTop. If the sequence has 5' or 3' variation, mismatch (nucleotide variation) was not recorded together. Could you also check about it?
Here is an example... Those sequences have different mismatches and 3' trimming together, but mism is 0 in the table.

seq	mir	t3	S400001603	S400001616	S400001617
TCAGGTAGTAGGTTGTAT	hsa-let-7a-5p	agtt	2	1	1
TGAGGTAGTAGGTTGTAA	hsa-let-7a-5p	agtt	5	6	11
TGAGGTAGTAGGTTGTAT	hsa-let-7a-5p	agtt	197	166	351

I used the development version for both miRTop and isomiR package which I installed last week.

Thanks!!
Kicheol

PLS-DA plot interpretation

Hi,
It might sound a very basic question but I am little confused with my PLS-DA analysis in isomiRs.
Can you please interpret this plot please? I am aware that it shows separation of the data but I am just wondering why one right upper corner is blank ? But, I think I would like someone to interpret this plot for me where I am trying to see the difference in two age groups.

About isoNetwork

Hi Lorena,
I wanted to know about this function 'isoNetwork'.

There are 'mirna_rse' and 'gene_rse' needed for this function but I could not get enough idea of these by the details given in doc. Also , in document it is mentioned how to get these two but it's somehow cut in the pdf at the end and not fully printed. See the image (highlighted) -