lucasfvoges / grada Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 12.61 MB

simple GRep ADapter Analyser

License: MIT License

R 100.00%

grada's Introduction

GRADA - R-Package

simple GRep ADapter Analyser

R-Script utilazing the unix bash powers for adapter (sequence) analysis in a read file.

Many programms like fastp, fastQC, PRINSEQ are great for analyzing and preprocessing NGS read files. Though they are relativ complex programms and I wanted to see on a easy to understand level the contamination of a specific sequence (eg the adapter) in a read (fastq) file. This is possible by grep / agrep / wc commands. This scipt allows to get an overview of the contamination of a sequence in your read files.

A complete introduction: see in Vignettes

System requirements:

UNIX System (developed on Linux Mint 20.1 Cinnamon)
R-Studio
R-packages are suggested:
- DT
- parallel
- knitr
- rmarkdown

Installation:

To install the master brach (latest stable release) just run: devtools::install_github("LucasFVoges/GRADA", build_vignettes = TRUE)
or for latest development version:
devtools::install_github("LucasFVoges/GRADA", branch = "dev", build_vignettes = TRUE)

You can install also local version of this by simply download the latest or one of the releases, put your workspace inside the GRADA folder. and then:
devtools::install()

BEFORE USING THE SCRIPT

Please note, that GRADA will create an temp/ folder in your working directory. It will save the results here but also the .txt files wich will have the corresponding reads inside.

These files can be very big and can be deleted afterwards!

Usage:

library(GRADA)

# recommended at the moment:
library(parallel) 
library(DT)

You can load some example data:

read1 <- system.file("extdata", "grada_R1.fastq", package = "GRADA")
read2 <- system.file("extdata", "grada_R2.fastq", package = "GRADA")
seq <- system.file("extdata", "adapter_list.txt", package = "GRADA")

Then you can call the analyze functions. The table and plot function will render the results. For rendering only the "grada_table.txt" and "adapter_positions.Rdata" are needed.

grada_analyze(PE = TRUE, seq = seq, read1 = read1, read2 = read2)
grada_analyze_positions(PE = TRUE, readlength = 150)
grada_table()
grada_plot()

There are additional options to these functions.

Table

For the table there is:

# For a kable-table:
grada_table_simple() = grada_table()
# For a rmarkdown-table (requires "rmarkdown" package):
grada_table_md()
# For a DT interactive table (requires "DT" package):
grada_table_DT()

But you could use your own table-script. you can load the data with: load("temp/Adapter_Positions.Rdata")

Plot

For the plots there is:

# For a standard barplot:
grada_plot_bar() = grada_plot()

Example:

GRADA comes with an example vignette and example data (very basic).

See in Vignettes

browseVignettes("GRADA")

or:

library(GRADA)   
vignette("example")

grada's People

Contributors

Stargazers

Watchers

grada's Issues

temp data delete option

The temp data can occupy a lot of space.
a optional "keepData" would be nice to delete the data on the fly if wanted (analyze_positions needs them though!)

update Test function ooutput

it is a little better now in grada_analyze!

Please document appropriat

AIs connected to #33

Alternative Barplot

interactive plots (maybe ggplot or shiny) would look nice and one could see the exact numbers.

Better Mismatches: Make a mismatch per x bases option

this could be nice.

1 mismatch every 10 bases. to have a more comparable value for example.

Better Mismatches: Include N as a non mismatch charakter?

If an N (or non standard Base is found, should it be a mismatch?)

Adapter List Duplicate will be a bug

if a sequence in the adapter_list (seq) is doubled, it will not work, beacause the sequence is taken as the name!

Beacause there is no need to have a doubled sequence search this is only to check!

Solution:
Add check to begin, if sequence is doubled!

add Test functions

analyze and plot functions make basic tests for example, if a m_min is smaller then m_max.

This could be done in an extra function!

split functions into calc and display to reduce recomputing

This is a good idea.

barplot bug if no or one plot will be available

There is a small bug when plotting only one sequence plot beacause the rownames will be deleted as soon as one row is left...
see comment in function grada_plot()

Skip plots rework please

I have changed the code for skip=True in plot_bar()

But now it will not count for R1/R2 together, what I find very good if printed in an arrangement. (So skip will only happen if R1 and R2 is empty.)

Can we implemenmt this again?

Better Mismatches: max mismatches rule

It seems that agrep has 8 mismatches maximum. Maybe check on input

example vignette is broken

why?

FASTA support

It would be nice to search also fasta files

Agrep hast the option: -d '$'
to search line wise is standard. But we could change that to search the complete sequence. and with -c give the number of results back.

multiple finding in one read?

Ok, I need to test this, but at the moment per read 1 finding is possible.

This is not true for the position findings?!

Check for Mismmatch greater than shortest adapter?

This can be a mistake (rarely a use case scenario)

if adapter seq is 2 bases and mismatch is set to 3, it will break the agrep, it will break the position detection. But in the position detection it will give an error?

after length(adapter) this can be checked very easily. and then this could be skipped with stop("massage")?

Statistical knowlege of sequence distribution

The knowlege of

adapter length
position
(mismatches)
input data

can be used to see if a contamination is randomly accouring or not. Then the noise could be filtered out and only the true contamination will be visible.

@SEQ_ID will be searched!

At this moment all sequences will be searched in the complete fastq file, therefore also the @SEQ_ID will be part of it.

files with Index or #NNNNNN will be easely getting you wrong results!

Solution:

well, skip the lines with "@" in the beginning. Is this possible with agrep?
Or delete these lines before! Maybe also blank lines is a good idea.

For the time beeing, check if this can be a problem in your data..

quick solution:

sed '/^@/d' file

pdf-output table

see example:

Warnmeldung:
In get_engine(options$engine) :
  Unknown language engine 'txt' (must be registered via knit_engines$set()).

Better Mismatches: Include Quality Score?

another variable for accepting a mismatch or not.

actually very bad quality score == N, therefore a good question is, if it is a mismatch at all.

readlength is a problematic variable

In the analyze_positions() function the readlength will genereate the matrix.

If the length of the reads is longer, the matrix will do what?

In the end it would nice to solve this by getting the maximum length during grada_analyze() !!!

position finding R - skip for too big files

big files are still the problem if loaded into R. It eventually crashes.

I would suggest, that adapters are partially skipped (small ones for example.)
This needs to be in this way that the file will be there but they will just set to 0!

position finding - input correction of reads with lower mismatches

The position finding is working right now in the way that the files of reads with a specific adapter and mismatch is searched.

In 2 Mismatches for examples are reads with 2,1,0 mismatches.
To reduce duplication and the problem of shifting positions (see wiki-mismatches) one could delete all M0,M1 reads in 2M file. Therefore 2 Mismatch will only contain reads with exact 2 mismatches!

an option for this coul'd be considered. This can be also applied for the table!
But it needs to be visible directly (<=2M | =2M)

make parallel an optional thing

At the moment GRADA will use always the mclapply() function. This can be easely set to lapply() without big changes.

Solution:
if (numCores == 1){ use lapply() }
else { use mclapply() }

Mismatch analysis in Histogramm

The histogramm does not support mismatch analysis.
To display a barplot with different colors depending on the mismatch of the adapter will be interesting to add!