Git Product home page Git Product logo

grada's Introduction

GRADA - R-Package

simple GRep ADapter Analyser

R-Script utilazing the unix bash powers for adapter (sequence) analysis in a read file.

Many programms like fastp, fastQC, PRINSEQ are great for analyzing and preprocessing NGS read files. Though they are relativ complex programms and I wanted to see on a easy to understand level the contamination of a specific sequence (eg the adapter) in a read (fastq) file. This is possible by grep / agrep / wc commands. This scipt allows to get an overview of the contamination of a sequence in your read files.

A complete introduction: see in Vignettes

System requirements:

  • UNIX System (developed on Linux Mint 20.1 Cinnamon)
  • R-Studio
  • R-packages are suggested:
    • DT
    • parallel
    • knitr
    • rmarkdown

Installation:

To install the master brach (latest stable release) just run: devtools::install_github("LucasFVoges/GRADA", build_vignettes = TRUE)
or for latest development version:
devtools::install_github("LucasFVoges/GRADA", branch = "dev", build_vignettes = TRUE)

You can install also local version of this by simply download the latest or one of the releases, put your workspace inside the GRADA folder. and then:
devtools::install()

BEFORE USING THE SCRIPT

Please note, that GRADA will create an temp/ folder in your working directory. It will save the results here but also the .txt files wich will have the corresponding reads inside.

These files can be very big and can be deleted afterwards!

Usage:

library(GRADA)

# recommended at the moment:
library(parallel) 
library(DT)

You can load some example data:

read1 <- system.file("extdata", "grada_R1.fastq", package = "GRADA")
read2 <- system.file("extdata", "grada_R2.fastq", package = "GRADA")
seq <- system.file("extdata", "adapter_list.txt", package = "GRADA")

Then you can call the analyze functions. The table and plot function will render the results. For rendering only the "grada_table.txt" and "adapter_positions.Rdata" are needed.

grada_analyze(PE = TRUE, seq = seq, read1 = read1, read2 = read2)
grada_analyze_positions(PE = TRUE, readlength = 150)
grada_table()
grada_plot()

There are additional options to these functions.

Table

For the table there is:

# For a kable-table:
grada_table_simple() = grada_table()
# For a rmarkdown-table (requires "rmarkdown" package):
grada_table_md()
# For a DT interactive table (requires "DT" package):
grada_table_DT() 

But you could use your own table-script. you can load the data with: load("temp/Adapter_Positions.Rdata")

Plot

For the plots there is:

# For a standard barplot:
grada_plot_bar() = grada_plot()

Example:

GRADA comes with an example vignette and example data (very basic).

See in Vignettes

browseVignettes("GRADA")

or:

library(GRADA)   
vignette("example")

grada's People

Contributors

lucasfvoges avatar

Stargazers

 avatar

Watchers

 avatar

grada's Issues

temp data delete option

The temp data can occupy a lot of space.
a optional "keepData" would be nice to delete the data on the fly if wanted (analyze_positions needs them though!)

Alternative Barplot

interactive plots (maybe ggplot or shiny) would look nice and one could see the exact numbers.

Adapter List Duplicate will be a bug

if a sequence in the adapter_list (seq) is doubled, it will not work, beacause the sequence is taken as the name!

Beacause there is no need to have a doubled sequence search this is only to check!

Solution:
Add check to begin, if sequence is doubled!

add Test functions

analyze and plot functions make basic tests for example, if a m_min is smaller then m_max.

This could be done in an extra function!

Skip plots rework please

I have changed the code for skip=True in plot_bar()

But now it will not count for R1/R2 together, what I find very good if printed in an arrangement. (So skip will only happen if R1 and R2 is empty.)

Can we implemenmt this again?

FASTA support

It would be nice to search also fasta files

Agrep hast the option: -d '$'
to search line wise is standard. But we could change that to search the complete sequence. and with -c give the number of results back.

multiple finding in one read?

Ok, I need to test this, but at the moment per read 1 finding is possible.

This is not true for the position findings?!

Check for Mismmatch greater than shortest adapter?

This can be a mistake (rarely a use case scenario)

if adapter seq is 2 bases and mismatch is set to 3, it will break the agrep, it will break the position detection. But in the position detection it will give an error?

after length(adapter) this can be checked very easily. and then this could be skipped with stop("massage")?

Statistical knowlege of sequence distribution

The knowlege of

  • adapter length
  • position
  • (mismatches)
  • input data

can be used to see if a contamination is randomly accouring or not. Then the noise could be filtered out and only the true contamination will be visible.

@SEQ_ID will be searched!

At this moment all sequences will be searched in the complete fastq file, therefore also the @SEQ_ID will be part of it.

  • files with Index or #NNNNNN will be easely getting you wrong results!

Solution:

  • well, skip the lines with "@" in the beginning. Is this possible with agrep?
  • Or delete these lines before! Maybe also blank lines is a good idea.

For the time beeing, check if this can be a problem in your data..

quick solution:

sed '/^@/d' file

pdf-output table

see example:

Warnmeldung:
In get_engine(options$engine) :
  Unknown language engine 'txt' (must be registered via knit_engines$set()).

readlength is a problematic variable

In the analyze_positions() function the readlength will genereate the matrix.

If the length of the reads is longer, the matrix will do what?

In the end it would nice to solve this by getting the maximum length during grada_analyze() !!!

position finding R - skip for too big files

big files are still the problem if loaded into R. It eventually crashes.

I would suggest, that adapters are partially skipped (small ones for example.)
This needs to be in this way that the file will be there but they will just set to 0!

position finding - input correction of reads with lower mismatches

The position finding is working right now in the way that the files of reads with a specific adapter and mismatch is searched.

In 2 Mismatches for examples are reads with 2,1,0 mismatches.
To reduce duplication and the problem of shifting positions (see wiki-mismatches) one could delete all M0,M1 reads in 2M file. Therefore 2 Mismatch will only contain reads with exact 2 mismatches!

an option for this coul'd be considered. This can be also applied for the table!
But it needs to be visible directly (<=2M | =2M)

make parallel an optional thing

At the moment GRADA will use always the mclapply() function. This can be easely set to lapply() without big changes.

Solution:
if (numCores == 1){ use lapply() }
else { use mclapply() }

Mismatch analysis in Histogramm

The histogramm does not support mismatch analysis.
To display a barplot with different colors depending on the mismatch of the adapter will be interesting to add!

Alternative Table

To loose DT dependency a simpler table could be used as an alternative function

adapter_file handling

The file handling for the adapters is a little unflexible!

if not used ">" at the beginning, everything will be messed up.

maybe a little more robust and flexible can handle exceptions in this file better.

? ideas

Make Options dependent on input

Maybe for later, if the output is complete.

The PE=FLASE is clear after a the table is created. no need to have this included in the other functions.

AGREP Maximum?

Manpage of agrep: "Records are limited to 48K, and may be truncated if they are larger than that. The limit of record length can be changed by modifying the parameter Max_record in agrep.h."

This should be tested! And why does it say maybe truncated?

R (>= 3.5.0) dependency?

There is a warning durign devtools::install()
It seems only beacause of temp .Rdata ...

But maybe just include it in Depends: description

better example data

The example data is not very good.

a slightly bigger data set and better adapter contamination would be great.

grada_plot_bar(_full) ERROR in KILLS

There is an error if all adapters are found and skipped = TRUE in the plot function

it will ask to delete NULL from a table - a check if NULL is neccessary!

Pleas FIX!

NumCores can become bleow 1

If only one core is available, the "numCores()/2" will fail.

Can we fix this to get it to 1?

Beacause this is the standard setting and will therefore be used e.g. on AWS.

Input whitespace problem

In the function read = "reads.fastq.gz " the whitespace makes a problem.

I have corrected this in the compressed check with trimws.

But it seems good to have this correction done in the beginning maybe!

Read File whitespaces

It seems that whitespaces in file name will break it.

  • Error in the agrep command.

this is compleetly logic.

Solution:

  • Add whitespace chars in string for system command.
  • Or just do not allow whitespaces!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.