Git Product home page Git Product logo

gseadv's Introduction

GSEAdv

Travis build status AppVeyor Build Status Coverage status lifecycle Project Status: Abandoned โ€“ Initial development has started, but there has not yet been a stable, usable release; the project has been abandoned and the author(s) do not intend on continuing development.

The goal of GSEAdv is to provide methods to work with gene sets collections.

This package is abandonded as a new package is being developed to work with gene sets under Bioconductor. My own efforts are now in BaseSet

GSEAdv is based on the relationship between genes and gene sets under this schema: Schema of gene sets

And provides methods to understand the relationships between each property of the schema and as a whole.

Installation

It is an R package you will be able to install it from the Bioconductor project with:

## install.packages("BiocManager") 
BiocManager::install("GSEAdv")

You can install this version of GSEAdv with:

## install.packages("devtools")
devtools::install_github("llrs/GSEAdv")

How does it work?

It is simple, load the package and learn from your data!

# Load some data
library("GSEAdv")
fl <- system.file("extdata", "Broad.xml", package = "GSEABase")
gss <- getBroadSets(fl)
gss
## GeneSetCollection
##   names: chr5q23, chr16q24 (2 total)
##   unique identifiers: ZNF474, CCDC100, ..., TRAPPC2L (215 total)
##   types in collection:
##     geneIdType: SymbolIdentifier (1 total)
##     collectionType: BroadCollection (1 total)
summary(gss)
## Genes: 215
##  Gene in more pathways: 1 pathways
##  h-index: 0 genes with at least 0 pathways.
## Pathways: 2
##  Biggest pathway: 129 genes
##  h-index: 1 pathways with at least 1 genes.
## All genes in a single gene set.

Which tells us that each gene in the GeneSetCollection is only on one gene set.

We can try with a bigger dataset, one derived from human genes pathways in KEGG:

summary(genesKegg)
## Genes: 5869
##  Gene in more pathways: 51 pathways
##  h-index: 0 genes with at least 0 pathways.
## Pathways: 228
##  Biggest pathway: 1130 genes
##  h-index: 15 pathways with at least 15 genes.
## IC(genesPerPathway): 6.65 ( 0.96 of the maximum)
## IC(pathwaysPerGene): 2.47 ( 0.48 of the maximum)

Knowing that it has so much pathways and genes we can learn how do they relate. The number of genes per pathway in the collection is:

gpp <- genesPerPathway(genesKegg)
plot(table(gpp))

Distribution of the number of genes per gene set. We can see that most gene sets have low number of genes but one has 1130 genes in a single gene set (It is the gene set 01100). The genes might be associated too with many gene sets, it is so extreme? Let's see:

ppg <- pathwaysPerGene(genesKegg)
plot(table(ppg))

Distribution of the number of gene sets per gene Not so extreme, one gene (5594) appears in 51 gene sets.

To see which gene sets are included in other gene sets we can use nested:

nested(genesKegg)[1:10, 80:90]
##       00970 00980 00982 00983 01040 01100 02010 03008 03010 03013 03015
## 00010     0     0     0     0     0     0     0     0     0     0     0
## 00020     0     0     0     0     0     1     0     0     0     0     0
## 00030     0     0     0     0     0     0     0     0     0     0     0
## 00040     0     0     0     0     0     0     0     0     0     0     0
## 00051     0     0     0     0     0     0     0     0     0     0     0
## 00052     0     0     0     0     0     0     0     0     0     0     0
## 00053     0     0     0     0     0     0     0     0     0     0     0
## 00061     0     0     0     0     0     1     0     0     0     0     0
## 00062     0     0     0     0     0     1     0     0     0     0     0
## 00071     0     0     0     0     0     0     0     0     0     0     0

As expected the pathway with more than 1100 genes has other pathways inside it.

You can see the vignettes for more examples.

Who will use this repo or project?

It is intended for bioinformaticians, both people interested in comparing databases and people developing analysis using the information provided by GSEAdv.

What is the goal of this project?

The goal of this project is to be able to understand the gene sets collections available.

What can be GSEAdv used for?

  • Measure properties of the genes and pathways given a GeneSetCollection The number of genes by pathway, the probability of having x genes in more than y pathways...
  • Compare pathway database: By comparing the differences between them.
  • Select the gene set collection of interest: By testing their properties.
  • Create GeneSetCollections with certain properties Create a GeneSetCollection were the collections follow certain distributions.

Contributing

Please read how to contribute for details on the code of conduct, and the process for submitting pull requests.

You can also look at the tests and add more tests to increase the quality of the package.

Acknowledgments

The ideas of this package were developed after a colleague asked a question in a poster presentation of my other package BioCor. To know the whole history you can read this blogpost.

gseadv's People

Contributors

llrs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

yonicd

gseadv's Issues

Use dynamic programming in the functions `from*`

Many function that simulate a GSC are quite slow (more than 300 iterations ~ 1 min) if they reach a solution in timely manner.

Proposed change: use dynamic programming, create the amount of data and remove when they are picked. It might improve time results.

Example

This software can be implemented using the methods on this package.

Antitesis of add

Add the possibility to drop a relationship but without removing completely all the genes or GeneSets. The opposite of add (which really adds a relationship, and if not present it adds a Gene or a GeneSet.)

Improve README

Add the original object stats with the simulations in Figure 3 of the README.
Also do the simulation with the number of genes.

Compare same terms in different GeneSetCollections

A user request a method to compare the same GeneSet names of different GeneSetCollections (GSC), by counting the number of genes in each.

Could go on the line of compare(GSC1, GSC2) and then look for same pathways/GeneSet names and compare the number in each case. Could return the total number of shared names and the differences between them in a data.frame.

Would it be expandable to more than two GSC? Yes if in the names of the data.frame we set the names of the GSC being compared and return NA if the term is not shared between two GSC.
Like: Terms GSC1_GSC2 GSC1_GSC3 GSC2_GSC3

Appveyor

Add to install from biocondutor (from), perhaps install first bioconductor then the other packages

Make it more general

At the moment it is only designed for GeneSetCollections, but I expect that it will be a need to cluster cell lines. As such it would be interesting to have a general set class in bioconductor and then a subclass "GeneSet", and equivalent "SetCollection" and the "GeneSetCollection".

This would require a feature request in https://github.com/bioconductor/GSEABase

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.