jorisvansteenbrugge / tbasco Goto Github PK

License: GNU Affero General Public License v3.0

R 91.63% Python 8.19% Dockerfile 0.18%

tbasco's Issues

Add association rules

Some kind of enhancement where parameters support and confidence are automatically set would be a nice addition

As exemplified by the pyruvate oxidation pathways, which has many alternative sub-modules that may result in the same function, care needs to be taken when determining whether a module is complete. However, there is unlikely to be a single solution, as there are 4 different types of modules:

pathway modules, 2) structural complexes, 3) functional sets, 4) signature modulse.

http://www.kegg.jp/kegg/module.html

Currently there is a hard filter for 75% 'complete' for a module. Completeness is determined by the percentage of total genes contained within a module. However, not all genes in a module are necessary for the function, and therefore this filter is too strict.

For pathway modules, one simple possibility is to check the number of reactions in the pathway that are filled using the Reaction Modules database, and then filter based on the presence of 75% of reactions.

For other modules types, such as the structural complexes, perhaps we can continue using the 75% cut-off exists at the moment - however we can revisit this issue in the future

typos in Utility.R

Lines 256-269 are missing closing parentheses:

  random.genes.hexb <- hexbin(bkgd.individual.Zscores$zscores$`Random Genes`$PC,
    bkgd.individual.Zscores$zscores$`Random Genes`$NRED,
    ybnds = c(min(all_scores_y), max(all_scores_y)),
    xbnds = c(min(all_scores_x), max(all_scores_x)))

  random.annotated.genes.hexb <- hexbin(bkgd.individual.Zscores$zscores$`Random Annotated Genes`$PC,
    bkgd.individual.Zscores$zscores$`Random Annotated Genes`$NRED,
    ybnds = c(min(all_scores_y), max(all_scores_y)),
    xbnds = c(min(all_scores_x), max(all_scores_x)))

  random.identical.annotated.genes.hexb <- hexbin(bkgd.individual.Zscores$zscores$`Genes with the same annotation`$PC,
    bkgd.individual.Zscores$zscores$`Genes with the same annotation`$NRED,
    ybnds = c(min(all_scores_y), max(all_scores_y)),
    xbnds = c(min(all_scores_x), max(all_scores_x)))

TbasCO installs correctly with these fixes, but with the following warnings:

Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: possible error in 'Model_Module(RNAseq.data, ': unused arguments (Yrange, bkgd.traits) 
Note: break used in wrong context: no loop is visible

Go-fish pattern matching

distance?

Add a way to view gene/module function

Create Plotting Function to select specific genomes

Modify the Plot_Trait_Attribute_Expression function to provide a string of "genomes" to only look at those and not all genomes that have the trait attribute. For example, I modified the existing function to just look at the two Accumulibacter genomes by hard coding those genome names, would be nice to just provide a string.

Plot_Accumulibacter_Attribute_Expression <- function(trait.attribute = "M00793_1",
                                            trait.attributes.pruned,
                                            RNAseq.data) {
  t.data <- trait.attributes.pruned[[trait.attribute]]
  annotations <- RNAseq.data$features$annotation.db$module.dict[[trait.attribute]]
  n.att <- length(t.data)
  n.annotations <- length(annotations)
  
  
  
  
  
  plots <- list()
  
  for(i in 1:length(t.data)){
    
    plot.df <- matrix(ncol = 4, nrow = 0)
    trait.genomes <- c("3300026302-bin.3", "3300026286-bin.31")          #### specific genomes instead of all
    
    for(genome in trait.genomes){
      
      for(annotation in annotations){
        expr_values <- RNAseq.data$table[which(RNAseq.data$table$Annotation == annotation),]
        expr_values <- expr_values[ which(expr_values$Bin == genome), RNAseq.data$features$sample.columns]
        
        
        if(nrow(expr_values) >= 2) expr_values <- expr_values[1,]
        else if(nrow(expr_values) == 0){
          print(paste(genome, "lacks", annotation))
          next
        }
        
        
        if(! NA %in% expr_values){
          expr_values <- log2(expr_values)
          expr_values <- expr_values-min(expr_values)
        }
        
        
        for (timepoint in 1:length(expr_values)){
          
          tp_expr <- expr_values[timepoint] %>% as.numeric
          row <- c(genome,  annotation, paste0("TP",timepoint) , tp_expr)
          print(row)
          plot.df <- rbind(plot.df, row )
          
        }
        
      }
      
    }
    
    
    plot.df <- data.frame(plot.df, stringsAsFactors = FALSE)
    colnames(plot.df) <- c("Genome", "KID", "TimePoint", "Expr")
    
    plot.df$Expr %<>% as.numeric
    plots[[i]] <- ggplot(plot.df) + geom_line(aes(x = TimePoint, y = Expr, group = Genome, color = Genome)) + facet_wrap(~KID, nrow = 1)
    
    
    
  }
  gridExtra::grid.arrange(grobs = plots, nrow = 2)
  
  
}

Expanded Options Don't Exist

For these commands:

trait.attributes        <- Identify_Trait_Attributes(RNAseq.data = RNAseq.data, 
                                                     pairwise.distances = pairwise.distances,
                                                     annotation.db = expanded_annotation.db,
                                                     threads = 2)
trait.attributes.pruned <- Prune_Trait_Attributes(trait.attributes, bkgd.traits, 
                                                  RNAseq.data,
                                                  p.threshold = 0.05,
                                                  pairwise.distances = pairwise.distances,
                                                  annotation.db = expanded_annotation.db,
                                                  trait_presence_absence = trait_pa_expanded)

The variables expanded_annotation.db and trait_pa_expanded do not exist, and I can't find where they might be created. I instead do this:

trait.attributes <- Identify_Trait_Attributes(RNAseq.data = RNAseq.data,
pairwise.distances = pairwise.distances,
annotation.db = RNAseq.data$features$annotation.db,threads = 2)

trait.attributes.pruned <- Prune_Trait_Attributes(trait.attributes, 
bkgd.traits, 
RNAseq.data,
 p.threshold = 0.05,
pairwise.distances = pairwise.distances, 
bkgd.individual.Zscores = bkgd.individual.Zscores, 
annotation.db = RNAseq.data$features$annotation.db,
trait_presence_absence = RNAseq.data$features$trait_presence_absence)

This seems to create the trait attributes/pruned versions successfully, but then there might be downstream problems caused by this.

Plot_Trait_Attribute_Expression Gives Empty Plots

After loading the Plot_Trait_Attribute_Expression function successfully, and running this command for example: Plot_Trait_Attribute_Expression(trait.attribute='M00009_756', trait.attributes$trait.attribute, RNAseq.data), it draws the plot boxes for each KO in the module, but the plots are empty.

Filter for module completeness

If a genome only expresses a very small fraction of the full module, it may be marked as 'not expressing' that module.

e.g. from CTR:

  module_completion_table          <- Calc_module_completion(module_attribute_table,
                                                             matrix_features,
                                                             modules_to_KO_list)

  average_attribute_completeness   <- Calc_Ave_Attribute_Completion(module_attribute_table,
                                                                    module_completion_table)`

Visualize association rules in cytoscape automatically

Probably by reusing the ctr version since that is quite a hassle.

goFish function

A function that allows a user to identify gene(s)/KOs with a particular pattern of expression/ranks.

(Feature Request) command line implementation outside of R

Are there any plans to develop a command line version outside of R? I've gone through your tutorial and have pretty much all the necessary files from my github.com/jolespin/veba pipeline. If this software can take generalized input (eg, a table of genome Id, protein Id, kegg Ids) and counts tables, I can work on a walkthrough (potentially even include in a module) to use your tool downstream of the results that are processed by my tool or similar tools.

A command line option would also make it easier to run at scale on distributed systems.

Create some pretty plots

Visualize a trait attribute
Background of Individual genes
Background of modules

Empty Trait Redundancy Dataframe

After loading the function Calc_TnA_redundancy and running TnA_redundancy <- Calc_TnA_redundancy(), I get the error:

Warning message:
In rev(as.numeric(Most_redundant_order)) : NAs introduced by coercion

And TnA_redundancy looks like this:

      all_bins            sum_traits sum_attributes
 [1,] "3300009517-bin.1"  "0"        "0"           
 [2,] "3300009517-bin.12" "0"        "0"           
 [3,] "3300009517-bin.13" "0"        "0"           
 [4,] "3300009517-bin.3"  "0"        "0"           
 [5,] "3300009517-bin.30" "0"        "0"           
 [6,] "3300009517-bin.31" "0"        "0"           
 [7,] "3300009517-bin.42" "0"        "0"           
 [8,] "3300009517-bin.47" "0"        "0"           
 [9,] "3300009517-bin.6"  "0"        "0"           
[10,] "3300009517-bin.7"  "0"        "0"           
[11,] "3300026282-bin.4"  "0"        "0"           
[12,] "3300026283-bin.28" "0"        "0"           
[13,] "3300026284-bin.9"  "0"        "0"           
[14,] "3300026288-bin.32" "0"        "0"           
[15,] "3300026288-bin.43" "0"        "0"           
[16,] "3300026289-bin.24" "0"        "0"           
[17,] "3300026299-bin.22" "0"        "0"           
[18,] "3300026302-bin.10" "0"        "0"           
[19,] "3300026302-bin.20" "0"        "0"           
[20,] "3300026302-bin.32" "0"        "0"           
[21,] "3300026302-bin.46" "0"        "0"           
[22,] "3300026302-bin.47" "0"        "0"           
[23,] "3300026302-bin.62" "0"        "0"           
[24,] "3300026303-bin.42" "0"        "0"           
[25,] "3300026303-bin.46" "0"        "0"

Therefore leading to an empty plot when plotting traits/attributes.

ClusterPrune p-value

include line to adjust for multiple testing.

see:
p.adjust(p, method = p.adjust.methods, n = length(p))

whereby n would be the total number of module attributes

sbs.trait.attributes <- Traitattributes_To_Sbsmatrix(trait.attributes.pruned, RNAseq.data$features$bins)
Error in trait[[i]] : subscript out of bounds

GO-Fish parameters

Let the user decide to use the rank or correlation. Let set a margin

Filtering

The filtering module should be modular.
Possibly:

Give a few options
Make it user adaptable

jorisvansteenbrugge / tbasco Goto Github PK

tbasco's Issues

Recommend Projects

Recommend Topics

Recommend Org