biocore / gneiss Goto Github PK
View Code? Open in Web Editor NEWcompositional data analysis toolbox
Home Page: https://biocore.github.io/gneiss/
License: BSD 3-Clause "New" or "Revised" License
compositional data analysis toolbox
Home Page: https://biocore.github.io/gneiss/
License: BSD 3-Clause "New" or "Revised" License
There are some issues with using pickle to represent Model objects.
https://bugs.python.org/issue24658
We'll want to think about some alternative file formats to store all of the necessary information.
Looks like pycogent has support for coloring trees
https://github.com/pycogent/pycogent/blob/master/cogent/draw/dendrogram.py
Combine this with bokeh, and we can have interactive trees
http://chuckpr.github.io/blog/trees2.html
This was brought up in #4. We need a better way to be able to test tree layouts.
There are a few things that I think would improve this object
Coming to think of it, naming it RegressionResults
is a little misnomer - its really a model rather than a results object.
The methods that it encapsulates have different functionalities. For instance MixedLM has no prediction functionality, but it may not be appropriate to have everything within the same class. However, there are shared functionalities between the classes
On top of that, there are some functionalities that we are not exposing from the statsmodels RegressionResults
objects such as fit()
that would allow the models to be trained with different parameters. Exposing this sort of API would also make this more inline with the scikit-learn api
Here's what I'm thinking
Rename RegressionResults
to something like Model
or RegressionModel
and make it an abstract base class. This naming will require little care thought, considering that we will be expanding this functionality out to include classification as well in the near future.
Have separate Model
classes for each of the methods, such as OLSModel
or LMEModel
On top of that, it would be great if this object could support some querying functionality. Specifically
Allow for the querying of subtrees. If I query say internal node y7
, I could retrieve all of the tips within that subtree.
Allow for intuitive interpretation of left/right balances. This ties into #69. It would be great to have functionality that could state which subtree is more abundant than the other subtree off of the bat.
Need to clean up balanceplot to allow for multiple attributes to be plotted simultaneously
Would be nice if the labels were kept if this function was run on a pd.Series
This happens when applying this algorithm to the EMP dataset
It would be amazing if we could take advantage of the formula interface in statsmodels to run statistical tests on the individual balances.
It would be nice to be able to save the RegressionResults to a file, to avoid rerunning analyses.
I'm thinking about having the following formats
The examples will need to be updated with the new api
Also the convert_biom_to_pandas will need to be updated to handle some edge cases
We can switch to BSD as soon as ETE becomes duel licensed.
It would be nice to have an overarching function that sorts tree tips.
I tested a dataset of 374 samples and 3682 OTUs. The analysis itself was swift, but it took long time (~5-10min) to generate the PDF format heat map. The resulting file is 26 MB in size.
I wanted to install gneiss as described in the Readme on barnacle. That is what happened. Is the readme still up to date?
barnacle x86_64 ~/>conda create -n gneiss_env python=3
barnacle x86_64 ~/>source activate gneiss_env
(gneiss_env) barnacle x86_64 ~/>conda install pyqt=4.11.4
Fetching package metadata .........
Solving package specifications: ....
UnsatisfiableError: The following specifications were found to be in conflict:
It would be nice it there could be a Docker container for this.
Relevant to #70
balance_basis
.mean_niche_estimator
renders properly.Summarize subtree taxonomies of a given balance to aid interpretation
There needs to be an easy way to remove nodes with single children.
If none of the samples between the metadata and the table match, an error needs to be thrown.
balance_basis
(#8)ladderize
order_tips
Need to port over the balances trees from this repo here
One thing that I'm thinking would be appropriate for documentation is IPython notebooks / markdown.
This plugin can be found here. It would be great to knock this out by the beta release.
cc @gregcaporaso to give an idea about the latest possible deadline to have a q2 plugin for the beta version release.
Note that this will extend the Model superclass
I personally find it really frustrating when I drop a couple of hours to build a regression model, only to realize that it was initialized with the wrong types.
For example, consider a variable like Age
is actually a numerical value, but is represented as an object (i.e. string).
It would be nice to have some sort of type checking on the fly, maybe even taking advantage of PY3 types and/or the categorical / numerical types in pandas.
Not sure about what the best approach here is ...
The ols
and mixedlm
calculations can take upwards to 30min on 1000 OTUs.
And these calculations are embarrassingly parallel. So I'm thinking about enabling an optional dependency with joblib
to enable this.
fit
is called multiple timesfit
should return the updated RegressionModel object as output_stale
parameter should be enabled so that coefficients
, predict
and residuals
won't return anything until fit
has been called.Need to be able to pass in file handles for read_pickle
and write_pickle
.
It would be nice if we could perform coordinate conversions between 2 trees.
Note that this will extend the Model superclass
http://congress.cimne.com/codawork11/Admin/Files/FilePaper/p55.pdf
http://www.elib.bsu.by/bitstream/123456789/51958/1/173-176.pdf
Rename niche_sort
and mean_niche_estimator
to band_sort
and mean_band_estimator
Also need to think about how to split up the table sorting algorithms, and the tree sorting algorithms.
It would be nice if the RegressionResults object just had the tree and balances built inside of it.
There are a few filtering steps done within ols
and mixedlm
. It would be easier to get the resulting tree and balances directly from this object rather than guessing it.
It may be advantageous to break out the ilr transform out of the regression module.
Specifically, this would change all of the regression modules so that they take in a balance
table rather than an OTU table. That way, if multiple analyses were to be run on the same balance table, the balance table doesn't need to be recomputed every iteration.
This is also relevant to #79
Need to encode a transformation that allows for biplots to be plotted with balances.
CC @ElDeveloper
There was some error handling that has been corrected in the dev version of scikit-bio here. Will need to upgrade to depend on that version.
To make Gneiss great again
Having log files available for some of the pipelines, especially the regression functions will very nifty for debugging.
What is a good measure of effect size?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.