uiucsinhalab / gemstat Goto Github PK
View Code? Open in Web Editor NEWThermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression
Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression
Some other things in the ExprPredictor.h do use design patterns, and others don't...
Fix the factory method for pars to create that.
Include one.
The problem is that line
https://github.com/UIUCSinhaLab/GEMSTAT/blob/master/src/seq2expr.cpp#L331
( Update: now https://github.com/UIUCSinhaLab/GEMSTAT/blob/master/src/seq2expr.cpp#L340 )
needs to change the "false" to "true".
Needs to be fixed on both main branches.
See the article by Stroustrup : http://www.artima.com/cppsource/rvalue.html
Basically, if you are returning into a reference provided as an argument, you should use && references.
ex:
void myfunction(int&& return_into){...}
rather than void myfunction(int& return_into){...}
It will improve some compilation properties.
Once a model is created, later predictions using that model do not create output curves at the same scale.
There is a lot of IO inlined in the main function that could be moved out to an IO.cpp / IO.h file.
The various parsers implemented should be DRY, and should also be absolutely pedantic. (Excess data afterward in a file should cause an error!)
The correlation coefficient objective function gives 'nan' error probably because of division by zero error.
Bryan came up with these changes quickly:
Tools.cpp - line 994 : double corr_xy = cov_xy / sqrt( x_var * y_var + 0.000001 );
ObjFunc.cpp - line 36 : totalSim += abs( corr( prediction[i], ground_truth[i] ) );
Such as ExprPredictor::modelOption
If we want to instantiate many predictors with the same modelOption or something, consider using a Factory design pattern.
Modify the ExprPar::load function to just get the cooperativity graph from the par file itself.
There is no reason for the Sequence objects to be so bare-bones.
Make the Sequence objects also know their own name and appropriate accessibility to that.
Eliminate everywhere that a separate seqNames array/vector is used.
Currently true expression is required, even when giving a prediction only.
Users should resolve by vote which order these get resolved in.
Currently, the factor thresholds will be input if they are provided in the .par file, but the other input locations are, in effect, ignored.
So we need to decide which order to resolve these in.
My preference is :
-ft (The factor threshold file) overwrites the input from
-et (The overall default factor threashold) which overwrites the input from
the .par file
The reason this is effectively ignored is that from neither of the first sources does it make its way into the parameters that get fed into the ExprPredictor (They do get used for the initial annotation.).
Unfortunately ExprPredictor re-annotates every sequence each time it's called on to make a prediction.
(That's fine if you want to learn the annotation threshold, but because it never got the values that were specified earlier, it means it starts at the default.)
This will let us see how easy the new API is to use, while having a baseline against which to compare.
The current infty_transform seems to think that SINE has domain [0,+1], but it is really [-1, +1].
Because the inverse_infty_transform is the correct inverse of the bad infty_transform, it somehow still worked.
Unfortunately, people are still using old versions of GEMSTAT, so a tool to convert between formats would be useful.
Modify the ExprPar::load function to be more defensive. Currently it silently corrupts things on bad input.
We Probably need to define a new file format altogether, really.
We need to select a testing framework (even just a bash script) and have some automated tests.
There is a big memory leak in ExprPredictor::evalObjective( const ExprPar& par )
, but it is easily fixed.
gradient_minimize will partially roll back optimization if there is some kind of exception.
This means even if the previous parameter vector was acceptable and good.
When the exception happens, it does not save the objective function value that corresponds to the parameter vector that it does save (whatever that is.) It saves that value from the parameter vector that it didn't like.
This issue is multifold :
I guess those values should be truncated?
Perhaps it should be an option to truncate, with the error behavior serving as the default. If we throw an explicit error, it might help users catch mistakes that might otherwise get silently ignored and corrupt their science.
It seems the coop parser can't deal with half directional that goes "... - *", it only works with "... * +/-"
The Sequences (which don't change during execution!) get copied several times per iteration. I wonder what kind of efficiency hit that is.
That way virtual function dispatch and such will work.
(re-include beta optimization stuff.)
Using the direct model without a factorinfo file causes it to assume that all TFs are neither activators nor repressors. Is that the desired behavior?
Please document this.
Investigate, fix.
SNOT was not throwing an exception when parsing a snot file starting with a back-quote/graive (`) .
Test this and add more error handling.
This is separate from issue #19 because this seems more like a parsing problem.
In
ExprFunc::predictExpr(...) (ExprPredictor.cpp)
we see at least one place where the code assumes that 'cic' will be factor number 2 (0-index, so the third in the factor_expression file.)
This silently and secretly requires the user to have their transcription factors in identical order to what was used in the paper.
There should be a facility in the ExprPar
and ParFactory
to hold more additional parameters, more easily. Perhaps a first step would be making the functions that convert between a std::vector<double>
and an ExprPar
virtual, so that it is easier to overwrite and add to them.
My suggestion is that each TF be allowed to have any number of parameters, and instead of storing them in vectors per parameter (column oriented) store them with the description of the TF (row oriented). When an ExprFunc
is created, it can reformat them for its own use as it likes anyway.
Likewise, each promoter/enhancer should have its own parameters, with some facility for storing global parameters.
Similarly, in the ExprModel
, we should have a more general facility for storing indicators, rather than the current ExprModel::actIndicators
and ExprModel::repIndicators
.
It can silently load an empty cooperativity graph if the coop.txt file is in the wrong format, instead of failing and warning the user.
This is in the "bryanInd" branch.
Currently, the ExprPredictor::optimize_beta method that I introduced does not enforce constraints, nor does it choose which betas should be free or fixed from the free_fix indicator array as it should.
Change the order of indicator_bool to match the order of parameters in the .par files.
Copy out the code that happens all over the place to create free and fixed variable arrays based on indicator bool into its own function. (DRY)
GEMSTAT should fail with an error if the listed cooperativities don't match those in the par file.
Pi parameter is not getting used in the code, it needs to be fixed inside the code as a temporary solution.
In long term, we would like to remove pi.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.