uiucsinhalab / gemstat Goto Github PK

Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression

Shell 0.47% C++ 73.61% C 0.08% Makefile 0.66% Jupyter Notebook 19.98% Roff 3.78% Dockerfile 0.16% CMake 1.26%

gemstat's People

Contributors

Stargazers

Watchers

Forkers

bryan-lunt shounakbhogale payamdiba shayanbordbar virginiatsiouri gilbertoalvarez

gemstat's Issues

Make the ExprPredictor objective functions a Strategy pattern.

Some other things in the ExprPredictor.h do use design patterns, and others don't...

Default pars created by the ParFactory don't have interactions in new_storage branch.

Fix the factory method for pars to create that.

No example factor_info.txt file.

Include one.

No activators if no factor_info file provided.

The problem is that line

https://github.com/UIUCSinhaLab/GEMSTAT/blob/master/src/seq2expr.cpp#L331
( Update: now https://github.com/UIUCSinhaLab/GEMSTAT/blob/master/src/seq2expr.cpp#L340 )

needs to change the "false" to "true".

Needs to be fixed on both main branches.

Fix return arguments in functions to use rvalue references.

See the article by Stroustrup : http://www.artima.com/cppsource/rvalue.html

Basically, if you are returning into a reference provided as an argument, you should use && references.

ex:
void myfunction(int&& return_into){...}

rather than void myfunction(int& return_into){...}

It will improve some compilation properties.

Prediction Scaling

Once a model is created, later predictions using that model do not create output curves at the same scale.

Copious IO inlined in main fuction.

There is a lot of IO inlined in the main function that could be moved out to an IO.cpp / IO.h file.

The various parsers implemented should be DRY, and should also be absolutely pedantic. (Excess data afterward in a file should cause an error!)

Problem in Corr

The correlation coefficient objective function gives 'nan' error probably because of division by zero error.

Bryan came up with these changes quickly:

Tools.cpp - line 994 : double corr_xy = cov_xy / sqrt( x_var * y_var + 0.000001 );
ObjFunc.cpp - line 36 : totalSim += abs( corr( prediction[i], ground_truth[i] ) );

Things that should be instance variables are static.

Such as ExprPredictor::modelOption

If we want to instantiate many predictors with the same modelOption or something, consider using a Factory design pattern.

Remove need for cooperativity file. (put it into par.)

Modify the ExprPar::load function to just get the cooperativity graph from the par file itself.

Make Sequence objects know their own names, eliminate seqNames arrays anywhere they are used.

There is no reason for the Sequence objects to be so bare-bones.
Make the Sequence objects also know their own name and appropriate accessibility to that.

Eliminate everywhere that a separate seqNames array/vector is used.

Read optimization bounds from another input file.

Eliminate the need to input true expression when making a predition only.

Currently true expression is required, even when giving a prediction only.

-et option (default factor threshold) and Hassan's factor thr file are (effectively) ignored.

Users should resolve by vote which order these get resolved in.

Currently, the factor thresholds will be input if they are provided in the .par file, but the other input locations are, in effect, ignored.

So we need to decide which order to resolve these in.

My preference is :
-ft (The factor threshold file) overwrites the input from
-et (The overall default factor threashold) which overwrites the input from
the .par file

The reason this is effectively ignored is that from neither of the first sources does it make its way into the parameters that get fed into the ExprPredictor (They do get used for the initial annotation.).
Unfortunately ExprPredictor re-annotates every sequence each time it's called on to make a prediction.
(That's fine if you want to learn the annotation threshold, but because it never got the values that were specified earlier, it means it starts at the default.)

Recreate IND gemstat using new "Conditions" and ExprFunc class hierarchy, as an example.

This will let us see how easy the new API is to use, while having a baseline against which to compare.

Incorrect sine-based infty_transform

The current infty_transform seems to think that SINE has domain [0,+1], but it is really [-1, +1].
Because the inverse_infty_transform is the correct inverse of the bad infty_transform, it somehow still worked.

Create a tool to convert .par formats and free-fix format.

Unfortunately, people are still using old versions of GEMSTAT, so a tool to convert between formats would be useful.

Fix the ExprPar::load function

Modify the ExprPar::load function to be more defensive. Currently it silently corrupts things on bad input.
We Probably need to define a new file format altogether, really.

Need unit/regression tests.

We need to select a testing framework (even just a bash script) and have some automated tests.

Memory Leak

There is a big memory leak in ExprPredictor::evalObjective( const ExprPar& par ), but it is easily fixed.

Bad exception handling in gradient_minimize .

gradient_minimize will partially roll back optimization if there is some kind of exception.
This means even if the previous parameter vector was acceptable and good.
When the exception happens, it does not save the objective function value that corresponds to the parameter vector that it does save (whatever that is.) It saves that value from the parameter vector that it didn't like.

This issue is multifold :

The best parameter vector seen is not saved.
The objective value reported is not that of the parameter vector that was saved.

The usenlopt branch throws an error if a starting point has values outside the optimization bounds?

I guess those values should be truncated?
Perhaps it should be an option to truncate, with the error behavior serving as the default. If we throw an explicit error, it might help users catch mistakes that might otherwise get silently ignored and corrupt their science.

Not all command-line arguments are documented.

Problem in coop parser (HALF_DIRECTIONAL)

It seems the coop parser can't deal with half directional that goes "... - *", it only works with "... * +/-"

Eliminate superfluous copying of Sequences and similar variables.

The Sequences (which don't change during execution!) get copied several times per iteration. I wonder what kind of efficiency hit that is.

Carefully consider how const ExprPar& are used in ExprPredictor.cpp and INDExprPredictor.cpp, and how to fix that.

That way virtual function dispatch and such will work.

Fix mistake in which ExprPredictor.cpp was used in "designPatterns" branch.

(re-include beta optimization stuff.)

If no factorinfo file is provided, assume all factors are activators? (Or throw an error.)

Using the direct model without a factorinfo file causes it to assume that all TFs are neither activators nor repressors. Is that the desired behavior?

Document that IND GEMSTAT is sensitive to TF order

Please document this.

The free_fix (indicator_bool) parser code is broken.

Investigate, fix.

It was the ParFactory::load that was not creating enough/useful errors.

SNOT was not throwing an exception when parsing a snot file starting with a back-quote/graive (`) .

Test this and add more error handling.

In the master branch, somehow when the .par file is loaded, its factorThresholds are not loaded.

This is separate from issue #19 because this seems more like a parsing problem.

IND Gemstat is sensitive to the order of TFs, (and who knows what else?)

In
ExprFunc::predictExpr(...) (ExprPredictor.cpp)
we see at least one place where the code assumes that 'cic' will be factor number 2 (0-index, so the third in the factor_expression file.)

This silently and secretly requires the user to have their transcription factors in identical order to what was used in the paper.

Add scripts/submodules to get SGL/NLOpt, and an option to fetch/build/link statically.

Additional Parameters

There should be a facility in the ExprPar and ParFactory to hold more additional parameters, more easily. Perhaps a first step would be making the functions that convert between a std::vector<double> and an ExprPar virtual, so that it is easier to overwrite and add to them.

My suggestion is that each TF be allowed to have any number of parameters, and instead of storing them in vectors per parameter (column oriented) store them with the description of the TF (row oriented). When an ExprFunc is created, it can reformat them for its own use as it likes anyway.

Likewise, each promoter/enhancer should have its own parameters, with some facility for storing global parameters.

Similarly, in the ExprModel, we should have a more general facility for storing indicators, rather than the current ExprModel::actIndicators and ExprModel::repIndicators.