soedinglab / bamm_webserver Goto Github PK
View Code? Open in Web Editor NEWWebserver for motif discovery with higher-order Bayesian Markov Models (BaMMs)
Home Page: https://bammmotif.mpibpc.mpg.de
License: GNU Affero General Public License v3.0
Webserver for motif discovery with higher-order Bayesian Markov Models (BaMMs)
Home Page: https://bammmotif.mpibpc.mpg.de
License: GNU Affero General Public License v3.0
In our paper, we claimed that the max. size of FASTA file is 50 MB while in .env_template file:
MAX_UPLOAD_FILE_SIZE = 20MB
Unfortunately a new one 👎
We have two higher-order databases for mouse and human. They should be integrated before submission.
There are two bugs here:
Urgh :-(
Error in evaluateMotif(pvalues, filename = filename, rerank = FALSE, data_eta0 = data_eta0) :
Error: input p-values must all be in the range 0 to 1!
Execution halted
@WanwanGe noted that the current peng-de-novo workflow does not allow the user to do a pure motif refinement. That is, after restructuring there is currently no way to upload your own pwm or meme file and refine the model
It should be straightforward to implement the BaMM Refinement
workflow that
and then runs the BaMM pipeline.
currently the webserver has no functionality to visualize motif hits on sequences. This would be a very nice feature that we can queue for some time after the submission.
Currently the webserver relies on the user copying the job in order to be able to access the results after some time. The job id however is only displayed right after submission, but not in the result page.
We should add it to the result pages, so that users can access it later.
BaMM_webserver/bammmotif/tasks.py
Line 225 in 41927ec
FDR analysis in shootpeng crashes when at least one sequence is shorter than peng_motif's reported motifs.
I think there's a change in the command line syntax in BaMM tools after updating the submodules.
BaMMScan /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-0
6a0fb9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/594
74_seqs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs
.hbcp --order 2 --Order 2 --extend 0 0 --maxPWM 1 --pvalCutoff 0.01 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1
and
FDR /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-06a0fb
9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_se
qs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs.hbcp
--order 2 --Order 2 --extend 0 0 --maxPWM 1 --cvFold 1 --mFold 5 --sOrder 2 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1
Currently it is not possible to run several bamm optimization runs from one peng seeding, due to the selected motifs being copied to the peng folder. A second refinement round will copy to the same folder, running BaMM with all motifs ever selected.
This is a way around the problem:
Instead of copying the selected motifs to the peng folder, then transfering all of them to the BaMM folder, copy them directly to the BaMM folder.
The example fasta file for the server contains ~30k sequences which is too much for a simple check.
We should replace it with a smaller file then flush the examples on the server.
cc @WanwanGe
All paths to scripts and binary are currently hardcoded. There is no good reason for that. It makes deploying harder and the code unnecessarily fragile. Instead we should set the PATH variable properly set in the Dockerfile and make all accessible through the PATH.
Here the list of problems, that should be fixed by Friday:
During the development we will accumulate a number of celery tasks that share the logic of converting the UI commandline input of a tool (e.g. PEnGmotif) to a command line call of the binary.
To avoid a messy code base, we should modularize our tasks by creating objects/methods that do this conversion to avoid massive code duplication.
File handles in python should always be accessed in a with statement such as
with open('/path/to/file') as out_handle:
print("Hello world!", file=outhandle)
The current codes fails to close files properly in many cases. This should be fixed - the number of open file handles is limited by the kernel. This can crash the server.
:-(
Currently we are still using an older, slower peng version, upgrading to the lastest peng is realistic before the submission
A python script BaMMmatch.py in the folder BaMMmotif/py/ is created.
It can be a replacement for MMcompare_PWM.R.
Differences:
It allows to run the script on multiple cores, thus is faster.
Note: The comparison is still limited to 0th-order model. Higher-order comparison has not been implemented.
Things to be changed:
Here are all the points I recall that came up for improving the performance plots
Additionally:
cc @WanwanGe
models.yaml
lists one less model than in models
folder.models
Maxi's current project is to replace the current BaMM workflow with a two step process:
Maxi has uploaded his work recently (Thanks a lot!). Here's my non-comprehensive list of things we should do before we can run the new version on the offical BaMM webserver:
temp
simultaneously${MEDIA_DB}/jobs/uuid
, move log file into this directory.To avoid more nasty problems with the git history, here are the guidelines how to contribute code:
new_peng_workflow
remote branch in your forknew_peng_workflow
in the soedinglab repositoryHere are my suggestions of things we should tackle before we let the webserver out in the wild.
bamm-suite
) and install them into the webserver docker imagebamm-suite
.👎
Since the tomtom tool is an important part of Roya's lab rotation, I have a two questions/remarks here:
https://github.com/soedinglab/BaMM_webserver/blob/master/bammmotif/static/scripts/tomtomtool.py#L101
In this line you calculate a couple of entropies in the form of sum_{p_i * math.log(p_i)}
this will crash and burn when p_i = 0
.
Also have you already benchmarked the code against bigger databases? Are searches in the order of 1000x5000 realistic?
The hardcoded file paths here
BaMM_webserver/bammmotif/management/commands/populate_example.py
Lines 39 to 45 in 41927ec
are not in sync with the current git master.
needs investigation.
Just watched this video from 2014, in which is said that celery does not play very well with celery and one should better use rabbitMQ instead.
https://www.youtube.com/watch?v=3cyq5DHjymw&t=24m24s
@meiermark, @AnjaSophieKiesel - do you know whether this is still a thing? Should we better switch?
Idea from Johannes:
BaMMScan could also output a zeroth order logo derived from the reported motif hits on the sequences. The user can use it to check whether the bamms find his motif, or a similar one.
Strategy to speed up the seeding for huge datasets
--heuristic-eval-n-seqs <number-of-sequences>
to FDROn the server, I would set to 10000. This way the speed should scale only with the number of motifs, but not with the sequence set size.
This should be fine, because we only need a rough estimation of the motif performance and q for optimization. We do not have to be very accurate.
The gitignore file doesn't contain a section for python. You commited a couple of temporary files (*.pyc) that should not go in the repository.
You can find a .gitignore template for python-related files here:
https://github.com/github/gitignore/blob/master/Python.gitignore
In order to make the webserver public, certain passwords will leak in the history of the project and have to be invalidated.
As discussed in the last group meeting, the current FDR curves could provide more information for users how well the models would perform in their settings.
We discussed:
EDIT: Wrong branch error.
Following the instructions git submodule update --init --recursive
yields
fatal: No url found for submodule path 'bammmotif/static/scripts/PEnG-motif' in .gitmodules
git ls-files --stage | grep 160000
shows that there are two submodules:
but neither are in .gitmodules (as a matter of fact .gitmodules doesn't exist at all).
To skip this error both directories were removed with git rm
.
git submodule update --init --recursive
then executes without errors, but also gives no further output (because I guess there is nothing to do?).
Then cp .env_template .env
also isn't working because there is no .env_template
.
After I copied the suggested variables in .env (I have no idea if any more default variables need to be set).
Now I'm stuck at docker-compose build
, which also failes. But this might be an OS related problem, so trying something different here first.
Any suggestions for the repository related problems?
😖
Hi all,
I spent a little time thinking how we could design the databases to make it easy to update and integrate motif databases.
The idea is the following:
All databases are in a special folder and have their own config file - which contains the metadata, a database name and a version number.
Whenever the server starts
Important things to note somewhere else
1.) Nginx (on marvin):
BaMM_webserver/bammmotif/models.py
Line 293 in 4ab827f
When I make docker-compose up
, it gives a TypeError. Any idea how this happens?
currently input files plus output files are stored indefinitely. In case the webserver will be heavily used, this will sooner or later lead to storage problems on marvin.
We can handle this by implementing a celery cronjob that
BaMM_webserver/webserver/settings.py hardcodes usernames, connection info and passwords. This is a security issue and very inconvenient when porting to other systems.
These settings should instead be read from environment variables that can be passed to the docker files.
bammmotif.utils.misc.CommandFailureException: 'plotBaMMLogo.R' '/code/media/19530c7c-9d7b-4040-a24c-29e743d51b24/Output/' 'PausingSequences.Rep1' '0' '--web' '1'
With the advent of multi motif database support in #31, which I am about to finish, I think we can start thinking about the final looks of the BaMM output page.
This is what I think should still be done:
When, not a single motif is found, the peng workflow errors instead of showing a message to the user.
filterPWM.py currently expects at least one motif.
Traceback (most recent call last):
File "/ext/filterPWMs/filterPWM.py", line 167, in <module>
main()
File "/ext/filterPWMs/filterPWM.py", line 57, in main
new_models = filter_pwms(models, min_overlap)
File "/ext/filterPWMs/utils.py", line 179, in filter_pwms
af = AffinityPropagation(affinity='precomputed').fit(matrix_sim)
File "/usr/local/lib/python3.5/site-packages/sklearn/cluster/affinity_propagation_.py", line 294, in fit
X = check_array(X, accept_sparse='csr')
File "/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 462, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 0)) while a minimum of 1 is required.
Currently the text search against the database is exact, meaning that the user has to get the special characters right or she will not find the factor.
It would be helpful if the search was fuzzy or ignores special characters to improve the user experience.
cc @WanwanGe
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.