max. size of input file

In our paper, we claimed that the max. size of FASTA file is 50 MB while in .env_template file:
MAX_UPLOAD_FILE_SIZE = 20MB

testing and debugging new databases

We have two higher-order databases for mouse and human. They should be integrated before submission.

[fa31e1f2] extra newlines in meme generated for refinement

There are two bugs here:

the script generating the meme file from the selected motifs adds two extra newlines between bg frequencies and the first model in the meme file
BaMMmotif makes assumptions about the number of new lines although not specified in the meme file format

[9dd888d1] error in plotPvalStats.R

Urgh :-(

Error in evaluateMotif(pvalues, filename = filename, rerank = FALSE, data_eta0 = data_eta0) :
  Error: input p-values must all be in the range 0 to 1!
Execution halted

Add motif refinement with user uploaded initialization file

@WanwanGe noted that the current peng-de-novo workflow does not allow the user to do a pure motif refinement. That is, after restructuring there is currently no way to upload your own pwm or meme file and refine the model

It should be straightforward to implement the BaMM Refinement workflow that

takes a fasta file
takes an initialization file

and then runs the BaMM pipeline.

UI can be salvaged from Anja's old de-novo workflow
celery modules can be salvaged from the second part of the new peng-bamm pipeline.

visualize motif occurences in bammscan

currently the webserver has no functionality to visualize motif hits on sequences. This would be a very nice feature that we can queue for some time after the submission.

Estimating the q from the data

Todo

write write q value into meme file in shoot-peng (Christian)
change q-value in BaMM such that it reads it from meme, if not defined, use it from the command line argument.

add full job_id to result pages

Currently the webserver relies on the user copying the job in order to be able to access the results after some time. The job id however is only displayed right after submission, but not in the result page.

We should add it to the result pages, so that users can access it later.

typo: 'Predicition' -> 'Prediction'

BaMM_webserver/bammmotif/tasks.py

Line 225 in 41927ec

if job.mode == 'Predicition':

[ffc51207] Problem with very short sequences

FDR analysis in shootpeng crashes when at least one sequence is shorter than peng_motif's reported motifs.

BaMM workflow: Ignoring unknown option --basename

I think there's a change in the command line syntax in BaMM tools after updating the submodules.

BaMMScan /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-0
6a0fb9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/594
74_seqs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs
.hbcp --order 2 --Order 2 --extend 0 0 --maxPWM 1 --pvalCutoff 0.01 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1

and

FDR /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-06a0fb
9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_se
qs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs.hbcp
 --order 2 --Order 2 --extend 0 0 --maxPWM 1 --cvFold 1 --mFold 5 --sOrder 2 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1

seeding workflow: selecting optimization fails when optimizing several times in a row.

Currently it is not possible to run several bamm optimization runs from one peng seeding, due to the selected motifs being copied to the peng folder. A second refinement round will copy to the same folder, running BaMM with all motifs ever selected.

This is a way around the problem:

Instead of copying the selected motifs to the peng folder, then transfering all of them to the BaMM folder, copy them directly to the BaMM folder.

Smaller example data

The example fasta file for the server contains ~30k sequences which is too much for a simple check.

We should replace it with a smaller file then flush the examples on the server.

cc @WanwanGe

put scripts into PATH instead of hardcoding the script paths

All paths to scripts and binary are currently hardcoded. There is no good reason for that. It makes deploying harder and the code unnecessarily fragile. Instead we should set the PATH variable properly set in the Dockerfile and make all accessible through the PATH.

Small post-submission bug-fixes

Small improvements

run button on "Discovered Seeds" example page does not use the right css button class
MMcompare should use a e-value cutoff, not p-value cutoff. UI should be updated accordingly.
The "browse all models" button in the "Motif Database" section should redirect to a static url: .../database/browse/<db_id>
Input files for all tools should be validated synchronously and an error message shown if file format is invalid
use sessions to list all jobs of the user in the "find my job" list.
implement maximum file size restriction for uploads and put it in the config file
rename PWM init format to MEME use consistenly in all tools.

Remaining webserver issues

Here the list of problems, that should be fixed by Friday:

Peng result page

Move results up and add missing result name
remove 'select motifs...'
decide on decimal points for ausfc
3 decimal points for log
Make the download button nice again. (Might be a little bit more tricky, because it's below a form node, that we need for the checkboxes. Doing it later when most of the other things are done.)

Bamm result page

Show predicted motifs
Fix download button for detailed motifs
Redesign settings
move "back to peng" down and rename
Download button for peng results also in bamm results
Strip path from input sequence
decide on decimal points for ausfc

Find my job

Repair find my job

Examples

check if examples are in database
load examples (upload and run through peng pipeline, bamm and bammscan)
assign uuids starting with 000....0 and then 000....1

Submission

use Anja's submission text

Main page

all buttons should be the same size and not be black.

Modularize UI->commandline interface

During the development we will accumulate a number of celery tasks that share the logic of converting the UI commandline input of a tool (e.g. PEnGmotif) to a command line call of the binary.
To avoid a messy code base, we should modularize our tasks by creating objects/methods that do this conversion to avoid massive code duplication.

most file handles are not closed properly

File handles in python should always be accessed in a with statement such as

with open('/path/to/file') as out_handle:
   print("Hello world!", file=outhandle)

The current codes fails to close files properly in many cases. This should be fixed - the number of open file handles is limited by the kernel. This can crash the server.

[42c5d4b7] crash in MMcompare script

:-(

upgrade to the latest peng version

Currently we are still using an older, slower peng version, upgrading to the lastest peng is realistic before the submission

Add BaMMmatch.py

A python script BaMMmatch.py in the folder BaMMmotif/py/ is created.

It can be a replacement for MMcompare_PWM.R.

Differences:

the input can be MEME-format or BaMM-format file.
the input options are changed; it is required to give a path+name for output file.
the output is changed, there is title for each column, the second column is removed, and the PWMs from MEME-file is named by their consensus, instead of input filename+count.

It allows to run the script on multiple cores, thus is faster.

Note: The comparison is still limited to 0th-order model. Higher-order comparison has not been implemented.

reworking the motif database result page

Things to be changed:

remove fields lab, grant, data source
species field should be model specific
details should have a mechanism to forward to url if not trained BaMMs (Hocomoco)
cell types field should be a stored as a list, as GTRD sometimes provides a mixture

update performance plots on webserver

Here are all the points I recall that came up for improving the performance plots

unify axis labels (TPR = sensitivity = recall)
use same definition for TP everywhere
use png instead of jpeg images

Additionally:

check whether rerank AUSFC x occ ranking is consistent to benchmark AUSFC
decide which AUSFC we would like to show in the webserver

cc @WanwanGe

Inconsistent databases

Hocomoco databases are currently inconsistent on the server. models.yaml lists one less model than in models folder.
GTRD yeast defines more models than available in models

PEnG seeding - BaMM refinement workflow

Maxi's current project is to replace the current BaMM workflow with a two step process:

Seeding phase: runs PEnG-motif to find seeds for BaMM motif
Refinement phase: the user selects the peng seeds to be optimized by bamm

Maxi has uploaded his work recently (Thanks a lot!). Here's my non-comprehensive list of things we should do before we can run the new version on the offical BaMM webserver:

Planned Changes

Critical Changes

Important Changes

Nice-to-have Changes

put original meme file in “Download all” archive on peng result page
add MMcompare annotation to peng result page
cleaning up unused classes/functions
put job directories in ${MEDIA_DB}/jobs/uuid, move log file into this directory.
implement bamm tools as command line modules

Changes after everything else is done

cleanup directory and file name mess

How to work with the existing code

To avoid more nasty problems with the git history, here are the guidelines how to contribute code:

fork the webserver repository, do not work directly on the soedinglab repository
derive a feature branch from the new_peng_workflow remote branch in your fork
commit your changes to your derived branch
when you are done, open a pull request of your feature branch against new_peng_workflow in the soedinglab repository

testing the server workflows for revision 1

denovo

bammscan

bammscan meme model
bammscan bamm model
bammscan database model
bammscan example

mmcompare

mmcompare meme model
mmcompare bamm model
mmcompare example

Things we might want to tackle before the webserver goes live.

Here are my suggestions of things we should tackle before we let the webserver out in the wild.

General

move all utility scripts that are not specific to the webserver to a separate repository (bamm-suite) and install them into the webserver docker image

Modularization

modularize calls to commandline scripts and reuse them in celery tasks
separate css code from html templates into reusable style sheets

Maintenance

make sure we are logging all server activity (e.g. rotate daily logfiles)
implement cleanup logic (e.g. removal of old files) as celery tasks
setup backup/restore logic

Documentation

Move the documentation to the general BaMM documentation in bamm-suite.
Explain how to start the webserver in production. How it can be backuped, how it can be restored in case of data loss.
How to add new models to an existing database

check all javascript for selecting motifs in manual denovo workflow broken

👎

Roadmap to first revision

Roadmap

tomtom tool

Since the tomtom tool is an important part of Roya's lab rotation, I have a two questions/remarks here:

https://github.com/soedinglab/BaMM_webserver/blob/master/bammmotif/static/scripts/tomtomtool.py#L101

In this line you calculate a couple of entropies in the form of sum_{p_i * math.log(p_i)} this will crash and burn when p_i = 0.

Also have you already benchmarked the code against bigger databases? Are searches in the order of 1000x5000 realistic?

populate_example.py: Filename mismatch

The hardcoded file paths here

BaMM_webserver/bammmotif/management/commands/populate_example.py

Lines 39 to 45 in 41927ec

 filename= '/code/example_data/Hepg2JunD.fasta' 

 f = open(str(filename)) 

 new_entry.Input_Sequences.save(filename, File(f)) 

 filename= '/code/example_data/Hepg2JunD.peng' 

 f = open(str(filename)) 

 new_entry.Motif_InitFile.save(filename, File(f))

are not in sync with the current git master.

[c05f7883] error raised in plotPvalStats.R

needs investigation.

Celery: RabbitMQ vs. Redis

Just watched this video from 2014, in which is said that celery does not play very well with celery and one should better use rabbitMQ instead.

https://www.youtube.com/watch?v=3cyq5DHjymw&t=24m24s

@meiermark, @AnjaSophieKiesel - do you know whether this is still a thing? Should we better switch?

bammscan: additionally show zeroth order of motif hits

Idea from Johannes:

BaMMScan could also output a zeroth order logo derived from the reported motif hits on the sequences. The user can use it to check whether the bamms find his motif, or a similar one.

fast FDR seeding

Strategy to speed up the seeding for huge datasets

add extra parameter --heuristic-eval-n-seqs <number-of-sequences> to FDR

changes

FDR uses at maximum sequences from the positive set. If it contains more, the positive sequences are subsampled randomly
FDR uses exactly negative sequences

On the server, I would set to 10000. This way the speed should scale only with the number of motifs, but not with the sequence set size.

This should be fine, because we only need a rough estimation of the motif performance and q for optimization. We do not have to be very accurate.

*.pyc files in repository

The gitignore file doesn't contain a section for python. You commited a couple of temporary files (*.pyc) that should not go in the repository.

You can find a .gitignore template for python-related files here:

https://github.com/github/gitignore/blob/master/Python.gitignore

making the webserver public

In order to make the webserver public, certain passwords will leak in the history of the project and have to be invalidated.

passwords that have to be changed

for bamm account on marvin
for mysql database (less problematic)
bammadmin gwdg account

code that has to be modified

password for bammadmin should be removed from settings.py

miscellaneous

webserver should be licenced under AGPL, readme should be adapted accordingly.
bamm-suite needs license text

making github repos public

bamm-suite
bamm-server

FDR evaluation: give scores for important fdr values

As discussed in the last group meeting, the current FDR curves could provide more information for users how well the models would perform in their settings.

We discussed:

in addition to our 1:10 pos:neg ratio in the FDR-SEN curve, add the 1:1 and 1:100 curve
give the user BaMMscores for important fdr values.

make_startup_work errors

EDIT: Wrong branch error.

Following the instructions git submodule update --init --recursive yields
fatal: No url found for submodule path 'bammmotif/static/scripts/PEnG-motif' in .gitmodules

git ls-files --stage | grep 160000 shows that there are two submodules:

bammmotif/static/scripts/PEnG-motif
bammmotif/static/scripts/bamm-private

but neither are in .gitmodules (as a matter of fact .gitmodules doesn't exist at all).

To skip this error both directories were removed with git rm.

git submodule update --init --recursive then executes without errors, but also gives no further output (because I guess there is nothing to do?).

Then cp .env_template .env also isn't working because there is no .env_template.

After I copied the suggested variables in .env (I have no idea if any more default variables need to be set).

Now I'm stuck at docker-compose build, which also failes. But this might be an OS related problem, so trying something different here first.

Any suggestions for the repository related problems?

[1a38acd0] segfault in BaMM fdr

😖

Suggestion how to read in databases

Hi all,

I spent a little time thinking how we could design the databases to make it easy to update and integrate motif databases.

The idea is the following:
All databases are in a special folder and have their own config file - which contains the metadata, a database name and a version number.

Whenever the server starts

it first scans the database folder - for each database there it reads out the config file
it uses the config file to
- find missing databases and adds them
- update databases that are outdated
- delete databases from the mysql database that are not present in the database folder.

Web server To-Dos List

Important things to note somewhere else
1.) Nginx (on marvin):

The NginX on marvin has a config file which is located here:
/etc/nginx/sites-enabled/bammmotif.mpibpc.mpg.de
in this file settings like the max_user_upload are described.
when changing this file use ´sudo service nginx reload´ in order to apply these changes.

TypeError: init() missing 1 required positional argument: 'on_delete'

BaMM_webserver/bammmotif/models.py

Line 293 in 4ab827f

motif = models.ForeignKey('Motifs', on_delete=models.CASCADE)

When I make docker-compose up, it gives a TypeError. Any idea how this happens?

cleanup logic for jobs

currently input files plus output files are stored indefinitely. In case the webserver will be heavily used, this will sooner or later lead to storage problems on marvin.

We can handle this by implementing a celery cronjob that

detects all expired jobs by their creation date
deletes jobs in database plus job directories on the HD.

read sensitive settings from environment variables

BaMM_webserver/webserver/settings.py hardcodes usernames, connection info and passwords. This is a security issue and very inconvenient when porting to other systems.

These settings should instead be read from environment variables that can be passed to the docker files.

[19530c7c] Error in plotBaMMLogo.R code

bammmotif.utils.misc.CommandFailureException: 'plotBaMMLogo.R' '/code/media/19530c7c-9d7b-4040-a24c-29e743d51b24/Output/' 'PausingSequences.Rep1' '0' '--web' '1'

TODO list: design of the BaMM output page

With the advent of multi motif database support in #31, which I am about to finish, I think we can start thinking about the final looks of the BaMM output page.

This is what I think should still be done:

below the obligatory header: a table of BaMM refined models
add MMcompare database hits from the DbMatch table
include the detailed information section with higher order plots, etc from Anja's BaMM pipeline
add a link to the peng results page, so that the user can jump back easily.

seeding workflow errors, when not at least one enriched motif found

When, not a single motif is found, the peng workflow errors instead of showing a message to the user.

filterPWM.py currently expects at least one motif.

Traceback (most recent call last):
  File "/ext/filterPWMs/filterPWM.py", line 167, in <module>
    main()
  File "/ext/filterPWMs/filterPWM.py", line 57, in main
    new_models = filter_pwms(models, min_overlap)
  File "/ext/filterPWMs/utils.py", line 179, in filter_pwms
    af = AffinityPropagation(affinity='precomputed').fit(matrix_sim)
  File "/usr/local/lib/python3.5/site-packages/sklearn/cluster/affinity_propagation_.py", line 294, in fit
    X = check_array(X, accept_sparse='csr')
  File "/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 462, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 0)) while a minimum of 1 is required.

improve database text search

Currently the text search against the database is exact, meaning that the user has to get the special characters right or she will not find the factor.

It would be helpful if the search was fuzzy or ignores special characters to improve the user experience.

cc @WanwanGe

	filename= '/code/example_data/Hepg2JunD.fasta'
	f = open(str(filename))
	new_entry.Input_Sequences.save(filename, File(f))

	filename= '/code/example_data/Hepg2JunD.peng'
	f = open(str(filename))
	new_entry.Motif_InitFile.save(filename, File(f))

soedinglab / bamm_webserver Goto Github PK

bamm_webserver's People

Contributors

Stargazers

Watchers

Forkers

bamm_webserver's Issues

Todo

Small improvements

Peng result page

Bamm result page

Find my job

Examples

Submission

Main page

Planned Changes

Critical Changes

Important Changes

Nice-to-have Changes

Changes after everything else is done

How to work with the existing code

denovo

bammscan

mmcompare

General

Modularization

Maintenance

Documentation

Roadmap

changes

passwords that have to be changed

code that has to be modified

miscellaneous

making github repos public

Recommend Projects

Recommend Topics

Recommend Org