Git Product home page Git Product logo

bamm_webserver's People

Contributors

anjakiesel avatar mwess avatar wge11 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

croth1 wge11

bamm_webserver's Issues

max. size of input file

In our paper, we claimed that the max. size of FASTA file is 50 MB while in .env_template file:
MAX_UPLOAD_FILE_SIZE = 20MB

[fa31e1f2] extra newlines in meme generated for refinement

There are two bugs here:

  1. the script generating the meme file from the selected motifs adds two extra newlines between bg frequencies and the first model in the meme file
  2. BaMMmotif makes assumptions about the number of new lines although not specified in the meme file format

[9dd888d1] error in plotPvalStats.R

Urgh :-(

Error in evaluateMotif(pvalues, filename = filename, rerank = FALSE, data_eta0 = data_eta0) :
  Error: input p-values must all be in the range 0 to 1!
Execution halted

Add motif refinement with user uploaded initialization file

@WanwanGe noted that the current peng-de-novo workflow does not allow the user to do a pure motif refinement. That is, after restructuring there is currently no way to upload your own pwm or meme file and refine the model

It should be straightforward to implement the BaMM Refinement workflow that

  • takes a fasta file
  • takes an initialization file

and then runs the BaMM pipeline.

  • UI can be salvaged from Anja's old de-novo workflow
  • celery modules can be salvaged from the second part of the new peng-bamm pipeline.

visualize motif occurences in bammscan

currently the webserver has no functionality to visualize motif hits on sequences. This would be a very nice feature that we can queue for some time after the submission.

Estimating the q from the data

Todo

  • write write q value into meme file in shoot-peng (Christian)
  • change q-value in BaMM such that it reads it from meme, if not defined, use it from the command line argument.

add full job_id to result pages

Currently the webserver relies on the user copying the job in order to be able to access the results after some time. The job id however is only displayed right after submission, but not in the result page.

We should add it to the result pages, so that users can access it later.

BaMM workflow: Ignoring unknown option --basename

I think there's a change in the command line syntax in BaMM tools after updating the submodules.

BaMMScan /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-0
6a0fb9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/594
74_seqs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs
.hbcp --order 2 --Order 2 --extend 0 0 --maxPWM 1 --pvalCutoff 0.01 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1

and

FDR /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output /code/media/216aea2f-8645-44c8-b9f6-06a0fb
9ec2f3/Input/59474_seqs.fa --BaMMFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_se
qs_motif_1.ihbcp --bgModelFile /code/media/jobs/07c164ec-f7ed-462e-ab55-c606e8014928/Output/59474_seqs.hbcp
 --order 2 --Order 2 --extend 0 0 --maxPWM 1 --cvFold 1 --mFold 5 --sOrder 2 --basename 59474_seqs_motif_1
Ignoring unknown option --basename
Ignoring unknown option 59474_seqs_motif_1

seeding workflow: selecting optimization fails when optimizing several times in a row.

Currently it is not possible to run several bamm optimization runs from one peng seeding, due to the selected motifs being copied to the peng folder. A second refinement round will copy to the same folder, running BaMM with all motifs ever selected.

This is a way around the problem:

Instead of copying the selected motifs to the peng folder, then transfering all of them to the BaMM folder, copy them directly to the BaMM folder.

Smaller example data

The example fasta file for the server contains ~30k sequences which is too much for a simple check.

We should replace it with a smaller file then flush the examples on the server.

cc @WanwanGe

put scripts into PATH instead of hardcoding the script paths

All paths to scripts and binary are currently hardcoded. There is no good reason for that. It makes deploying harder and the code unnecessarily fragile. Instead we should set the PATH variable properly set in the Dockerfile and make all accessible through the PATH.

Small post-submission bug-fixes

Small improvements

  • run button on "Discovered Seeds" example page does not use the right css button class
  • MMcompare should use a e-value cutoff, not p-value cutoff. UI should be updated accordingly.
  • The "browse all models" button in the "Motif Database" section should redirect to a static url: .../database/browse/<db_id>
  • Input files for all tools should be validated synchronously and an error message shown if file format is invalid
  • use sessions to list all jobs of the user in the "find my job" list.
  • implement maximum file size restriction for uploads and put it in the config file
  • rename PWM init format to MEME use consistenly in all tools.

Remaining webserver issues

Here the list of problems, that should be fixed by Friday:

Peng result page

  • Move results up and add missing result name
  • remove 'select motifs...'
  • decide on decimal points for ausfc
  • 3 decimal points for log
  • Make the download button nice again. (Might be a little bit more tricky, because it's below a form node, that we need for the checkboxes. Doing it later when most of the other things are done.)

Bamm result page

  • Show predicted motifs
  • Fix download button for detailed motifs
  • Redesign settings
  • move "back to peng" down and rename
  • Download button for peng results also in bamm results
  • Strip path from input sequence
  • decide on decimal points for ausfc

Find my job

  • Repair find my job

Examples

  • check if examples are in database
  • load examples (upload and run through peng pipeline, bamm and bammscan)
  • assign uuids starting with 000....0 and then 000....1

Submission

  • use Anja's submission text

Main page

  • all buttons should be the same size and not be black.

Modularize UI->commandline interface

During the development we will accumulate a number of celery tasks that share the logic of converting the UI commandline input of a tool (e.g. PEnGmotif) to a command line call of the binary.
To avoid a messy code base, we should modularize our tasks by creating objects/methods that do this conversion to avoid massive code duplication.

most file handles are not closed properly

File handles in python should always be accessed in a with statement such as

with open('/path/to/file') as out_handle:
   print("Hello world!", file=outhandle)

The current codes fails to close files properly in many cases. This should be fixed - the number of open file handles is limited by the kernel. This can crash the server.

Add BaMMmatch.py

A python script BaMMmatch.py in the folder BaMMmotif/py/ is created.

It can be a replacement for MMcompare_PWM.R.

Differences:

  1. the input can be MEME-format or BaMM-format file.
  2. the input options are changed; it is required to give a path+name for output file.
  3. the output is changed, there is title for each column, the second column is removed, and the PWMs from MEME-file is named by their consensus, instead of input filename+count.

It allows to run the script on multiple cores, thus is faster.

Note: The comparison is still limited to 0th-order model. Higher-order comparison has not been implemented.

reworking the motif database result page

Things to be changed:

  • remove fields lab, grant, data source
  • species field should be model specific
  • details should have a mechanism to forward to url if not trained BaMMs (Hocomoco)
  • cell types field should be a stored as a list, as GTRD sometimes provides a mixture

update performance plots on webserver

Here are all the points I recall that came up for improving the performance plots

  • unify axis labels (TPR = sensitivity = recall)
  • use same definition for TP everywhere
  • use png instead of jpeg images

Additionally:

  • check whether rerank AUSFC x occ ranking is consistent to benchmark AUSFC
  • decide which AUSFC we would like to show in the webserver

cc @WanwanGe

Inconsistent databases

  • Hocomoco databases are currently inconsistent on the server. models.yaml lists one less model than in models folder.
  • GTRD yeast defines more models than available in models

PEnG seeding - BaMM refinement workflow

Maxi's current project is to replace the current BaMM workflow with a two step process:

  1. Seeding phase: runs PEnG-motif to find seeds for BaMM motif
  2. Refinement phase: the user selects the peng seeds to be optimized by bamm

Maxi has uploaded his work recently (Thanks a lot!). Here's my non-comprehensive list of things we should do before we can run the new version on the offical BaMM webserver:

Planned Changes

Critical Changes

  • use JobMeta table for BaMM jobs to avoid JobID clashes
  • rebase upon Anja’s latest version
  • make PEnG stage runnable with example data
  • update PEnG to latest version (Christian)
  • prevent the user from starting bamm refinement without selected motifs.
  • download buttons for motifs on peng result page need fixing
  • BaMM workflow: ValueError: invalid literal for int() with base 10: 'motif' - because of newly introduced header?
  • replace former de-novo workflow on the main page with the new seed-refine workflow
  • rework populate_example.py for new database design
  • PengToBamm page should not show PEnG job_id as job_name - user should be able to choose the job_name
  • fix plots in BaMM result page
  • peng module problem: temp_dir seems to be unset - therefore several instances write on the same directory named temp simultaneously
  • Peng->Bamm job shows wrong job id in "Submission Successful" view

Important Changes

  • add multiple database support - (Christian)
  • add logging configuration - (Christian)
  • add header to peng result page: “Select Motifs for Higher Order Refinement”
  • replace MEME plotter with our own plotting script for consistency on peng result page
  • PEnG settings is blank on peng result page
  • peng results table should not have # sites (not very accurate), but instead AUSFC scoring
  • BaMM results coming from PEnG should have a link to the peng results
  • PengToBamm submit: do not give the user the option to choose reverse complement, hide the sequence set; use a foreign key to PEnGJoB in BaMMJob table to access this data from the PEnG job
  • validate input of MMCompare
  • Overwork design of html pages, specifically results pages.
  • Add error handling in case the task chain fails.

Nice-to-have Changes

  • put original meme file in “Download all” archive on peng result page
  • add MMcompare annotation to peng result page
  • cleaning up unused classes/functions
  • put job directories in ${MEDIA_DB}/jobs/uuid, move log file into this directory.
  • implement bamm tools as command line modules

Changes after everything else is done

  • cleanup directory and file name mess

How to work with the existing code

To avoid more nasty problems with the git history, here are the guidelines how to contribute code:

  • fork the webserver repository, do not work directly on the soedinglab repository
  • derive a feature branch from the new_peng_workflow remote branch in your fork
  • commit your changes to your derived branch
  • when you are done, open a pull request of your feature branch against new_peng_workflow in the soedinglab repository

testing the server workflows for revision 1

denovo

  • denovo one-step
  • denovo one-step example
  • denovo seeding
  • denovo seeding example
  • denovo refinement
  • denovo with sequence set without overrepresented pattern

bammscan

  • bammscan meme model
  • bammscan bamm model
  • bammscan database model
  • bammscan example

mmcompare

  • mmcompare meme model
  • mmcompare bamm model
  • mmcompare example

Things we might want to tackle before the webserver goes live.

Here are my suggestions of things we should tackle before we let the webserver out in the wild.

General

  • move all utility scripts that are not specific to the webserver to a separate repository (bamm-suite) and install them into the webserver docker image

Modularization

  • modularize calls to commandline scripts and reuse them in celery tasks
  • separate css code from html templates into reusable style sheets

Maintenance

  • make sure we are logging all server activity (e.g. rotate daily logfiles)
  • implement cleanup logic (e.g. removal of old files) as celery tasks
  • setup backup/restore logic

Documentation

  • Move the documentation to the general BaMM documentation in bamm-suite.
  • Explain how to start the webserver in production. How it can be backuped, how it can be restored in case of data loss.
  • How to add new models to an existing database

Roadmap to first revision

Roadmap

  • update to new version of BaMM & AURRC scripts (#56)
  • implement and validate initalizing q from the data (#65)
  • decide on number of cores per job & check parallelization
  • add motif occurrences plot for unequal sequence lengths & test for correctness
  • redesign motif search output page (#68)
  • making the webserver public (#61)
  • create new databases and upload to our gwdg server
  • update the documentation to the latest server version, include important details on (How to read our higher-order logos, performance plots, ...)
  • flip the logo plots on the result page
  • check impressum
  • change motif scanning e-value to p-value
  • update motif refinement UI page.

fast FDR seeding

Strategy to speed up the seeding for huge datasets

  • add extra parameter --heuristic-eval-n-seqs <number-of-sequences> to FDR

changes

  • FDR uses at maximum sequences from the positive set. If it contains more, the positive sequences are subsampled randomly
  • FDR uses exactly negative sequences

On the server, I would set to 10000. This way the speed should scale only with the number of motifs, but not with the sequence set size.

This should be fine, because we only need a rough estimation of the motif performance and q for optimization. We do not have to be very accurate.

making the webserver public

In order to make the webserver public, certain passwords will leak in the history of the project and have to be invalidated.

passwords that have to be changed

  • for bamm account on marvin
  • for mysql database (less problematic)
  • bammadmin gwdg account

code that has to be modified

  • password for bammadmin should be removed from settings.py

miscellaneous

  • webserver should be licenced under AGPL, readme should be adapted accordingly.
  • bamm-suite needs license text

making github repos public

  • bamm-suite
  • bamm-server

FDR evaluation: give scores for important fdr values

As discussed in the last group meeting, the current FDR curves could provide more information for users how well the models would perform in their settings.

We discussed:

  • in addition to our 1:10 pos:neg ratio in the FDR-SEN curve, add the 1:1 and 1:100 curve
  • give the user BaMMscores for important fdr values.

make_startup_work errors

EDIT: Wrong branch error.

Following the instructions git submodule update --init --recursive yields
fatal: No url found for submodule path 'bammmotif/static/scripts/PEnG-motif' in .gitmodules

git ls-files --stage | grep 160000 shows that there are two submodules:

  • bammmotif/static/scripts/PEnG-motif
  • bammmotif/static/scripts/bamm-private

but neither are in .gitmodules (as a matter of fact .gitmodules doesn't exist at all).

To skip this error both directories were removed with git rm.

git submodule update --init --recursive then executes without errors, but also gives no further output (because I guess there is nothing to do?).

Then cp .env_template .env also isn't working because there is no .env_template.

After I copied the suggested variables in .env (I have no idea if any more default variables need to be set).

Now I'm stuck at docker-compose build, which also failes. But this might be an OS related problem, so trying something different here first.

Any suggestions for the repository related problems?

Suggestion how to read in databases

Hi all,

I spent a little time thinking how we could design the databases to make it easy to update and integrate motif databases.

The idea is the following:
All databases are in a special folder and have their own config file - which contains the metadata, a database name and a version number.

Whenever the server starts

  • it first scans the database folder - for each database there it reads out the config file
  • it uses the config file to
    • find missing databases and adds them
    • update databases that are outdated
    • delete databases from the mysql database that are not present in the database folder.

Web server To-Dos List

Important things to note somewhere else
1.) Nginx (on marvin):

  • The NginX on marvin has a config file which is located here:
    /etc/nginx/sites-enabled/bammmotif.mpibpc.mpg.de
  • in this file settings like the max_user_upload are described.
  • when changing this file use ´sudo service nginx reload´ in order to apply these changes.

cleanup logic for jobs

currently input files plus output files are stored indefinitely. In case the webserver will be heavily used, this will sooner or later lead to storage problems on marvin.

We can handle this by implementing a celery cronjob that

  • detects all expired jobs by their creation date
  • deletes jobs in database plus job directories on the HD.

read sensitive settings from environment variables

BaMM_webserver/webserver/settings.py hardcodes usernames, connection info and passwords. This is a security issue and very inconvenient when porting to other systems.

These settings should instead be read from environment variables that can be passed to the docker files.

[19530c7c] Error in plotBaMMLogo.R code

bammmotif.utils.misc.CommandFailureException: 'plotBaMMLogo.R' '/code/media/19530c7c-9d7b-4040-a24c-29e743d51b24/Output/' 'PausingSequences.Rep1' '0' '--web' '1'

TODO list: design of the BaMM output page

With the advent of multi motif database support in #31, which I am about to finish, I think we can start thinking about the final looks of the BaMM output page.

This is what I think should still be done:

  • below the obligatory header: a table of BaMM refined models
  • add MMcompare database hits from the DbMatch table
  • include the detailed information section with higher order plots, etc from Anja's BaMM pipeline
  • add a link to the peng results page, so that the user can jump back easily.

seeding workflow errors, when not at least one enriched motif found

When, not a single motif is found, the peng workflow errors instead of showing a message to the user.

filterPWM.py currently expects at least one motif.

Traceback (most recent call last):
  File "/ext/filterPWMs/filterPWM.py", line 167, in <module>
    main()
  File "/ext/filterPWMs/filterPWM.py", line 57, in main
    new_models = filter_pwms(models, min_overlap)
  File "/ext/filterPWMs/utils.py", line 179, in filter_pwms
    af = AffinityPropagation(affinity='precomputed').fit(matrix_sim)
  File "/usr/local/lib/python3.5/site-packages/sklearn/cluster/affinity_propagation_.py", line 294, in fit
    X = check_array(X, accept_sparse='csr')
  File "/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 462, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 0)) while a minimum of 1 is required.

improve database text search

Currently the text search against the database is exact, meaning that the user has to get the special characters right or she will not find the factor.

It would be helpful if the search was fuzzy or ignores special characters to improve the user experience.

cc @WanwanGe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.