kundajelab / basepairmodels Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 6.0 7.02 MB

License: MIT License

Python 64.46% Makefile 0.19% Batchfile 0.23% Jupyter Notebook 35.12%

basepairmodels's People

Contributors

Stargazers

Watchers

Forkers

erankotler juanelenter alexlan137 panushri25 mlweilert bernardo-de-almeida

basepairmodels's Issues

Add hit scoring feature as new command

Test if shap is working fine for all newly created models

Interpretation failing due to old deepLIFT version

This is the error:
File "/home/users/surag/oak/miniconda3/envs/basepairmodels/lib/python3.6/site-packages/basepairmodels/cli/interpret.py", line 192, in data_func return [dinuc_shuffle(model_inputs[0], args.num_shuffles, rng)] + \ TypeError: dinuc_shuffle() takes from 1 to 2 positional arguments but 3 were given

Followed installation instructions as specified in the docs with fresh conda env. It seems to install version 0.6.10.0 of deepLIFT. That version takes in only two arguments.

https://github.com/kundajelab/deeplift/blob/d0d5f3606b96ea6803b9c8bcb6c3b8e1f4b99300/deeplift/dinuc_shuffle.py#L41

Resolved by pip instal --upgrade deeplift. Perhaps specify version deepLIFT 0.6.13 or above in requirements.txt?

interpret saves h5 files extremely slowly

For a run of about 12k regions, each 1 kb wide, running the interpret script takes about as long to run through the regions and generate scores as it does to save those scores (about 1 GB) to disk. Since this is a single call to DeepDish, I wonder if this could be optimized in some way to write those files much faster.

[suggestion] Allow the user to give a counts-loss-weight value in input_data.json

In training two data sets with wildly different coverages, I'd like to be able to give separate counts loss weights. If I don't do that, then the data sets can have different relative weights between profile and counts losses.

As I see it, there are two ways to do this, and both involve removing the flag from train and adding the information to input_data.json.

Include a counts_loss_weight term on each task in input_data.json. Weight the counts-vs-profile losses for each track accordingly.
Include a counts_loss_alpha term on each task in input_data.json. Run counts_loss_weight on the specified track with the given --alpha value. This way, tracks with different coverages can still be trained at the same alpha. (Or different alphas, if the user wants to do that for some reason.)
Of course, you could implement both of these. In that case, you'd have to supply exactly one of these for each track.

This could be accomplished by weighting the input data before training, but for ease of use, it'd be nifty to just specify the alpha for each task and have the code deal with automatically.

Second idea: Allow the user to weight the different tracks to each other. For example, a track with high coverage will have larger loss values, and this can dominate the training loss of a lower-coverage track. So include a parameter (and a corresponding feature in the counts_loss_weight program) that weights the total loss of each track either (1.) in an absolute term, so total_loss = loss[task_1] * 0.1 + loss[task_2] * 0.9 or (2.) with an alpha-like parameter that accounts for the expected difference in profile loss values based on the coverage, so loss = loss[task_1] * get_loss_weight(task_1_coverage, alpha_task_1) + loss[task_2] * get_loss_weight(task_2_coverage, alpha_task_2)
Again, the user could scale the input data before training, but this would simplify it for the biologists who don't want to manipulate their files by multiplying by magic numbers.

Your thoughts?

Option to set number of dilated layers (and sanity check for architecture)

I don't see an option to set the number of dilated convolutions in the train command. It's not a huge deal since I'm defining my own architecture, but it's an easy thing to miss that should be tweakable.
On that note, it'd be nice to have the commands do some sanity checking. For example, there's a layer prof_out_crop2match_output that addresses the case where the architecture doesn't reduce down to the desired output width. In my case, if that layer is doing any work, then it means that I mis-defined my network. It'd be nice to have the program emit a warning (if it could silence the tensorflow warnings during startup, then it'd be easier to see. Similarly, when I was making predictions and getting all zeros because I forgot --predict-peaks, it'd be nice to have the program tell me that instead of looking like it's working but then producing the wrong output.

Integrate new batch generator that combines foreground and background loci

----- for the generator

The input json will have new field for background loci
The user will specify #samples for foreground and background (-1 for all)
The background_only option will dictate whether the bed file in the background_loci field of the input json will have weight as zero or one for mnll loss (e.g. in case of training background model the weight for background_loci will be 1)

---- for integration

Add command line argument to pass background_only flag to the batch generator

Update modisco version

Use latest version of modisco and update the call to TfModiscoWorkflow

shap scores for multi-task

Thank you for developing this fantastic package and the very recent update. I have not tried it yet, but I guess the predict step will be much faster.

I have some questions regarding to multi-task shap score generation.

For profile shap scores: in line 286 of shap_scores.py, the option stranded=True is hard-coded. I think it will work fine for single-task (whether it is stranded or not). But for multi-tasks (unstranded), will it pick up the wrong task? For example, I have 2 tasks (unstranded). Based on the code, task 0 will use output[0:2] (0 and 1) and task 1 will use [2:4] (2 and 3, which will be out of bounds since there are only 2 outputs).
For counts shap scores: in line 280 of shap_scores.py, it seems to use all count outputs. Again this is fine for single task. But for multi-tasks, will it make all tasks having the same scores?

Thank you for your time. Let me know if you have any questions or comments.

[suggestion] Remove --stranded and --has-control flags; infer these from input_data.json

This is just a suggestion, but I wanted to log it some place where we have a formal channel to discuss.
Currently, in the input_data.json file, the user must provide control tracks, and also must indicate the strandiness of the data. Then, she must also invoke --stranded and --has-control in a whole slew of scripts. I propose we just rely on input_data.json to infer the strandiness and controlledness of the data.
This also would allow for a clean way to have datasets of mixed strandednesses and controllednesses. For example,
`{
"task_nanog_plus" : {"strand" : 0,
"task_id" : 0,
"signal": [...],
"peaks" : [...],
"control" : [...]}
"task_nanog_minus" : {"strand" : 1,
"task_id" : 0,
[signal, peaks, control]}
"task_mnase" : { "strand" : 0
"task_id" : 1,
[signal, peaks]} //control omitted
}
would be a valid input.
In the case of mixed strandednesses, it would construct a model with the appropriate number of outputs (2*n_stranded_inputs + n_unstranded_inputs) and then either (1.) the model would only expect the number of control tracks listed in the input or (2.) the model would expect a control track for every output, but the code would supply bias tracks full of zeroes in cases where the user has not provided one.

I need this sort of functionality because I'm mixing and matching all sorts of data types, some stranded, some controlled, and some both stranded and controlled.

Your thoughts on this proposal?

Test entire pipeline for chromatin models

First train bias model
Train main model
rest of the pipeline steps (shap, modisco, metrics, hit scoring)

Installation fails due to tensorflow requirement

I tried to download the repo using the command pip install --no-cache-dir --ignore-installed git+https://github.com/kundajelab/basepairmodels.git from the README.

I get the following error:

Collecting git+https://github.com/kundajelab/basepairmodels.git
  Cloning https://github.com/kundajelab/basepairmodels.git to /tmp/pip-req-build-__cioa9b
  Running command git clone -q https://github.com/kundajelab/basepairmodels.git /tmp/pip-req-build-__cioa9b
WARNING: Keyring is skipped due to an exception: Failed to create the collection: Prompt dismissed..
WARNING: Keyring is skipped due to an exception: Failed to create the collection: Prompt dismissed..
ERROR: Could not find a version that satisfies the requirement tensorflow-gpu==1.14 (from basepairmodels==0.1.0) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow-gpu==1.14 (from basepairmodels==0.1.0)
WARNING: Keyring is skipped due to an exception: Failed to create the collection: Prompt dismissed..
WARNING: Keyring is skipped due to an exception: Failed to create the collection: Prompt dismissed..

Test entire pipeline for TF models

using tensorflow 2

Thank you for developing basepairmodels. We are interested in using it on some of our ChIP-seq and CUT&Tag data.

I have a question about tensorflow version. We recently built a GPU server with A100 GPU cards. The minimum CUDA version for A100 is 11. If I install the package using pip install git+https://github.com/kundajelab/[email protected] based on the README file, I get tensorflow 1.14 (it works on CUDA10 and A100, but not optimal).

Is there a way to use tensorflow 2 with CUDA11? For example, if I directly install the dev version with pip install git+https://github.com/kundajelab/basepairmodels.git (remove the @v0.1.4), the tensorflow is bumped to 2.4.1 which can be paired with CUDA11. I am wondering if this works or not.

Thank you so much for your help. Let me know if you have any questions.

Ning

parameter tweaking for Dnase/ATAC-seq data

Dear developers,

I wonder if there is documentation/example for training based on Dnase/ATAC-seq input? I would highly appreciate suggestions on parameter tweaking!!

It seems ATAC/Dnase are supported model architectures from these lines in train_and_validate:

    if model_arch_name == "BPNet_ATAC_DNase":
        model = get_model(
            tasks, bias_tasks, model_arch_params, bias_model_arch_params, 
            name_prefix="main")
    else:        
        model = get_model(tasks, model_arch_params, name_prefix="main")

Thanks!!

[Suggestion] Read in custom architecture and loss functions from command line

As I start complicating my interaction with BPNet more, I'd like to add a feature where the user can easily supply a network architecture and loss function to the program. Here's a rough sketch of how I was thinking this could work, and I'd love to get your input. I'm very much not a software engineer, so these may be terrible anti-patterns with a much better way of addressing them.

Here's the issue I'm bumping into. When I want to create a custom model architecture, I have to dive deep into the source of basepairmodels and modify model_archs.py. This is problematic for three reasons. First, the changes are global and I cannot, for example, have my model architecture file in the same directory I'm using for my data without symlinking from basepairmodels into my experiment directory. (and even that doesn't help if I'm working on multiple models in different directories.) Second, if I want to download the latest version of basepairmodels, it will overwrite my changes to model_archs.py, or at least break the symlink to my modified version. Third, it is very difficult to provide additional customization options on the command line, as might be necessary when doing a grid search over hyperparameters.

Model definitions.
1.1 Add a flag --model-src to train; this flag would accept a string naming a python file. Let's call it customModelDef.py.
1.2. Add a flag --model-args to train; this flag would be an arbitrary string. Let's call the string modelArgs for the moment.
1.3. The python file named by --model-src shall contain a function called model(). It will take a single string as an argument, and this is the string given on the command line to --model-args. The function model() shall return a network just as the function in model_archs.py do currently.
1.3.1. The function model() could optionally be required to accept other arguments that are germane to other parts of the program. For example, since the sequence generator needs to know the input length, and the number of bias profiles is required to provide the correct input to the model, then model() could require arguments of the form model(input_seq_len, output_len, num_bias_profiles, modelArgs).
1.3.2. The function could also be designed to accept keyword arguments from the main program, so that model()'s signature would be model(modelArgs, **kwargs). Then the main program would provide additional information that model() could use or discard. For example, the main program could call customModelDef.model(modelArgs, input_seq_len=NNNN, output_len=NNNN, num_bias_profiles=NNNN, ...) and so on. This way, for an architecture where num_bias_profiles is irrelevant, model() could simply ignore that keyword argument.
1.4. The function model() may do with its argument string what it pleases. For example, the modelArgs string could be something like "num_profiles=5:kernel_size=6:allow_opt=false:add_insult_layer=0.825", in which case model() would return the corresponding network. Or, more likely, modelArgs would be something like "/projects/Sebastian/training/config.json", in which case model() would probably open up that file and read in configuration from it. In any event, the main train command would be completely agnostic to how modelArgs is processed or its meaning.
Add a custom loss function.
2.1. Add a flag --loss-src to any cli components that need the loss function. (Since most of the tools that work with the network have to create multinomial_nll before they can load the model, this would probably be most of the cli tools). This flag would take a string naming a python file, call it customLossDef.py
2.2. customLossDef.py should contain a function that returns a loss function, or similar. I'm not familiar enough with how the loss is created to know what the precise architecture of this function should be.
2.2.1. One option would be to have a function getLossFunction(lossArgs), accepting a string like model() would. This function would then return a loss function based on the string lossArgs. But there could be other, better ways of implementing this.

Your thoughts?

bedGraphToBigWig error in README example

Howdy

I was working through the example, and when I got to the following code snippet

# get coverage of 5’ positions of the plus strand
bedtools genomecov -5 -bg -strand + \
        -g hg38.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > plus.bedGraph

# get coverage of 5’ positions of the minus strand
bedtools genomecov -5 -bg -strand - \
        -g hg38.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > minus.bedGraph

# Convert bedGraph files to bigWig files
bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw

I encountered the following error when running bedGraphToBigWig using hg38.chrom.sizes from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes:

chrEBV is not found in chromosome sizes file

shap_scores aren't compatible with non-bias models

Is there support for shap_scores.py that for models that don't contain a bias track? When setting up a model with no bias, there is only 1 sequence input, but the shap_scores script assumes that there are three model inputs. This means that usage of this CLI breaks the contribution generation after a model is trained.

bug in Counts Pearson metric

the counts pearson metric saves the pearson of tracks rather than the counts

Add an option to predict allowing for a bed file of regions to predict

The predict program currently uses input_data.json to both gather task information and get a list of peaks to predict. In the event that I want to train and predict on different regions, this means I have to create a second input_data.json with different bed files. It'd be nice if there were a way to provide a bed file of regions where I want predictions to the predict program, so that I don't have to go through this hoop.