databio / bedshift Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 2.0 745 KB

Bedfile perturbation tool

Home Page: http://bedshift.databio.org

License: BSD 2-Clause "Simplified" License

Python 92.87% Shell 3.93% R 3.19%

genomic-intervals region-of-interest genomic-regions

bedshift's Introduction

Bedshift

Docs: bedshift.databio.org

Install from PyPi: pip install bedshift

Install from local repository: pip install .

Command line

Run with:

bedshift -l tests/hg38.chrom.sizes -b tests/test.bed

See bedshift -h for parameters.

Python

import bedshift

bedshifter = bedshift.Bedshift('tests/test.bed', 'hg38.chrom.sizes')
bedshifter.all_perturbations(addrate=0.3, addmean=320.0, addstdev=20.0,
							 shiftrate=0.3, shiftmean=-10.0, shiftstdev=120.0,
							 cutrate=0.1,
							 mergerate=0.11,
							 droprate=0.03)
# can also run single operations: shift, add, cut, merge, drop

bedshifter.to_bed('test_output.bed')

Development

Run tests (from this directory):

python -m unittest

Double check the output files to see if the regions make sense.

build docs:

bedshift's People

Contributors

Stargazers

Watchers

Forkers

berkuva baharehh

bedshift's Issues

displayed outputfiles are incorrect with repetitions

When I run something with repetitions, it says:

                                                                                                                                        
REGION COUNT | original: 47331  new: 52064      changed: 4733                                                                           
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733

First, I think it seems redundant to output this 100 times, since the message is identical.
Second, I think it would be better if this message fit on 1 line, if we're really going to display it for every single replicate
Finally, I think the output file is wrong -- it's not outputting it to that file at all, it's actually this path:

ll /home/nsheff/bedshifted//0/rep*_add1_0.bed                         
-rw-rw-r-- 1 nsheff nsheff 1249750 Mar 23 18:09 /home/nsheff/bedshifted//0/rep100_add1_0.bed                                            
-rw-rw-r-- 1 nsheff nsheff 1348751 Mar 25 11:47 /home/nsheff/bedshifted//0/rep10_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348824 Mar 25 11:47 /home/nsheff/bedshifted//0/rep11_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348789 Mar 25 11:47 /home/nsheff/bedshifted//0/rep12_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348663 Mar 25 11:47 /home/nsheff/bedshifted//0/rep13_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348680 Mar 25 11:47 /home/nsheff/bedshifted//0/rep14_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348836 Mar 25 11:47 /home/nsheff/bedshifted//0/rep15_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348873 Mar 25 11:47 /home/nsheff/bedshifted//0/rep16_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348768 Mar 25 11:47 /home/nsheff/bedshifted//0/rep17_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348825 Mar 25 11:47 /home/nsheff/bedshifted//0/rep18_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348704 Mar 25 11:47 /home/nsheff/bedshifted//0/rep19_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348711 Mar 25 11:46 /home/nsheff/bedshifted//0/rep1_add1_0.bed                                              
-rw-rw-r-- 1 nsheff nsheff 1348810 Mar 25 11:47 /home/nsheff/bedshifted//0/rep20_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348876 Mar 25 11:47 /home/nsheff/bedshifted//0/rep21_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348851 Mar 25 11:47 /home/nsheff/bedshifted//0/rep22_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348888 Mar 25 11:47 /home/nsheff/bedshifted//0/rep23_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348727 Mar 25 11:47 /home/nsheff/bedshifted//0/rep24_add1_0.bed

This just makes the log files too long when running a whole PEP with lots of replicates.
What do you think of just spitting out 1 line like:

"Generating X replicates for ... in output location ..."
"Finished generating X replicates"

That will be more understandable. You could output a line every time 10% is complete or something if you want more verbosity.

make required folders

if a folder for an output doesn't exist, it gives:

## [1 of 216] sample: add1_0; pipeline: bedshift_run
'pipestat' not found in 'looper' section of the project configuration file. Using defaults.
Could not determine pipestat namespace. Caught exception: AttributeError('pipestat_config')
Calling pre-submit function: refgenconf.looper_refgenie_plugin
Writing script to /home/nsheff/code/bedshift_paper/fig/code/pep_project_5trials/looper_output/submission/bedshift_run_add1_0.sub
Job script (n=1; 0.00Gb): looper_output/submission/bedshift_run_add1_0.sub
Compute node: zither
Start time: 2021-03-23 17:47:32
Welcome to bedshift version 1.1.0-dev
Shifting file: '/home/nsheff/encode_tfbs/regions//wgEncodeAwgTfbsUwHek293CtcfUniPk.narrowPeak'
Params:
  chrom.sizes file: /home/nsheff/code/refgenie_sandbox/alias/hg19/fasta/default/hg19.chrom.sizes
  shift:
    shift rate: 0.0
    shift mean distance: 0.0
    shift stdev: 150.0
    shift regions from file: None
  add:
    rate: 0.1
    add mean length: 320.0
    add stdev: 30.0
    add regions from file: /home/nsheff/code/bedshift_paper/fig/code/ALL_1000overlap_300files_segments.bed
    valid regions: None
  cut rate: 0.0
  drop rate: 0.0
    drop regions from file: None
  merge rate: 0.0
  outputfile: /home/nsheff/bedshifted//0/add1_0.bed
  repeat: 100
  yaml_config: None

        4733 regions changed in total.

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 571, in main
    bedshifter.to_bed(modified_outfile)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 465, in to_bed
    self.bed.to_csv(outfile_name, sep='\t', header=False, index=False, float_format='%.0f')
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 3170, in to_csv
    formatter.save()
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 185, in save
    f, handles = get_handle(
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/common.py", line 493, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
FileNotFoundError: [Errno 2] No such file or directory: '/home/nsheff/bedshifted//0/rep1_add1_0.bed'

should we just auto-create said folder? otherwise I have to create all these manually...

Performance improvement

Python Pandas is slow for the large number of single-row operations in bedshift. It may be faster to read bedfiles into a native object like a list or dictionary and conduct operations on it.

Add from valid regions

A file can be provided similar to "add from file", except instead of exact regions that can be added, it contains large ranges where regions can be added.

bedshift is using an outdated pandas API

The requirements file doesn't specify a specific pandas version. Therefore, it will grab pandas>2.0. In pandas 2.0, the df.append method is no longer used.

This is occurring in a few places, and I believe that there are other alternatives.

Systematize change notification

This notice looks a bit strange:

The output bedfile located in /home/nsheff/bedshifted//0/rep99_add1_0.bed has 52064 regions. The original bedfile had 47331 regions.
        4733 regions changed in total.

Can we restructure to:

remove newlines
use tabs/spacing consistently
be more concise

For example: reduce from using 3 lines to 1 like this:

Region count. old: 47331; new: 52064; changed: 4733. Output file: /home/nsheff/bedshifted//0/rep98_add1_0.bed

Or something ...

Addfile bug

Add from file causes ordering issue in the new regions added:


chrX	154377250	154377510	0			
chrX	154445076	154445336	0			
chrX	154880965	154881225	0			
chrX	154892108	154892219	0			
chrY	21154626	21154886	0			
			3	chr18	60255170	60253366
			3	chr8	75347741	75347525
			3	chr12	9596105	9595865
			3	chr21	37189188	37188872
			3	chr17	55185623	55183885
			3	chr12	122209596	122209272
			3	chr12	43190572	43190272
			3	chr7	133941061	133940568
			3	chr20	900370	900126

show default output file when not specified

If I put no output file, it says "outputfile: None".

Instead, show the default output file. basically, show wherever you will put it.

Reported regions changed are inaccurate if regions are outside chrom sizes

bedshift --verbosity 5 -b tests/simple_1.bed  -s .9 -l tests/chrom_sizes_2 --shiftstdev 50
...
The output bedfile located in bedshifted_simple_1.bed has 7 regions. The original bedfile had 7 regions.
6 regions changed

But, no regions changed:

diff tests/simple_1.bed bedshifted_simple_1.bed
1,7c1,7
< 1	50	70
< 1	150	220
< 1	270	320
< 1	420	590
< 1	610	630
< 1	780	940
< 1	940	990
---
> 1	50	70	0
> 1	150	220	0
> 1	270	320	0
> 1	420	590	0
> 1	610	630	0
> 1	780	940	0
> 1	940	990	0

Because they are outside chrom_sizes.

Should we report if included regions are outside given chrom sizes?
at very least, the number of reported changed regions should be accurate, even if there's no sanity check on chrom sizes.

size of test file

There's a test bed file with almost 10000 regions and takes up about 1 meg of space. That's pretty big for a unit test.

Can this be pared down? or if not, can it be zipped at least?

Write doc guides

guide ideas:

how to make a test set from one file for similarity metric
how to alter multiple files: python and bash way
how to create a randomly generated bed file

Move chrom_lengths

there are lots of functions that use a global-ish variable for chrom_lens which makes it confusing

bedshift/bedshift/bedshift.py

Line 18 in cf720d8

chrom_lens = {}

bedshift/bedshift/bedshift.py

Line 270 in cf720d8

if start + theshift < 0 or end + theshift > chrom_lens[str(chrom)]:

can this be moved in to be an object property or something? is there a reason it's outside the object?

Shift is too slow

The performance of shift is really slow. I think it can be improved if regions are not modified in place, but are added as new regions and old regions are removed.

Right now, replicate files are named repX_filename.bed. This means with multiple runs, the replicates will sort by replicate number (rep1 of 5 different perturbations all sort together), which is meaningless.

I think it's more intuitive to name them filename_repX.bed. This way, the replicates of the same type sort together.

Shift from file and Drop from file logic

The logic in selecting regions to shift in shift_from_file is flawed.

https://github.com/databio/bedshift/blob/dev/bedshift/bedshift.py#L258

It should instead be using the intersected rows as indexes back into the original BED file.

This also occurs in drop_from_file

https://github.com/databio/bedshift/blob/dev/bedshift/bedshift.py#L258

Passing list-likes to .loc or [] with any missing labels is no longer supported.

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 555, in main
    n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 443, in all_perturbations
    n += self.add_from_file(addfile, addrate)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 170, in add_from_file
    add_df = df.loc[add_rows].reset_index(drop=True)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1315, in _validate_read_indexer
    raise KeyError(
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([0], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

version info:
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 

In [2]: pandas.__version__
Out[2]: '1.1.3'

bedshift can't handle numeric chroms

using simple.bed:

with chrom_sizes:

1\t100

bedshift --verbosity 5 -b simple.bed  -s .9 -l chrom_sizes
DEBU 11:56:24 | logmuse:bedshift:268 > Chrom lengths: {'chr1': 100} 
DEBU 11:56:24 | logmuse:bedshift:269 > chrom: 1 
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 518, in main
    n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 396, in all_perturbations
    n += self.shift(shiftrate, shiftmean, shiftstdev)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 253, in shift
    drop_row, new_region = self._shift(row, shiftmean, shiftstdev) # shifted rows display a 1
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 270, in _shift
    if start + theshift < 0 or end + theshift > chrom_lens[chrom]:
KeyError: 1

Because it's indexing into a dict with that key directly, and it's treated as an integer index instead of a string index.

Add does not add proportionally to each chromosome

Add needs to consider the length of each chromosome instead of selecting a random chromosome with the same chance of each chromosome being selected.

Bedshift seems to produce invalid regions

I noticed using bedshift recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900). bedtools was throwing errors because of this.

I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.

To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):

# %%
import os
from tqdm import tqdm
from bedshift import Bedshift

# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"

# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")

for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
    file_name = f"shifted_{shift_num}.bed"
    file_path = os.path.join(OUT_DIR, file_name)

    bedshifter.all_perturbations(
        addrate=ADD_RATE,  # the rate (as a proportion of the total number of regions) to add regions
        addmean=ADD_MEAN,  # the mean length of added regions
        addstdev=ADD_STDEV,  # the standard deviation of the length of added regions
        shiftrate=SHIFT_RATE,  #  the rate to shift regions (both the start and end are shifted by the same amount)
        shiftmean=SHIFT_MEAN,  #  the mean shift distance
        shiftstdev=SHIFT_STDEV,  #  the standard deviation of the shift distance
        cutrate=CUT_RATE,  #  the rate to cut regions into two separate regions
        mergerate=MERGE_RATE,  #  the rate to merge two regions into one
        droprate=DROP_RATE,  # the rate to drop/remove regions
        seed=42,
    )

    bedshifter.to_bed(file_path)

    # start at the file we just created
    bedshifter = Bedshift(file_path, "hg38.chrom.sizes")

Then I analyzed overlaps with bedtools with (one directory up):

#!/bin/bash

mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv

original_file="data/pbmcs.bed"

# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)

echo "pbmcs.bed,$percentage" >> results/intersect.csv

for file in data/shifted_clean/*.bed; do
    filename=$(basename "$file")

    # compute num intersections and store
    total=$(wc -l < "$file")
    n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
    percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
    
    echo "Processing $file"
    echo "$filename,$percentage" >> results/intersect.csv

done

Refgenie integration

Currently the add and shift operations do not consider the reference assembly when doing perturbations, so regions can be outside of the length of the chromosome. Refgenie should have some functions to fetch reference assembly data that can solve this issues.

Will have to add a parameter to specify the reference assembly, as well as change add and shift code.

Random seed

Add the feature of random seed for perturbations so that results are reproducible

bug in calculation of regions changed?

I'm confused as the "X regions changed" notification at the end of every run appears incorrect. For example, I have 1 region and I run bedshift, and it says: 5 regions changed"

bedshift -b test.bed -l `refgenie seek hg38_primary/fasta.chrom_sizes` -s 1 
welcome to bedshift
Shifting file: 'test.bed'
Params:
    chrom.sizes file: /home/nsheff/code/refgenie_sandbox/hg38_primary/fasta/default/hg38_primary.chrom.sizes
    shift rate: 1.0
        shift mean distance: 0.0
        shift stdev: 150.0
    add rate: 0.0
        add mean length: 320.0
        add stdev: 30.0
    add regions from file: None
    cut rate: 0.0
    drop rate: 0.0
    merge rate: 0.0
    outputfile: None
    repeat: 1
            
The output bedfile located in bedshifted_test.bed has 1 regions. The original bedfile had 1 regions.
5 regions changed
nsheff@zither:~/garage/bedshift_test$ cat test.bed
chrX	1	1000

actually, no regions were changed, look:

cat bedshifted_test.bed 
chrX	1	1000	0