Git Product home page Git Product logo

bedshift's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

berkuva baharehh

bedshift's Issues

bedshift can't handle numeric chroms

using simple.bed:

1	5	7
1	15	22
1	27	32
1	42	59
1	61	63
1	78	94
1	94	99

with chrom_sizes:

1\t100
bedshift --verbosity 5 -b simple.bed  -s .9 -l chrom_sizes
DEBU 11:56:24 | logmuse:bedshift:268 > Chrom lengths: {'chr1': 100} 
DEBU 11:56:24 | logmuse:bedshift:269 > chrom: 1 
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 518, in main
    n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 396, in all_perturbations
    n += self.shift(shiftrate, shiftmean, shiftstdev)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 253, in shift
    drop_row, new_region = self._shift(row, shiftmean, shiftstdev) # shifted rows display a 1
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 270, in _shift
    if start + theshift < 0 or end + theshift > chrom_lens[chrom]:
KeyError: 1

Because it's indexing into a dict with that key directly, and it's treated as an integer index instead of a string index.

Write doc guides

guide ideas:

  • how to make a test set from one file for similarity metric
  • how to alter multiple files: python and bash way
  • how to create a randomly generated bed file

Add from valid regions

A file can be provided similar to "add from file", except instead of exact regions that can be added, it contains large ranges where regions can be added.

size of test file

There's a test bed file with almost 10000 regions and takes up about 1 meg of space. That's pretty big for a unit test.

Can this be pared down? or if not, can it be zipped at least?

Performance improvement

Python Pandas is slow for the large number of single-row operations in bedshift. It may be faster to read bedfiles into a native object like a list or dictionary and conduct operations on it.

Bedshift seems to produce invalid regions

I noticed using bedshift recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900). bedtools was throwing errors because of this.

I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.

To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):

# %%
import os
from tqdm import tqdm
from bedshift import Bedshift

# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"

# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")

for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
    file_name = f"shifted_{shift_num}.bed"
    file_path = os.path.join(OUT_DIR, file_name)

    bedshifter.all_perturbations(
        addrate=ADD_RATE,  # the rate (as a proportion of the total number of regions) to add regions
        addmean=ADD_MEAN,  # the mean length of added regions
        addstdev=ADD_STDEV,  # the standard deviation of the length of added regions
        shiftrate=SHIFT_RATE,  #  the rate to shift regions (both the start and end are shifted by the same amount)
        shiftmean=SHIFT_MEAN,  #  the mean shift distance
        shiftstdev=SHIFT_STDEV,  #  the standard deviation of the shift distance
        cutrate=CUT_RATE,  #  the rate to cut regions into two separate regions
        mergerate=MERGE_RATE,  #  the rate to merge two regions into one
        droprate=DROP_RATE,  # the rate to drop/remove regions
        seed=42,
    )

    bedshifter.to_bed(file_path)

    # start at the file we just created
    bedshifter = Bedshift(file_path, "hg38.chrom.sizes")

Then I analyzed overlaps with bedtools with (one directory up):

#!/bin/bash

mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv

original_file="data/pbmcs.bed"

# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)

echo "pbmcs.bed,$percentage" >> results/intersect.csv

for file in data/shifted_clean/*.bed; do
    filename=$(basename "$file")

    # compute num intersections and store
    total=$(wc -l < "$file")
    n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
    percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
    
    echo "Processing $file"
    echo "$filename,$percentage" >> results/intersect.csv

done

Addfile bug

Add from file causes ordering issue in the new regions added:


chrX	154377250	154377510	0			
chrX	154445076	154445336	0			
chrX	154880965	154881225	0			
chrX	154892108	154892219	0			
chrY	21154626	21154886	0			
			3	chr18	60255170	60253366
			3	chr8	75347741	75347525
			3	chr12	9596105	9595865
			3	chr21	37189188	37188872
			3	chr17	55185623	55183885
			3	chr12	122209596	122209272
			3	chr12	43190572	43190272
			3	chr7	133941061	133940568
			3	chr20	900370	900126

bug in calculation of regions changed?

I'm confused as the "X regions changed" notification at the end of every run appears incorrect. For example, I have 1 region and I run bedshift, and it says: 5 regions changed"

bedshift -b test.bed -l `refgenie seek hg38_primary/fasta.chrom_sizes` -s 1 
welcome to bedshift
Shifting file: 'test.bed'
Params:
    chrom.sizes file: /home/nsheff/code/refgenie_sandbox/hg38_primary/fasta/default/hg38_primary.chrom.sizes
    shift rate: 1.0
        shift mean distance: 0.0
        shift stdev: 150.0
    add rate: 0.0
        add mean length: 320.0
        add stdev: 30.0
    add regions from file: None
    cut rate: 0.0
    drop rate: 0.0
    merge rate: 0.0
    outputfile: None
    repeat: 1
            
The output bedfile located in bedshifted_test.bed has 1 regions. The original bedfile had 1 regions.
5 regions changed
nsheff@zither:~/garage/bedshift_test$ cat test.bed
chrX	1	1000

actually, no regions were changed, look:

cat bedshifted_test.bed 
chrX	1	1000	0

Random seed

Add the feature of random seed for perturbations so that results are reproducible

Passing list-likes to .loc or [] with any missing labels is no longer supported.

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 555, in main
    n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 443, in all_perturbations
    n += self.add_from_file(addfile, addrate)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 170, in add_from_file
    add_df = df.loc[add_rows].reset_index(drop=True)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1315, in _validate_read_indexer
    raise KeyError(
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([0], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
version info:
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 

In [2]: pandas.__version__
Out[2]: '1.1.3'

displayed outputfiles are incorrect with repetitions

When I run something with repetitions, it says:

                                                                                                                                        
REGION COUNT | original: 47331  new: 52064      changed: 4733                                                                           
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331  new: 52064      changed: 4733

  1. First, I think it seems redundant to output this 100 times, since the message is identical.
  2. Second, I think it would be better if this message fit on 1 line, if we're really going to display it for every single replicate
  3. Finally, I think the output file is wrong -- it's not outputting it to that file at all, it's actually this path:
ll /home/nsheff/bedshifted//0/rep*_add1_0.bed                         
-rw-rw-r-- 1 nsheff nsheff 1249750 Mar 23 18:09 /home/nsheff/bedshifted//0/rep100_add1_0.bed                                            
-rw-rw-r-- 1 nsheff nsheff 1348751 Mar 25 11:47 /home/nsheff/bedshifted//0/rep10_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348824 Mar 25 11:47 /home/nsheff/bedshifted//0/rep11_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348789 Mar 25 11:47 /home/nsheff/bedshifted//0/rep12_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348663 Mar 25 11:47 /home/nsheff/bedshifted//0/rep13_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348680 Mar 25 11:47 /home/nsheff/bedshifted//0/rep14_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348836 Mar 25 11:47 /home/nsheff/bedshifted//0/rep15_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348873 Mar 25 11:47 /home/nsheff/bedshifted//0/rep16_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348768 Mar 25 11:47 /home/nsheff/bedshifted//0/rep17_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348825 Mar 25 11:47 /home/nsheff/bedshifted//0/rep18_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348704 Mar 25 11:47 /home/nsheff/bedshifted//0/rep19_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348711 Mar 25 11:46 /home/nsheff/bedshifted//0/rep1_add1_0.bed                                              
-rw-rw-r-- 1 nsheff nsheff 1348810 Mar 25 11:47 /home/nsheff/bedshifted//0/rep20_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348876 Mar 25 11:47 /home/nsheff/bedshifted//0/rep21_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348851 Mar 25 11:47 /home/nsheff/bedshifted//0/rep22_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348888 Mar 25 11:47 /home/nsheff/bedshifted//0/rep23_add1_0.bed                                             
-rw-rw-r-- 1 nsheff nsheff 1348727 Mar 25 11:47 /home/nsheff/bedshifted//0/rep24_add1_0.bed

This just makes the log files too long when running a whole PEP with lots of replicates.
What do you think of just spitting out 1 line like:

"Generating X replicates for ... in output location ..."
"Finished generating X replicates"

That will be more understandable. You could output a line every time 10% is complete or something if you want more verbosity.

Reported regions changed are inaccurate if regions are outside chrom sizes

bedshift --verbosity 5 -b tests/simple_1.bed  -s .9 -l tests/chrom_sizes_2 --shiftstdev 50
...
The output bedfile located in bedshifted_simple_1.bed has 7 regions. The original bedfile had 7 regions.
6 regions changed

But, no regions changed:

diff tests/simple_1.bed bedshifted_simple_1.bed
1,7c1,7
< 1	50	70
< 1	150	220
< 1	270	320
< 1	420	590
< 1	610	630
< 1	780	940
< 1	940	990
---
> 1	50	70	0
> 1	150	220	0
> 1	270	320	0
> 1	420	590	0
> 1	610	630	0
> 1	780	940	0
> 1	940	990	0

Because they are outside chrom_sizes.

  1. Should we report if included regions are outside given chrom sizes?
  2. at very least, the number of reported changed regions should be accurate, even if there's no sanity check on chrom sizes.

make required folders

if a folder for an output doesn't exist, it gives:

## [1 of 216] sample: add1_0; pipeline: bedshift_run
'pipestat' not found in 'looper' section of the project configuration file. Using defaults.
Could not determine pipestat namespace. Caught exception: AttributeError('pipestat_config')
Calling pre-submit function: refgenconf.looper_refgenie_plugin
Writing script to /home/nsheff/code/bedshift_paper/fig/code/pep_project_5trials/looper_output/submission/bedshift_run_add1_0.sub
Job script (n=1; 0.00Gb): looper_output/submission/bedshift_run_add1_0.sub
Compute node: zither
Start time: 2021-03-23 17:47:32
Welcome to bedshift version 1.1.0-dev
Shifting file: '/home/nsheff/encode_tfbs/regions//wgEncodeAwgTfbsUwHek293CtcfUniPk.narrowPeak'
Params:
  chrom.sizes file: /home/nsheff/code/refgenie_sandbox/alias/hg19/fasta/default/hg19.chrom.sizes
  shift:
    shift rate: 0.0
    shift mean distance: 0.0
    shift stdev: 150.0
    shift regions from file: None
  add:
    rate: 0.1
    add mean length: 320.0
    add stdev: 30.0
    add regions from file: /home/nsheff/code/bedshift_paper/fig/code/ALL_1000overlap_300files_segments.bed
    valid regions: None
  cut rate: 0.0
  drop rate: 0.0
    drop regions from file: None
  merge rate: 0.0
  outputfile: /home/nsheff/bedshifted//0/add1_0.bed
  repeat: 100
  yaml_config: None

        4733 regions changed in total.

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 571, in main
    bedshifter.to_bed(modified_outfile)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 465, in to_bed
    self.bed.to_csv(outfile_name, sep='\t', header=False, index=False, float_format='%.0f')
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 3170, in to_csv
    formatter.save()
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 185, in save
    f, handles = get_handle(
  File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/common.py", line 493, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
FileNotFoundError: [Errno 2] No such file or directory: '/home/nsheff/bedshifted//0/rep1_add1_0.bed'

should we just auto-create said folder? otherwise I have to create all these manually...

Refgenie integration

Currently the add and shift operations do not consider the reference assembly when doing perturbations, so regions can be outside of the length of the chromosome. Refgenie should have some functions to fetch reference assembly data that can solve this issues.

Will have to add a parameter to specify the reference assembly, as well as change add and shift code.

Name of replicate files

Right now, replicate files are named repX_filename.bed. This means with multiple runs, the replicates will sort by replicate number (rep1 of 5 different perturbations all sort together), which is meaningless.

I think it's more intuitive to name them filename_repX.bed. This way, the replicates of the same type sort together.

Systematize change notification

This notice looks a bit strange:

The output bedfile located in /home/nsheff/bedshifted//0/rep99_add1_0.bed has 52064 regions. The original bedfile had 47331 regions.
        4733 regions changed in total.

Can we restructure to:

  • remove newlines
  • use tabs/spacing consistently
  • be more concise

For example: reduce from using 3 lines to 1 like this:

Region count. old: 47331; new: 52064; changed: 4733. Output file: /home/nsheff/bedshifted//0/rep98_add1_0.bed

Or something ...

Shift is too slow

The performance of shift is really slow. I think it can be improved if regions are not modified in place, but are added as new regions and old regions are removed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.