databio / bedshift Goto Github PK
View Code? Open in Web Editor NEWBedfile perturbation tool
Home Page: http://bedshift.databio.org
License: BSD 2-Clause "Simplified" License
Bedfile perturbation tool
Home Page: http://bedshift.databio.org
License: BSD 2-Clause "Simplified" License
using simple.bed
:
1 5 7
1 15 22
1 27 32
1 42 59
1 61 63
1 78 94
1 94 99
with chrom_sizes:
1\t100
bedshift --verbosity 5 -b simple.bed -s .9 -l chrom_sizes
DEBU 11:56:24 | logmuse:bedshift:268 > Chrom lengths: {'chr1': 100}
DEBU 11:56:24 | logmuse:bedshift:269 > chrom: 1
Traceback (most recent call last):
File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
sys.exit(main())
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 518, in main
n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 396, in all_perturbations
n += self.shift(shiftrate, shiftmean, shiftstdev)
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 253, in shift
drop_row, new_region = self._shift(row, shiftmean, shiftstdev) # shifted rows display a 1
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 270, in _shift
if start + theshift < 0 or end + theshift > chrom_lens[chrom]:
KeyError: 1
Because it's indexing into a dict with that key directly, and it's treated as an integer index instead of a string index.
guide ideas:
A file can be provided similar to "add from file", except instead of exact regions that can be added, it contains large ranges where regions can be added.
There's a test bed file with almost 10000 regions and takes up about 1 meg of space. That's pretty big for a unit test.
Can this be pared down? or if not, can it be zipped at least?
The requirements file doesn't specify a specific pandas version. Therefore, it will grab pandas>2.0
. In pandas 2.0, the df.append
method is no longer used.
This is occurring in a few places, and I believe that there are other alternatives.
Add needs to consider the length of each chromosome instead of selecting a random chromosome with the same chance of each chromosome being selected.
The logic in selecting regions to shift in shift_from_file
is flawed.
https://github.com/databio/bedshift/blob/dev/bedshift/bedshift.py#L258
It should instead be using the intersected rows as indexes back into the original BED file.
This also occurs in drop_from_file
https://github.com/databio/bedshift/blob/dev/bedshift/bedshift.py#L258
Python Pandas is slow for the large number of single-row operations in bedshift. It may be faster to read bedfiles into a native object like a list or dictionary and conduct operations on it.
I noticed using bedshift
recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900
). bedtools
was throwing errors because of this.
I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.
To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):
# %%
import os
from tqdm import tqdm
from bedshift import Bedshift
# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"
# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")
for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
file_name = f"shifted_{shift_num}.bed"
file_path = os.path.join(OUT_DIR, file_name)
bedshifter.all_perturbations(
addrate=ADD_RATE, # the rate (as a proportion of the total number of regions) to add regions
addmean=ADD_MEAN, # the mean length of added regions
addstdev=ADD_STDEV, # the standard deviation of the length of added regions
shiftrate=SHIFT_RATE, # the rate to shift regions (both the start and end are shifted by the same amount)
shiftmean=SHIFT_MEAN, # the mean shift distance
shiftstdev=SHIFT_STDEV, # the standard deviation of the shift distance
cutrate=CUT_RATE, # the rate to cut regions into two separate regions
mergerate=MERGE_RATE, # the rate to merge two regions into one
droprate=DROP_RATE, # the rate to drop/remove regions
seed=42,
)
bedshifter.to_bed(file_path)
# start at the file we just created
bedshifter = Bedshift(file_path, "hg38.chrom.sizes")
Then I analyzed overlaps with bedtools
with (one directory up):
#!/bin/bash
mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv
original_file="data/pbmcs.bed"
# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
echo "pbmcs.bed,$percentage" >> results/intersect.csv
for file in data/shifted_clean/*.bed; do
filename=$(basename "$file")
# compute num intersections and store
total=$(wc -l < "$file")
n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
echo "Processing $file"
echo "$filename,$percentage" >> results/intersect.csv
done
Add from file causes ordering issue in the new regions added:
chrX 154377250 154377510 0
chrX 154445076 154445336 0
chrX 154880965 154881225 0
chrX 154892108 154892219 0
chrY 21154626 21154886 0
3 chr18 60255170 60253366
3 chr8 75347741 75347525
3 chr12 9596105 9595865
3 chr21 37189188 37188872
3 chr17 55185623 55183885
3 chr12 122209596 122209272
3 chr12 43190572 43190272
3 chr7 133941061 133940568
3 chr20 900370 900126
I'm confused as the "X regions changed" notification at the end of every run appears incorrect. For example, I have 1 region and I run bedshift, and it says: 5 regions changed"
bedshift -b test.bed -l `refgenie seek hg38_primary/fasta.chrom_sizes` -s 1
welcome to bedshift
Shifting file: 'test.bed'
Params:
chrom.sizes file: /home/nsheff/code/refgenie_sandbox/hg38_primary/fasta/default/hg38_primary.chrom.sizes
shift rate: 1.0
shift mean distance: 0.0
shift stdev: 150.0
add rate: 0.0
add mean length: 320.0
add stdev: 30.0
add regions from file: None
cut rate: 0.0
drop rate: 0.0
merge rate: 0.0
outputfile: None
repeat: 1
The output bedfile located in bedshifted_test.bed has 1 regions. The original bedfile had 1 regions.
5 regions changed
nsheff@zither:~/garage/bedshift_test$ cat test.bed
chrX 1 1000
actually, no regions were changed, look:
cat bedshifted_test.bed
chrX 1 1000 0
Add the feature of random seed for perturbations so that results are reproducible
Traceback (most recent call last):
File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
sys.exit(main())
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 555, in main
n = bedshifter.all_perturbations(args.addrate, args.addmean, args.addstdev,
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 443, in all_perturbations
n += self.add_from_file(addfile, addrate)
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 170, in add_from_file
add_df = df.loc[add_rows].reset_index(drop=True)
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1315, in _validate_read_indexer
raise KeyError(
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([0], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
version info:
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
In [2]: pandas.__version__
Out[2]: '1.1.3'
When I run something with repetitions, it says:
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
output file: /home/nsheff/bedshifted//0/add1_0.bed
REGION COUNT | original: 47331 new: 52064 changed: 4733
ll /home/nsheff/bedshifted//0/rep*_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1249750 Mar 23 18:09 /home/nsheff/bedshifted//0/rep100_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348751 Mar 25 11:47 /home/nsheff/bedshifted//0/rep10_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348824 Mar 25 11:47 /home/nsheff/bedshifted//0/rep11_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348789 Mar 25 11:47 /home/nsheff/bedshifted//0/rep12_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348663 Mar 25 11:47 /home/nsheff/bedshifted//0/rep13_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348680 Mar 25 11:47 /home/nsheff/bedshifted//0/rep14_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348836 Mar 25 11:47 /home/nsheff/bedshifted//0/rep15_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348873 Mar 25 11:47 /home/nsheff/bedshifted//0/rep16_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348768 Mar 25 11:47 /home/nsheff/bedshifted//0/rep17_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348825 Mar 25 11:47 /home/nsheff/bedshifted//0/rep18_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348704 Mar 25 11:47 /home/nsheff/bedshifted//0/rep19_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348711 Mar 25 11:46 /home/nsheff/bedshifted//0/rep1_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348810 Mar 25 11:47 /home/nsheff/bedshifted//0/rep20_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348876 Mar 25 11:47 /home/nsheff/bedshifted//0/rep21_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348851 Mar 25 11:47 /home/nsheff/bedshifted//0/rep22_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348888 Mar 25 11:47 /home/nsheff/bedshifted//0/rep23_add1_0.bed
-rw-rw-r-- 1 nsheff nsheff 1348727 Mar 25 11:47 /home/nsheff/bedshifted//0/rep24_add1_0.bed
This just makes the log files too long when running a whole PEP with lots of replicates.
What do you think of just spitting out 1 line like:
"Generating X replicates for ... in output location ..."
"Finished generating X replicates"
That will be more understandable. You could output a line every time 10% is complete or something if you want more verbosity.
bedshift --verbosity 5 -b tests/simple_1.bed -s .9 -l tests/chrom_sizes_2 --shiftstdev 50
...
The output bedfile located in bedshifted_simple_1.bed has 7 regions. The original bedfile had 7 regions.
6 regions changed
But, no regions changed:
diff tests/simple_1.bed bedshifted_simple_1.bed
1,7c1,7
< 1 50 70
< 1 150 220
< 1 270 320
< 1 420 590
< 1 610 630
< 1 780 940
< 1 940 990
---
> 1 50 70 0
> 1 150 220 0
> 1 270 320 0
> 1 420 590 0
> 1 610 630 0
> 1 780 940 0
> 1 940 990 0
Because they are outside chrom_sizes.
if a folder for an output doesn't exist, it gives:
## [1 of 216] sample: add1_0; pipeline: bedshift_run
'pipestat' not found in 'looper' section of the project configuration file. Using defaults.
Could not determine pipestat namespace. Caught exception: AttributeError('pipestat_config')
Calling pre-submit function: refgenconf.looper_refgenie_plugin
Writing script to /home/nsheff/code/bedshift_paper/fig/code/pep_project_5trials/looper_output/submission/bedshift_run_add1_0.sub
Job script (n=1; 0.00Gb): looper_output/submission/bedshift_run_add1_0.sub
Compute node: zither
Start time: 2021-03-23 17:47:32
Welcome to bedshift version 1.1.0-dev
Shifting file: '/home/nsheff/encode_tfbs/regions//wgEncodeAwgTfbsUwHek293CtcfUniPk.narrowPeak'
Params:
chrom.sizes file: /home/nsheff/code/refgenie_sandbox/alias/hg19/fasta/default/hg19.chrom.sizes
shift:
shift rate: 0.0
shift mean distance: 0.0
shift stdev: 150.0
shift regions from file: None
add:
rate: 0.1
add mean length: 320.0
add stdev: 30.0
add regions from file: /home/nsheff/code/bedshift_paper/fig/code/ALL_1000overlap_300files_segments.bed
valid regions: None
cut rate: 0.0
drop rate: 0.0
drop regions from file: None
merge rate: 0.0
outputfile: /home/nsheff/bedshifted//0/add1_0.bed
repeat: 100
yaml_config: None
4733 regions changed in total.
Traceback (most recent call last):
File "/home/nsheff/.local/bin/bedshift", line 8, in <module>
sys.exit(main())
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 571, in main
bedshifter.to_bed(modified_outfile)
File "/home/nsheff/.local/lib/python3.8/site-packages/bedshift/bedshift.py", line 465, in to_bed
self.bed.to_csv(outfile_name, sep='\t', header=False, index=False, float_format='%.0f')
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 3170, in to_csv
formatter.save()
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 185, in save
f, handles = get_handle(
File "/home/nsheff/.local/lib/python3.8/site-packages/pandas/io/common.py", line 493, in get_handle
f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
FileNotFoundError: [Errno 2] No such file or directory: '/home/nsheff/bedshifted//0/rep1_add1_0.bed'
should we just auto-create said folder? otherwise I have to create all these manually...
Currently the add and shift operations do not consider the reference assembly when doing perturbations, so regions can be outside of the length of the chromosome. Refgenie should have some functions to fetch reference assembly data that can solve this issues.
Will have to add a parameter to specify the reference assembly, as well as change add and shift code.
Right now, replicate files are named repX_filename.bed
. This means with multiple runs, the replicates will sort by replicate number (rep1 of 5 different perturbations all sort together), which is meaningless.
I think it's more intuitive to name them filename_repX.bed
. This way, the replicates of the same type sort together.
This notice looks a bit strange:
The output bedfile located in /home/nsheff/bedshifted//0/rep99_add1_0.bed has 52064 regions. The original bedfile had 47331 regions.
4733 regions changed in total.
Can we restructure to:
For example: reduce from using 3 lines to 1 like this:
Region count. old: 47331; new: 52064; changed: 4733. Output file: /home/nsheff/bedshifted//0/rep98_add1_0.bed
Or something ...
The performance of shift is really slow. I think it can be improved if regions are not modified in place, but are added as new regions and old regions are removed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.