pyranges / pyranges Goto Github PK
View Code? Open in Web Editor NEWPerformant Pythonic GenomicRanges
Home Page: https://pyranges.readthedocs.io
License: MIT License
Performant Pythonic GenomicRanges
Home Page: https://pyranges.readthedocs.io
License: MIT License
Hi @endrebak. Another snag I ran into. We have a large number of ranges we are processing, and it would be great to be able to do this in parallel. Some of our processing is fast, but for other situations it could take up to an hour or more to get all of the records processed. A use case to consider is iterating a VCF file and for each record, determining transcript and exon overlaps from PyRanges objects which contain this information. I'd like to be able to parallelize sending the overlap calls for each record in the VCF. However, the PyRanges objects can't be pickled in their current state.
Here's a trivial example, with some code to show the parallel syntax working.
# %%
import pyranges as pr
gr = pr.data.chipseq()
# %%
from joblib import Parallel, delayed
# %%
def overlap(source, target):
return target.overlap(source)
def test(t):
return t
samples = [pr.PyRanges(gr.df.sample()) for i in range(10)]
# %%
with Parallel(n_jobs=8) as par:
results = par((delayed(test)(i) for i in range(10)))
results
# %%
with Parallel(n_jobs=8) as par:
results = par((delayed(overlap)(gr, s) for s in samples))
results
I will concede that there may be a better way to do this. Some info on the underlying data:
Stranded PyRanges object has 790,511 rows and 19 columns from 25 chromosomes
Stranded PyRanges object has 77,825 rows and 19 columns from 25 chromosomes.
When searching exons, I do filter the ranges first by transcript to speed things up, and doing this on a record-by-record basis in parallel would be better.
Any way it would be possible to make PyRanges objects pickleable?
Thanks very much!
Chris
Being able to do gr.to_csv("bla").cluster()
is convenient for storing the intermediate results in a call chain to view them later, but it is annoying to have the pyranges printed when just doing
gr.to_gff3("bla")
in the REPL.
Solutions?
I could add a chain-flag to the to_*
methods. I guess this is the most Pythonic way, as it is explicit.
gr.to_gff3("bla", chain=False)
is kinda verbose though.
There are ways to know whether you are in an interactive interpreter or not, but I do not see how to use them to turn off printing in the specific case of writing a pyrange to a file.
Is there a possibility to add columns from the pyranges which was used for overlapping. What I mean is that let's say one uses a subset of gtf file as regions of interest, but after overlapping wants to also annotate with other columns like gene name from gtf file.
It could be an option like this:
gr.overlap(gr_gtf, addcols = True)
The output would have all rows from gr that overlap with gtf, but in addition also all columns from gtf object for each overlap.
my_data contains the reads I want to map to genes
my_annotation contains the gene information
When I do this:
my_annotation.join(my_data)
It is OK.
But when I do this:
my_data.join(my_annotation)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.local/lib/python3.6/site-packages/pyranges/pyranges.py", line 230, in join
dfs = pyrange_apply(_write_both, self, other, **kwargs)
File "/.local/lib/python3.6/site-packages/pyranges/multithreaded.py", line 271, in pyrange_apply
result = call_f(function, nparams, df, odf, kwargs)
File "/.local/lib/python3.6/site-packages/pyranges/multithreaded.py", line 22, in call_f
return f.remote(df, odf, kwargs)
File "/.local/lib/python3.6/site-packages/pyranges/methods/join.py", line 45, in _write_both
scdf, ocdf = _both_dfs(scdf, ocdf, how=how)
File "/.local/lib/python3.6/site-packages/pyranges/methods/join.py", line 18, in _both_dfs
starts, ends, indexes)
File "ncls/src/ncls32.pyx", line 76, in ncls.src.ncls32.NCLS32.all_overlaps_both
ValueError: Buffer dtype mismatch, expected 'const long' but got Python object
Here is how my_data looks like
+--------------+-----------+-----------+-----------+
| Chromosome | Start | End | Index |
| (category) | (int32) | (int32) | (int64) |
|--------------+-----------+-----------+-----------|
| 1 | 3003842 | 3014343 | 0 |
| 1 | 3025329 | 3035830 | 1 |
| 1 | 3030643 | 3041144 | 2 |
| 1 | 3057723 | 3068243 | 3 |
Here is how my_annotation looks like
+--------------+------------+-------------+-----------+-----------+--------------+-------+
| Chromosome | source | gene_type | Start | End | Strand | ... |
| (category) | (object) | (object) | (int32) | (int32) | (category) | ... |
|--------------+------------+-------------+-----------+-----------+--------------+-------|
| 1 | havana | gene | 3073253 | 3074322 | + | ... |
| 1 | havana | transcript | 3073253 | 3074322 | + | ... |
| 1 | havana | exon | 3073253 | 3074322 | + | ... |
| 1 | ensembl | gene | 3102016 | 3102125 | + | ... |
Do you have any idea about this error?
Prior to 0.0.57, the files I am processing produce much larger PyRanges objects. However, when I upgrade to 0.0.57, the ranges are much smaller and cause our pipeline to crash. I'm happy to share files offline, but can't post them here.
For example, in 0.0.56 (and 0.0.55), I log this:
2019-10-23 14:00:52,481|processor.processor|process|769|INFO|Processing 8577 exon overlaps from 16 segments in 964 transcripts of 355 genes
With the same files and code, but after upgrading to 0.0.57, this is logged.
2019-10-23 14:05:06,403|processor.processor|process|769|INFO|Processing 17 exon overlaps from 3 segments in 11 transcripts of 3 genes
The method in question is here:
def get_exon_join(self, other: PyRanges, flip: bool = True) -> PyRanges:
"""
Join ranges, other, with exon ranges. If flip is false, the join
is done with other as the target, rather than the source
:param other: the ranges to join
:param flip: if true, other is first. if false, other is second
:return: a pyranges object
"""
other.Chromosome = other.df.Chromosome.apply(self.get_target_contig_name)
if flip:
return other.join(self._exon_ranges, suffix="_exon")
return self._exon_ranges.join(other, suffix="_other")
Has anything changed with respect to join in this regard? I've tried messing with new_pos
with no success.
Thanks so much for taking a look!
Dear @endrebak
I have been looking into the updated pyrle = 0.0.28
and have some challenges - the code example that you presented works without issue
import pyranges as pr
gr = pr.random(int(3e6))
r = gr.to_rle()
t = gr.tile(100)
print(r[t])
I have two objects that are giving me some issues
(1) - object rle
contains the RLE information from a BAM file corresponding to a target enrichment study - the data looks just like this
1 +
+--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----+
| Runs | 10001 | 2 | 4 | 25 | 66 | ... | 18 | 155 | 573 | 125 | 1 |
|--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----|
| Values | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 | ... | 2.0 | 3.0 | 4.0 | 3.0 | 1.0 |
+--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----+
Rle of length 248946422 containing 15316 elements
...
Y -
+--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------+
| Runs | 3328476 | 368 | 1201 | 13975 | 134481 | ... | 2861 | 326 | 9465 | 3329 | 3993 |
|--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------|
| Values | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | ... | 1.0 | 2.0 | 1.0 | 2.0 | 1.0 |
+--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------+
Rle of length 56887898 containing 1110 elements
PyRles object with 50 chromosomes/strand pairs.
(2) -- ranges
is a pyranges object that looks like this
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int32) | (int32) | (category) |
|--------------+-----------+-----------+--------------|
| 1 | 0 | 1000 | + |
| 1 | 1000 | 2000 | + |
| 1 | 2000 | 3000 | + |
| 1 | 3000 | 4000 | + |
| ... | ... | ... | ... |
| Y | 57224000 | 57225000 | - |
| Y | 57225000 | 57226000 | - |
| Y | 57226000 | 57227000 | - |
| Y | 57227000 | 57227415 | - |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 6,176,562 rows and 4 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
print(rle[ranges])
I am met with an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-44-3caf6f3aa493> in <module>
1 import mkl
2 mkl.set_num_threads(1)
----> 3 print(rle[ranges])
~/miniconda3/envs/megrim/lib/python3.7/site-packages/pyrle/rledict.py in __getitem__(self, key)
227
228 id_values, id_runs = find_runs(ids)
--> 229 df = pd.DataFrame({"Start": np.repeat(v.Start.values, id_runs),
230 "End": np.repeat(v.End.values, id_runs),
231 "RunID": ids,
<__array_function__ internals> in repeat(*args, **kwargs)
~/miniconda3/envs/megrim/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
480
481 """
--> 482 return _wrapfunc(a, 'repeat', repeats, axis=axis)
483
484
~/miniconda3/envs/megrim/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
59
60 try:
---> 61 return bound(*args, **kwds)
62 except TypeError:
63 # A TypeError occurs if the object does have such a method in its
ValueError: operands could not be broadcast together with shape (248957,) (248947,)
I have no idea what I have done wrong here - the data looks very much similar to your simulated data - a brief comment would be very much welcomed - thanks!
Hi,
First of all, thanks a lot for the great package. This is exactly what the python ecosystem required.
It would be great, if the QNAME field of BAM files would be kept in the pyranges object when using the read_bam
function. E.g. when working with paired end data, this is important to recover read pairs.
Thanks a lot
Hi, this looks like a promising library that could replace pybedtools in my workflow. Are there any plans to add it as a package in the bioconda channel?
Just following the example for subsetting from the documentation but I get the following exception:
>>> print(cpg[cpg.CpG > 50])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyranges/pyranges.py", line 101, in __getitem__
return _getitem(self, val)
File "pyranges/methods/getitem.py", line 20, in _getitem
raise Exception("Not valid subsetter: {}".format(str(val)))
Exception: Not valid subsetter: 0 True
1 True
2 True
3 True
4 False
...
1072 True
1073 False
1074 False
1075 False
1076 False
Name: CpG, Length: 1077, dtype: bool
This is a great package thanks - pyranges is easing a transition from R to python.
Previously using GenomicRanges I have used the binnedAverage R function to convert RLE data to mean depth of coverage across target bins.
I much appreciate the pyranges coverage function and the associated NumberOverlaps and FractionOverlaps and would now love to discover how to move beyond these metrics and include e.g. mean coverage and stddev of coverage.
Would welcome any thoughts and suggestions
Is there a function to calculate the coverage or fraction of intersect of two or more pyranges?
Similar to https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html
The current behavior of the PyRanges constructor when initialising from a DataFrame is to change the dtypes of the Chromosome, Start and End column of the input dataframe inplace.
For me this behavior has been problematic. Would consider changing it? The culprit is the set_dtypes call at the beginning of PyRanges._init.
Example:
import pyranges as pr
import pandas as pd
df = pd.DataFrame({
"Chromosome": list("1122"),
"Start": [1, 2, 3, 4],
"End": [10, 11, 12, 13]}
)
df.dtypes
# object, i8, i8
gr = pr.PyRanges(df)
df.dtypes
# category, i4, i4
Btw: thanks again for your great work!
I was happy to see this package, thank you.
I am getting failed joins when using long ranges (ie that cover the genome rather than a chromosome)
import pyranges as pr
import pandas as pd
a = pr.PyRanges(pd.DataFrame(data = {'Chromosome': [0], 'Start':[0], 'End': 5e10}), int64=True)
b = pr.PyRanges(pd.DataFrame(data = {'Chromosome': [0, 0], 'Start':[0, 5], 'End': [5, 5e10]}), int64=True)
a.join(b)
produces:
Empty PyRanges
I expect it to produce the same as below
a = pr.PyRanges(pd.DataFrame(data = {'Chromosome': [0], 'Start':[0], 'End': 5e8}), int64=True)
b = pr.PyRanges(pd.DataFrame(data = {'Chromosome': [0, 0], 'Start':[0, 5], 'End': [5, 5e8]}), int64=True)
a.join(b)
which produces
+--------------+-----------+-----------+-----------+-----------+
| Chromosome | Start | End | Start_b | End_b |
| (category) | (int64) | (int64) | (int64) | (int64) |
|--------------+-----------+-----------+-----------+-----------|
| 0 | 0 | 500000000 | 0 | 5 |
| 0 | 0 | 500000000 | 5 | 500000000 |
+--------------+-----------+-----------+-----------+-----------+
Unstranded PyRanges object has 2 rows and 5 columns from 1 chromosomes.
Python version 3.7.4
pyranges version 0.0.54
I have been testing gr overlap and merge functions. As far as I can tell right now overlap works by not considering the end (1,5] while merge works by considering it (1,5)?
I have this example bellow:
import pyranges as pr
from pyranges import PyRanges
import pandas as pd
from io import StringIO
f1 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +"""
df1 = pd.read_csv(StringIO(f1), sep="\s+")
gr1 = PyRanges(df1)
f2 = """Chromosome Start End Score Strand
chr2 14 18 15.3 +
"""
df2 = pd.read_csv(StringIO(f2), sep="\s+")
gr2 = PyRanges(df2)
gr1.overlap(gr2)
f3 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +
chr2 14 18 15.3 +"""
df3 = pd.read_csv(StringIO(f3), sep="\s+")
gr3 = PyRanges(df3)
gr3.merge(count=True)
overlap doesn't produce any overlaps while merge merges the intervals that start and end with the same coordinate. Personally I would have expected the two to work the same way. Is there a way to get the same results from merge as you get from overlap?
Hi,
I ran into this problem when finding what ranges from one set overlap ranges in several other sets. Here is a trivial test case:
pr1 = pr.PyRanges(pd.DataFrame({
'Chromosome': ['1', '1'],
'Start': [243, 645],
'End': [267, 689],
}))
pr2 = pr.PyRanges(pd.DataFrame({
'Chromosome': ['2', '2'],
'Start': [243, 645],
'End': [267, 689],
}))
pr3 = pr.PyRanges(pd.DataFrame({
'Chromosome': ['3', '3'],
'Start': [243, 645],
'End': [267, 689],
}))
# overlap of pr1 and pr2; should be empty
s1 = pr1
s2 = pr2
res1 = s1.overlap(s2, strandedness=None, how='first')
# overlap of first comparison with pr3; should also be empty
s1 = res1
s2 = pr2
res2 = s1.overlap(s2, strandedness=None, how='first')
Which generates this error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-158-047be8f5ac7d> in <module>
17 s1 = res1
18 s2 = pr2
---> 19 res2 = s1.overlap(s2, strandedness=None, how='first')
20 len(res2.df)
~/jupyter-venv/lib/python3.7/site-packages/pyranges/pyranges.py in overlap(self, other, **kwargs)
423 kwargs = fill_kwargs(kwargs)
424
--> 425 dfs = pyrange_apply(_overlap, self, other, **kwargs)
426
427 return PyRanges(dfs)
~/jupyter-venv/lib/python3.7/site-packages/pyranges/multithreaded.py in pyrange_apply(function, self, other, **kwargs)
215 else:
216
--> 217 if self.stranded and not other.stranded:
218
219 for (c, s), df in items:
~/jupyter-venv/lib/python3.7/site-packages/pyranges/pyranges.py in stranded(self)
686 def stranded(self):
687 # print(self.keys())
--> 688 return len(list(self.keys())[0]) == 2
689
690 @property
IndexError: list index out of range
The immediate problem is since res1 is empty it doesn't have anything with which to populate the dataframe dictionary dfs
. So when checking if either set is stranded it fails because self.keys()
is empty
@property
def stranded(self):
return len(list(self.keys())[0]) == 2
Not sure if this is the best fix, but stranded(self)
should probably first check if len(self.keys()) > 0
or something. Alternately, maybe just save a lot of computational work and check in overlap if either PyRanges
is empty and if so return an empty PyRanges
object.
when performing a left join, ie:
bed1.join(bed2,strandedness=False,how='left')
If bed1
has more chromosomes than bed2
, the result of this join will not report all the regions in bed1
.
I need to test the whole codebase with values larger than 2,147,483,647 to see that it works as expected.
Known places that are buggy: pyranges.random, pyranges.cluster and possibly more :/
What is the best way to add multiple metadata columns to a PyRanges object simultaneously (or to create a PyRanges object from a dataframe that has many metadata columns)?
The documentation describes adding one metadata column at a time, but it doesn't provide the syntax for multiple columns.
I am a user of PyRanges and I found your contact information on GitHub. I am wondering why the PyRanges "Insert" function was removed, where one can overlap pyranges and add a column of data simultaneously. I used this quite often. Is there another way to accomplish this same thing? I cannot seem to find anything similar in the documentation. If you have the chance to respond with any suggestion, I would appreciate any guidance.
I'm trying to work through some of the examples in the documentation and run into an error when trying to print pyranges objects:
gr = pr.PyRanges(chromosomes="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
Traceback (most recent call last):
File "", line 1, in
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/pyranges.py", line 107, in str
return tostring(self)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/tostring2.py", line 276, in tostring
df = _get_df(self, n, sort)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/tostring2.py", line 72, in _get_df
df = self.df.astype(object)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/util/_decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3299, in astype
**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3224, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3091, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 471, in astype
**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 2197, in _astype
return self.make_block(values)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 206, in make_block
return make_block(values, placement=placement, ndim=ndim, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 2719, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 1844, in init
placement=placement, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 106, in init
raise ValueError('Wrong number of dimensions')
ValueError: Wrong number of dimensions
pyranges version: 0.0.54
First, thank you for your work. Alternatives to compute genomic arithmetics are always welcome!
cs = pyranges.data.chipseq()
cs
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome | Start | End | Name | Score | Strand |
| (category) | (int32) | (int32) | (object) | (int64) | (category) |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1 | 212609534 | 212609559 | U0 | 0 | + |
| chr1 | 169887529 | 169887554 | U0 | 0 | + |
| chr1 | 216711011 | 216711036 | U0 | 0 | + |
| chr1 | 144227079 | 144227104 | U0 | 0 | + |
| ... | ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | U0 | 0 | - |
| chrY | 13517892 | 13517917 | U0 | 0 | - |
| chrY | 8010951 | 8010976 | U0 | 0 | - |
| chrY | 7405376 | 7405401 | U0 | 0 | - |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
def too_long(x):
return (x.End - x.Start) < 120
cs.subset(too_long)
+--------------+-----------+-----------+------------+-----------+
| Chromosome | Start | End | Name | Score |
| (category) | (int32) | (int32) | (object) | (int64) |
|--------------+-----------+-----------+------------+-----------|
| chr1 | 212609534 | 212609559 | U0 | 0 |
| chr1 | 169887529 | 169887554 | U0 | 0 |
| chr1 | 216711011 | 216711036 | U0 | 0 |
| chr1 | 144227079 | 144227104 | U0 | 0 |
| ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | U0 | 0 |
| chrY | 13517892 | 13517917 | U0 | 0 |
| chrY | 8010951 | 8010976 | U0 | 0 |
| chrY | 7405376 | 7405401 | U0 | 0 |
+--------------+-----------+-----------+------------+-----------+
Unstranded PyRanges object has 10,000 rows and 5 columns from 24 chromosomes.
Why do we miss the Strand column with the subset function? Is it a bug or intended behaviour?
Can you provide support for vcf file as gtf and bam files?
I ran across this bug trying to use the features.tss()
method. After doing some research the problem seems to be that this code was written expecting GFF column names which are different if the source is GTF.
In genomicfeatures.py the tss()
definition includes this line:
df = df.drop(["ExonNumber", "ExonID"], 1)
But if my features are read from a gtf file (either by myself or using pyranges_db from gencode) the corresponding columns are actually called ['exon_number', 'exon_id']
. This is in contrast to what the columns are called when parsed from GFF files.
It looks like this is also the case with tes()
Additionally, in looking at the code it looks like the drop_duplicates
argument to tss
and tes
doesn't actually have any effect if you set this to False
. It's true you would disable the drop duplicates statement at this level, but two calls deeper in _tss()
and _tes()
there is another drop_duplicates argument which defaults to True
and the argument does not get passed down through tss_or_tes()
to prevent duplicate dropping at this level.
As stated, use the __init__.py
file from this github repo fix the problem. But the distributed file on pypi has an empty __init__.py
Hi again,
as usual I am super-thrilled at how this is coming along.
I found that I often want to pass a PyRanges instance to a function which I want to use in parallel (using multiprocessing). I can of course pass the underlying dataframe, and then reconstruct the PyRanges within the child processes. However, i) PyRanges construction for large sets of regions takes some time ii) if i use the same function outside of the context of multiprocessing, I do want to pass a PyRanges instance directly. So it would be just a little bit more convenient if the PyRanges class would be serializable.
Do you have any plans / thoughts on that? In the simplest case, setstate and getstate could automate the 'serialize the underlying DF, then reconstruct the PyRanges' part. I am not Cython-savy enough to know whether there is a more performant alternative...
Thanks!
Thanks for making a really useful package! I was wondering if it's possible to get the information about which pairs of regions form overlaps, similar to the Hits object from the GenomicRanges library in R?
I want to merge/join two Pyranges objects, let say obj1 and obj2. And I still want to trackback the indices of obj1 and obj2. I tried to use 'join' but the function returns with a completely new object which I couldn't trackback the indices of obj1 and obj2 anymore. I imagine it will be similar to the merge function in R. Any suggestion? Or the join function should be improved?
Hi,
If you are like me then you probably don't like writing documentation, or worse, keeping that documentation up to date. But your document is very helpful to get started with pyranges. It looks like I can clean up some nasty data-format-to-pandas parsers I wrote and replace them by pyranges code. Perhaps if I get more experienced with your module I can contribute (to the docs), but I'm not comfortable doing that yet.
After going through the documentation I have some questions/suggestions though:
.slack()
and slack=2
without explaining what that exactly this. It seems to extend intervals according to chapter 12 https://biocore-ntnu.github.io/pyranges/methods-for-manipulating-single-pyranges.html, but perhaps that needed to be specified earlier. It's first mentioned as a method in a code example in chapter 7I could probably hack something together, but perhaps you can suggest an efficient solution for the following:
I have two data files, containing per genomic position some information. In addition to those, I have a bed file with intervals of interest.
Now for every interval in the bed file, I would like to slice the two data files, sum some numbers and statistically compare file A vs file B.
And then print some output, move on to the next interval.
Would it be most efficient to just iterate over the bed pyranges object and slice pyranges object of the data files? Or should I first annotate every record in the data files with the name of the interval in the bed file it belongs to?
Thanks,
Wouter
Thanks for the great addition to the Python ecosystem! We are starting to incorporate this into some of our systems, which operate on models built from large amounts of transcript and exon range data. We build these models now with the really nice interlap
library and drop them to pickles for reuse.
Saving our PyRanges objects to csv files is an excellent replacement for this, but they are kind of large when written (~70MB). It would be great to pass the compression
kwarg of pandas, like you're currently doing with sep
and header
to reduce that size.
In the 0-based convention (used by the BED format) the range that spans the first 10 bases of a chromosome is represented by setting start-end to 0-10 (end position is excluded). In the 1-based convention (used by the GTF and GTF3 formats and in Bioconductor) this range is represented by setting start-end to 1-10.
I couldn't find anything in PyRanges documentation about which convention is used.
Note that the following result:
>>> import pyranges as pr
>>> x = pr.PyRanges(chromosomes='chr1', starts=[10], ends=[20])
>>> y = pr.PyRanges(chromosomes='chr1', starts=[20], ends=[30])
>>> x.intersect(y)
Empty PyRanges
suggests that the 0-based convention is used. With the 1-based convention, this intersection wouldn't be empty (it would contain exactly one base).
But OTOH the following result:
>>> print(x.to_gff3())
chr1 . . 10 20 . . .
suggests the opposite!
Also note that having a PyRanges method that returns the number of bases in each range (width in Bioconductor terminology) would be very useful. Right now, given that it's not clear which convention is used, one has no way to know whether the width should be computed with x.End - x.Start
(0-based convention) or with x.End - x.Start + 1
(1-based convention).
Thanks!
H.
PS: FWIW I started a small Bioconductor package to facilitate access to PyRanges objects from Bioconductor: https://github.com/hpages/BiocPyRanges
Hi - thanks for the great work and continuing to enhance this package. We're getting great use out of it. Would it be possible to incorporate git tagging/releases to match with what's in pypi? That would make tracking changes, especially breaking ones, much more straightforward since the tag would capture the whole repo, not just individual commit messages.
The multiline headers in bam/vcf files is often significant. How should it be preserved?
Possibility: add a meta-dict to PyRanges where the header is stored after reading a file. Or just in a variable gr.header
.
In current behavior, built-in methods such as hasattr
do not work because of wrong exception type.
One thing I have been wondering about is what to do in case the range is too wide to display properly. I guess truncating the columns Chromosome Strand Start End
to Position
in the display and only have one column in the displayed output like
Position
chr1:100-200:+
without changing the underlying data would be an improvement.
But this is something that one should be able to change in the settings.
In case it is still too wide, I guess I will have to remove a few columns and just display ...
instead.
I'd love an API to try to make a PyRanges automatically from a dataframe, without the user having to rename any columns.
gr = pr.from_dataframe(df)
But then we would need a method to guess which columns contain chromsomes, starts, ends and optionally strand.
Starts and Ends: the two first int columns where all End >= Start.
Strand: any column containing either "+" or "-", ".".
Chromosome: harder....
Also, would be great with a method like
df = pr.to_dataframe(gr)
which remembers the original chromosome, start, end and strand names.
What license is this software released under? What are the terms for using it in my own project?
Or is the license embedded in another file?
Convention is to have a LICENSE
file
Hello,
Thanks very much for bringing GRanges to python. Very much appreciated!
I tried to install pyranges through pip install pyranges
. The installation issued no problems. However, when I tried to import pyranges, it complained about No module named raymock
.
I did a quick fix by updating import pyranges.ramock as ray
as import ramock as ray
in the pyranges.py
. But then it similarly complained other packages. I am wondering is there way I can just fix all automatically? Thanks for any help.
Simo
When trying to install from pip, I get the following error:
'ModuleNotFoundError: No module named 'Cython'
If I install cython manually, the pip installation for pyranges then works. Is there a reason cython is not included as a requirement for pip?
I was reading the docs on joining ranges but I can't find any place with information on how to join multiple ranges across files. These are, for instance, accomplished through bedtools multiintersectbed
. Is this possible here too?
Thanks in advance
Hey!
Would be lovely to have a full-fledged granges clone in python!
How do you feel about implementing sub-setting of ranges objects by indexing into them, and/or with a bool array? Right now I found the workaround is to go via pandas, i.e.
gtf_df = gtf.as_df()
gtf_df_genes = gtf_df[gtf_df.Feature=='gene']
gtf_genes = pyranges.PyRanges(gtf_df_genes)
Would be nice to do this directly on the PyRanges
object, no? :-)
Thx,
J.
Correct me if I'm wrong, but I don't believe there is an equivalent for R's tileGenome function
tileGenome
returns a set of genomic regions that form a partitioning of the specified genome. Each region is called a "tile".
Hello,
A useful feature would be to be able to ensure that the slack function does not extend genomic ranges beyond the ends of chromosomes. This is implemented in bedtools: https://bedtools.readthedocs.io/en/latest/content/tools/slop.html
It seems there is some logic to prevent slack from creating negative positions (but maybe not for stranded slack), but there is no check for the other end of the chromosome.
I'm thinking this could be implemented as an optional argument to slack. If the optional 'bounds' object is passed, then the output of the slack function is guaranteed to be within the bounds.
First of all, thanks so much for all the work you already put into this. I have been tentatively working with your package and think this is very promising. I am sure many people would be very happy to have a Ranges package of this quality in their python toolbox. I have two questions:
Thanks!
The slack parameter seems to work for merge but not join:
gr1 = pr.PyRanges(chromosomes="chr1", starts=[0], ends=[10], strands = "+")
gr2 = pr.PyRanges(chromosomes="chr1", starts=[15], ends=[20], strands = "+")
gr1.merge(gr2, slack = 10)
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int32) | (int32) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 0 | 10 | + |
+--------------+-----------+-----------+--------------+
gr1.join(gr2, slack = 10)
Empty PyRanges
The current read_bed
is very strict. I'd like a method which reads any table where the three first columns are Chromosome, Start and End and the optional strand can be anywhere else. This would lessen the need for
gr = pr.PyRanges(pd.read_table("bla"))
# pr.read_bedlike("bla") much more convenient
I'm running the pyranges example code for nearest.py on Windows (Python 3.7.3 on Anaconda). I installed pyranges with pip. I'm getting this error:
gr = pr.data.chipseq()
gr2 = pr.data.chipseq_background()
print(gr.nearest(gr2, suffix="_Input"))
ValueError Traceback (most recent call last)
1 gr = pr.data.chipseq()
2 gr2 = pr.data.chipseq_background()
----> 3 print(gr.nearest(gr2, suffix="_Input"))
~\Anaconda3\lib\site-packages\pyranges\pyranges.py in nearest(self, other, **kwargs)
199 assert other.stranded, "If doing upstream or downstream nearest, other pyranges must be stranded"
200
--> 201 dfs = pyrange_apply(_nearest, self, other, **kwargs)
202
203 return PyRanges(dfs)
~\Anaconda3\lib\site-packages\pyranges\multithreaded.py in pyrange_apply(function, self, other, **kwargs)
290 # dbg(odf)
291
--> 292 result = call_f(function, nparams, df, odf, kwargs)
293 results.append(result)
294
~\Anaconda3\lib\site-packages\pyranges\multithreaded.py in call_f(f, nparams, df, odf, kwargs)
20
21 if nparams == 3:
---> 22 return f.remote(df, odf, kwargs)
23 else:
24 return f.remote(df, odf)
~\Anaconda3\lib\site-packages\pyranges\methods\nearest.py in _nearest(scdf, ocdf, kwargs)
120 else:
121 previous_r_idx, previous_dist = _previous_nonoverlapping(
--> 122 df_to_find_nearest_in.Start, ocdf.End)
123
124 next_r_idx, next_dist = _next_nonoverlapping(
~\Anaconda3\lib\site-packages\pyranges\methods\nearest.py in _previous_nonoverlapping(left_starts, right_ends)
73 right_ends = right_ends.sort_values()
74 r_idx, dist = nearest_previous_nonoverlapping(
---> 75 left_starts.values, right_ends.values - 1, right_ends.index.values)
76
77 r_idx = pd.Series(r_idx, index=left_starts.index).sort_index().values
~\Anaconda3\lib\site-packages\sorted_nearest\src\sorted_nearest.pyx in sorted_nearest.src.sorted_nearest.nearest_previous_nonoverlapping()
ValueError: Buffer dtype mismatch, expected 'const long' but got 'long long'
Given a PyRanges object:
+--------------+-----------+-----------+------------------------+
| Chromosome | Start | End | Tumor_Sample_Barcode |
| (category) | (int32) | (int32) | (object) |
|--------------+-----------+-----------+------------------------|
| X | 44938423 | 44938423 | DO52686 |
| X | 49595038 | 49595038 | DO52686 |
| X | 49957631 | 49957631 | DO52686 |
| X | 51488226 | 51488226 | DO52686 |
| ... | ... | ... | ... |
| X | 137709431 | 137709431 | DO52686 |
| X | 140993315 | 140993315 | DO52686 |
| X | 144909350 | 144909350 | DO52686 |
| X | 144909500 | 144909500 | DO52686 |
+--------------+-----------+-----------+------------------------+
Unstranded PyRanges object has 18 rows and 4 columns from 1 chromosomes.
The corresponding PyRles object from calling coverage() is:
+--------+-------------+
| Runs | 144909500 |
|--------+-------------|
| Values | 0.0 |
+--------+-------------+
Rle of length 144909500 containing 1 elements
However, on R's GenomicRanges, the RleList with the same data is:
integer-Rle of length 155270560 with 37 runs
Lengths: 44938422 1 4656614 1 362592 1 1530594 ... 1 3916034 1 149 1 10361060
Values : 0 1 0 1 0 1 0 ... 1 0 1 0 1 0
Hey!
Again, I appreciate the work you're doing on this :-)
It would be nice if r1.overlap(r2)
told me which rows of r1
overlap which rows of r2
. Then I would be able to use it to e.g. annotate my ranges. This is a-la findOverlaps
in GenomicRanges
: https://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges_Usage.html#findoverlaps.
If you do this, there'll be one less reason for me to open R ;-)
Thx!
J.
Come see me at the Genome Informatics meeting :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.