Comments (7)
@gawbul Thank you so much for getting to the bottom of this issue. Amazing to see the detective work in action!
So originally, my intention was to have Dask handle the read_table()
, since reading a large gtf file with Pandas' read_table()
can have a huge memory footprint. Just as you've pointed out, it returns AttributeError: '_io.TextIOWrapper'
error because Dask can't read_table
from a StringIO stream after uncompressing the gzip at https://github.com/BioMeCIS-Lab/OpenOmics/blob/2e891028d9df0af6ab38b65b05dbdcd7b906cfdd/openomics/database/base.py#L77-L79.
I've just tried dask.dataframe.read_table()
with compression="gzip"
on the compressed gzip file and it worked beautifully. Thanks for the suggestion!
Will be applying the fix and refactoring codes. I'll close this when it's done.
from openomics.
I added this line openomics/database/base.py#L74 which now only lets dask handles uncompression when dealing with GTF files. Running GENCODE(npartitions=5)
now works.
I also added some functionalities to parse attributes from GTF files into Dask dataframes at openomics/utils/read_gtf.py
. Now, creating dask dataframes from GTF files is working albeit no time improvement compared to creating a pandas dataframe. In the future, will look into optimizing GTF attributes parsing by using ddf.map_partitions(func) or ddf["attributes"].apply(func).
from openomics.
I initially opened an issue with fsspec
here (fsspec/filesystem_spec#529) but realised it was due to gzip.open
here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L89-L90) returning an io.TextIOWrapper
object.
fsspec
is unable to get a path (see https://github.com/intake/filesystem_spec/blob/fb406453b6418052f98b64d405bd4e6a4be1def1/fsspec/utils.py#L304-L308) from the object and so it just returns it in its original state, which, of course, doesn't have a startswith
method.
from openomics.
Found this too https://stackoverflow.com/questions/65998183/python-dask-module-error-attributeerror-io-textiowrapper-object-has-no-at from a few days ago. Added a comment asking if they managed to figure it out.
Update: I fixed their issue https://stackoverflow.com/a/66110233/393634.
from openomics.
So, I've found the issue.
The code here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/utils/read_gtf.py#L178-L179) is the problem:
logging.info(filepath_or_buffer)
chunk_iterator = dd.read_table(
filepath_or_buffer,
sep="\t",
comment="#",
names=REQUIRED_COLUMNS,
skipinitialspace=True,
skip_blank_lines=True,
error_bad_lines=True,
warn_bad_lines=True,
# chunksize=chunksize,
engine="c",
dtype={
"start": np.int64,
"end": np.int64,
"score": np.float32,
"seqname": str,
},
na_values=".",
converters={"frame": parse_frame})
Specifically passing filepath_or_buffer
to dd.read_table
.
dd
is an alias for dask.dataframe
as per the import dask.dataframe as dd
statement.
However, the read_table
function doesn't take a stream object, it takes a file path or a list of file paths, as per the documentation here https://docs.dask.org/en/latest/dataframe-api.html?highlight=read_table#dask.dataframe.read_table.
The parameter list defines the following:
urlpath:string or list
Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
from openomics.
This might help? https://stackoverflow.com/q/39924518/393634 🤔
Specifically this answer https://stackoverflow.com/a/46428853/393634.
I don't think we need to do the decompression here https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L80-L106. I think, perhaps, we can just pass through the file path as is and let the dask.dataframe
compression='infer'
parameter handle it?
from openomics.
Awesome 🥳 Glad you managed to get this sorted 🙂
from openomics.
Related Issues (16)
- Reviewer 1 - comment 2 - Cannot change the directory of downloaded Dataset files (using astropy) HOT 2
- Reviewer 2 - Installation instructions - Unable to install dependency HOT 2
- Reviewer 2 - Vignette - Unable to `import umap` due to dependencies with scipy on Python 3.9 HOT 2
- Reviewer 2 - Automated tests - Errors with importing dataset in generate_MiRTarBase test HOT 3
- Review 2 - Automated tests - Failed builds in Travis CI HOT 10
- Review 2 - Documentations - Readthedocs documentations should be updated HOT 4
- Reviewer 1 - Documentation HOT 6
- Reviewer 1 - errors using the example code HOT 1
- Reviewer 1 - Installation HOT 2
- Unable to use load_data() for class Multiomics HOT 4
- Test Suite Errors HOT 1
- Bug: MalaCards integration not working because CSV is not working HOT 1
- Request: making OMIM integration
- The Creating biological interaction networks file is empty
- Initial Update
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openomics.