openOmics version: 0.8.4 Python version: 3.8.7 Operating Syste

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I added this line <a href="https://github.com/BioMeCIS-Lab/OpenOmics/blob/2e891028d9df

I initially opened an issue with fsspec here (<a clas

Found this too <a href="https://stackoverflow.com/questions/65998183/python-dask-modul

So, I've found the issue. The code here (<a href="https://github.com

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' about openomics HOT 7 CLOSED

jonnytran commented on May 27, 2024

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

from openomics.

Comments (7)

JonnyTran commented on May 27, 2024 1

@gawbul Thank you so much for getting to the bottom of this issue. Amazing to see the detective work in action!

So originally, my intention was to have Dask handle the read_table(), since reading a large gtf file with Pandas' read_table() can have a huge memory footprint. Just as you've pointed out, it returns AttributeError: '_io.TextIOWrapper' error because Dask can't read_table from a StringIO stream after uncompressing the gzip at https://github.com/BioMeCIS-Lab/OpenOmics/blob/2e891028d9df0af6ab38b65b05dbdcd7b906cfdd/openomics/database/base.py#L77-L79.

I've just tried dask.dataframe.read_table() with compression="gzip" on the compressed gzip file and it worked beautifully. Thanks for the suggestion!

Will be applying the fix and refactoring codes. I'll close this when it's done.

from openomics.

JonnyTran commented on May 27, 2024 1

I added this line openomics/database/base.py#L74 which now only lets dask handles uncompression when dealing with GTF files. Running GENCODE(npartitions=5) now works.

I also added some functionalities to parse attributes from GTF files into Dask dataframes at openomics/utils/read_gtf.py. Now, creating dask dataframes from GTF files is working albeit no time improvement compared to creating a pandas dataframe. In the future, will look into optimizing GTF attributes parsing by using ddf.map_partitions(func) or ddf["attributes"].apply(func).

from openomics.

gawbul commented on May 27, 2024

I initially opened an issue with fsspec here (fsspec/filesystem_spec#529) but realised it was due to gzip.open here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L89-L90) returning an io.TextIOWrapper object.

fsspec is unable to get a path (see https://github.com/intake/filesystem_spec/blob/fb406453b6418052f98b64d405bd4e6a4be1def1/fsspec/utils.py#L304-L308) from the object and so it just returns it in its original state, which, of course, doesn't have a startswith method.

from openomics.

gawbul commented on May 27, 2024

Found this too https://stackoverflow.com/questions/65998183/python-dask-module-error-attributeerror-io-textiowrapper-object-has-no-at from a few days ago. Added a comment asking if they managed to figure it out.

Update: I fixed their issue https://stackoverflow.com/a/66110233/393634.

from openomics.

gawbul commented on May 27, 2024

So, I've found the issue.

The code here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/utils/read_gtf.py#L178-L179) is the problem:

        logging.info(filepath_or_buffer)
        chunk_iterator = dd.read_table(
            filepath_or_buffer,
            sep="\t",
            comment="#",
            names=REQUIRED_COLUMNS,
            skipinitialspace=True,
            skip_blank_lines=True,
            error_bad_lines=True,
            warn_bad_lines=True,
            # chunksize=chunksize,
            engine="c",
            dtype={
                "start": np.int64,
                "end": np.int64,
                "score": np.float32,
                "seqname": str,
            },
            na_values=".",
            converters={"frame": parse_frame})

Specifically passing filepath_or_buffer to dd.read_table.

dd is an alias for dask.dataframe as per the import dask.dataframe as dd statement.

However, the read_table function doesn't take a stream object, it takes a file path or a list of file paths, as per the documentation here https://docs.dask.org/en/latest/dataframe-api.html?highlight=read_table#dask.dataframe.read_table.

The parameter list defines the following:

urlpath:string or list
Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

from openomics.

gawbul commented on May 27, 2024

This might help? https://stackoverflow.com/q/39924518/393634 🤔

Specifically this answer https://stackoverflow.com/a/46428853/393634.

I don't think we need to do the decompression here https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L80-L106. I think, perhaps, we can just pass through the file path as is and let the dask.dataframe compression='infer' parameter handle it?

from openomics.

gawbul commented on May 27, 2024

Awesome 🥳 Glad you managed to get this sorted 🙂

from openomics.

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' about openomics HOT 7 CLOSED

Comments (7)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent