guma44 / geoparse Goto Github PK
View Code? Open in Web Editor NEWPython library to access Gene Expression Omnibus Database (GEO)
License: BSD 3-Clause "New" or "Revised" License
Python library to access Gene Expression Omnibus Database (GEO)
License: BSD 3-Clause "New" or "Revised" License
Now that you've got #45 sorted, it'd be great if you could publish an updated package version!
When I try
gse = GEOparse.get_GEO(geo="GSE69263", destdir="./")
gse.columns
I get:
gse.columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GSE' object has no attribute 'columns'
while in the docs it is mentioned that columns is a standard GSE property
The following code is generating a warning for me:
import GEOparse
gpl = GEOparse.get_GEO('GPL17481')
The output is:
>>> import GEOparse
>>> gpl = GEOparse.get_GEO('GPL17481')
17-May-2021 13:32:21 DEBUG utils - Directory ./ already exists. Skipping.
17-May-2021 13:32:21 INFO GEOparse - Downloading http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full to ./GPL17481.
txt
17-May-2021 13:32:23 DEBUG downloader - Total size: 0
17-May-2021 13:32:23 DEBUG downloader - md5: None
1.72MB [00:00,1.63MB/s]
10.3MB [00:01, 7.26MB/s]
17-May-2021 13:32:24 DEBUG downloader - Moving /tmp/tmp2lblbvso to /home/dbolser/Geromics/Dogome/Geromics/GPL17481.txt
17-May-2021 13:32:24 DEBUG downloader - Successfully downloaded http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full
17-May-2021 13:32:24 INFO GEOparse - Parsing ./GPL17481.txt:
17-May-2021 13:32:24 DEBUG GEOparse - PLATFORM: GPL17481
/usr/bin/bpython3:1: DtypeWarning: Columns (7) have mixed types.Specify dtype option on import or set low_memory=False.
#!/usr/bin/python3
>>>
I get that this error is coming from pandas, but I'm not sure how to fix it.
I got an IndexError for some GSM SOFT txt files. For instance: for GSM32878 (string index out of range): geo = GEOparse.get_GEO('GSM32878'):
Traceback (most recent call last):
File "indexReportUpdate.py", line 825, in <module>
createIndices("GSM", outputDoc[2], outputEdg[0], outputDoc[6], outputEdg[2])
File "indexReportUpdate.py", line 171, in createIndices
raise e
File "indexReportUpdate.py", line 163, in createIndices
geo = GEOparse.get_GEO(filepath=fpath, silent=True)
File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 82, in get_GEO
return parse_GSM(filepath)
File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 374, in parse_GSM
table_data = parse_table_data(soft)
File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in parse_table_data
data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in <listcomp>
data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
IndexError: string index out of range
Is this package for supported for python 2.7 only? Might be a good idea to include that in the readme
Thank you for great project. It is very convenient to me.
My interested dataset has multi-GPLs GSE. I want to find specific GPL's samples. But I can not find where the relations are.
In this case,
>>> gse = GEOparse.get_GEO("GSE6532", destdir='data/',
annotate_gpl=True, include_data=True, silent=True)
>>> print(gse.gpls)
{'GPL570': <PLATFORM: GPL570>,
'GPL96': <PLATFORM: GPL96>,
'GPL97': <PLATFORM: GPL97>}
>>> print(len(gse.gsms))
741
How can I filter GPL570's samples?
Looks like this lib uses module Bio from Entrez submodule Biopython, but I do not see Biopython in requirements file
I'm trying to read in the XML file for the metadata and stumbled across this package. Do you recommend a good way to do this?
GEOparse hangs for me quite a lot, particularly on slow connections. I think this is because there are no timeout values set.
Hi!
Thanks for this awesome project, a much-needed tool in the Python ecosystem.
I wonder if you could consider addiing a separate data structure to each GSM object that would store the list of all SRX and SRR entries associated with this GSM. The motivation is that, with a list like that, users could use your library only to scrape the GSE/GSM/SRX/SRR data. This would allow using your library in other workflows, where users may need to manage data download themselves.
Thank you!
Anton.
I run fastq-dump with the following parameters:
/opt/sratoolkit/fastq-dump --skip-technical --gzip --readids --read-filter pass --dumpbase --split-files --clip ${file}
(at https://edwards.sdsu.edu/research/fastq-dump/ there are good explanations for need in some of them). While default geoparse has
cmd = "fastq-dump --split-files --gzip %s --outdir %s %s"
That creates some problems. For instance, if I do not have --readids and use paired sra, I get two files with ideas that are the same, that creates problem for downstream analysis. If I do not provide --skip-technical, then I get some technical Illumina reads that have nothing to do with biology ( like Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse. ) --read-filter pass allows to get read of multiple N-s in reads
Hi,
I want to annotate all samples, pivot_and_annotate function is working well for single platforms.
g = geo.get_GEO(geo='GSE17907', how='full', destdir=download_dir)
g.pivot_and_annotate('VALUE',gse.gpls[list(gse.gpls)[0]],'Gene Symbol')
However, some datasets have multiple platforms. For instance, GSE17907.
So, I use merge_and_average function due to platform filter feature. It is good because I am able to get samples for each platform seperately. But unfortunately, merge_and_average does not annotate samples.
gse.merge_and_average(d='GPL570', expression_column='VALUE', gsm_on='ID_REF', gpl_on='ID', group_by_column='ID_REF')
Is there any feature to annotate this multiple platforms? Maybe I missed something so I just wanted to ask it.
btw, I annotate samples manually like this.
soft = gse.gpls[list(gse.gpls)[0]].table
if soft.columns[0] == 'ID' and 'Gene Symbol' in list(soft.columns) and 'ID_REF' == eset.index.name:
soft = soft[['ID','Gene Symbol']]
pd.merge(left=soft , right=eset, left_on='ID', right_on='ID_REF').drop(['ID'],axis=1)
When NCBI/GEO is down, I'd expect a custom exception or some kind of graceful handling, instead you get:
File "/home/user/data_refinery_foreman/surveyor/geo.py", line 222, in create_experiment_and_samples_from_api
gse = GEOparse.get_GEO(experiment_accession_code, destdir=self.get_temp_path(), how="brief", silent=True)
File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 84, in get_GEO
return parse_GSE(filepath)
File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 502, in parse_GSE
with utils.smart_open(filepath) as soft:
File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/GEOparse/utils.py", line 156, in smart_open
fh = fopen(filepath, mode, errors="ignore")
File "/usr/lib/python3.5/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/usr/lib/python3.5/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/1/GSE11915_family.soft.gz'
which looks like it's a local disk error, which it isn't, it's a GEO-is-down-error.
gse = GEOparse.get_GEO(filepath=DIR_PATH)
ValueError: Unknown GEO type: E:\. Available types: GSM, GSE, GPL and GDS
This error arises from the way windows directory file path works i.e. "\" as opposed to linux "/"
In GEOparse.py on line 77 is the culprit :
` else:
if geotype is None:
geotype = filepath.split("/")[-1][:3] ------------this line #77logger.info("Parsing %s: " % filepath) if geotype.upper() == "GSM": return parse_GSM(filepath) elif geotype.upper() == "GSE": return parse_GSE(filepath) elif geotype.upper() == 'GPL': return parse_GPL(filepath) elif geotype.upper() == 'GDS': return parse_GDS(filepath) else: raise ValueError(("Unknown GEO type: %s. Available types: GSM, GSE, " "GPL and GDS.") % geotype.upper())
`
When I download GSM supplementary files by:
gsm = cast(GSM, GEOparse.get_GEO("GSM1944823", destdir="/tmp"))
files = gsm.download_supplementary_files("/tmp", False, "[email protected]")
I get the following error
13-Feb-2018 18:02:51 DEBUG utils - Directory /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain already exists. Skipping.
13-Feb-2018 18:02:51 INFO utils - Downloading NONE to /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain/NONE
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/GEOTypes.py", line 443, in download_supplementary_files
utils.download_from_url(metavalue[0], download_path)
File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/utils.py", line 114, in download_from_url
destination_path))
File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/wgetter.py", line 272, in download
url = opener.open(link)
File "/usr/lib/python3.6/urllib/request.py", line 511, in open
req = Request(fullurl, data)
File "/usr/lib/python3.6/urllib/request.py", line 329, in __init__
self.full_url = url
File "/usr/lib/python3.6/urllib/request.py", line 355, in full_url
self._parse()
File "/usr/lib/python3.6/urllib/request.py", line 384, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'NONE'
GSE14755
File already exist: using local version.
Parsing ../data/geo/GSE14755_family.soft.gz:
UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)
--> 508 gpls[entry_name] = parse_GPL(data_group, entry_name)
509 elif entry_type == "DATABASE":
510 is_data, data_group = next(groupper)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GPL(filepath, entry_name, silent)
383 gpl_soft.append(line)
384 else:
--> 385 for line in filepath:
386 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
387 has_table = True
/home/k/Jumis/tools/anaconda/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 5280: invalid continuation byte
GSE5336
File already exist: using local version.
Parsing ../data/geo/GSE5336_family.soft.gz:
UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
503 elif entry_type == "SAMPLE":
504 is_data, data_group = next(groupper)
--> 505 gsms[entry_name] = parse_GSM(data_group, entry_name)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSM(filepath, entry_name)
303 soft = []
304 has_table = False
--> 305 for line in filepath:
306 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
307 has_table = True
/home/k/Jumis/tools/anaconda/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 2897: invalid start byte
Thanks
GSE52666
File already exist: using local version.
Parsing ../data/geo/GSE52666_family.soft.gz:
AssertionError Traceback (most recent call last)
in ()
6
7 # Download and/or load GEO dataset
----> 8 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)
9
10 print('\tannotation.head(): {}'.format(gse.phenotype_data))
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
518 gpls=gpls,
519 gsms=gsms,
--> 520 database=database)
521 return gse
522
/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOTypes.py in init(self, name, metadata, gpls, gsms, database)
590
591 for gsm_name, gsm in iteritems(gsms):
--> 592 assert isinstance(gsm, GSM), "All GSMs should be of type GSM"
593 for gpl_name, gpl in iteritems(gpls):
594 assert isinstance(gpl, GPL), "All GPLs should be of type GPL"
AssertionError: All GSMs should be of type GSM
Could you take a look at this and let me know what the issue is?
Thanks,
Hi,
I sometimes want to parse large GPL files (e.g., GPL570), but my PC does not work due to out of memory. So, I'd like to be able to parse the GPL file partially by specifying the GSM samples to parse from the GPL. If you agree with my idea, I will make a pull request for this feature.
Thanks
Is there any scientific paper describing current library?
I would like to cite GEOparse, but there is no recommended way in README.
I got the following error while parsing GSM2795971 SOFT file:
File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Great package, thanks!
I lost a bit of time tracking down fastq-dump error with the download_SRA() function.
Granted,
"09-Apr-2019 12:48:02 ERROR sra_downloader - fastq-dump command not found"
is pretty good, but maybe it would be nice to see a good exception raised before the 15Gb file dowload?
Again, awesome pkg, thanks!
Best,
John
I have varying outcomes with which subset_types I encounter per parse. Although I have not encountered the ones encoded set(['individual', 'disease_state'])
, I have had to make use of set(['dose', 'agent', 'time', 'gender'])
. So, in parse_GDS_columns
, I modified the code to start out with an empty subset_ids
and collected everything on the fly.
This turned out nicely as each subset_type was accounted for in each sample in the end, so in GDS.columns
no rows were dropped during GDS.__init__()
.
fastq_dump is very slow. Maybe usage of https://github.com/rvalieris/parallel-fastq-dump can speed things up
GEOparse doesn't seem to provide a way to clean up after itself. I'd like to be able to delete all of the local data that has been downloaded and created once I'm finished.
I cannot download the majority of GEO metadata files! I Think that NCBI has changed again the structure of their URLs :(
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963773/soft/GDS301963773.soft.gz
GDS301963934
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz to XXX/GDS301963934.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz
GDS301385886
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz to XXX/GDS301385886.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz
GDS302278020
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz to XXX/GDS302278020.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz
GDS302478025
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz to XXX/GDS302478025.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz
GDS301172854
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz to XXX/GDS301172854.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz
GDS301192685
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz to XXX/GDS301192685.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz
GDS302483410
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz to XXX/GDS302483410.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz
GDS302048642
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.soft.gz to XXX/GDS302048642.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.so
This prefers SOFT, would love the option to download MINiML.
Hi,
I use get_geo function on my script for 2 times.
First time it is for getting only sample names (brief), Second time for full download.
However, I can not pass second time because it use existing file.
So, is there any way to do that?
Thanks.
I've had an FTP error when trying to get_gse for: "GSE122295"
And then I realized it is because it is still private....
Is there a way to know is a GSE is private?
In gsm.download_supplementary_files email field looks optional (as email=None by default) but in reality, it crashes with "Exception: You have to provide valid e-mail", that means this field is in fact mandatory. I suggest to either make email mandatory or to make it really optional and allow downloading SRAs without email
Now in order to generate the phenotypic data like in the pData from GEOquery one should do following:
pheno_data = {}
for gsm_name, gsm in gse.gsms.iteritems():
print gsm_name, gsm
pheno_data[gsm_name] = {key: value[0] for key, value in gsm.metadata.iteritems()}
pheno_data = pd.DataFrame(pheno_data).T
This should be a function.
The title says it all. Thank you.
Great library! Just wondering if it is possible to allow bulk downloading of the GEO dataset from the outset rather than when it is queried. I want to speed up development times and having to download the files as they are needed takes up 90% of the analysis time. It would be great if there was a way to just dump all the GSE files into one folder. I understand this is quite large but if I have the space -- can this be added? I looked at ftp://ftp.ncbi.nlm.nih.gov/geo/series/
but I just want the _family.soft.gz files as they are used by GEOparse in a single folder.
Hi,
The following code from the first section of the tutorials is broken on my machine. I'm running Anaconda Python 3.6 on windows 10.
import GEOparse
gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")
Produces the following error
12-Nov-2018 15:40:26 INFO GEOparse - Parsing ./GSE1563.soft.gz:
Traceback (most recent call last):
File "C:/Users/Ciaran/Box Sync/MesiSTRAT/PublicDataSetSearch/ReFormatShittyNCBIOutput.py", line 95, in <module>
gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")
File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 84, in get_GEO
return parse_GSE(filepath)
File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 502, in parse_GSE
with utils.smart_open(filepath) as soft:
File "C:\ProgramData\Anaconda2\lib\contextlib.py", line 17, in __enter__
return self.gen.next()
File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\utils.py", line 154, in smart_open
fh = fopen(filepath, mode)
File "C:\ProgramData\Anaconda2\lib\gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "C:\ProgramData\Anaconda2\lib\gzip.py", line 94, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')
Process finished with exit code 1
I'm assuming this was developed by highly UNIX users, and thus there are some missing substitutions for illegal filenames for Windows. Specifically in the GEOTypes.py file on line 403
I modified to the substitution to this for my own purposes: re.sub(r'[\s\*\?\(\),\.\:\%\|\"\<\>]
Whenever I try to use it it downloads everything, but in the end tells me:
Converting to /home/antonkulaga/rna-seq/containers/geoparse/GSM1696283/Supp_GSM1696283_Transgenic_Control_L4_A/SRR2040662_*.fasta.gz
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 352, in download_supplementary_files
self.download_SRA(email, filetype=sra_filetype, directory=directory)
File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 463, in download_SRA
if "command not found" in perr:
TypeError: a bytes-like object is required, not 'str'
Probably it tries to find some SRA tools in the path. So, it is better just to say that SRA tools are not in the PATH
When I call getGEO()
with silent=True
I expect no output at all. But the result is identical to that obtained with silent=False
.
Even if I redirect sys.stdout
and sys.stderr
to files, I still see the same output.
Is it possible to really silence the output of getGEO
?
Python 3.5.2
GEOparse 0.1.10
macOS 10.12.4
For bioinformatic pipelines it is useful to get pairs of name-> file for all downloaded supplementary files. My suggestion is to return dictionary of pairs name -> path in gsm.download_supplementary_files instead of current None
I've enable silent = True
when calling GEOparse.get_GEO
. but I still get the messages.
Parsing downloads/GSE72400_family.soft.gz:
- DATABASE : GeoMiame
- SERIES : GSE72400
- PLATFORM : GPL18573
- SAMPLE : GSM1861834
- SAMPLE : GSM1861835
- SAMPLE : GSM1861836
- SAMPLE : GSM1861837
- SAMPLE : GSM1861838
- SAMPLE : GSM1861839
- SAMPLE : GSM1861840
- SAMPLE : GSM1861841
- SAMPLE : GSM1861842
- SAMPLE : GSM1861843
Thank you very much for the effort made in providing this useful package.
I would like to request the following feature: that the encoding can be specified when calling gzip.open()
or open()
in smart_open()
.
I am currently using GEOparse 2.0.1 with Python 3.8.3 on Windows 10. I have successfully downloaded GSE files from GEO (e.g. GSE134809_family.soft) and have also used GEOparse to read the .soft (or .soft.gz) files stored locally on my computer.
I have discovered that some special characters in the .soft files are not being interpreted correctly, due to gzip.open()
or open()
using Python's default encoder ('cp1252'
in my computer) instead of 'utf-8'
even though the .soft files use 'utf-8'
encoding. Due to smart_open()
ignoring errors when reading the file with fh = fopen(filepath, mode, errors="ignore")
, the special characters do not prevent the file from being read, but they are not interpreted correctly.
The types of characters that I've found to be problematic are letters with accents, and some punctuation marks, e.g. Naïve
, 4°C
, 3’ prime
, “union”
(those single and double quotation marks are not the standard ones even though they look similar).
This could be solved by allowing the encoding
argument to be passed to gzip.open()
or open()
when calling smart_open()
:
@contextmanager
def smart_open(filepath, encoding):
"""Open file intelligently depending on the source and python version.
Args:
filepath (:obj:`str`): Path to the file.
encoding (:obj:`str`): Encoding to use when reading the file.
Yields:
Context manager for file handle.
"""
if filepath[-2:] == "gz":
mode = "rt"
fopen = gzip.open
else:
mode = "r"
fopen = open
if sys.version_info[0] < 3:
fh = fopen(filepath, mode)
else:
fh = fopen(filepath, mode, encoding=encoding)
try:
yield fh
except IOError:
fh.close()
finally:
fh.close()
Alternatively, **kwargs
could be passed through smart_open()
and into gzip_open()
and open()
.
Additionally, it would be beneficial if the errors were not ignored when reading the files, so that the user can be aware of them. This could be done by using a try/except block to attempt to open the file, and if errors are raised, display them to the user and then try to read the file again but this time ignoring errors. This would mean that the file would still be read but the user would be aware that there was a problem.
Sorry for being dumb, but I wondered if you could give me some feedback on my code?
https://gist.github.com/CholoTook/2eaed8009e48e65bc1b1b65111320a59
I always worry that I'm not using the tool 'canonically' or that I've overlooked some simple features.
I'd really appreciate you giving my code a once over and letting me know what I've done wrong.
Huge thanks!
Dan.
for instance, I want to get GSE19826 sample names.
GEOparse.get_GEO(geo='GSE19826',how='quick')
I have changed how variable to quick but It is still downloads full dataset files and it takes time for large datasets. Is there any way to download only sample names and descriptions?
parse_GSE
will return a GSE object with wrong name
from_csv
is deprecated, so creates a lot of deprecation warnings in the code. Additionally, read_csv
is 46x to 490x faster than from_csv
. There are small changes to the interface, described here.
When i do the following:
gse = GEOparse.get_GEO(filepath="GPL17021_family.soft.gz") print(type(gse))
it prints out a long list of Debug...., like:
13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189087 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189088 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189089 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189090 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189091
I guess it didn't read my soft file correctly. Or maybe it is because I don't know how to use it yet.
If we are downloading with keep_sra=false and sra_format as fastq it makes sence to check for fast files and do not download sra-s if fastq files are avaliable and forcerewrite = false
I am trying to download the following RNA-Seq dataset https://www.ncbi.nlm.nih.gov/sra/SRX313696[accn] with GEOParse
However, it tell me that SRR filetype is not known
Hi, just a minor mistake that write() is omitted.
https://github.com/guma44/GEOparse/blob/master/GEOparse/GEOparse.py#L244
The docs mention there is a
GEOparse.logger.set_verbosity('ERROR')
however, this causes:
AttributeError: 'Logger' object has no attribute 'set_verbosity'
This can be side-stepped with:
import logging
GEOparse.logger.setLevel(logging.getLevelName("ERROR"))
Thank you for the package.
I am quite new to GEOparse
and have been trying to figure out the basics of the package. I tried to implement the initial example provided on the documentation and get an Unbound Local Error. The screenshot of the same is as follows:
Python Version: 3.8.5
GEOparse Version: 2.0.2
Any leads about how to overcome this problem would be really helpful. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.