Comments (11)
I will look into it further in a couple days. In the meantime, you can help me with two things. First, trim down the file to creating the smallest file that gives the same error. You will need to work with XML, not the zipped file. Second, try downloading the file in various ways. Try manual downloads too. See if the file changes with different ways of downloading.
from pubmed_parser.
Thank you so much @raypereda-gr for helping out!
from pubmed_parser.
Yes, it seems like the file that you're putting in is not parsable by lxml.
from pubmed_parser.
Thanks for taking time to answer.
So, you are saying the parser won't work for files from the year 2022? Or is there any other issue apart from the date of file?
Because it is working just fine for a 2017 file (downloaded from the exact same source) with the same extension .xml.gz
If year is the only issue, then do you have any idea till which year/date the parser shall work?
from pubmed_parser.
Oh, if it works until 2017. It might be the problem with the file format. I don't have much time to check the format but there might be an issue there!
from pubmed_parser.
In the last year, I have used parse_medline_xml() on all of the PubMed XML files without error. In general, I use the xml.gz file format but I have tested the .xml file too. I recommend stepping through the code while parsing that file in a debugger and isolating the error.
from pubmed_parser.
Thanks @raypereda-gr.
Yes, I'm also using it with a .xml.gz file. It's a 2022 file.
I tried debugging - However, I'm unable to figure out the error. Can you please help?
_> c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)
ERROR! Session/line number was not unique in database. History logging moved to new session 157
ipdb> w
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\async_helpers.py(78)_pseudo_sync_runner()
76 """
77 try:
---> 78 coro.send(None)
79 except StopIteration as exc:
80 return exc.value
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3185)run_cell_async()
3183 interactivity = 'async'
3184
-> 3185 has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
3186 interactivity=interactivity, compiler=compiler, result=result)
3187
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3396)run_ast_nodes()
3394 if result:
3395 result.error_before_exec = sys.exc_info()[1]
-> 3396 self.showtraceback()
3397 return True
3398
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2063)showtraceback()
2061 # Though this won't be called by syntax errors in the input
2062 # line, there may be SyntaxError cases with imported code.
-> 2063 self.showsyntaxerror(filename, running_compiled_code)
2064 elif etype is UsageError:
2065 self.show_usage_error(value)
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2129)showsyntaxerror()
2127 # If the error occurred when executing compiled code, we should provide full stacktrace.
2128 elist = traceback.extract_tb(last_traceback) if running_compiled_code else []
-> 2129 stb = self.SyntaxTB.structured_traceback(etype, value, elist)
2130 self._showtraceback(etype, value, stb)
2131
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1403)structured_traceback()
1401 and isinstance(value.lineno, int):
1402 linecache.checkcache(value.filename)
-> 1403 newtext = linecache.getline(value.filename, value.lineno)
1404 if newtext:
1405 value.text = newtext
c:\users\hp\anaconda3\envs\test\lib\linecache.py(30)getline()
28 Update the cache if it doesn't contain an entry for this file already."""
29
---> 30 lines = getlines(filename, module_globals)
31 if 1 <= lineno <= len(lines):
32 return lines[lineno - 1]
c:\users\hp\anaconda3\envs\test\lib\linecache.py(46)getlines()
44
45 try:
---> 46 return updatecache(filename, module_globals)
47 except MemoryError:
48 clearcache()
c:\users\hp\anaconda3\envs\test\lib\linecache.py(136)updatecache()
134 return []
135 try:
--> 136 with tokenize.open(fullname) as fp:
137 lines = fp.readlines()
138 except OSError:
c:\users\hp\anaconda3\envs\test\lib\tokenize.py(394)open()
392 buffer = _builtin_open(filename, 'rb')
393 try:
--> 394 encoding, lines = detect_encoding(buffer.readline)
395 buffer.seek(0)
396 text = TextIOWrapper(buffer, encoding, line_buffering=True)
c:\users\hp\anaconda3\envs\test\lib\tokenize.py(371)detect_encoding()
369 return default, []
370
--> 371 encoding = find_cookie(first)
372 if encoding:
373 return encoding, [first]
c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)_
from pubmed_parser.
@raypereda-gr can you also please let me know the source and code you are downloading the files from?
here is my code - I'm afraid if incorrect files are getting downloaded on my end hence causing errors.
save_loc = 'Desktop/scratch/'
def download_ftp_files(link, save_loc, verbose=True):
""" Downloads all ftp files from the supplied link """
process = Popen(['wget', link + "*"],
stdout=PIPE, cwd=save_loc)
if verbose:
for line in iter(process.stdout.readline, ''):
sys.stdout.write(line)
download_ftp_files('ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/', save_loc=save_loc + 'baseline/')
from pubmed_parser.
@raypereda-gr Thanks very much for considering to help.
As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().
To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.
Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here. Sorry for bothering you too much. I'm very new to this hence the naivety.
from pubmed_parser.
As you asked to work with the .xml and not .xml.gz (zipped)
file, is it required for trimming the file
down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().
That is the same function that I use:
list_of_dictionary = pp.parse_medline_xml(pubmed_xml_filename, year_info_only=False)
That function will accept a .xml or .xml.gz file. You don't need to worry about unzipping explicity; the function with handle that if needed.
Since you have been able to to parse the .xml.gz file, we can be confident that the problem is with the .xml file. How exactly did you unzip it? Here's ls output of the the unzipped file that I created by unzipping on a Mac using the pre-installed unzip tool. I also counted the number of lines.
$ ls -l medline17n0116.xml
-rw-r--r--@ 1 raypereda staff 188634668 Mar 19 16:41 medline17n0116.xml
$ wc *.xml
4572705 10113718 188634668 medline17n0116.xml
To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.
Good. That means we can be confident that the problem is not with the download. I suspect something is off with the unzipping.
Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here.
Ok, why can you just parse the .xml.gz file? I would suggest not worry about unzipping the files.
from pubmed_parser.
Thanks @raypereda-gr !
Yes, I was working with the zipped file only (.xml.gz) ; it still wasn't working.
I made a small change by just adding the keyword arg path while calling the function like so -
pp.parse_medline_xml(path = pubmed_xml_filepath)
instead of positional calling like -
pp.parse_medline_xml(pubmed_xml_filepath)
and it worked hence. Anyway, thanks a lot for helping patiently, @titipata @raypereda-gr.
Best,
Srishti
from pubmed_parser.
Related Issues (20)
- PMC OA: tags in the <journal-title> field break parse_pubmed_xml HOT 2
- parse_pubmed_paragraph() function seems to miss some paragraphs sometimes.
- physical & electronic publication dates can be mixed into erroneous dates HOT 1
- AttributeError: 'NoneType' object has no attribute 'find' HOT 2
- Table parsing issues with parse_pubmed_table HOT 2
- parse_pubmed_table() and parse_pubmed_references() returning None HOT 1
- Question for extracting text HOT 1
- Question: Abstract with Mesh Tag HOT 3
- question using pp.parese_medline HOT 2
- Question: parsing error first line expecting '<' not found
- parse_pubmed_caption() failing on some papers
- ValueError when attempting to parse OA XML HOT 3
- Bug report - XML citations from website HOT 1
- CALL FOR MAINTAINER/CONTRIBUTOR FOR PUBMED_PARSER HOT 7
- Parse language in `parse_xml_web()` HOT 1
- Update tests: Use current data
- Fix Medline parser tests
- Parse VersionID and VersionDate
- Inconsistent PMID and DOI when using parse_xml_web for an XML file. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pubmed_parser.