Git Product home page Git Product logo

Comments (11)

raypereda-gr avatar raypereda-gr commented on June 3, 2024 1

I will look into it further in a couple days. In the meantime, you can help me with two things. First, trim down the file to creating the smallest file that gives the same error. You will need to work with XML, not the zipped file. Second, try downloading the file in various ways. Try manual downloads too. See if the file changes with different ways of downloading.

from pubmed_parser.

titipata avatar titipata commented on June 3, 2024 1

Thank you so much @raypereda-gr for helping out!

from pubmed_parser.

titipata avatar titipata commented on June 3, 2024

Yes, it seems like the file that you're putting in is not parsable by lxml.

from pubmed_parser.

srishti-git1110 avatar srishti-git1110 commented on June 3, 2024

Thanks for taking time to answer.
So, you are saying the parser won't work for files from the year 2022? Or is there any other issue apart from the date of file?
Because it is working just fine for a 2017 file (downloaded from the exact same source) with the same extension .xml.gz

If year is the only issue, then do you have any idea till which year/date the parser shall work?

from pubmed_parser.

titipata avatar titipata commented on June 3, 2024

Oh, if it works until 2017. It might be the problem with the file format. I don't have much time to check the format but there might be an issue there!

from pubmed_parser.

raypereda-gr avatar raypereda-gr commented on June 3, 2024

In the last year, I have used parse_medline_xml() on all of the PubMed XML files without error. In general, I use the xml.gz file format but I have tested the .xml file too. I recommend stepping through the code while parsing that file in a debugger and isolating the error.

from pubmed_parser.

srishti-git1110 avatar srishti-git1110 commented on June 3, 2024

Thanks @raypereda-gr.
Yes, I'm also using it with a .xml.gz file. It's a 2022 file.

I tried debugging - However, I'm unable to figure out the error. Can you please help?

_> c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)

ERROR! Session/line number was not unique in database. History logging moved to new session 157
ipdb> w
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\async_helpers.py(78)_pseudo_sync_runner()
76 """
77 try:
---> 78 coro.send(None)
79 except StopIteration as exc:
80 return exc.value

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3185)run_cell_async()
3183 interactivity = 'async'
3184
-> 3185 has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
3186 interactivity=interactivity, compiler=compiler, result=result)
3187

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3396)run_ast_nodes()
3394 if result:
3395 result.error_before_exec = sys.exc_info()[1]
-> 3396 self.showtraceback()
3397 return True
3398

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2063)showtraceback()
2061 # Though this won't be called by syntax errors in the input
2062 # line, there may be SyntaxError cases with imported code.
-> 2063 self.showsyntaxerror(filename, running_compiled_code)
2064 elif etype is UsageError:
2065 self.show_usage_error(value)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2129)showsyntaxerror()
2127 # If the error occurred when executing compiled code, we should provide full stacktrace.
2128 elist = traceback.extract_tb(last_traceback) if running_compiled_code else []
-> 2129 stb = self.SyntaxTB.structured_traceback(etype, value, elist)
2130 self._showtraceback(etype, value, stb)
2131

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1403)structured_traceback()
1401 and isinstance(value.lineno, int):
1402 linecache.checkcache(value.filename)
-> 1403 newtext = linecache.getline(value.filename, value.lineno)
1404 if newtext:
1405 value.text = newtext

c:\users\hp\anaconda3\envs\test\lib\linecache.py(30)getline()
28 Update the cache if it doesn't contain an entry for this file already."""
29
---> 30 lines = getlines(filename, module_globals)
31 if 1 <= lineno <= len(lines):
32 return lines[lineno - 1]

c:\users\hp\anaconda3\envs\test\lib\linecache.py(46)getlines()
44
45 try:
---> 46 return updatecache(filename, module_globals)
47 except MemoryError:
48 clearcache()

c:\users\hp\anaconda3\envs\test\lib\linecache.py(136)updatecache()
134 return []
135 try:
--> 136 with tokenize.open(fullname) as fp:
137 lines = fp.readlines()
138 except OSError:

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(394)open()
392 buffer = _builtin_open(filename, 'rb')
393 try:
--> 394 encoding, lines = detect_encoding(buffer.readline)
395 buffer.seek(0)
396 text = TextIOWrapper(buffer, encoding, line_buffering=True)

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(371)detect_encoding()
369 return default, []
370
--> 371 encoding = find_cookie(first)
372 if encoding:
373 return encoding, [first]

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)_

from pubmed_parser.

srishti-git1110 avatar srishti-git1110 commented on June 3, 2024

@raypereda-gr can you also please let me know the source and code you are downloading the files from?

here is my code - I'm afraid if incorrect files are getting downloaded on my end hence causing errors.

save_loc = 'Desktop/scratch/'
def download_ftp_files(link, save_loc, verbose=True):
     """ Downloads all ftp files from the supplied link """

    process = Popen(['wget', link + "*"],
                    stdout=PIPE, cwd=save_loc)

    if verbose:
        for line in iter(process.stdout.readline, ''):
            sys.stdout.write(line)
           
download_ftp_files('ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/', save_loc=save_loc + 'baseline/')

from pubmed_parser.

srishti-git1110 avatar srishti-git1110 commented on June 3, 2024

@raypereda-gr Thanks very much for considering to help.

As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here. Sorry for bothering you too much. I'm very new to this hence the naivety.

from pubmed_parser.

raypereda-gr avatar raypereda-gr commented on June 3, 2024

As you asked to work with the .xml and not .xml.gz (zipped)
file, is it required for trimming the file
down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

That is the same function that I use:

list_of_dictionary = pp.parse_medline_xml(pubmed_xml_filename, year_info_only=False)

That function will accept a .xml or .xml.gz file. You don't need to worry about unzipping explicity; the function with handle that if needed.

Since you have been able to to parse the .xml.gz file, we can be confident that the problem is with the .xml file. How exactly did you unzip it? Here's ls output of the the unzipped file that I created by unzipping on a Mac using the pre-installed unzip tool. I also counted the number of lines.

$ ls -l medline17n0116.xml
-rw-r--r--@ 1 raypereda  staff  188634668 Mar 19 16:41 medline17n0116.xml

$ wc *.xml
 4572705 10113718 188634668 medline17n0116.xml

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Good. That means we can be confident that the problem is not with the download. I suspect something is off with the unzipping.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here.

Ok, why can you just parse the .xml.gz file? I would suggest not worry about unzipping the files.

from pubmed_parser.

srishti-git1110 avatar srishti-git1110 commented on June 3, 2024

Thanks @raypereda-gr !

Yes, I was working with the zipped file only (.xml.gz) ; it still wasn't working.

I made a small change by just adding the keyword arg path while calling the function like so -
pp.parse_medline_xml(path = pubmed_xml_filepath)

instead of positional calling like -
pp.parse_medline_xml(pubmed_xml_filepath)

and it worked hence. Anyway, thanks a lot for helping patiently, @titipata @raypereda-gr.
Best,
Srishti

from pubmed_parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.