titipata / pubmed_parser Goto Github PK

View Code? Open in Web Editor NEW

548.0 21.0 163.0 61.85 MB

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

Home Page: http://titipata.github.io/pubmed_parser/

License: MIT License

Python 95.97% TeX 4.03%

python pubmed-central pubmed-parser parse xml medline-xml article nlp doi pmid

pubmed_parser's Introduction

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the lxml library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.

Available Parsers

path provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the data folder.
for website parsing, you should scrape with pause. Please see the copyright notice because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from pubmed_parser.

Parse PubMed OA XML information
Parse PubMed OA citation references
Parse PubMed OA images and captions
Parse PubMed OA Paragraph
Parse PubMed OA Table [WIP]
Parse MEDLINE XML
Parse MEDLINE Grant ID
Parse MEDLINE XML from eutils website
Parse MEDLINE XML citations from website
Parse Outgoing XML citations from website

Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called parse_pubmed_xml which will return a dictionary with the following information:

full_title : article's title
abstract : abstract
journal : Journal name
pmid : PubMed ID
pmc : PubMed Central ID
doi : DOI of the article
publisher_id : publisher ID
author_list : list of authors with affiliation keys in the following format

 [['last_name_1', 'first_name_1', 'aff_key_1'],
  ['last_name_1', 'first_name_1', 'aff_key_2'],
  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]

affiliation_list : list of affiliation keys and affiliation strings in the following format

 [['aff_key_1', 'affiliation_1'],
  ['aff_key_2', 'affiliation_2'], ...]

publication_year : publication year
subjects : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.

import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)

Parse PubMed OA citation references

The function parse_pubmed_references will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

pmid : PubMed ID of the article
pmc : PubMed Central ID of the article
article_title : title of cited article
journal : journal name
journal_type : type of journal
pmid_cited : PubMed ID of article that article cites
doi_cited : DOI of article that article cites
year : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)

dicts_out = pp.parse_pubmed_references(path) # return list of dictionary

Parse PubMed OA images and captions

The function parse_pubmed_caption can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

pmid : PubMed ID
pmc : PubMed Central ID
fig_caption : string of caption
fig_id : reference id for figure (use to refer in XML article)
fig_label : label of the figure
graphic_ref : reference to image file name provided from Pubmed OA

dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary

Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use parse_pubmed_paragraph to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

pmid : PubMed ID
pmc : PubMed Central ID
text : full text of the paragraph
reference_ids : list of reference code within that paragraph.

This IDs can merge with output from parse_pubmed_references .

section : section of paragraph (e.g. Background, Discussion, Appendix, etc.)

dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)

Parse PubMed OA Table [WIP]

You can use parse_pubmed_table to parse table from XML file. This function will return list of dictionaries where each has following keys.

pmid : PubMed ID
pmc : PubMed Central ID
caption : caption of the table
label : lable of the table
table_columns : list of column name
table_values : list of values inside the table
table_xml : raw xml text of the table (return if return_xml=True)

dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)

Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function parse_medline_xml to parse that format. This function will return list of dictionaries, where each element contains:

pmid : PubMed ID
pmc : PubMed Central ID
doi : DOI
other_id : Other IDs found, each separated by ;
title : title of the article
abstract : abstract of the article
authors : authors, each separated by ;
mesh_terms : list of MeSH terms with corresponding MeSH ID, each separated by ; e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...
publication_types : list of publication type list each separated by ; e.g. 'D016428:Journal Article'
keywords : list of keywords, each separated by ;
chemical_list : list of chemical terms, each separated by ;
pubdate : Publication date. Defaults to year information only.
journal : journal of the given paper
medline_ta : this is abbreviation of the journal name
nlm_unique_id : NLM unique identification
issn_linking : ISSN linkage, typically use to link with Web of Science dataset
country : Country extracted from journal information field
reference : string of PMID each separated by ; or list of references made to the article
delete : boolean if False means paper got updated so you might have two
languages : list of languages, separated by ;
vernacular_title: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False) # return list of dictionary

To extract month and day information from PubDate, set year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing nlm_category argument.

Parse MEDLINE Grant ID

Use parse_grant_id in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

pmid : PubMed ID
grant_id : Grant ID
grant_acronym : Acronym of grant
country : Country where grant funding from
agency : Grant agency

If no Grant ID is found, it will return None

Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from E-Utilities using parse_xml_web . For this function, you can provide a single pmid as an input and get a dictionary with following keys

title : title
abstract : abstract
journal : journal
affiliation : affiliation of first author
authors : string of authors, separated by ;
year : Publication year
keywords : keywords or MESH terms of the article

dict_out = pp.parse_xml_web(pmid, save_xml=False)

Parse MEDLINE XML citations from website

The function parse_citation_web allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

pmc : PubMed Central ID
pmid : PubMed ID
doi : DOI of the article
n_citations : number of citations for given articles
pmc_cited : list of PMCs that cite the given PMC

dict_out = pp.parse_citation_web(doc_id, id_type='PMC')

Parse Outgoing XML citations from website

The function parse_outgoing_citation_web allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

n_citations : number of cited articles
doc_id : the document identifier given
id_type : the type of identifier given. Either 'PMID' or 'PMC'
pmid_cited : list of PMIDs cited by the article

dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the 'PMC' prefix. If no citations are found, or if no article is found matching doc_id in the indicated database, it will return None.

Installation

You can install the most update version of the package directly from the repository

pip install git+https://github.com/titipata/pubmed_parser.git

or install recent release with PyPI using

pip install pubmed-parser

or clone the repository and install using pip

git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser

You can test your installation by running pytest --cov=pubmed_parser tests/ --verbose in the root of the repository.

Example snippet to parse PubMed OA dataset

An example usage is shown as follows

import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",
 'affiliation_list':
  [['I1': 'Department of Biological Sciences, ...'],
   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
  'author_list':
  [['Dennehy', 'John J', 'I1'],
   ['Dennehy', 'John J', 'I2'],
   ['Wang', 'Ing-Nang', 'I1']],
 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
 'journal': 'BMC Microbiology',
 'pmc': '3166277',
 'pmid': '21810267',
 'publication_year': '2011',
 'publisher_id': '1471-2180-11-174',
 'subjects': 'Research Article'}

Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using PySpark 2.1

import os
import pubmed_parser as pp
from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
                                               **pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
                                 'file_name', 'pmc', 'pmid',
                                 'publication_year', 'publisher_id',
                                 'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe

See scripts folder for more information.

Core Members

and contributors

Dependencies

Citation

If you use Pubmed Parser, please cite it from JOSS as follows

Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

@article{Achakulvisut2020,
  doi = {10.21105/joss.01979},
  url = {https://doi.org/10.21105/joss.01979},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {46},
  pages = {1979},
  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
  journal = {Journal of Open Source Software}
}

Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.

Acknowledgement

This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including tleonardi, timClicks, and majensen. They made our repository much better!

License

pubmed_parser's People

Contributors

Stargazers

Watchers

Forkers

ptighe h-plus-time tcyb huyaqiang09 rengarandkhz dterg billy322 mjspka wgmueller1 jhnlp spacepajamas mohabdel2013 cmeb45 timecracker cosylabiiit pycrack shreyas1701 bkbonde emedgene poivrenoir yifangma becksloc hm2602 bluenex yutaolife vino5211 mingtao13595 afcarl coshiang geekheal davidnarganes guri soupstandstop jherskovic kjhenner avennia zorrotrying ljfernan ravivmg lucian-whu anarkia7115 imartincevic fgh95 chaxor lprzychodzien tanganyao unlikestack asoehartono yangkuoone wentaohub aviatoryan tongwang666 sakrifor magnolini vporoshin nicford kevindbs leornardzhou tleonardi yaoliuoa sailfish009 patrusso2 voronkovventures linzhou-zhong zhouqx2019 yulab41 mengysun rsingh2083 anoopmudholkar zpeng1989 aspirincode simonwoerpel staticwalrus amarzana philippcheung gdsttian shandianliyu yuannian1006 dmaturana81 mso1830 cloud-technology-solutions y1zhou daniel-mietchen zhangguiyu hepengfe dancho123 xiaodamo2014 dtrinh62 astorfi sequent zabbonat jasonzou bioinfonerd-forks weg-mark yaominghe beira-bf thiliniiw samuelesciancalepore georgeerickson armon-chen

pubmed_parser's Issues

`stringify_affiliation` function return excess whitespace

According to function stringify_affiliation, we return ' '.join(filter(None, parts)). So, it solve problem if we have multiple children node but for some case it returns excess space for example University , Turku , Finland. We should return more reliable string format.

pretty print xml does not print out format xml

In function pretty_print_xml, it does not print pretty output but just a long string.

ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Started hapenning a few days ago. Tried reinstalling a few libs, but to no avail. Do you have a suggestion?

Traceback (most recent call last):
  File "main.py", line 11, in <module>
    from util import *
  File "/home/sevajuri/projects/zsl/util.py", line 8, in <module>
    import pubmed_parser as pp
  File "/home/sevajuri/anaconda2/envs/tf/lib/python2.7/site-packages/pubmed_parser/__init__.py", line 7, in <module>
    from .pubmed_oa_parser import list_xml_path, \
  File "/home/sevajuri/anaconda2/envs/tf/lib/python2.7/site-packages/pubmed_parser/pubmed_oa_parser.py", line 2, in <module>
    from lxml import etree
ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Return a new format of authors and affiliation

For author_list, it should be this following format instead, which I think it is easier to do post-process (i.e. linking author to affiliation).

[[first_name_1, last_name_1, aff_1], [first_name_1, last_name_1, aff_2], [first_name_2, last_name_2, aff_1]]

For affiliation_list, it should be list instead of dictionary format. Then we can add more function to return link of both list.

Return table in format that can convert to multilayer dataframe

We have a function to parse table from Pubmed OA subsets. However, right now, it returns only tables that have one layer (only list of values per one column).

However, tables from Pubmed OA could be multilayers which we currently discard those rows of column names. I would like to have function that parse those table in a format that we can use pandas to convert it right away (e.g. JSON format or a list that we can convert it easily)

Problem with parsing MEDLINE baseline 2017 xml files

Hi, I found the code works well with MEDLINE 2016 baseline. However, when I tried to apply it to the MEDLINE 2017 baseline, only PMIDs were parsed, other fields are blank. Anyone aware of that? Thanks!

Chengkun

Apply unidecode to the parsed string

Clean code and update function's comments

Let's add comments to each function and clean the code overall.

Possible error while parsing structured abstracts.

Hi, first of all big thanks for this life-saver of a package.

I think there is some problem with parsing XML for structured abstracts. Consider the following example:

         <Abstract>
          <AbstractText Label="" NlmCategory="UNASSIGNED">
            <b>Patient: Female, 16</b>
            <b>Final Diagnosis: Pelvic mass</b>
            <b>Symptoms: None</b>
            <b>Medication: None</b>
            <b>Clinical Procedure: CT • MRI</b>
            <b>Specialty: Diagnostic radiology • pediatrics.</b>
          </AbstractText>
          <AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE">
            <b>Unusual presentation of unknown etiology, Rare disease, Mistake in diagnosis.</b>
          </AbstractText>
          <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Müllerian anomalies encompass a wide variety of malformations in the female genital tract, usually associated with renal and anorectal malformations. Of these anomalies, approximately 11% are uterus didelphys, which occurs when midline fusion of the müllerian ducts is arrested to a variable extent.</AbstractText>
          <AbstractText Label="CASE REPORT" NlmCategory="METHODS">We report the case of a 16-year-old female with uterine didelphys, jejunal malrotation, hematometra, hematosalpinx, and bilateral subcentimeter homogenous circular cystic-like renal lesions, who initially presented with left lower quadrant abdominal pain, non-bloody vomiting, and a history of irregular menstrual periods. Initial CT was confusing for an adnexal cystic mass, but further imaging disclosed the above müllerian anomalies.</AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Müllerian anomalies may mimic other, more common, adnexal lesions; thus, adequate evaluation of suspicious cystic adnexal masses with multiple and advanced imaging modalities such as MRI is essential for adequate diagnosis and management.</AbstractText>
        </Abstract>

The parse returned by medline_parser is as follows:

'Patient: Female, 16\n            Final Diagnosis: Pelvic mass\n            Symptoms: None\n            Medication: None\n            Clinical Procedure: CT \u2022 MRI\n            Specialty: Diagnostic radiology \u2022 pediatrics.'

As you can see, it completely misses a major portion of the text. I wonder if this is the case for all structured abstracts or only limited ones. As additional info, the file I'm using is ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0763.xml.gz and the PMID of the abstract is 23826455.

Thanks!

Extract GrantID and related basic information

For Pubmed OA dataset, we want to extract information from acknowledgement section text
For Medline dataset, it has GrantID section that we can write a parser to extract these number.

note that for now, I couldn't find example of xml file that has Grant or GrantID section. Feel free to let me know if anyone have example of xml file where we can parse GrantID section out

cc. @daniel-acuna

Unresolved references to 'join' in pubmed_oa_parser

I believe (I could be wrong) there are three small mistakes on lines 192, 246 and 312 of pubmed_oa_parser.py in the parse_references, parse_paragraph and parse_pubmed_caption functions, respectively.

The code references a function join, but I have been unable to locate it. Perhaps this should be corrected to reference to the string method join (i.e., " ".join()) ?

Parallelize MEDLINE parser

I think we can speed up the parser a lot by parallelize this L105

Problem in list path function: list_xml_path()

path_list = [folder for folder in fullpath if os.path.splitext(folder)[-1] == ('.nxml' or '.xml')]
Is this code correct? ('.nsml', '.xml') can only return .nxml, so basically it can not get the .nxml from the path.
Do you mean path_list = [folder for folder in fullpath if os.path.splitext(folder)[-1] in ('.nxml', '.xml')]?

Create function to parse reference list

I attach code to do that here but still haven't cleaned it up. @davidbrandfonbrener can you take a look?

import pubmed_parser as pp
from lxml import etree

def join(l):
    return ' '.join(l)

path_xml = pp.list_xml_path('data/')
#tree = etree.parse('data/pntd.0002065.nxml')
tree = etree.parse(path_xml[0])
references = tree.xpath('//ref-list/ref[@id]')
dict_refs = list()
for r in references:
    ref_id = r.attrib['id']
    for rc in r:
        if 'publication-type' in rc.attrib.keys():
            if rc.attrib.values() is not None:
                journal_type = rc.attrib.values()[0]
            else:
                journal_type = ''
            names = list()
            for n in rc.findall('name'):
                name = join([t.text for t in n.getchildren()][::-1])
                names.append(name)
            try:
                article_title = rc.findall('article-title')[0].text
            except:
                article_title = ''
            try:
                journal = rc.findall('source')[0].text
            except:
                journal = ''
            try:
                pmid = rc.findall('pub-id[@pub-id-type="pmid"]')[0].text
            except:
                pmid = ''
            dict_ref = {'ref_id': ref_id, 'name': names, 'article_title': article_title, 
                        'journal': journal, 'journal_type': journal_type, 'pmid': pmid}
            dict_refs.append(dict_ref)

Adding Web of Science (WoS) parser

Web of Science data is probably valid to just few people since you need the license. However, there is example XML file avialable at https://github.com/yadudoc/wos_builder. Should we add function to parse WoS to Pubmed Parser?

AttributeError: 'NoneType' object has no attribute 'text' while using parse_xml_web

In [8]: dict_out = pp.parse_xml_web('26849437', save_xml=False)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-5194f10e1609> in <module>()
----> 1 dict_out = pp.parse_xml_web(pmid, save_xml=False)

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_web_parser.pyc in parse_xml_web(pmid, sleep, save_xml)
     86     """
     87     tree = load_xml(pmid, sleep=sleep)
---> 88     dict_out = parse_pubmed_web_tree(tree)
     89     dict_out['pmid'] = str(pmid)
     90     if save_xml:

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_web_parser.pyc in parse_pubmed_web_tree(tree)
     63     if authors_tree is not None:
     64         for a in authors_tree:
---> 65             firstname = a.find('forename').text
     66             lastname = a.find('lastname').text
     67             fullname = firstname + ' ' + lastname

AttributeError: 'NoneType' object has no attribute 'text'

Update script to download whole Pubmed OA and MEDLINE dataset

Now the script here only download subset of the data.

Add usage section on readme and example usage notebook

Parse more information from MEDLINE XML file

These include the following:

MedlineTA
NlmUniqueID
ISSNLinking
Country

These data can be used as a linkage to Web of Science (WoS) dataset later on for people who own WoS dataset.

Abstract partically extracted

For a (small?) subset of documents, only part of the abstract is extracted (e.g. PMID 24653627, 23357879, 27983391, 26762307, 28005260, 22351618, 23456555,18006916,25371446)

paragraphs with no references not included by parse_pubmed_paragraph

Lists returned by parse_pubmed_paragraph does not include paragraphs if they do not have any references. I think it would be useful if paragraphs with no references can also be returned.

Create function to grab paragraph and reference within the paragraph

We can loop through each sec and grab title for that section. Then in each section, we go through each paragraph can get all references id and full text.

Add citation suggestion to README

Some people might use this in publications, we should let them spread the credit :)

Pubmed OA parser gives some error when parsing `subject`

here is one example that it gives error Acta_Psychiatr_Scand/Acta_Psychiatr_Scand_2010_Feb_121(2)_152-156.nxml. I'll update the issue soon.

Add function to parse table from Pubmed OA XML

I wrote a script right here https://gist.github.com/titipata/3f896ab5017e95fbd8aeec25bc283b44. We can put as function parse_pubmed_table. Anyone want to help adding it to the library?

Move include path option to parse_pubmed_xml function directly

Now include_path=False is in parse_pubmed_xml_to_df, not in parse_pubmed_xml. It would be nicer to move directly into parse_pubmed_xml for PySpark user to specify as a flag. We should change README specification also.

PMID to PMC API from Medline cannot convert all provided PMID

The API here cannot convert all PMID input. I was trying to parse citations from given set of PMIDs but it only returns subset of PMIDs that I provided. One possibility is to host pair of PMIDs/PMCs somewhere on the cloud and provide similar API or source file that user can use to convert PMID to PMC.

Pubdate not returning correct year

There is a problem with some of the pubdate fields in the output. It is not pulling the correct year and instead is splitting the text based off of " " and grabbing the first chunk of text. Because of this you end up with results for pubdate like ["Summer","Winter"]. Some example pmid's this happens for is [28599031,28599032,28599033, etc]. Could you please update to match on some form of regular expression like "\d{4}" instead of splitting on the whitespace and just grabbing the first chunk?

Output type in Pubmed OA is lxml.etree._ElementUnicodeResult

@daniel-acuna catched this error when we run output = pp.parse_pubmed_xml('data/1472-6831-8-11.nxml') and then check an output type i.e. list(map(type, out.values()))

Output type is

[str,
 list,
 lxml.etree._ElementUnicodeResult,
 str,
 list,
 str,
 str,
 str,
 lxml.etree._ElementUnicodeResult,
 str]

We have to turn this type into proper string.

Allow parsing strings

Right now, the parse accepts a path or file-like. We should allow strings too.

Error while reading Medline gz file from path

In [234]: pp.parse_pubmed_xml('/home/docClass/files/pubmed/medline17n0330.xml.gz')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-234-bfaa6482d6d1> in <module>()
----> 1 pp.parse_pubmed_xml('/home/docClass/files/pubmed/medline17n0330.xml.gz')

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_oa_parser.pyc in parse_pubmed_xml(path, include_path)
    108         journal = ''
    109
--> 110     dict_article_meta = parse_article_meta(tree)
    111     pub_year_node = tree.find('//pub-date/year')
    112     pub_year = pub_year_node.text if pub_year_node is not None else ''

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_oa_parser.pyc in parse_article_meta(tree)
     56     """
     57     article_meta = tree.find('//article-meta')
---> 58     pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
     59     pmc_node = article_meta.find('article-id[@pub-id-type="pmc"]')
     60     pub_id_node = article_meta.find('article-id[@pub-id-type="publisher-id"]')

AttributeError: 'NoneType' object has no attribute 'find'

Use urllib library instead of requests

I think I'll use urllib instead of requests in order to reduce required packages.

python-2.7 import error: No module named 'request'

In my environment python-2.7 is not able to find the module urllib.request required by pubmed_web_parser.py.

If I replace the urllib import statement as suggested in this discussion then python2 is able to find the request module:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

urlopen is used by the function parse_outgoing_citation_web in the same file. For all http requests from this file requests library is used except the request made from the function parse_outgoing_citation_web. It could be possible to use the requests library here as well? to avoid the import.

--mahmut

Fix online parser to parse given PMID

There are some error related when parsing some PMID including PMID29323847 (book)

Parsers cannot read the xml file.

Below I've copied my python instance. I'm trying to parse medline data. I've done this with your pubmed and medline parser on the listed machine as well as on a ubuntu server with the same error. I've also generated a file using the R programming language. If you are familiar with that, the package I used is called easyPubMed. I used the batch_pubmed_download() function.

Anyways I'd really like to use your code, especially as it links the authors with their affiliated institutions.
I'm new to XML parsing so I have no idea what I'm doing in that respect.

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32

>>>import pubmed_parser as pp
>>>pp.parse_pubmed_xml('C:\\Users\\Work\\Downloads\\medline16n0902.xml')

Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 14, in read_xml
    tree = etree.parse(path)
  File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81101)
  File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117832)
  File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118179)
  File "src\lxml\parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:117091)
  File "src\lxml\parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:111637)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105612)
OSError: Error reading file 'medline16n0902.xml': failed to load external entity "medline16n0902.xml"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\medline_parser.py", line 354, in parse_medline_xml
    tree = read_xml(path)
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 17, in read_xml
    tree = etree.fromstring(path)
  File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)
  File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)
  File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)
  File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105655)
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Merge web parser function to the repository

Incorporate created functionality from https://github.com/titipata/pubmed_parser_web to this repository

Remove extract space that appear when parsing abstract text

In both Pubmed Open Access and MEDLINE open access, when we parse abstract text. It mostly includes blank space and new line \n. We can preprocess that later using nltk whitespace tokenizer. However, it would be nice to return string without extra-space right away.

Affiliation contains number

We select all string under affiliation where it comes with number. It should be a way to remove it from parser itself or we might use regular expression to remove it.

Bug in parse reference function

Right now, the function parse_pubmed_references return empty list of name. This happens in example folder mds526.nxml.

Parse citations from website

here is an example link that we want to parse http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2631471/citedby/

Return PMCID when parsing Medline XML files

Some website require PMCID instead of PMID.

Create function to caption figure caption and image link

We want to grab figure label, caption and xlink:href for that file so we can relate caption with images later.

Parsing abstract section

There can be one or more section in single abstract. For example, we can have BACKGROUND, METHOD, RESULT or just Abstract and RESULT in single abstract etc. We want to add some tagger when we parse these abstracts in case people want to analyze only few parts of abstract.

I'm still not sure which format should we return in this case.

Problem during parsing medline17m1086 file

Hey guys,
I'm getting the following error if I try to use your code.
Whats the problem?

Best regards

AttributeError                            Traceback (most recent call last)
<ipython-input-2-9968ea229488> in <module>()
      1 ## settings
      2 file = "medline17n1086.xml";
----> 3 pubmed_dict = pp.parse_pubmed_xml(file) # dictionary output

/usr/local/lib/python3.5/dist-packages/pubmed_parser-0.1-py3.5.egg/pubmed_parser/pubmed_oa_parser.py in parse_pubmed_xml(path, include_path)
    108         journal = ''
    109 
--> 110     dict_article_meta = parse_article_meta(tree)
    111     pub_year_node = tree.find('//pub-date/year')
    112     pub_year = pub_year_node.text if pub_year_node is not None else ''

/usr/local/lib/python3.5/dist-packages/pubmed_parser-0.1-py3.5.egg/pubmed_parser/pubmed_oa_parser.py in parse_article_meta(tree)
     56     """
     57     article_meta = tree.find('//article-meta')
---> 58     pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
     59     pmc_node = article_meta.find('article-id[@pub-id-type="pmc"]')
     60     pub_id_node = article_meta.find('article-id[@pub-id-type="publisher-id"]')

AttributeError: 'NoneType' object has no attribute 'find'

"cannot use absolute path on element" when passing xml as string instead of fileobject

Getting

> raise SyntaxError("cannot use absolute path on element")
> SyntaxError: cannot use absolute path on element
>

Fixed when replacing all xpaths starting with // into .//

article_meta = tree.find('//article-meta') to article_meta = tree.find('.//article-meta')

feature request: MeSH IDs

It'd be useful to be able to get MeSH IDs as well as heading names. I've used the following quick-and-dirty change to medline_parser.py but there's probably a more elegant way.

mesh_terms_list = [m.find('DescriptorName').attrib.get('UI','') + ":" + m.find('DescriptorName').text for m in mesh.getchildren()]

XMLSyntaxError

In [41]: pp.parse_medline_xml('/home/docClass/files/pubmed/pubmed18n1040.xml.gz')
Error: it was not able to read a path, a file-like object, or a string as an XML
File "", line 1
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Source: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed18n1040.xml.gz

problem with parse paragraph function

I tried to run the following test. Seems like it should have worked. Any suggestions welcome. Thanks!!

from pubmed_oa_parser import parse_pubmed_paragraph
from os import system

if __name__=="__main__":
 
    system("wget 'https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:156895&metadataPrefix=pmc' -O ./data/PMC156895.nxml") 

    dicts_out = parse_pubmed_paragraph('data/PMC156895.nxml', all_paragraph=False)

Output

--2017-06-14 22:23:06--  https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:156895&metadataPrefix=pmc
Resolving www.ncbi.nlm.nih.gov... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘./data/PMC156895.nxml’

./data/PMC156895.nxml                        [    <=>                                                                            ]  66.07K  67.0KB/s    in 1.0s    

2017-06-14 22:23:10 (67.0 KB/s) - ‘./data/PMC156895.nxml’ saved [67658]

<lxml.etree._ElementTree object at 0x1068596c8>
Traceback (most recent call last):
  File "scrape.py", line 9, in <module>
    dicts_out = parse_pubmed_paragraph('data/PMC156895.nxml', all_paragraph=False)
  File "/Users/fenwick/Library/Mobile Documents/com~apple~CloudDocs/other/web_scrape/pubmed_oa_parser.py", line 253, in parse_pubmed_paragraph
    dict_article_meta = parse_article_meta(tree)
  File "/Users/fenwick/Library/Mobile Documents/com~apple~CloudDocs/other/web_scrape/pubmed_oa_parser.py", line 57, in parse_article_meta
    pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
AttributeError: 'NoneType' object has no attribute 'find'

Processing deleted Medline "citations" in NLM XML records

How should we process the delete citations?

Sometimes the update XML comes with "deleted" citations (like this example), and it would be good to know which PMID were deleted.

For example, the stats for the update file medline16n0906.xml available [here ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/medline16n0906_stats.html] says that there are 8809 citations and 353 delete citations. If we process the XML with pubmed_parser, we correctly get 8809-353 = 8456 records. Use the code below to test this:

# adapted from http://stackoverflow.com/questions/18772703/read-a-file-in-buffer-from-ftp-python
from ftplib import FTP
import gzip
import StringIO

def open_ftp_data(server, path, binary=True):
    ftp = FTP(server)
    ftp.login()

    data_io = StringIO.StringIO()
    def handle_data(more_data):
        data_io.write(more_data)
    if binary:
        resp = ftp.retrbinary("RETR " + path, callback=handle_data)
    else:
        resp = ftp.retrlines("RETR " + path, callback=handle_data)
    data_io.seek(0) # Go back to the start
    ftp.close()
    return data_io

import pubmed_parser as pp
binary_file = open_ftp_data('ftp.nlm.nih.gov', 'nlmdata/.medlease/gz/medline16n0906.xml.gz')
zippy = gzip.GzipFile(fileobj=binary_file)
medline_xml = zippy.read()
dict_records = pp.parse_medline_xml(medline_xml)
print("pubmed_parser records processed: {}".format(len(dict_records)))

Output

pubmed_parser records processed: 8456

We can find the deleted citations by simply

from lxml import etree
root = etree.fromstring(medline_xml)
print("Delete citations {}".format(len(root.xpath('//DeleteCitation/PMID'))))