Git Product home page Git Product logo

pubmed_parser's Introduction

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

License DOI DOI Build Status

Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the lxml library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.

Available Parsers

  • path provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the data folder.
  • for website parsing, you should scrape with pause. Please see the copyright notice because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from pubmed_parser.

Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called parse_pubmed_xml which will return a dictionary with the following information:

  • full_title : article's title
  • abstract : abstract
  • journal : Journal name
  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • doi : DOI of the article
  • publisher_id : publisher ID
  • author_list : list of authors with affiliation keys in the following format
 [['last_name_1', 'first_name_1', 'aff_key_1'],
  ['last_name_1', 'first_name_1', 'aff_key_2'],
  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]
  • affiliation_list : list of affiliation keys and affiliation strings in the following format
 [['aff_key_1', 'affiliation_1'],
  ['aff_key_2', 'affiliation_2'], ...]
  • publication_year : publication year
  • subjects : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)

Parse PubMed OA citation references

The function parse_pubmed_references will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

  • pmid : PubMed ID of the article
  • pmc : PubMed Central ID of the article
  • article_title : title of cited article
  • journal : journal name
  • journal_type : type of journal
  • pmid_cited : PubMed ID of article that article cites
  • doi_cited : DOI of article that article cites
  • year : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary

Parse PubMed OA images and captions

The function parse_pubmed_caption can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • fig_caption : string of caption
  • fig_id : reference id for figure (use to refer in XML article)
  • fig_label : label of the figure
  • graphic_ref : reference to image file name provided from Pubmed OA
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary

Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use parse_pubmed_paragraph to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • text : full text of the paragraph
  • reference_ids : list of reference code within that paragraph.

This IDs can merge with output from parse_pubmed_references .

  • section : section of paragraph (e.g. Background, Discussion, Appendix, etc.)
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)

Parse PubMed OA Table [WIP]

You can use parse_pubmed_table to parse table from XML file. This function will return list of dictionaries where each has following keys.

  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • caption : caption of the table
  • label : lable of the table
  • table_columns : list of column name
  • table_values : list of values inside the table
  • table_xml : raw xml text of the table (return if return_xml=True)
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)

Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function parse_medline_xml to parse that format. This function will return list of dictionaries, where each element contains:

  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • doi : DOI
  • other_id : Other IDs found, each separated by ;
  • title : title of the article
  • abstract : abstract of the article
  • authors : authors, each separated by ;
  • mesh_terms : list of MeSH terms with corresponding MeSH ID, each separated by ; e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...
  • publication_types : list of publication type list each separated by ; e.g. 'D016428:Journal Article'
  • keywords : list of keywords, each separated by ;
  • chemical_list : list of chemical terms, each separated by ;
  • pubdate : Publication date. Defaults to year information only.
  • journal : journal of the given paper
  • medline_ta : this is abbreviation of the journal name
  • nlm_unique_id : NLM unique identification
  • issn_linking : ISSN linkage, typically use to link with Web of Science dataset
  • country : Country extracted from journal information field
  • reference : string of PMID each separated by ; or list of references made to the article
  • delete : boolean if False means paper got updated so you might have two
  • languages : list of languages, separated by ;
  • vernacular_title: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False) # return list of dictionary

To extract month and day information from PubDate, set year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing nlm_category argument.

Parse MEDLINE Grant ID

Use parse_grant_id in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

  • pmid : PubMed ID
  • grant_id : Grant ID
  • grant_acronym : Acronym of grant
  • country : Country where grant funding from
  • agency : Grant agency

If no Grant ID is found, it will return None

Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from E-Utilities using parse_xml_web . For this function, you can provide a single pmid as an input and get a dictionary with following keys

  • title : title
  • abstract : abstract
  • journal : journal
  • affiliation : affiliation of first author
  • authors : string of authors, separated by ;
  • year : Publication year
  • keywords : keywords or MESH terms of the article
dict_out = pp.parse_xml_web(pmid, save_xml=False)

Parse MEDLINE XML citations from website

The function parse_citation_web allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

  • pmc : PubMed Central ID
  • pmid : PubMed ID
  • doi : DOI of the article
  • n_citations : number of citations for given articles
  • pmc_cited : list of PMCs that cite the given PMC
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')

Parse Outgoing XML citations from website

The function parse_outgoing_citation_web allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

  • n_citations : number of cited articles
  • doc_id : the document identifier given
  • id_type : the type of identifier given. Either 'PMID' or 'PMC'
  • pmid_cited : list of PMIDs cited by the article
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the 'PMC' prefix. If no citations are found, or if no article is found matching doc_id in the indicated database, it will return None.

Installation

You can install the most update version of the package directly from the repository

pip install git+https://github.com/titipata/pubmed_parser.git

or install recent release with PyPI using

pip install pubmed-parser

or clone the repository and install using pip

git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser

You can test your installation by running pytest --cov=pubmed_parser tests/ --verbose in the root of the repository.

Example snippet to parse PubMed OA dataset

An example usage is shown as follows

import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",
 'affiliation_list':
  [['I1': 'Department of Biological Sciences, ...'],
   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
  'author_list':
  [['Dennehy', 'John J', 'I1'],
   ['Dennehy', 'John J', 'I2'],
   ['Wang', 'Ing-Nang', 'I1']],
 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
 'journal': 'BMC Microbiology',
 'pmc': '3166277',
 'pmid': '21810267',
 'publication_year': '2011',
 'publisher_id': '1471-2180-11-174',
 'subjects': 'Research Article'}

Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using PySpark 2.1

import os
import pubmed_parser as pp
from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
                                               **pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
                                 'file_name', 'pmc', 'pmid',
                                 'publication_year', 'publisher_id',
                                 'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe

See scripts folder for more information.

Core Members

and contributors

Dependencies

Citation

If you use Pubmed Parser, please cite it from JOSS as follows

Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

@article{Achakulvisut2020,
  doi = {10.21105/joss.01979},
  url = {https://doi.org/10.21105/joss.01979},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {46},
  pages = {1979},
  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
  journal = {Journal of Open Source Software}
}

Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.

Acknowledgement

This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including tleonardi, timClicks, and majensen. They made our repository much better!

License

MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna

pubmed_parser's People

Contributors

bluenex avatar daniel-acuna avatar daniel-mietchen avatar davidbrandfonbrener avatar gitter-badger avatar grivaz avatar h-plus-time avatar jimzijun avatar jtourille avatar kjhenner avatar kthyng avatar majensen avatar michael-e-rose avatar nils-herrmann avatar njford avatar patrusso2 avatar raypereda-gr avatar seandavi avatar simonwoerpel avatar tanganyao avatar tariqahassan avatar tcyb avatar thomascpan avatar tiansuyu avatar titipata avatar tleonardi avatar vbatts avatar yak1r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pubmed_parser's Issues

`stringify_affiliation` function return excess whitespace

According to function stringify_affiliation, we return ' '.join(filter(None, parts)). So, it solve problem if we have multiple children node but for some case it returns excess space for example University , Turku , Finland. We should return more reliable string format.

ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Started hapenning a few days ago. Tried reinstalling a few libs, but to no avail. Do you have a suggestion?

Traceback (most recent call last):
  File "main.py", line 11, in <module>
    from util import *
  File "/home/sevajuri/projects/zsl/util.py", line 8, in <module>
    import pubmed_parser as pp
  File "/home/sevajuri/anaconda2/envs/tf/lib/python2.7/site-packages/pubmed_parser/__init__.py", line 7, in <module>
    from .pubmed_oa_parser import list_xml_path, \
  File "/home/sevajuri/anaconda2/envs/tf/lib/python2.7/site-packages/pubmed_parser/pubmed_oa_parser.py", line 2, in <module>
    from lxml import etree
ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Return a new format of authors and affiliation

For author_list, it should be this following format instead, which I think it is easier to do post-process (i.e. linking author to affiliation).

[[first_name_1, last_name_1, aff_1], [first_name_1, last_name_1, aff_2], [first_name_2, last_name_2, aff_1]]

For affiliation_list, it should be list instead of dictionary format. Then we can add more function to return link of both list.

Return table in format that can convert to multilayer dataframe

We have a function to parse table from Pubmed OA subsets. However, right now, it returns only tables that have one layer (only list of values per one column).

However, tables from Pubmed OA could be multilayers which we currently discard those rows of column names. I would like to have function that parse those table in a format that we can use pandas to convert it right away (e.g. JSON format or a list that we can convert it easily)

Problem with parsing MEDLINE baseline 2017 xml files

Hi, I found the code works well with MEDLINE 2016 baseline. However, when I tried to apply it to the MEDLINE 2017 baseline, only PMIDs were parsed, other fields are blank. Anyone aware of that? Thanks!

Chengkun

Possible error while parsing structured abstracts.

Hi, first of all big thanks for this life-saver of a package.

I think there is some problem with parsing XML for structured abstracts. Consider the following example:

         <Abstract>
          <AbstractText Label="" NlmCategory="UNASSIGNED">
            <b>Patient: Female, 16</b>
            <b>Final Diagnosis: Pelvic mass</b>
            <b>Symptoms: None</b>
            <b>Medication: None</b>
            <b>Clinical Procedure: CT • MRI</b>
            <b>Specialty: Diagnostic radiology • pediatrics.</b>
          </AbstractText>
          <AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE">
            <b>Unusual presentation of unknown etiology, Rare disease, Mistake in diagnosis.</b>
          </AbstractText>
          <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Müllerian anomalies encompass a wide variety of malformations in the female genital tract, usually associated with renal and anorectal malformations. Of these anomalies, approximately 11% are uterus didelphys, which occurs when midline fusion of the müllerian ducts is arrested to a variable extent.</AbstractText>
          <AbstractText Label="CASE REPORT" NlmCategory="METHODS">We report the case of a 16-year-old female with uterine didelphys, jejunal malrotation, hematometra, hematosalpinx, and bilateral subcentimeter homogenous circular cystic-like renal lesions, who initially presented with left lower quadrant abdominal pain, non-bloody vomiting, and a history of irregular menstrual periods. Initial CT was confusing for an adnexal cystic mass, but further imaging disclosed the above müllerian anomalies.</AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Müllerian anomalies may mimic other, more common, adnexal lesions; thus, adequate evaluation of suspicious cystic adnexal masses with multiple and advanced imaging modalities such as MRI is essential for adequate diagnosis and management.</AbstractText>
        </Abstract>

The parse returned by medline_parser is as follows:

'Patient: Female, 16\n            Final Diagnosis: Pelvic mass\n            Symptoms: None\n            Medication: None\n            Clinical Procedure: CT \u2022 MRI\n            Specialty: Diagnostic radiology \u2022 pediatrics.'

As you can see, it completely misses a major portion of the text. I wonder if this is the case for all structured abstracts or only limited ones. As additional info, the file I'm using is ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0763.xml.gz and the PMID of the abstract is 23826455.

Thanks!

Extract GrantID and related basic information

  • For Pubmed OA dataset, we want to extract information from acknowledgement section text
  • For Medline dataset, it has GrantID section that we can write a parser to extract these number.

note that for now, I couldn't find example of xml file that has Grant or GrantID section. Feel free to let me know if anyone have example of xml file where we can parse GrantID section out

cc. @daniel-acuna

Unresolved references to 'join' in pubmed_oa_parser

I believe (I could be wrong) there are three small mistakes on lines 192, 246 and 312 of pubmed_oa_parser.py in the parse_references, parse_paragraph and parse_pubmed_caption functions, respectively.

The code references a function join, but I have been unable to locate it. Perhaps this should be corrected to reference to the string method join (i.e., " ".join()) ?

Problem in list path function: list_xml_path()

path_list = [folder for folder in fullpath if os.path.splitext(folder)[-1] == ('.nxml' or '.xml')]
Is this code correct? ('.nsml', '.xml') can only return .nxml, so basically it can not get the .nxml from the path.
Do you mean path_list = [folder for folder in fullpath if os.path.splitext(folder)[-1] in ('.nxml', '.xml')]?

Create function to parse reference list

I attach code to do that here but still haven't cleaned it up. @davidbrandfonbrener can you take a look?

import pubmed_parser as pp
from lxml import etree

def join(l):
    return ' '.join(l)

path_xml = pp.list_xml_path('data/')
#tree = etree.parse('data/pntd.0002065.nxml')
tree = etree.parse(path_xml[0])
references = tree.xpath('//ref-list/ref[@id]')
dict_refs = list()
for r in references:
    ref_id = r.attrib['id']
    for rc in r:
        if 'publication-type' in rc.attrib.keys():
            if rc.attrib.values() is not None:
                journal_type = rc.attrib.values()[0]
            else:
                journal_type = ''
            names = list()
            for n in rc.findall('name'):
                name = join([t.text for t in n.getchildren()][::-1])
                names.append(name)
            try:
                article_title = rc.findall('article-title')[0].text
            except:
                article_title = ''
            try:
                journal = rc.findall('source')[0].text
            except:
                journal = ''
            try:
                pmid = rc.findall('pub-id[@pub-id-type="pmid"]')[0].text
            except:
                pmid = ''
            dict_ref = {'ref_id': ref_id, 'name': names, 'article_title': article_title, 
                        'journal': journal, 'journal_type': journal_type, 'pmid': pmid}
            dict_refs.append(dict_ref)

AttributeError: 'NoneType' object has no attribute 'text' while using parse_xml_web

In [8]: dict_out = pp.parse_xml_web('26849437', save_xml=False)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-5194f10e1609> in <module>()
----> 1 dict_out = pp.parse_xml_web(pmid, save_xml=False)

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_web_parser.pyc in parse_xml_web(pmid, sleep, save_xml)
     86     """
     87     tree = load_xml(pmid, sleep=sleep)
---> 88     dict_out = parse_pubmed_web_tree(tree)
     89     dict_out['pmid'] = str(pmid)
     90     if save_xml:

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_web_parser.pyc in parse_pubmed_web_tree(tree)
     63     if authors_tree is not None:
     64         for a in authors_tree:
---> 65             firstname = a.find('forename').text
     66             lastname = a.find('lastname').text
     67             fullname = firstname + ' ' + lastname

AttributeError: 'NoneType' object has no attribute 'text'

Parse more information from MEDLINE XML file

These include the following:

  • MedlineTA
  • NlmUniqueID
  • ISSNLinking
  • Country

These data can be used as a linkage to Web of Science (WoS) dataset later on for people who own WoS dataset.

Abstract partically extracted

For a (small?) subset of documents, only part of the abstract is extracted (e.g. PMID 24653627, 23357879, 27983391, 26762307, 28005260, 22351618, 23456555,18006916,25371446)

PMID to PMC API from Medline cannot convert all provided PMID

The API here cannot convert all PMID input. I was trying to parse citations from given set of PMIDs but it only returns subset of PMIDs that I provided. One possibility is to host pair of PMIDs/PMCs somewhere on the cloud and provide similar API or source file that user can use to convert PMID to PMC.

Pubdate not returning correct year

There is a problem with some of the pubdate fields in the output. It is not pulling the correct year and instead is splitting the text based off of " " and grabbing the first chunk of text. Because of this you end up with results for pubdate like ["Summer","Winter"]. Some example pmid's this happens for is [28599031,28599032,28599033, etc]. Could you please update to match on some form of regular expression like "\d{4}" instead of splitting on the whitespace and just grabbing the first chunk?

Output type in Pubmed OA is lxml.etree._ElementUnicodeResult

@daniel-acuna catched this error when we run output = pp.parse_pubmed_xml('data/1472-6831-8-11.nxml') and then check an output type i.e. list(map(type, out.values()))

Output type is

[str,
 list,
 lxml.etree._ElementUnicodeResult,
 str,
 list,
 str,
 str,
 str,
 lxml.etree._ElementUnicodeResult,
 str]

We have to turn this type into proper string.

Error while reading Medline gz file from path

In [234]: pp.parse_pubmed_xml('/home/docClass/files/pubmed/medline17n0330.xml.gz')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-234-bfaa6482d6d1> in <module>()
----> 1 pp.parse_pubmed_xml('/home/docClass/files/pubmed/medline17n0330.xml.gz')

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_oa_parser.pyc in parse_pubmed_xml(path, include_path)
    108         journal = ''
    109
--> 110     dict_article_meta = parse_article_meta(tree)
    111     pub_year_node = tree.find('//pub-date/year')
    112     pub_year = pub_year_node.text if pub_year_node is not None else ''

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser-0.1.dev0-py2.7.egg/pubmed_parser/pubmed_oa_parser.pyc in parse_article_meta(tree)
     56     """
     57     article_meta = tree.find('//article-meta')
---> 58     pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
     59     pmc_node = article_meta.find('article-id[@pub-id-type="pmc"]')
     60     pub_id_node = article_meta.find('article-id[@pub-id-type="publisher-id"]')

AttributeError: 'NoneType' object has no attribute 'find'

python-2.7 import error: No module named 'request'

In my environment python-2.7 is not able to find the module urllib.request required by pubmed_web_parser.py.

If I replace the urllib import statement as suggested in this discussion then python2 is able to find the request module:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

urlopen is used by the function parse_outgoing_citation_web in the same file. For all http requests from this file requests library is used except the request made from the function parse_outgoing_citation_web. It could be possible to use the requests library here as well? to avoid the import.

--mahmut

Parsers cannot read the xml file.

Below I've copied my python instance. I'm trying to parse medline data. I've done this with your pubmed and medline parser on the listed machine as well as on a ubuntu server with the same error. I've also generated a file using the R programming language. If you are familiar with that, the package I used is called easyPubMed. I used the batch_pubmed_download() function.

Anyways I'd really like to use your code, especially as it links the authors with their affiliated institutions.
I'm new to XML parsing so I have no idea what I'm doing in that respect.

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32

>>>import pubmed_parser as pp
>>>pp.parse_pubmed_xml('C:\\Users\\Work\\Downloads\\medline16n0902.xml')

Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 14, in read_xml
    tree = etree.parse(path)
  File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81101)
  File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117832)
  File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118179)
  File "src\lxml\parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:117091)
  File "src\lxml\parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:111637)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105612)
OSError: Error reading file 'medline16n0902.xml': failed to load external entity "medline16n0902.xml"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\medline_parser.py", line 354, in parse_medline_xml
    tree = read_xml(path)
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 17, in read_xml
    tree = etree.fromstring(path)
  File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)
  File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)
  File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)
  File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105655)
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Remove extract space that appear when parsing abstract text

In both Pubmed Open Access and MEDLINE open access, when we parse abstract text. It mostly includes blank space and new line \n. We can preprocess that later using nltk whitespace tokenizer. However, it would be nice to return string without extra-space right away.

Affiliation contains number

We select all string under affiliation where it comes with number. It should be a way to remove it from parser itself or we might use regular expression to remove it.

Parsing abstract section

There can be one or more section in single abstract. For example, we can have BACKGROUND, METHOD, RESULT or just Abstract and RESULT in single abstract etc. We want to add some tagger when we parse these abstracts in case people want to analyze only few parts of abstract.

I'm still not sure which format should we return in this case.

Problem during parsing medline17m1086 file

Hey guys,
I'm getting the following error if I try to use your code.
Whats the problem?

Best regards

AttributeError                            Traceback (most recent call last)
<ipython-input-2-9968ea229488> in <module>()
      1 ## settings
      2 file = "medline17n1086.xml";
----> 3 pubmed_dict = pp.parse_pubmed_xml(file) # dictionary output

/usr/local/lib/python3.5/dist-packages/pubmed_parser-0.1-py3.5.egg/pubmed_parser/pubmed_oa_parser.py in parse_pubmed_xml(path, include_path)
    108         journal = ''
    109 
--> 110     dict_article_meta = parse_article_meta(tree)
    111     pub_year_node = tree.find('//pub-date/year')
    112     pub_year = pub_year_node.text if pub_year_node is not None else ''

/usr/local/lib/python3.5/dist-packages/pubmed_parser-0.1-py3.5.egg/pubmed_parser/pubmed_oa_parser.py in parse_article_meta(tree)
     56     """
     57     article_meta = tree.find('//article-meta')
---> 58     pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
     59     pmc_node = article_meta.find('article-id[@pub-id-type="pmc"]')
     60     pub_id_node = article_meta.find('article-id[@pub-id-type="publisher-id"]')

AttributeError: 'NoneType' object has no attribute 'find'

feature request: MeSH IDs

It'd be useful to be able to get MeSH IDs as well as heading names. I've used the following quick-and-dirty change to medline_parser.py but there's probably a more elegant way.

mesh_terms_list = [m.find('DescriptorName').attrib.get('UI','') + ":" + m.find('DescriptorName').text for m in mesh.getchildren()]

XMLSyntaxError

In [41]: pp.parse_medline_xml('/home/docClass/files/pubmed/pubmed18n1040.xml.gz')
Error: it was not able to read a path, a file-like object, or a string as an XML
File "", line 1
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Source: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed18n1040.xml.gz

problem with parse paragraph function

I tried to run the following test. Seems like it should have worked. Any suggestions welcome. Thanks!!

from pubmed_oa_parser import parse_pubmed_paragraph
from os import system

if __name__=="__main__":
 
    system("wget 'https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:156895&metadataPrefix=pmc' -O ./data/PMC156895.nxml") 

    dicts_out = parse_pubmed_paragraph('data/PMC156895.nxml', all_paragraph=False)

Output

--2017-06-14 22:23:06--  https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:156895&metadataPrefix=pmc
Resolving www.ncbi.nlm.nih.gov... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘./data/PMC156895.nxml’

./data/PMC156895.nxml                        [    <=>                                                                            ]  66.07K  67.0KB/s    in 1.0s    

2017-06-14 22:23:10 (67.0 KB/s) - ‘./data/PMC156895.nxml’ saved [67658]

<lxml.etree._ElementTree object at 0x1068596c8>
Traceback (most recent call last):
  File "scrape.py", line 9, in <module>
    dicts_out = parse_pubmed_paragraph('data/PMC156895.nxml', all_paragraph=False)
  File "/Users/fenwick/Library/Mobile Documents/com~apple~CloudDocs/other/web_scrape/pubmed_oa_parser.py", line 253, in parse_pubmed_paragraph
    dict_article_meta = parse_article_meta(tree)
  File "/Users/fenwick/Library/Mobile Documents/com~apple~CloudDocs/other/web_scrape/pubmed_oa_parser.py", line 57, in parse_article_meta
    pmid_node = article_meta.find('article-id[@pub-id-type="pmid"]')
AttributeError: 'NoneType' object has no attribute 'find'

Processing deleted Medline "citations" in NLM XML records

How should we process the delete citations?

Sometimes the update XML comes with "deleted" citations (like this example), and it would be good to know which PMID were deleted.

For example, the stats for the update file medline16n0906.xml available [here ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/medline16n0906_stats.html] says that there are 8809 citations and 353 delete citations. If we process the XML with pubmed_parser, we correctly get 8809-353 = 8456 records. Use the code below to test this:

# adapted from http://stackoverflow.com/questions/18772703/read-a-file-in-buffer-from-ftp-python
from ftplib import FTP
import gzip
import StringIO

def open_ftp_data(server, path, binary=True):
    ftp = FTP(server)
    ftp.login()

    data_io = StringIO.StringIO()
    def handle_data(more_data):
        data_io.write(more_data)
    if binary:
        resp = ftp.retrbinary("RETR " + path, callback=handle_data)
    else:
        resp = ftp.retrlines("RETR " + path, callback=handle_data)
    data_io.seek(0) # Go back to the start
    ftp.close()
    return data_io

import pubmed_parser as pp
binary_file = open_ftp_data('ftp.nlm.nih.gov', 'nlmdata/.medlease/gz/medline16n0906.xml.gz')
zippy = gzip.GzipFile(fileobj=binary_file)
medline_xml = zippy.read()
dict_records = pp.parse_medline_xml(medline_xml)
print("pubmed_parser records processed: {}".format(len(dict_records)))

Output

pubmed_parser records processed: 8456

We can find the deleted citations by simply

from lxml import etree
root = etree.fromstring(medline_xml)
print("Delete citations {}".format(len(root.xpath('//DeleteCitation/PMID'))))

Output

Delete citations 353

"can only parse strings" while reading PMC nxml

For several PMC files I get an error while reading in the content. list below:

PMC4569614.nxml 
PMC5362956.nxml 
PMC4162892.nxml 
PMC4569628.nxml 
PMC5348996.nxml 
PMC5362810.nxml 
PMC5352161.nxml 
PMC4522714.nxml 
PMC5352154.nxml 
PMC5363022.nxml 
PMC4522719.nxml 
PMC5346358.nxml 

P.S.
Ill post the traceback when I run it again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.