Git Product home page Git Product logo

grobid's Introduction

grobid

Python library for serializing GROBID TEI XML to dataclasses

Build Status Coverage Status Latest Version Python Version License

Installation

Use pip to install:

$ pip install grobid
$ pip install grobid[json] # for JSON serializable dataclass objects

You can also download the .whl file from the release section:

$ pip install *.whl

Usage

Client

In order to convert an academic PDF to TEI XML file, we use GROBID's REST services. Specifically the processFulltextDocument endpoint.

from pathlib import Path
from grobid.models.form import Form, File
from grobid.models.response import Response

pdf_file = Path("<your-academic-article>.pdf")
with open(pdf_file, "rb") as file:
    form = Form(
        file=File(
            payload=file.read(),
            file_name=pdf_file.name,
            mime_type="application/pdf",
        )
    )
    c = Client(base_url="<base-url>", form=form)
    try:
        xml_content = c.sync_request().content  # TEI XML file in bytes
    except GrobidClientError as e:
        print(e)

where base-url is the URL of the GROBID REST service

You can use https://cloud.science-miner.com/grobid/ to test

The Form class supports most of the optional parameters of the processFulltextDocument endpoint.

Parser

If you want to serialize the XML content, we can use the Parser class to create dataclasses objects.

Not all of the GROBID annoation guidelines are met, but compliance is a goal. See #1.

from grobid.tei import Parser

xml_content: bytes
parser = Parser(xml_content)
article = parser.parse()
article.to_json()  # raises RuntimeError if extra require 'json' not installed

where xml_content is the same as in Client section

Alternately, you can load the XML from a file:

from grobid.tei import Parser

with open("<your-academic-article>.xml", "rb") as xml_file:
  xml_content = xml_file.read()
  parser = Parser(xml_content)
  article = parser.parse()
  article.to_json()  # throws RuntimeError if extra require 'json' not installed

We use orjson to provide a method to_json to serialize the dataclasses into JSON. By default, orjson isn't installed, use pip install grobid[json].

License

MIT

Contributing

You are welcome to add missing features by submitting a PR, however, I won't be accepting any requests other than GROBID annotation compliance.

Disclaimer

This module was originally part of a group university project, however, all the code and tests was also authored by me.

grobid's People

Contributors

ram02z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

chmodas

grobid's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.