Git Product home page Git Product logo

pynlp's Introduction

pynlp

Build Status PyPI version

A pythonic wrapper for Stanford CoreNLP.

Description

This library provides a Python interface to Stanford CoreNLP built over corenlp_protobuf.

Installation

  1. Download Stanford CoreNLP from the official download page.
  2. Unzip the file and set your CORE_NLP environment variable to point to the directory.
  3. Install pynlp from pip
pip3 install pynlp

Quick Start

Launch the server

Lauch the StanfordCoreNLPServer using the instruction given here. Alternatively, simply run the module.

python3 -m pynlp

By default, this lauches the server on localhost using port 9000 and 4gb ram for the JVM. Use the --help option for instruction on custom configurations.

Example

Let's start off with an excerpt from a CNN article.

text = ('GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, '
        'according to Kentucky State Police. State troopers responded to a call to the senator\'s '
        'residence at 3:21 p.m. Friday. Police arrested a man named Rene Albert Boucher, who they '
        'allege "intentionally assaulted" Paul, causing him "minor injury". Boucher, 59, of Bowling '
        'Green was charged with one count of fourth-degree assault. As of Saturday afternoon, he '
        'was being held in the Warren County Regional Jail on a $5,000 bond.')

Instantiate annotator

Here we demonstrate the following annotators:

  • Annotoators: tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment, quote, openie
  • Options: openie.resolve_coref
from pynlp import StanfordCoreNLP

annotators = 'tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment, quote, openie'
options = {'openie.resolve_coref': True}

nlp = StanfordCoreNLP(annotators=annotators, options=options)

Annotate text

The nlp instance is callable. Use it to annotate the text and return a Document object.

document = nlp(text)

print(document) # prints 'text'

Sentence splitting

Let's test the ssplit annotator. A Document object iterates over its Sentence objects.

for index, sentence in enumerate(document):
    print(index, sentence, sep=' )')

Output:

0) GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
1) State troopers responded to a call to the senator's residence at 3:21 p.m. Friday.
2) Police arrested a man named Rene Albert Boucher, who they allege "intentionally assaulted" Paul, causing him "minor injury".
3) Boucher, 59, of Bowling Green was charged with one count of fourth-degree assault.
4) As of Saturday afternoon, he was being held in the Warren County Regional Jail on a $5,000 bond.

Named entity recognition

How about finding all the people mentioned in the document?

[str(entity) for entity in document.entities if entity.type == 'PERSON']

Output:

Out[2]: ['Rand Paul', 'Rene Albert Boucher', 'Paul', 'Boucher']

We may use named entities on a sentence level too.

first_sentence = document[0]
for entity in first_sentence.entities:
    print(entity, '({})'.format(entity.type))

Output:

GOP (ORGANIZATION)
Rand Paul (PERSON)
Bowling Green (LOCATION)
Kentucky (LOCATION)
Friday (DATE)
Kentucky State Police (ORGANIZATION)

Part-of-speech tagging

Let's find all the 'VB' tags in the first sentence. A Sentence object iterates over Token objects.

for token in first_sentence:
    if 'VB' in token.pos:
        print(token, token.pos)

Output:

was VBD
assaulted VBN
according VBG

Lemmatization

Using the same words, lets see the lemmas.

for token in first_sentence:
    if 'VB' in token.pos:
       print(token, '->', token.lemma)

Output:

was -> be
assaulted -> assault
according -> accord

Coreference resultion

Let's use pynlp to find the first CorefChain in the text.

chain = document.coref_chains[0]
print(chain)

Output:

((GOP Sen. Rand Paul))-[id=4] was assaulted in (his)-[id=5] home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
State troopers responded to a call to (the senator's)-[id=10] residence at 3:21 p.m. Friday.
Police arrested a man named Rene Albert Boucher, who they allege "(intentionally assaulted" Paul)-[id=16], causing him "minor injury.

In the string representation, coreferences are marked with parenthesis and the referent with double parenthesis. Each is also labelled with a coref_id. Let's have a closer look at the referent.

ref = chain.referent
print('Coreference: {}\n'.format(ref))

for attr in 'type', 'number', 'animacy', 'gender':
    print(attr,  getattr(ref, attr), sep=': ')

# Note that we can also index coreferences by id
assert chain[4].is_referent

Output:

Coreference: Police

type: PROPER
number: SINGULAR
animacy: ANIMATE
gender: UNKNOWN

Quotes

Extracting quotes from the text is simple.

print(document.quotes)

Output:

[<Quote: "intentionally assaulted">, <Quote: "minor injury">]

TODO (annotation wrappers):

  • ssplit
  • ner
  • pos
  • lemma
  • coref
  • quote
  • quote.attribution
  • parse
  • depparse
  • entitymentions
  • openie
  • sentiment
  • relation
  • kbp
  • entitylink
  • 'options' examples i.e openie.resolve_coref

Saving annotations

Write

A pynlp document can be saved as a byte string.

with open('annotation.dat', 'wb') as file:
    file.write(document.to_bytes())

Read

To load a pynlp document, instantiate a Document with the from_bytes class method.

from pynlp import Document

with open('annotation.dat', 'rb') as file:
    document = Document.from_bytes(file.read())

pynlp's People

Contributors

sina-al avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pynlp's Issues

ImportError: cannot import name 'client'

I have installed this package with pip3.
There seems to be a circular dependency between the modules.
I get the following exception:

Traceback (most recent call last):
  File "SemEval-2013.py", line 12, in <module>
    from pynlp import StanfordCoreNLP
  File "/inf/pynlp/pynlp/__init__.py", line 1, in <module>
    from .client import stanford_core_nlp
  File "/inf/pynlp/pynlp/client.py", line 5, in <module>
    from pynlp.wrapper import Document
  File "/inf/pynlp/pynlp/wrapper.py", line 2, in <module>
    from pynlp import client
ImportError: cannot import name 'client'

Using regexner

Is it possible to use regexner with pynlp?
Thank you!

Sentiment score?

Is it possible to return the (integer) sentiment score, rather than the label in Sentence.sentiment?

Enhancement

Hi
Any plans for this class RelationExtractorAnnotator?

Thanks

Less precise

Hi
Can you check on some other piece of text? After updating the module I get far less entities and less precise.
Thanks for the effort

typo error on StanfordCoreNLP __init__

Traceback (most recent call last):
  File "main_core.py", line 3, in <module>
    from pynlp import StanfordCoreNLP
  File "/Users/avelino/.virtualenvs/nuveo.nlp/lib/python2.7/site-packages/pynlp/__init__.py", line 1, in <module>
    from .client import StanfordCoreNLP
  File "/Users/avelino/.virtualenvs/nuveo.nlp/lib/python2.7/site-packages/pynlp/client.py", line 66
    def __init__(self, properties: Properties):
                                 ^
SyntaxError: invalid syntax

def __init__(self, properties: Properties):

OpenIE support

Hi,
In the examples the "openie" annotator was used but the outputs still do not have the openIE results. So when are you planning to add the openIE support in the outputs?

In your source code I think it will be developed under the relations function defined in the Sentence class.

SUTime functionality

According to Stanford's website, SUTime is provided automatically in corenlp. Is it included in this wrapper as well? If so, is there any documentation or can anyone provide an example as to how to use it (specifically to go from tagged entities to storing/printing a TIMEX3 object)?

DecodeError: Tag had invalid wire type

When running the analysis on a long list of strings, I always get this error after successfully processing a number of strings:

google.protobuf.message.DecodeError: Tag had invalid wire type.

I'm crawling random webpages, so it doesn't seem to matter what the actual contents of the string are. I'm using BeautifulSoup to extract just the text, and it's coerced into a string to ensure it's unicode.

From what I've read about this error, it seems it occurs when trying to write over an existing file. I think it would be ideal if I could reset the CoreNLP server after each iteration.

My current workflow is

## start corenlp server from command line
$ python3 -m pynlp

In python:

from pynlp import StanfordCoreNLP
annotators = 'tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment'
nlp = StanfordCoreNLP(annotators=annotators)
document = nlp(str(line['text'])) ## line['text'] is a line of unicode text

The trackback call is:


Traceback (most recent call last):
  File "/Users/adamg/Dropbox/Northwestern/Classes/Text_Analytics/homework/ta-hw4/extract_debates.py", line 188, in <module>
    debate_sentiment_dct = analyze_utterances(analysis.get_lines())
  File "/Users/adamg/Dropbox/Northwestern/Classes/Text_Analytics/homework/ta-hw4/sentiment.py", line 14, in analyze_utterances
    document = nlp(str(line['text']))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 65, in __call__
    return self.annotate(text)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 72, in annotate
    return Document(_annotate(text, self._annotators, self._options, self._port))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 34, in _annotate
    return from_bytes(_annotate_binary(text, annotators, options, port))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 39, in from_bytes
    core.parseFromDelimitedString(doc, protobuf)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/corenlp_protobuf/__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/message.py", line 185, in ParseFromString
    self.MergeFromString(serialized)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1069, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1095, in InternalParse
    new_pos = local_SkipField(buffer, new_pos, end, tag_bytes)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/decoder.py", line 850, in SkipField
    return WIRETYPE_TO_SKIPPER[wire_type](buffer, pos, end)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/decoder.py", line 820, in _RaiseInvalidWireType
    raise _DecodeError('Tag had invalid wire type.')
google.protobuf.message.DecodeError: Tag had invalid wire type.

On the command line, the CoreNLP server raises the error:


java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer$CoreNLPHandler.handle(StanfordCoreNLPServer.java:662)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
    at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
    at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Is there an obvious cause for this error? Alternatively, is there a way to restart the CoreNLP server after each loop within python?

ModuleNotFoundError: No module named 'pynlp'

Hello,

I have followed the instructions in the README and installed the library via pip3 install pynlp.

When I go to the prompt and execute from pynlp import StanfordCoreNLP I get the following error:

>>> from pynlp import StanfordCoreNLP
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pynlp'

Is there something i am doing wrong?

Thank you for your assistance,

Output JSON files

Hi,
any plan to write the result into a JSON file with the same format as the JSON file outputFormat in the
CoreNLP?

Does pynlp keep the original tag type "O" which is the non-entity part?

Hello,

Does pynlp keep the original tag type "O" which is the non-entity part?

For example,
sentence = "Nora Jani, a single person, Matt Jani and Susan Jani, husband and wife"

Expecting result:
[('Nora Jani', 'PERSON'), ('a single person', 'O'), ('Matt Jani', 'PERSON'), ('and', 'O'), ('Susan Jani', 'PERSON'), ('husband and wife', 'O')]

Thanks.

Error running python3 -m pynlp

When I run the command above, I get the error:

adamg:~ adamg$ python3 -m pynlp
/usr/local/opt/python3/bin/python3.5: Error while finding spec for 'pynlp.__main__' (<class 'ImportError'>: No module named 'corenlp_protobuf'); 'pynlp' is a package and cannot be directly executed

I have set my CORE_NLP variable, and started a new Terminal session.

UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c'

I'm trying to use pynlp to process a bunch of text files, but I'm having trouble with one of them crashpynlp.txt . Using the following script

from pynlp import StanfordCoreNLP

with open("crashpynlp.txt", 'r') as file:
    text = file.read()
    nlp = StanfordCoreNLP(annotators="tokenize, ssplit, pos, lemma, ner")
    doc = nlp(text)

I'm getting the following traceback

  File "testPynlp.py", line 6, in <module>
    doc = nlp(text)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 132, in __call__
    return self.annotate_one(texts)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 138, in annotate_one
    return Document(self._annotate(text))
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 135, in _annotate
    return self._client.post(url=self._address, data=text, params=(('properties', str(self._properties)),))
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 559, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 81, in request
    response = super(CoreNLPClient, self).request(*args, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/home/fernio/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/fernio/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1284, in _send_request
    body = _encode(body, 'body')
  File "/usr/lib/python3.6/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 39: Body ('โ€œ') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

TypeError on running `python -m pynlp`

I'm getting the following error. It looks like the protobuf package is out of date?

$ python3 -m pynlp
Traceback (most recent call last):
  File "/Users/sooheon/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/sooheon/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/Users/sooheon/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/Users/sooheon/.pyenv/versions/nlp/lib/python3.6/site-packages/pynlp/__init__.py", line 1, in <module>
    from .client import StanfordCoreNLP
  File "/Users/sooheon/.pyenv/versions/nlp/lib/python3.6/site-packages/pynlp/client.py", line 3, in <module>
    from .wrapper import Document
  File "/Users/sooheon/.pyenv/versions/nlp/lib/python3.6/site-packages/pynlp/wrapper.py", line 1, in <module>
    from pynlp.protobuf import from_bytes, to_bytes
  File "/Users/sooheon/.pyenv/versions/nlp/lib/python3.6/site-packages/pynlp/protobuf/__init__.py", line 5, in <module>
    from .CoreNLP_pb2 import Document
  File "/Users/sooheon/.pyenv/versions/nlp/lib/python3.6/site-packages/pynlp/protobuf/CoreNLP_pb2.py", line 203, in <module>
    options=None, file=DESCRIPTOR),
TypeError: __init__() got an unexpected keyword argument 'file'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.