Git Product home page Git Product logo

pyannotation's Introduction

Python Linguistic Annotation Libary

PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files. Supported file format is currently only Elan XML, with Kura XML and Toolbox files support planned for future releases. A Corpus Reader API is provided to support statistical analysis within the Natural Language Toolkit. The software is licensed under the GNU General Public License.

REQUIREMENTS

You need to install the following packages:

INSTALLATION

To install PyAnnotation on Windows just start the .exe file you downloaded and follow the instructions in the setup process. To install PyAnnotation on Linux, Unix and other platforms you need to unpack the file and start "setup.py" on the command line. Change to the directory into which you downloaded the package and unpack it:

$ tar xzf pyannotation-x.y.z.tar.gz
$ cd pyannotation-x.y.z

Then, to install the package locally into your python repository (you may need to have root privileges):

$ python setup.py install

The installation process will give you feedback and should finish without errors.

BASIC USAGE

Here are a few examples what you can do with PyAnnotation. All the examples process Elan files which are stored in one directory, the directory here is "example_data" which is part of the package you downloaded. The package also contains a sample script "example1.py" that runs all the commands presented here, so you might just call "python example1.py" and see all the results on your own computer at once. First, start a python interpreter and import pyanntation for Elan:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

First, import the corpus reader module:

>>> import pyannotation.corpusreader

Then load create a corpus reader and load a file into your corpus. The second argument to the addFile method is the file type (.eaf here):

>>> cr = pyannotation.corpusreader.GlossCorpusReader()
>>> cr.addFile("example_data/turkish.eaf", pyannotation.data.EAF)

To get all sentences with their tags that have a gloss "ANOM" (here: tags are morphemes and their glosses stored in a kind of tree):

>>> result = [s for s in cr.tagged_sents() for (word, tag) in s
...             for (morphem, gloss) in tag
...             if 'ANOM' in gloss and s not in locals()['_[1]']]
>>> print result
[[('eve', [('ev', ['home']), ('e', ['DIR'])]), ('geldi\xc4\x9fimde', ...

Only the sentences of the result:

>>>sents = [[w for (w, t) in s] for s in result] >>> print sents [['eve', 'geldixc4x9fimde', 'yaxc4x9fmur', ...

A word list from the result:

>>> tagged_words = [(w,t) for s in result for (w, t) in s]
>>> print tagged_words
[('eve', [('ev', ['home']), ('e', ['DIR'])]), ('geldi\xc4\x9fimde', ...

A list of morphemes and their tags from the result:

>>> tagged_morphemes = [(m,g) for s in result for (w,t) in s for (m,g) in t]
>>> print tagged_morphemes
[('ev', ['home']), ('e', ['DIR']), ('gel', ['come']), ('di\xc4\x9f', ...

Another query: find all sentences that contain a certain word (here: "home") in their translation:

>>> import re
>>> result2 = [(s, translations)
...            for (s, translations) in cr.tagged_sents_with_translations()
...            for t in translations if re.search(r"\bhome\b", t)]
>>> print result2
[([('d\xc3\xbcn', [('d\xc3\xbcn', ['yesterday'])]), ('ak\xc5\x9fam', ...

And, last but not least, use your Elan corpus with NLTK. An example to get the concordance for the word "bir" (turkish for "one"):

>>> import nltk.text
>>> text = nltk.text.Text(cr.words())
>>> text.concordance('bir') # find concordance for turkish "bir"
Building index...
Displaying 2 of 2 matches:
 daha rahat ederdim çünkü içimden bir ses yeter artık çalışma derken bi
ir ses yeter artık çalışma derken bir diğer ses de çalışmam gerektiğin

Just try it out for yourself what you can do with the data. PyAnnotation's corpus reader for .eaf files has the following access methods for data:

# I{corpus}.mophemes()
# I{corpus}.words()
# I{corpus}.sents()
# I{corpus}.sents_with_translations()

# I{corpus}.tagged_morphemes()
# I{corpus}.tagged_words()
# I{corpus}.tagged_sents()
# I{corpus}.tagged_sents_with_translations()

More documentation is available at:

http://www.cidles.eu/doc/pyannotation/index.html

SITE

The website of this project is:

http://www.cidles.eu/ltll/poio-pyannotation

pyannotation's People

Contributors

arlopes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kristiank

pyannotation's Issues

use of deprecated module regex

File "/usr/local/lib/python2.6/dist-packages/pyannotation/data.py", line 12

the package imports of both module 're' and module 'regex'
later in the code uses regex.

FIX: remove the import of module 'regex', replace all occurences of regex with re.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.