Git Product home page Git Product logo

cidles / poio-api Goto Github PK

View Code? Open in Web Editor NEW
18.0 12.0 8.0 3.49 MB

Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation Framework” (GrAF), allow unified access to linguistic data from a wide range sources.

Home Page: http://media.cidles.eu/poio/poio-api

License: Apache License 2.0

Python 98.56% TeX 1.44%

poio-api's Introduction

Poio API

Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan's EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called "Graph Annotation Framework" (GrAF), allow unified access to linguistic data from a wide range sources.

For documentation, please visit http://media.cidles.eu/poio/poio-api/

License

Poio API source code is distributed under the Apache 2.0 License.

Poio API documentation is distributed under the Creative Commons Attribution 3.0 Unported.

poio-api's People

Contributors

arlopes avatar arne-cl avatar pbouda avatar pmanha avatar ricafett avatar togg1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

poio-api's Issues

Refine the mapping handling

Refine the handling of the case where we have multiple destination tags.
This happens in the mandinka to typecraft conversion, for instance.

Failing tests

There are currently one failure and one error when I run the tests with Python 3:

.......................................................F......E.............................
======================================================================
ERROR: poioapi.tests.test_annotationgraph.TestAnnotationGraph.test_for_node_duplicates
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/pbouda/Projects/git-github/poio-api/src/poioapi/tests/test_annotationgraph.py", line 105, in test_for_node_duplicates
trimmed = set(original)
TypeError: unhashable type: 'Node'

======================================================================
FAIL: poioapi.tests.io.test_typecraft.TestWriter.test_conversion
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/pbouda/Projects/git-github/poio-api/src/poioapi/tests/io/test_typecraft.py", line 116, in test_conversion
    assert os.path.getsize(outputfile) == os.path.getsize(originalfile)
AssertionError

----------------------------------------------------------------------
Ran 92 tests in 6.221s

FAILED (errors=1, failures=1)

Please fix the tests.

Conversion files when no media given in Elan file

File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/graf.py", line 334, in parse self.primary_data = self.parser.get_primary_data() File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/elan.py", line 292, in get_primary_data primary_data.type = poioapi.io.graf.UNKNOWN AttributeError: 'module' object has no attribute 'UNKNOWN'

when setting primary_data.type = poioapi.io.graf.VIDEO in line 292 for debugging purposes,

File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/graf.py", line 555, in _add_primary_data self.standoffheader.datadesc.primaryData = {'loc': loc, UnboundLocalError: local variable 'loc' referenced before assignment

suggestions:

  • define type UNKNOWN
  • initialize loc with '' before the if-statements starting in line 548
  • (even better: implement error handling with useful error messages)

Tests throw error on Windows, Python 3.3

::
h:\ProjectsWin\git-github\poio-api>c:\Python33\Scripts\nosetests.exe
E...............................................................

ERROR: poioapi.tests.io.test_brat.TestBrat.test_write


Traceback (most recent call last):
File "c:\Python33\lib\site-packages\nose\case.py", line 198, in runTest
self.test(*self.arg)
File "h:\ProjectsWin\git-github\poio-api\src\poioapi\tests\io\test_brat.py", line 51, in test_write
assert len(file_ann.readlines()) == len(file_ann_res.readlines())
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 13: character maps to


Ran 64 tests in 3.660s

FAILED (errors=1)

Regression test for Elan to GrAF conversion

There was a bug in the Elan parser that put integer values in an internal data structures that could not be serialized by the GrAF parser. Here is the commit that fixes the problem:

faef6d4

It was presumably caused by the creation if "missing" timeslots. There is a demo file available on SkyDrive/Nordhoff. Write a regression test for this problem.

preserve correct ordering of elements when importing from eaf

Currently, the eaf-import seems to erroneously assume that ELAN ANNOTATION_IDs follow the linear order. When writing a graf-xml from ELAN import, all edges are ordered according to their ANNOTATION_ID.

However, it is possible to edit elan files non-linearly. Suppose you have an item fefo, which has ANNOTATION_ID 23. If you later add a prefix ba- to it, it will have a higher ANNOTATION_ID (24, or more if other content was added in between). So, this yields

ba- fefo
24 23

When converting this to graf, it would yield

<node xml:id="morphemes..morphemes..na0023"/>
<edge from="words..words..na0001" to="morphemes..morphemes..na0023" xml:id="ea0023"/>
<a as="morphemes" label="morphemes" ref="morphemes..morphemes..na0023" xml:id="a0023">
    <fs>
    <f name="annotation_value">fefo</f>
    </fs>
</a>
<node xml:id="morphemes..morphemes..na0024"/>
<edge from="words..words..na352" to="morphemes..morphemes..na0024" xml:id="ea0024"/>
<a as="morphemes" label="morphemes" ref="morphemes..morphemes..na0024" xml:id="a0024">
<fs>
    <f name="annotation_value">ba-</f>
</fs>

Note that ba- appears after fefo.

As far as I can see, there is no way to reconstruct the original order.

For a real case, see http://www.glottotopia.org/solr/athagram/browse?&q=id%3Atau-811-24, where the items nay of huhnay and diil of natdindiil do not appear at the correct place. This example was converted from Elan via the poio-api.

Suggestions: take advantage of the PREVIOUS_ANNOTATION attribute in ELAN REF_ANNOTATIONs and order REF_ANNOTATIONS accordingly when writing out the graf nodes.

Add a TierMapping class

We need a way to define which tier types and/or names have content that are somehow similar (semantically equal?) to each other. We define a class "TierMapping" that does the work. Details to be discussed.

graf.py tries to import itself

import poioapi.io.elan
Traceback (most recent call last):
File "", line 1, in
File "poioapi/io/elan.py", line 28, in
import poioapi.io.graf
File "poioapi/io/graf.py", line 25, in
import graf
ImportError: No module named graf

Tests failing with XML output

The test to convert from Mandinka to Typecraft fails in Python 2. We will implement two different tests for Python 2 and 3, with two different target XML files.

Add a Writer for Latex

Check how to write interlinear glossed text as Latex code. We have one example in our ACRH2 paper on SkyDrive.

Toolbox Parser - No annotation parent

In line 204 "self._annotations_for_parent[("a{0}".format(id_to_add), last_tier_marker)][-1]". There's no parents.

Test text:
\nt Metadata: Thursday 19 Jan 2007
interlinearised once in this file.
\np R occur on short (or sometimes
medium-length) oral vowels, not.

Characters not appearing in Latex conversion

In the conversion to latex, some characters from the source may be missing in the output. This occurs with some combining characters (one combining character may be missing in one case but appear in another), and with letters (so far: ŋ).

This behaviour is only present if we run the conversion with python 2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.