cidles / poio-api Goto Github PK

Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation Framework” (GrAF), allow unified access to linguistic data from a wide range sources.

Home Page: http://media.cidles.eu/poio/poio-api

License: Apache License 2.0

Python 98.56% TeX 1.44%

poio-api's Introduction

Poio API

Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan's EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called "Graph Annotation Framework" (GrAF), allow unified access to linguistic data from a wide range sources.

For documentation, please visit http://media.cidles.eu/poio/poio-api/

License

Poio API source code is distributed under the Apache 2.0 License.

Poio API documentation is distributed under the Creative Commons Attribution 3.0 Unported.

poio-api's People

Contributors

Stargazers

Watchers

Forkers

arne-cl dorotheebeermann togg1 fielddb langsci igorbmstu

poio-api's Issues

Update documentation to new GrAF structure

The Elan to GrAF transformation description still describes the old GrAF structures:

https://poio-api.readthedocs.org/en/latest/howto.html#transformation-of-file-formats-from-and-to-graf

Refine the mapping handling

Refine the handling of the case where we have multiple destination tags.
This happens in the mandinka to typecraft conversion, for instance.

Failing tests

There are currently one failure and one error when I run the tests with Python 3:

.......................................................F......E.............................
======================================================================
ERROR: poioapi.tests.test_annotationgraph.TestAnnotationGraph.test_for_node_duplicates
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/pbouda/Projects/git-github/poio-api/src/poioapi/tests/test_annotationgraph.py", line 105, in test_for_node_duplicates
trimmed = set(original)
TypeError: unhashable type: 'Node'

======================================================================
FAIL: poioapi.tests.io.test_typecraft.TestWriter.test_conversion
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/pbouda/Projects/git-github/poio-api/src/poioapi/tests/io/test_typecraft.py", line 116, in test_conversion
    assert os.path.getsize(outputfile) == os.path.getsize(originalfile)
AssertionError

----------------------------------------------------------------------
Ran 92 tests in 6.221s

FAILED (errors=1, failures=1)

Please fix the tests.

BOM characters - Typecraft conversion

Remove BOM character from the conversion's source files.

Test if timeslots are created correctly from scratch in Elan writer

The attached script throws an error when converting from a pickle (available on SkyDrive) to Elan. Please check if timeslot creation works correctly.

https://gist.github.com/pbouda/6902620

Conversion files when no media given in Elan file

File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/graf.py", line 334, in parse self.primary_data = self.parser.get_primary_data() File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/elan.py", line 292, in get_primary_data primary_data.type = poioapi.io.graf.UNKNOWN AttributeError: 'module' object has no attribute 'UNKNOWN'

when setting primary_data.type = poioapi.io.graf.VIDEO in line 292 for debugging purposes,

File "/home/snordhoff/workspace/virtualenvironments/poio/local/lib/python2.7/site-packages/poioapi/io/graf.py", line 555, in _add_primary_data self.standoffheader.datadesc.primaryData = {'loc': loc, UnboundLocalError: local variable 'loc' referenced before assignment

suggestions:

define type UNKNOWN
initialize loc with '' before the if-statements starting in line 548
(even better: implement error handling with useful error messages)

Tests throw error on Windows, Python 3.3

::
h:\ProjectsWin\git-github\poio-api>c:\Python33\Scripts\nosetests.exe
E...............................................................

ERROR: poioapi.tests.io.test_brat.TestBrat.test_write

Traceback (most recent call last):
File "c:\Python33\lib\site-packages\nose\case.py", line 198, in runTest
self.test(*self.arg)
File "h:\ProjectsWin\git-github\poio-api\src\poioapi\tests\io\test_brat.py", line 51, in test_write
assert len(file_ann.readlines()) == len(file_ann_res.readlines())
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 13: character maps to

Ran 64 tests in 3.660s

FAILED (errors=1)

Regression test for Elan to GrAF conversion

There was a bug in the Elan parser that put integer values in an internal data structures that could not be serialized by the GrAF parser. Here is the commit that fixes the problem:

faef6d4

It was presumably caused by the creation if "missing" timeslots. There is a demo file available on SkyDrive/Nordhoff. Write a regression test for this problem.

Add a Parser class for Toolbox TXT files

We have some preliminary code for a general Toolbox parser that we can use for this. The code is not public yet.

preserve correct ordering of elements when importing from eaf

Currently, the eaf-import seems to erroneously assume that ELAN ANNOTATION_IDs follow the linear order. When writing a graf-xml from ELAN import, all edges are ordered according to their ANNOTATION_ID.

However, it is possible to edit elan files non-linearly. Suppose you have an item fefo, which has ANNOTATION_ID 23. If you later add a prefix ba- to it, it will have a higher ANNOTATION_ID (24, or more if other content was added in between). So, this yields

ba- fefo
24 23

When converting this to graf, it would yield

<node xml:id="morphemes..morphemes..na0023"/>
<edge from="words..words..na0001" to="morphemes..morphemes..na0023" xml:id="ea0023"/>
<a as="morphemes" label="morphemes" ref="morphemes..morphemes..na0023" xml:id="a0023">
    <fs>
    <f name="annotation_value">fefo</f>
    </fs>
</a>
<node xml:id="morphemes..morphemes..na0024"/>
<edge from="words..words..na352" to="morphemes..morphemes..na0024" xml:id="ea0024"/>
<a as="morphemes" label="morphemes" ref="morphemes..morphemes..na0024" xml:id="a0024">
<fs>
    <f name="annotation_value">ba-</f>
</fs>

Note that ba- appears after fefo.

As far as I can see, there is no way to reconstruct the original order.

For a real case, see http://www.glottotopia.org/solr/athagram/browse?&q=id%3Atau-811-24, where the items nay of huhnay and diil of natdindiil do not appear at the correct place. This example was converted from Elan via the poio-api.

Suggestions: take advantage of the PREVIOUS_ANNOTATION attribute in ELAN REF_ANNOTATIONs and order REF_ANNOTATIONS accordingly when writing out the graf nodes.

Add a TierMapping class

We need a way to define which tier types and/or names have content that are somehow similar (semantically equal?) to each other. We define a class "TierMapping" that does the work. Details to be discussed.

Write test case that demonstrastes duplicates in annotation spaces

For performance we commented one line in graf.py, so that annotation are added to spaces without check if they are already in there:

https://github.com/cidles/poio-api/blob/master/src/poioapi/io/graf.py#L409

Is there any case where this might cause problems? Write a test case with an example file that demonstrates when duplicates might be added to the same annotation space.

Add a writer for TCF

Use case is EAF to TCF conversion.

graf.py tries to import itself

import poioapi.io.elan
Traceback (most recent call last):
File "", line 1, in
File "poioapi/io/elan.py", line 28, in
import poioapi.io.graf
File "poioapi/io/graf.py", line 25, in
import graf
ImportError: No module named graf