Git Product home page Git Product logo

neocl / speach Goto Github PK

View Code? Open in Web Editor NEW
15.0 3.0 5.0 1.25 MB

๐Ÿ๐Ÿ‘ Python 3 library for managing, annotating, and converting natural language corpuses using popular formats (CoNLL, ELAN, Praat, CSV, JSON, SQLite, VTT, Audacity, TTL, TIG, ISF, etc.)

Home Page: https://speach.readthedocs.io/

License: MIT License

Shell 0.14% Python 99.86%
corpus elan transcription annotation nlp text linguistics

speach's Introduction

SpeachLogo Speach

ReadTheDocs Badge Total alerts Language grade: Python

Speach (๐Ÿ๐Ÿ‘, formerly texttaglib), is a Python 3 library for managing, annotating, and converting natural language corpuses using popular formats (CoNLL, ELAN, Praat, CSV, JSON, SQLite, VTT, Audacity, TTL, TTLIG, ISF, etc.)

Main functions are:

  • Reading, editing, and writing ELAN transcriptions and related media files directly in ELAN Annotation Format (eaf)
  • Cutting, converting, and merging audio/video files
  • TTLIG (or TIG) - A human-friendly linguistic documentation format with intelinear gloss support
  • Text corpus management using texttaglib format
  • Multiple storage formats (text, CSV, JSON, SQLite databases)

Useful Links

Installation

speach is available on PyPI.

pip install speach

Sample codes

Speach can extract annotations and metadata from ELAN transcripts directly, for example:

from speach import elan

# Test ELAN reader function in speach
eaf = elan.read_eaf('./test/data/test.eaf')

# accessing tiers & annotations
for tier in eaf:
    print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
    for ann in tier:
        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} :: {ann.to_ts}] {ann.text}")

Speach also provides command line tools for processing EAF files.

# this command converts an eaf file into csv
python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv

Processing media files

>>> from speach import media
>>> media.convert("~/Documents/test.wav", "~/Documents/test.ogg")
>>> media.cut("test.wav", "test_10-15.ogg", from_ts="00:00:10", to_ts="00:00:15")

Read Speach documentation for more information.

Contributors

Graphic materials

The Speach logo (SpeachLogo) was created by using the snake emoji (created by Selina Bauder) and the peach emoji (created by Marius Schnabel) from Openmoji project. License: CC BY-SA 4.0

Contributors are welcome! If you want to help developing speach, please visit Contributing page.

speach's People

Contributors

letuananh avatar vicchuayh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

speach's Issues

Support removing annotations

  • Ned to take care of dependent annotations, may be by adding a recursive arg to remove ref annotations, raise an error otherwise
  • Clean up TimeSlot objects after deletion, but they may be shared

Parse "EXTERNAL_REF" node in ELAN

Parsing an ELAN file with external controlled vocabulary will show this warning:

Unknown element type -- EXTERNAL_REF. Please consider to report an issue at https://github.com/neocl/speach/issues/

The XML node looks something like

    <EXTERNAL_REF EXT_REF_ID="er1" TYPE="ecv" VALUE="file://home/user/transcriptions/vocab.ecv"/>

Mistaken ruby generation by ttlig.py's from_furi

ttlig.py's from_furi method fails on words that contain kanji, end with kana, and where the ending kana is same as the last kana of the preceding kanji.
e.g.:

ๅฏๆ„›ใ„ (ใ‹ใ‚ใ„ใ„)
ๆ†Žใ (ใซใใ)
ไฝŽใ (ใฒใใ)

from speach import ttlig
ttlig.RubyToken.from_furi(surface='ๅฏๆ„›ใ„',kana='ใ‹ใ‚ใ„ใ„').to_code()
# it returns {ๅฏๆ„›/ใ‹ใ‚}ใ„
# but the correct result should be {ๅฏๆ„›/ใ‹ใ‚ใ„}ใ„

@larvata pointed this out downstream in a yomikata issue. I fixed this in yomikata by stripping surface and kana of trailing overlapping trailing characters before running them through the main logic, a pretty ugly hack.

This is what I needed in my life

I just used speach to batch convert ELAN files into a CLDF dataset, it was incredibly easy. Looking forward to try it on other formats. I'll close this since it's not really an issue, just wanted to say: thank you!

ELAN basic edit

  • Create a blank EAF file
  • Update author
  • Update created date
  • Update media file (media_file, media_url, relative_media_url)
  • Create new annotations
  • Update annotation values
  • Create new tiers
  • Change tier names
  • Edit controlled vocabulary (CV)
  • Create new CV entries
  • Edit CV entries
  • Edit Timeslot values
  • Shift timeline (i.e. batch edit timeslots)
  • Remove annotations (Related: #22)
  • Remove tiers
  • Remove CV entries (unused only)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.