Git Product home page Git Product logo

arpabet-syllabifier's Introduction

Syllabifier

Build Status Coverage Status

Syllabifier is a Python module to syllabify your English pronunciations. Currently only ARPABET syllabification is supported. It will take an ARPABET transcription in array or string form and return a list with the syllables chunked.

Dependencies

  • python>=3.5
  • jupyter>=1.0.0 (only if you want to run the test notebook locally)

How to Use

  • Install syllabifier by running python setup.py install
  • Import the function from syllabifier import syllabifyARPA.
  • Function parameters
    • A 2-letter ARPABET transcription in string form (with phones delimited by spaces) or as a Python list (stress markers on the vowels are optional)
    • (Optional) bool silence_warnings to suppress ValueErrors thrown because of unsyllabifiable input
  • Sample calls are in the Jupyter Notebook test.ipynb, using CMU Pronouncing Dictionary data.

Contents

  • syllabifier.py: Core module of this repository which contains all the code that syllabifies an ARPABET transcription
  • tests/test.ipynb: Jupyter Notebook demonstrating sample calls to syllabifyARPA using CMUDict data
  • tests/cmudict.txt: Very large text file containing over 100,000 ARPABET-syllabified English words
  • tests/cmusubset.txt: Subset of ~60 words and transcriptions from the CMU Dictionary text file for testing convenience
  • tests/test_syllabifier.py: Unit and integration tests for the package

ARPABET

ARPABET is a method of transcribing General American English phonetically with only ASCII characters. Refer here for a table of mappings between IPA and ARPABET. This syllabifier accepts only the 2-letter ARPABET codes but case does not matter.

CMUDict

The Carnegie Mellon University Pronouncing Dictionary is an open-source pronunciation dictionary for North American English. It contains ARPABET transcriptions of 100,000+ words with lexical stress markers on the vowels.

English Syllabification Rules

The syllabification rules that this function is based on are from this Wikipedia article, so they should be treated with some suspicion. In addition, I added a few clusters as being acceptable at my own discretion based on my judgment as a native English speaker and errors thrown when running the code on the CMUDict data. Most of these correspond to clusters that have come from loanwords, e.g., SH-N as in schnappes.

Learning Log

  • Tried to use a sonority-based approach to syllabification but found the explicit lists of acceptable clusters much easier to deal with
  • Had some fun with Python regex before deciding to go with sets to group phones (except for vowels)
  • Used Python sets for the first time
  • Experimented with the logging module and debugged most of the program using the log files I generated
  • Created a function in Python with optional parameters for the first time - this was more exciting than it perhaps should have been
  • Did some good documentation of my code based on the Google Python Style Guide
  • Literally my first time adding CI to a repo

arpabet-syllabifier's People

Contributors

dippedrusk avatar mepc36 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.