aparrish / pronouncingpy Goto Github PK

View Code? Open in Web Editor NEW

295.0 12.0 42.0 948 KB

A simple interface for the CMU pronouncing dictionary

License: BSD 3-Clause "New" or "Revised" License

Makefile 11.01% Python 88.99%

pronouncingpy's Introduction

pronouncing

Pronouncing is a simple interface for the CMU Pronouncing Dictionary. It's easy to use and has no external dependencies. For example, here's how to find rhymes for a given word:

>>> import pronouncing
>>> pronouncing.rhymes("climbing")
['diming', 'liming', 'priming', 'rhyming', 'timing']

Read the documentation here: https://pronouncing.readthedocs.org.

I made Pronouncing because I wanted to be able to use the CMU Pronouncing Dictionary in my projects (and teach other people how to use it) without having to install the grand behemoth that is NLTK.

Installation

Install with pip like so:

pip install pronouncing

You can also download the source code and install manually:

python setup.py install

License

The Python code in this module is distributed with a BSD license.

Acknowledgements

This package was originally developed as part of my Spring 2015 research fellowship at ITP. Thank you to the program and its students for their interest and support!

pronouncingpy's People

Contributors

Stargazers

Watchers

pronouncingpy's Issues

add license info from cmudict

CMUdict has a 2-clause BSD license which requires reproducing a copyright notice and disclaimer. The license itself is included in the version of CMUdict included in the source distribution, which meets the letter of the license for the purposes of this repository. The version of the package distributed on PyPI isn't technically a binary, so condition (2) in the license does not obtain, but I nevertheless think it might be good form to reproduce the required license in the documentation as well.

extra non-phones in phones for several words

A handful of words have extra non-phone content in their pronunciations, e.g. [(k, v) for k, v in pr.pronunciations if '#' in v] evaluates to...

[("d'artagnan", 'D AH0 R T AE1 NG Y AH0 N # foreign french'),
 ('danglar', 'D AH0 NG L AA1 R # foreign french'),
 ('danglars', 'D AH0 NG L AA1 R Z # foreign french'),
 ('gdp', 'G IY1 D IY1 P IY1 # abbrev'),
 ('hiv', 'EY1 CH AY1 V IY1 # abbrev'),
 ('porthos', 'P AO0 R T AO1 S # foreign french'),
 ('spieth', 'S P IY1 TH # name'),
 ('spieth', 'S P AY1 AH0 TH # old')]

This should obviously not be the case! There may be other instances like this—I haven't had time to check. I imagine it's a problem with the upstream module providing the pronunciations.

Docs for syllable_count assume phones_for_word returns string, not a list

https://pronouncing.readthedocs.io/en/latest/pronouncing.html#pronouncing.syllable_count says:

>>> import pronouncing
>>> phones = pronouncing.phones_for_word("literally")
>>> pronouncing.syllable_count(phones)
4

Parameters: phones – a string containing space-separated CMUdict phones
Returns: integer count of syllables in list of phones

However, phones_for_word returns a list, not a string (Python 2.7.12, pronouncing==0.1.3):

>>> import pronouncing

>>> phones = pronouncing.phones_for_word("literally")
>>> phones
[u'L IH1 T ER0 AH0 L IY0', u'L IH1 T R AH0 L IY0']

>>> pronouncing.syllable_count(phones)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pronouncing/__init__.py", line 72, in syllable_count
    return len(stresses(phones))
  File "/usr/local/lib/python2.7/site-packages/pronouncing/__init__.py", line 108, in stresses
    return re.sub(r"[^012]", "", s)
  File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

>>> pronouncing.syllable_count(phones[0])
4

The syllable_count examples in
https://pronouncing.readthedocs.io/en/latest/tutorial.html#counting-syllables look fine.

Can I do conda install?

Hi, I am using anaconda, so how do I do conda install, I went to anaconda cloud there the name was pypi_pronounce is it the same pronounce as this?, and also I saw that

"Last upload: 5 years and 7 months ago ",

I want the latest version. How do I do it? Help much appreciated.

Time for a new release?

There's been a number of changes since the last release in June:

0.1.2...master

pronouncing.phones_for_word("a") returns ['AH0', 'EY1', 'EY1 F AO1 R T UW1 W AH1 N T UW1 EY1 T']

The third result seems incorrect.

Missing words

Hello Allison,
Would you like to be informed if a word is missing from your module? For example, I learned that the word 'offence' is not in it. Or some others with prefixes and suffixes, like 'disinherited', 'dazzlingly', 'jingling', 'girded'? Or compound words, like 'brickwork' or 'seabird'?
Best regards,
Mya

[Feature Request] Arguments

Hello,

I'd like to request arguments for phones_for_word such as stress_marks=False. That would be brilliant.

And also the ability to convert multiple words in a single string. Perhaps a for_phone_in_words function

Docs for rhyming_part assume phones_for_word returns string, not a list

https://pronouncing.readthedocs.io/en/latest/pronouncing.html#pronouncing.rhyming_part says:

>>> import pronouncing
>>> phones = pronouncing.phones_for_word("purple")
>>> pronouncing.rhyming_part(phones)
u'ER1 P AH0 L'

Parameters: phones – a string containing space-separated CMUdict phones
Returns: a string with just the “rhyming part” of those phones

However, phones_for_word returns a list, not a string (Python 2.7.12, pronouncing==0.1.3):

>>> import pronouncing
>>> phones = pronouncing.phones_for_word("purple")
>>> phones
[u'P ER1 P AH0 L']

>>> pronouncing.rhyming_part(phones)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pronouncing/__init__.py", line 143, in rhyming_part
    phones_list = phones.split()
AttributeError: 'list' object has no attribute 'split'

>>> pronouncing.rhyming_part(phones[0])
u'ER1 P AH0 L'

The rhyming_part examples in
https://pronouncing.readthedocs.io/en/latest/tutorial.html?highlight=rhyming_part#rhyme work as expected.

LRU cache for phones_for_word lookups

I'm actually using the library for a thing right now and the performance isn't quite where I want it to be. I'd like to investigate drop-in LRU cache implementations and see if I can add one to at least the phones_for_word call in order to improve performance a little bit without needing to make a big dictionary for word-to-phone lookups. Starting point here

Link to pronouncing-js?

In the README somewhere, how about linking to pronouncing-js?

Some people might find this but would prefer to use the JS one, if they knew it exists.

(The pronouncing-js README already links here.)

python3.11

AttributeError: partially initialized module 'pronouncing' has no attribute 'rhymes' (most likely due to a circular import)

Rhyming multiple words?

Apologies if this is the wrong place to ask or if this is a feature that already exists, that I can't see within the repo.

Is there any way to rhyme multiple words with one? So if I typed in 'laptop' it would come up with 'crack pot' 'slap shot' etc. Of course I understand that this would result in it producing probably dozens if not hundreds more rhymes especially for longer words, but I would still really appreciate this nonetheless if it's possible in any way to do it.

Get Phones in German Language

Hello,

thanks for this module. I want to get phonemes for word, e.g.:

import pronouncing
pronouncing.phones_for_word("hello")

However, i suppose these are only valid for English words, but i want to use it for German words (e.g. "hallo") as well. What is the best way to extend this library to this functionality?

Regards,
Josef

suggested new methods

Hello developers,
I'm so grateful for your work! I felt huge relief when I found your module.
I'm not a linguist, I'm data analyst. I have been working on converting a text into phones for future analysis. In particular, I discovered that in your package the method pronouncing.phones_for_word() sometimes does not work with "s" and "'s" at a word end. For example, I can get phones for "pic" and "Streep", but not for "pics" and "Streep's". In addition I needed phones for sentences. So I created a couple of methods. First one add "s" or "z" to a existing phoneme list, depending on ending sound. Second one takes a sentence with only letter symbols and coverts it into a phoneme list, if possible. If there are symbol sequence for which a phoneme list does not exists, then such sequence is returned at the end, too. I needed it because I wanted to check how well my cleaning procedure worked. I understand that there may be foreign words with "s" at the end which must be pronounced differently, but I wanted a native English speaker pronunciation.

def adding_s(phone_list):
     fkptth = [ 'F', 'K', 'P', 'T', 'TH']
     chsh = ['CH',  'SH']
     if phone_list[-1] in fkptth:
          return phone_list+' S'
     elif phone_list[-1] in chsh: 
          return phone_list[-1] +  ' IH Z'
     else:
          return phone_list +" Z"

def phones_for_sentence(se):
    words = se.split()
    sounds = [pronouncing.phones_for_word(w) for w in se.split()]
    no_sound = ""
    for i in range(len(words)):
        if len(sounds[i]) == 0: 
            word_ = words[i]
            if word_[-2:]=="'s":
                word_ = word_[:-2]
                sou = pronouncing.phones_for_word(word_) 
                if len(sou)!=0:
                    sou = adding_s(sou[0])
                    sounds[i] = sou
            elif word_[-1]=="s":
                word_ = word_[:-1]
                sou = pronouncing.phones_for_word(word_)
                if len(sou)!=0:
                    sou = adding_s(sou[0])
                    sounds[i] = sou

            else:
                no_sound += words[i] + ' '
        else:
            sounds[i] = sounds[i][0] + ' '

    return sounds, no_sound

ms = "Meryl Streep's strong speech against Donald Trump at Golden Globes gets thumbs up from Hollywood Bollywood"
phones_for_sentence(ms)
jd = ''Johnny Depp Settles $25 Million Suit with Former Business Managers: Report'
phones_for_sentence(jd)
tv= "'The Voice' Kelly Clarkson and Jennifer Hudson Returning As Coaches For Season 15"
phones_for_sentence(tv)

As you see sentences require some previous cleaning from punctuation, emojies and such, but these procedures are well known for people who work with NLP, and if they are not implemented correctly, the function gives a warning in the form of list.
What do you think of this? I can rework the first function to detect "s" or "'s" at the end, if you think that it will be more suitable for your module. Or something else which I missed due to my lack of linguistic knowledge. I'm open to suggestions.
I believe that other people may benefit from the functions, too. Yes, I know that there could be a few ways to pronounce a word depending on a context. I can make a warning about it as well and leave a possibility for a user to change it to more suitable way. My point here is a function can streamline this process.
Best regards,
Mya

add info on how to cite

I don't think anyone is using this library in an academic context, but it might nevertheless make sense to add a bibtex template or something to the documentation with proper citation information, to make it easier just in case!

Is source code available?

Hi @aparrish!

Thanks for making this. Is there a source code available for functions of this library? For example, for the pronouncing.search?

[Question] How to associate pronunciation to senses/synsets/definitions

Hello - using this library (or the cmudict directly, same information), we can get multiple pronunciations. Some of these correspond to a different part of speech (e.g. PRO-ject noun vs. pro-JECT verb). Some are homographs with the same part of speech (e.g. bow).

Here's an example of bow:

>>> import pronouncing
>>> pronouncing.phones_for_word('bow')
['B AW1', 'B OW1']
>>>
>>> from nltk.corpus import wordnet
>>> [ss.definition() for ss in wordnet.synsets('bow')[:2]]
['a knot with two loops and loose ends; used to tie shoelaces', 'a slightly curved piece of resilient wood with taut horsehair strands; used in playing certain stringed instruments']

Does anybody have suggestions for how I could create relations from the pronunciations to the senses/synsets?

One potential path I'm looking at is:

arpabet to IPA
look up definitions/senses by IPA (I don't know where, just yet)

Cookbook: Example output incorrect?

In the first Meter example:

>>> import pronouncing
>>> phones_list = pronouncing.phones_for_word("snappiest")
>>> pronouncing.stresses(phones_list[0])
u'0102'

But when I try that code I get this:

>>> import pronouncing
>>> phones_list = pronouncing.phones_for_word("snappiest")
>>> pronouncing.stresses(phones_list[0])
u'102'

cannot search for 'A'

In this page, http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=A, I can search for an alphabet such as 'A'. But for this python package, I cannot search for alphabet 'A' even for 'ABT' listed in the cmu dictionary 0.7d

I known why I cannot search for 'A'. 'A' should be search by 'a'

Enable Coveralls for this repo

Please could you enable Coveralls for this repo?

Then we can see coverage reports like https://coveralls.io/github/hugovk/pronouncingpy, and it'll also do a check for submitted PRs.

It's free for open-source, enable here: https://coveralls.io/repos/new

rhymes() returns only words rhyming with first pronunciation

The docs state:

The pronouncing.rhymes() function returns a list of all possible rhymes for the given word—i.e., words that rhyme with any of the given word’s pronunciations.

It appears, however, that the pronouncing.rhymes() function in fact returns a list of a list of words that rhyme with only the first of the given word's pronunciations.

Using Threading with pronouncing.rhymes returns inaccurate candidates

I'm making a little project for fun that takes a huge amount of text data and tries to make a song. I tried to speed the process of finding rhyme candidates by using threading.

def addCandidates(strings, index):
    lastWord = ''.join(filter(str.isalpha, strings[index]["string"].strip().split(" ")[-1])).lower()
    strings[index]["candidates"] = pronouncing.rhymes(lastWord)
    strings[index]["lastWord"] = lastWord


...
...
...

    # Start a new thread for each string to find out all the rhyme candidates
    # Maximum thread pool for this process is 25 threads.
    x = 0
    while activeCount() < 25:

        process = Thread(target=addCandidates, args=[strings, x])
        process.start()
        threads.append(process)
        if x % 100 == 0:
            times[1] = time()
            print("{}% complete // {} sentences // {} threads // {} seconds // {} per second".format(int(x / len(strings) * 100), x, activeCount(), int(times[1]-times[0]), x / (times[1] + 0.00001 - times[0])))
        if x == len(strings):
            break
        x += 1


    # Wait for threads to finish processing
    for process in threads:
        process.join()

I figured by doing Threading, it would complete a lot faster. But it turns out threading slows down the process by 5-10% while also returning incorrect data.

For example, some of the threads will return a blank list for the word 'of' while others will return the correct candidates for the same word. I thought I was doing something wrong, but when i removed the threading, rhymes() was 100% accurate.

Any explanation for this? I would like to speed up the process somehow because currently the processing time is about 4-7 words per second which is incredibly slow when dealing with thousands of words.

Switch CMUDICT Source to GitHub

Would you be open to switching to use the cmudict files from their github repo? The data's split out more cleanly, and it opens the possibility of automatically packaging / deploying this library when the cmudict repo is updated. It may also help with a couple of open issues - not sure yet.

I actually started writing a python wrapper for cmudict this morning but stopped when I came across your library. I'd rather contribute here than create something new since this package meets my needs as well. I'd be happy to work on making the changes necessary.

python 3.5 bug when loading a custom dictionary

init.py line 28:
line = line.strip().decode('utf-8')
causes a failure on python 3.5. line.strip() is already a string, and does not implement the 'decode' function.

Punctuation at the start of cmudict

>>> import pronouncing
>>> pronouncing.rhymes("period")
[u'.period', u'myriad']
>>> pronouncing.rhymes("hyphen")
[u'-hyphen', u'siphon', u'syphon']

That .period is a bit strange. This is because cmudict-0.7b begins with a load of symbols:

!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE  K L OW1 Z K W OW1 T
"DOUBLE-QUOTE  D AH1 B AH0 L K W OW1 T
"END-OF-QUOTE  EH1 N D AH0 V K W OW1 T
"END-QUOTE  EH1 N D K W OW1 T
"IN-QUOTES  IH1 N K W OW1 T S
...

So this isn't necessarily a bug in Pronouncing, but more a case of how to interpret the input data. (In a way it's . that rhymes with period and not .period; - rhymes with hyphen and not -hyphen.) Would it make sense to filter some of these out?

This isn't a problem for me, I just noticed it in some unusual output and thought I'd mention it.

If they were filtered, care should be taken not to filter things like:

'ALLO  AA2 L OW1
'BOUT  B AW1 T
'CAUSE  K AH0 Z
'COURSE  K AO1 R S
'CUSE  K Y UW1 Z

Dividing a word into syllables

Is it possible to use the library for dividing words into syllables?