danesprite / pyjsgf Goto Github PK

View Code? Open in Web Editor NEW

49.0 10.0 22.0 530 KB

JSpeech Grammar Format (JSGF) compiler, matcher and parser package for Python.

License: MIT License

Python 100.00%

jsgf speech-recognition python

pyjsgf's Introduction

pyjsgf

JSpeech Grammar Format (JSGF) compiler, matcher and parser package for Python.

JSGF is a format used to textually represent grammars for speech recognition engines. You can read the JSGF specification here.

pyjsgf can be used to construct JSGF rules and grammars, compile them into strings or files, and find grammar rules that match speech hypothesis strings. Matching speech strings to tags is also supported. There are also parsers for grammars, rules and rule expansions.

pyjsgf has been written and tested for Python 2.7 and Python 3.5.

The documentation for this project is on readthedocs.org.

Installation

To install pyjsgf, run the following:

$ pip install pyjsgf

If you are installing in order to develop pyjsgf, clone/download the repository, move to the root directory and run:

$ pip install -e .

Usage Example

The following is a usage example for how to create a JSGF grammar with one rule, compile it and find matching rules given the speech string "hello world":

from jsgf import PublicRule, Literal, Grammar

# Create a public rule with the name 'hello' and a Literal expansion 'hello world'.
rule = PublicRule("hello", Literal("hello world"))

# Create a grammar and add the new rule to it.
grammar = Grammar()
grammar.add_rule(rule)

# Compile the grammar using compile()
# compile_to_file(file_path) may be used to write a compiled grammar to
# a file instead.
# Compilation is not required for finding matching rules.
print(grammar.compile())

# Find rules in the grammar that match 'hello world'.
matching = grammar.find_matching_rules("hello world")
print("Matching: %s" % matching[0])

Running the above code would output:

#JSGF V1.0;
grammar default;
public <hello> = hello world;

Matching: PublicRule(name='hello', expansion=Literal('hello world'))

The first line of the grammar can be changed using the jsgf_version, charset_name, and language_name members of the Grammar class.

There are some usage examples in pyjsgf/examples which may help you get started.

Multilingual support

Due to Python's Unicode support, pyjsgf can be used with Unicode characters for grammar, import and rule names, as well as rule literals. If you need this, it is better to use Python 3 or above where all strings are Unicode strings by default.

If you must use Python 2.x, you'll need to define Unicode strings as either u"text" or unicode(text, encoding), which is a little cumbersome. If you want to define Unicode strings in a source code file, you'll need to define the source code file encoding.

pyjsgf's People

Contributors

Stargazers

Watchers

pyjsgf's Issues

How to serialize PublicRule object?

I want to serialize the PublicRule object using dill. but I got an error AttributeError: 'ChildList' object has no attribute '_expansion' when I load the dumped object.
Is there any way to serialize the object?

NameError: name 'jsgf' is not defined

Steps to reproduce:
Just run the parser example file.

Mainly:
from jsgf import parse_grammar_string

grammar = parse_grammar_string(
"#JSGF V1.0 UTF-8 en;"
"grammar example;"
"public = hello world {tag};"
)

Strange precedence when parsing |

Consider this script:

import jsgf
grammar = jsgf.parser.parse_grammar_string("""
#JSGF V1.0 utf-8 en;
grammar main;
public <main> =
foo (bar|baz) qux | xxx;
""")
print(grammar.get_rule("main").compile())

I would expect the output to be public <main> = foo (bar|baz) qux|xxx;, or possibly something along the lines public <main> = (foo (bar|baz) qux)|(xxx); with additional parentheses to disambiguate.

Instead, I get public <main> = foo (bar|baz) (qux|xxx);. Is this expected behavior? I'm not an expert by far on JSGF, but going by the spec, | should have the lowest precedence of all, which doesn't seem compatible with this result.

Suggestion: it's better to add `import` function

As you know, the original JSGF docs show that we could import other jsgf files, while pyjsgf doesn't work
sample:
- import <com.acme.politeness.startPolite>

#JSGF V1.0 ISO8859-1 en;

grammar com.acme.commands;
import <com.acme.politeness.startPolite>;
import <com.acme.politeness.endPolite>;

/**
  * Basic command.
  * @example please move the window
  * @example open a file
  */

public <basicCmd> = <startPolite> <command> <endPolite>;

<command> = <action> <object>;
<action> = /10/ open |/2/ close |/1/ delete |/1/ move;
<object> = [the | a] (window | file | menu);

Support for jsgf tags?

Are the JSGF tags supported by your implementation? I couldn't find anything.

Parser ignores AlternativSet

When parsing a rule with a Literal followed by an AlternativeSet and again a Literal, the AlternativeSet is not parsed as such. If there is no Literal either on the rigtht or on the left side of the AlternativeSet everything works fine.
Example:
Rule: public <greet> = i (go | run) to school;
Parser Output: Rule(name='greet', visible=True, expansion=Sequence(Literal('i'), Literal('go'), Literal('run'), Literal('to school')))

Parser keeps adding RequiredGroupings on parsing grammar

I stumbled across this issue on editing an existing grammar.
Everytime I parse a grammar from a file, a RequiredGrouping is added if there was one before.

RequiredGrouping(Literal("something")) will become RequiredGrouping(RequiredGrouping(Literal("something"))).

This does not happen if there it's only a Literal

The grammar was written to a file using Grammar.compile_to_file(path).
It looked something like this:

<unk_process> = UNK;
public <answers> = (<X>|  Do <Y>);
public <generic> = (<answers>|<unk_process>);

After reopening it with Parser.parse_grammar_file(path)

A pair of parantheses is added turning the grammar into this

<unk_process> = UNK;
public <answers> = ((<X>|  Do <Y>));
public <generic> = ((<answers>|<unk_process>));

I'm not sure why it happens but it doesn't seem to be an intended behaviour to me.

This working example shows the issue

grammar = Grammar("Test")

x = HiddenRule("X","X")
y = HiddenRule("Y","Y")
grammar.add_rule(x)
grammar.add_rule(y)
grammar.add_rule(PublicRule("Z",AlternativeSet(RuleRef(x),RuleRef(y))))

print(grammar.compile())

grammar.compile_to_file("test.jgram")
grammar = parser.parse_grammar_file("test.jgram")
print(grammar.compile())

how to parse DictationGrammar for file. like parse_grammar_file(path)

how to parse DictationGrammar for file.
like parse_grammar_file(path)
and how to write Dictation in grammar file?

Caching calculations made during matching

The matching process doesn't scale well to matching strings for large grammars. This is because of many calls to methods like Expansion.mutually_exclusive_of. I'm thinking that intelligent caching of calculations like this could increase the performance. Some work on this is in feat/lookup-optimisations.

The cache for a rule (at the root expansion or in the Rule object) could be populated with calculations using a JointTreeContext. If the rule's joint expansion tree has children lists modified, then the cache would need to be updated. This could be checked using a (string) representation of the joint tree in Rule.matches.

Split up the monstrosity that is jsgf/expansions.py

As the title says, this project's jsgf/expansions.py file really needs to be split up into multiple files. It is nearly 2000 lines long! This makes it difficult to find where to make changes and I can imagine it is quite intimidating for anyone wanting to contribute.

The main barrier to achieving this is that there is far too much type checking in that file, making it difficult to extract any classes without running into import cycles. I will be adjusting the various methods and functions to use duck typing instead. For instance, using getattr() to check for the referenced_rule property instead of checking if an object is a NamedRuleRef. I believe this is also quicker and more Pythonic.

I should be able to do this and keep it backwards compatible by changing jsgf.expansions into a directory instead and import classes from the new modules in jsgf/expansions/__init__.py. So something like the following:

from .base import Expansion, ChildList
from .functions import (map_expansion, filter_expansion, flat_map_expansion, TraversalOrder,
                        save_current_matches, restore_current_matches)
from .single_child import SingleChildExpansion, OptionalGrouping, Repeat, KleeneStar
from .multi_child import VariableChildExpansion, AlternativeSet, RequiredGrouping, Sequence
from .references import NamedRuleRef, NullRef, VoidRef, RuleRef
from .other import JointTreeContext, Literal

Ambiguous rule matching limitations

Pyjsgf can't match certain rules where ambiguous speech strings are required/given. For example:

from jsgf import AlternativeSet
e = AlternativeSet("abc", "abcd")

The first alternative can be matched as normal, but the second can't be. If e.matches("abcd") is called, the first alternative will be matched instead of the second, with "d" as the returned string. If used in a rule, r.matches would return False because this is an incomplete match. You can get around this by writing the rule differently, e.g. rearranging the order.

This is also a problem for the jsgf.ext.Dictation expansion class:

from jsgf import Sequence
from jsgf.ext import Dictation
e = Sequence("test", Dictation(), "test")

Calling e.matches("test a test b") would work as expected, setting the match values appropriately for the expansions. However if "test test test" is used instead, no match values get set and the same speech string gets returned. You can get around this using either the SequenceRule or DictationGrammar classes.

Another limitation is matching successive non-optional dictation expansions:

Sequence(Dictation(), Dictation())

This will raise an ExpansionError because matching requires splitting a speech string between Dictation expansions somewhat arbitrarily. If you must do this, use SequenceRule or DictationGrammar instead.

Simplify Git branching scheme

I don't think this project really needs a separate develop branch for various reasons. It only complicates things at the moment and I don't see much benefit from keeping it. I will be switching to just using the master branch for latest changes instead and deleting the develop branch.

A few reasons for this change:

There is a pretty decent Keep a Changelog style changelog now that lists released and unreleased changes, so there's no need to have a branch at the latest release commit any more.
Release versions are tagged and can just be checked out if required, e.g. git checkout v1.5.0.
Updates to the changelog and things like the Read the Docs and Travis-CI projects will be simpler.
No confusion about which branch to submit pull requests to.

If you use the develop branch, just switch to using the master branch after this issue is closed.

Documentation

I'm sure a readthedocs project would be useful for pyjsgf. I'll look into getting that up sometime soon.

Why is an error reported when punctuation marks appear in the file and how to solve it

            <Punctuation> = ( ',' | '.' | '。' ｜ '，')

when i set the grammar_file like that ,why error:
ite-packages/pyparsing.py", line 3846, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected end of text, found '<' (at char 319), (line:11, col:1)

No reset in find_matching_part

In the method Rule.find_matching_part the rule is not reset for a new match. Thus, tags from previous matches will not be erased. The line self.expansion.reset_for_new_match() from Rule.matches is missing inRule.find_matching_part.

Matching process overhaul

The rule expansion matching process could use an overhaul. Sometimes it doesn't match when it should and it still doesn't scale well. I'm going to try reverting back to using regular expressions or the regex package internally. It needs to work in such a way that the Expansion.current_match values are still properly populated. Current match values are used for functionality such as the Rule.matched_tags property.

I'm planning to make these changes in v1.4.1.

Load JSGF grammar from file?

I quickly went through the code and I couldn't find any class or function that loads
a JSGF grammar from a file.

Here's the use case: I already have a JSGF grammar, but my speech recognition component does not
support grammars, e.g., deep speech. Thus, my SR outputs strings --- now I'd like to use pyjsgf
to match the output strings against my existing grammar that already exists in a file.

Is there a way of doing this?

Greedy match or optimal match?

It seems the matching strategy is greedy instead of optimal.

For example:
grammar = Grammar()
play = AlternativeSet("play", "play the")
something = AlternativeSet("the game", "piano")
play_something = PublicRule("play_something", Sequence(play, something))
grammar.add_rules(play_something)
grammar.compile()

grammar.find_matching_rules("play the game") -- no matching
grammar.find_matching_rules("play the piano") -- matching

I think the sentance "play the game" should match the rule "play""the game".

Can you help to support it? Thanks.

@Danesprite

Properly implement weights

Weights for alternatives (defined under JSGF spec. section 4.3.3) are not properly implemented at the moment. They can be compiled, but the parser doesn't recognise them and the matching process ignores them.

Some changes can be made to AlternativeSet to implement them properly, such as a set_weight(self, weight, child) method. The matching process could use weights as an optimisation for matching more likely alternatives first or skipping alternatives with weights of zero.

It should be possible to change the parser to check for an optional weight value.

parsing error for rules with [x] followed by x

Try the following script:

from jsgf import parser, Grammar
rule1 = parser.parse_rule_string('public <test1> = [la] lady;')
rule2 = parser.parse_rule_string('public <test2> = [la] bady;')
grammar = Grammar()
grammar.add_rule(rule1)
grammar.add_rule(rule2)
print(grammar.compile())
matching1 = grammar.find_matching_rules('lady')
matching2 = grammar.find_matching_rules('bady')
print('Matching "lady":', matching1)
print('Matching "bady":', matching2)

The output will show the following:

#JSGF V1.0;
grammar default;
public <test1> = [la] lady;
public <test2> = [la] bady;

Matching "lady": []
Matching "bady": [Rule(name='test2', visible=True, expansion=Sequence(OptionalGrouping(Literal('la')), Literal('bady')))]

Notice that the string "lady" is not matched by the grammar, even though rule test1 expands to it.

Import resolution

Currently Import objects added to grammars are compiled but do nothing else. It would be nice if imports were resolved prior to matching so that matching works as expected with cross-grammar references. This is much easier now that the parser is implemented.

The import resolver could check if the specified grammar file exists in a folder hierarchy or as a file. For example, the procedure for the import statement import <com.example.grammar.*>; would be:

Check if the Import is currently being resolved and raise an error if it is (circular import resolution).
Check if the Import has already been resolved and use the appropriate Grammar object instead of parsing it from a file. Skip to 7.
Otherwise check if there is a grammar file called com.example.grammar.jsgf in the working directory.
If not, then use os.walk to find the com directory (if it exists), collect files below the directory and check if com/example/grammar.jsgf was collected.
- This should not be done if the import statement doesn't use a qualified name, e.g. something like import <grammar.*>;.
Raise an error if neither the grammar.jsgf or com.example.grammar.jsgf files were found.
Parse the file that was found using parse_grammar_file and get a Grammar object.
Resolve any import statements in the new Grammar object. Pass a list of Import objects currently being resolved to detect circular importing and a list of already resolved Imports for optimisation.
Add the new Grammar object to an imported_grammars list in the grammar it was imported from.
After all import statements are resolved, disable public rules in the new Grammar object(s) that were not imported. Not necessary in this case because of the .* in the import statement.

After all this is accomplished, rules that reference a rule in another grammar should find the referenced rule using the imported_grammars list. If the list doesn't contain the required Grammar object, the relevant Import should be resolved. An error should be raised if the import resolution fails.

Matching issue with alternatives and OptionalGrouping

The following rule does not always match when in should: public <test> = this is a [big] (sentence | file);
Matches the following:

'this is a sentence'
'this is a big sentence'
'this i a file'

But it does not match:

'this is a big file'

It definitely should match the last sentence, too.

Same error with * instead of [ ].

Recursive rule definitions

JSGF spec sections 4.8-4.9 discuss right recursive rules:

<command> = <action> | (<action> and <command>);
<action> = stop | start | pause | resume | finish;

and nested right recursive rules:

<X> = something | <Y>;
<Y> = another thing <X>;

While constructing or parsing rules like these works, matching speech to them doesn't. Matching will fail with a "maximum recursion depth exceeded" error.

The JSGF spec has some notes on how to support right-recursive rules:

Any right recursive rule can be re-written using the Kleene star *' and/or the plus operator +'. For example, the following rule definitions are equivalent:

   <command> = <action> | (<action> and <command>);
   <command> = <action> (and <action>) *;
Although it is possible to re-write right recursive grammars using the +' and *' operators, the recursive form is permitted because it allows simpler and more elegant representations of some grammars. Other forms of recursion (left recursion, embedded recursion) are not supported because the re-write condition cannot be guaranteed.

With that in mind, it should be possible to use rewritten matcher elements somehow when right recursion is detected.

Plan: Fix errors and clean up this project

I wrote this library many years ago now. There are a number of problems with it. It's a bit of a mess, really. It needs to be redesigned.

My plan is to clean up it for the next (and final) major version: 2.0.0. This will involve some backwards-incompatible changes to the API. Many symbols (functions, methods, properties) are unnecessary and will be removed.

I will also try to fix a number of problems in a subsequent release, probably either version 1.9.1 or 1.10.0, without removals or backwards-incompatible changes.

I'll update this issue with more detail, when I have it.

Minor JSGF header format issue

Admittedly a minor issue, but I've just noticed the Grammar class and parser don't make the charset and language identifiers in the grammar header optional (see JSGF spec section 3.1). It shouldn't be difficult to change the parser to make these optional and the Grammar class to compile grammar headers with only the values that have been set.

This shouldn't make any difference for compiling grammars for CMU Pocket Sphinx, as it accepts grammar headers as per the spec.

Unary operators are not parsed correctly if used as alternatives

I have noticed a bug with the parser where unary operators * and + are not parsed correctly if used directly in alternative sets. For example:

parse_expansion_string('small+|medium+|large+').compile()
# produces '(small|+|medium|+|(large)+)' instead of '((small)+|(medium)+|(large)+)'

Alternative set sequences aren't parsed correctly with parenthesises

Alternative sequences aren't parsed correctly with parenthesises. For example:

parses to:

instead.

First alternatives cannot be tagged and repeated without additional grouping

There is a parser bug where the first alternative of a set cannot be weighted and repeated simultaneously. A GrammarError will be raised if this is attempted. For example:

parse_expansion_string("/10/a+|/20/b")

This also applies to the kleene star operator (*). This doesn't occur with subsequent weighted alternatives such as /10/a|/20/b+ or /10/a|/20/b|/30/c+ because of how rule expansion parsing is done.

The problem can be sidestepped by wrapping the repetition in a required or optional grouping: /10/(a+)|/20/b or /10/[a+]|/20/b. This is not required though; I will shortly be releasing a new version that fixes this.