Git Product home page Git Product logo

putput's Introduction

About

Build Status

In NLP, and especially when working with chatbots, it is useful to map utterances to tokens for further processing. For example, if you tell a smart speaker to "play Summer Hits on Spotify", a tokenizer may convert that utterance to the 3 tokens, [ACTION(play)] [PLAYLIST(Summer Hits)] [MUSIC_SERVICE(Spotify)]. Given these tokens, the speaker can process your request.

While building or testing a tokenizer, it's likely that you do not have enough utterances and corresponding tokens. Often, it is not possible to source them organically. Therefore, you have to come up with them yourself.

putput (pattern-utterance, repeated in a pattern) makes it trivial to create utterances and tokens in abundance by defining simple patterns via a YAML file.

Algorithm

With only two types of patterns, all utterances and tokens can be created. Once again, let's consider the example of a smart speaker.

The first type of pattern, a token pattern, describes the combinations of words that make up a token. For instance, one example of the PLAY token could be described with the pattern [[he, she], [wants], [to play, to listen to]]. This pattern means that all ordered permutations, (he wants to play, she wants to play, he wants to listen to, she wants to listen to) describe the PLAY token.

The second type of pattern, an utterance pattern, describes the structure of an utterance as a combination of tokens. Examples of utterance patterns include [WAKE], [WAKE, PLAY, SONG, INDICATOR, ARTIST], and so forth.

By defining these two types of patterns in a YAML file, putput can create utterances and associated tokens by:

  • combining the ordered permutations of words to create phrases described by token patterns
  • combining phrases described by token patterns to create utterances described by utterance patterns

For illustrative purposes, the YAML file may look like:

tokens:
  - static:
    - WAKE:
      - [[hey, ok], [speaker, sound system]]
    - PLAY:
      - [[he, she], [wants, needs], [to play, to listen to]]
      - [[play, turn on]]
    - SONG:
      - [[here comes the sun, stronger]]
    - INDICATOR:
      - [[by, performed by]]
    - ARTIST:
      - [[the beatles, kanye west]]
utterances:
  - [WAKE]
  - [WAKE, PLAY, SONG, INDICATOR, ARTIST]

For the utterance pattern [WAKE], putput will generate the utterances: [hey speaker, hey sound system, ok sound system, ok speaker].

For the utterance pattern [WAKE, PLAY, SONG, INDICATOR, ARTIST], putput can generate 320 utterances. Here is a sample of 5 of them: [ok speaker he wants to play here comes the sun by kanye west, ok sound system she needs to play stronger by the beatles, ok sound system play here comes the sun by the beatles, hey speaker she needs to play here comes the sun performed by kanye west, hey sound system play here comes the sun by kanye west].

Note: Although some of the utterances are not valid (for example, "here comes the sun" is not by kanye west), they can still provide utility in testing.

Even with just a few patterns, it is possible to generate millions of utterances and tokens. Don't worry -- putput can be configured to generate as few utterances as you'd like, so you will not run into the issue of running out of computational power.

Usage

  • clone this repo
  • cd into the top level directory
  • execute pip install -r requirements.txt
  • execute pip install -e putput
  • Sample applications can be found in samples/. Each sample exists in its own directory, with a main.py module. To run a sample, execute python -m samples.specific_sample.main. For instance, to run the smart speaker example from the README, execute python -m samples.smart_speaker.main
  • to run tests, execute nose2 putput --with-coverage

The following notes provide context for trickier parts of the samples:

Static vs dynamic token patterns

Not every token pattern can be specified before runtime. Inside the YAML file, underneath token_patterns, you may specify static or dynamic token patterns. Static means all of the token patterns can be specified before the application runs. For instance, there are only a few ways to tell a smart speaker to wake. Dynamic means the token patterns will be specified at runtime. For instance, you may have thousands of songs that change daily, so you could not specify them as part of a YAML file. Instead, you can load them at runtime, perhaps after calling a web service, and pass them to the input processor.

Token handlers

Applications using putput should have the ability to specify how to create a token. Consider two smart speakers. A simpler speaker can only wake on the phrase "hey speaker". A more advanced speaker can wake on "hey speaker" and "ok speaker", and it performs slightly different actions depending on the phrase. For the simpler speaker, tokenizing "hey speaker" to WAKE is enough. For the advanced speaker, tokenizing "hey speaker" and "ok speaker" to WAKE is not enough because the difference between the two phrases would be lost. Therefore, token handlers, or functions that accept as input the phrase to be tokenized, can be specified. For instance, a token handler could be specified for WAKE so output tokens could look like [WAKE(hey speaker)] and [WAKE(ok speaker)].

By default, phrases are tokenized to [token]. Default behavior can be overridden with a token handler just like an any other token. To do so, specify the token DEFAULT in the token handler dictionary.

putput's People

Contributors

michaelperel avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.