Git Product home page Git Product logo

keras-text's Introduction

Keras Text Classification Library

Build Status license Slack

keras-text is a one-stop text classification library implementing various state of the art models with a clean and extendable interface to implement custom architectures.

Quick start

Create a tokenizer to build your vocabulary

  • To represent you dataset as (docs, words) use WordTokenizer
  • To represent you dataset as (docs, sentences, words) use SentenceWordTokenizer
  • To create arbitrary hierarchies, extend Tokenizer and implement the token_generator method.
from keras_text.processing import WordTokenizer


tokenizer = WordTokenizer()
tokenizer.build_vocab(texts)

Want to tokenize with character tokens to leverage character models? Use CharTokenizer.

Build a dataset

A dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on trying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be saved and loaded from the disk.

from keras_text.data import Dataset


ds = Dataset(X, y, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.1)
ds.save('dataset')

The update_test_indices method automatically stratifies multi-class or multi-label data correctly.

Build text classification models

See tests/ folder for usage.

Word based models

When dataset represented as (docs, words) word based models can be created using TokenModelFactory.

from keras_text.models import TokenModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN


# RNN models can use `max_tokens=None` to indicate variable length words per mini-batch.
factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

TokenModelFactory.build_model uses the provided word encoder which is then classified via Dense block.

Sentence based models

When dataset represented as (docs, sentences, words) sentence based models can be created using SentenceModelFactory.

from keras_text.models import SentenceModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder


# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.
factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()

# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

SentenceModelFactory.build_model created a tiered model where words within a sentence is first encoded using
word_encoder_model. All such encodings per sentence is then encoded using sentence_encoder_model.

  • Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
  • For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using token_encoder_model=AveragingEncoder()
  • Mix and match encoders as you see fit for your problem.

Resources

TODO: Update documentation and add notebook examples.

Stay tuned for better documentation and examples. Until then, the best resource is to refer to the API docs

Installation

  1. Install keras with theano or tensorflow backend. Note that this library requires Keras > 2.0

  2. Install keras-text

From sources

sudo python setup.py install

PyPI package

sudo pip install keras-text
  1. Download target spacy model

keras-text uses the excellent spacy library for tokenization. See instructions on how to download model for target language.

Citation

Please cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:

@misc{raghakotkerastext
  title={keras-text},
  author={Kotikalapudi, Raghavendra and contributors},
  year={2017},
  publisher={GitHub},
  howpublished={\url{https://github.com/raghakot/keras-text}},
}

keras-text's People

Contributors

raghakot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keras-text's Issues

Install from Anaconda

I couldn't find any repos for install with conda install. Anyone were able to install it fro Anaconda environment?

Using trained model

Hi,

I use the packages to correctly train a model, now the question is how can I use the trained model to make predictions? I give the input of my training and testing sets as an array of lists of strings:

test_x = ['cat fat hat', 'lorem ipsum pretorium', ... ,'this is a list']

The Dataset routine essentially also creates numpy arrays of the lists of strings. Thus it should work with similar lists of arrays? I tried to use the model.predist(test_x), but I get a returned an error of:

Error when checking input: expected input_4 to have 3 dimensions, but got array with shape (20000, 1)

Any advice?

Any example notebooks?

Hi,
Do you have any notebook examples for any real text dataset that I can refer?
Thanks.
Selva

token_model Problem

After running this following code, I receive 'ModuleNotFoundError: No module named 'token_model''

`
with open('tweets1k.txt', 'r') as infile:
tweets = infile.readlines()

tokenizer = WordTokenizer()
tokenizer.build_vocab(tweets)

ds = Dataset(tweets, emojis, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.2)
ds.save('dataset')

factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
`
How to solve that, please.

Python 3 issue fix - dict_values does not support indexing

There is an issue with calling following code in Python 3+:
self.embeddings_index.values()[0]

Reason
In Python 3, dict.values() does not return list and following error will be raised:

dict_values does not support indexing

Solution
The line should be updated in python 3+ as following
list(self.embeddings_index.values())[0].shape[-1]
In following files:

  • token_model.py
  • embeddings.py

Python 3 issue fix - filter object has no attribute 'sort'

Calling the function "apply_encoding_options" using Python3 raise following error:

AttributeError: 'filter' object has no attribute 'sort'

Reason:
Following two line cause the issue:
token_counts = filter(lambda x: x[1] >= min_token_count, token_counts) token_counts.sort(key=lambda x: x[1], reverse=True)
In python 3 filter function return object filter
In python 2.7 filter function return list (the code working correctly here)

Suggested Solution:
Edit function 'apply_encoding_options' inside 'processing.py' to order and filter without using filter object as following:
token_counts = sorted((x for x in token_counts if x[1] >= min_token_count), reverse=True, key=lambda x: x[1])

ValueError in train_val_split() due to multi-class database

I get a ValueError when I try to make a dataset with strings as input. I want to assign 1 out of 5 classes to each string. I get this error:

ValueError: Found input variables with inconsistent numbers of samples: [21643, 108215]

even though my labels array has shape 21643 just like the shape of my input array.

When I change to a two class problem there is nog problem.

Compatible with spacy 2.0.3?

Using this code:

from keras_text.processing import WordTokenizer
tokenizer = WordTokenizer()
tokenizer.build_vocab(["this is a text", "an other "])

I get an error:

ypeError                                 Traceback (most recent call last)
<ipython-input-12-a4643a71418a> in <module>()
      1 from keras_text.processing import WordTokenizer
      2 tokenizer = WordTokenizer()
----> 3 tokenizer.build_vocab(["this is a text hello", "an other "])

~/venvs/srPrimaryPredFull/lib/python3.6/site-packages/keras_text-0.1-py3.6.egg/keras_text/processing.py in build_vocab(self, texts, verbose, **kwargs)
    367         self._num_texts = len(texts)
    368 
--> 369         for token_data in self.token_generator(texts, **kwargs):
    370             indices, token = token_data[:-1], token_data[-1]
    371             count_tracker.update(indices)

~/venvs/srPrimaryPredFull/lib/python3.6/site-packages/keras_text-0.1-py3.6.egg/keras_text/processing.py in token_generator(self, texts, **kwargs)
    549         }
    550 
--> 551         for text_idx, doc in enumerate(nlp.pipe(texts, **kwargs)):
    552             for word in doc:
    553                 processed_word = self._apply_options(word)

TypeError: pipe() got an unexpected keyword argument 'entity'

It seem to me that the code is not compatible with spacy 2.0.3, the latest version

Inference

Hi,

A silly question, but I'm following along with the tutorial building the model, but I'm having trouble trying to perform inference with new data.

For example, if trained on an IMDB dataset with 0/1 labels, I want to infer/fit a new sentence and make use of the model but I want to do it in a proper way. Right now I'm taking the raw text ("I loved this movie.") and feeding it to the tokenizer.encode_texts() method then using the tokenizer's embeddings_index to attach the embeddings, etc....

But I'm pretty sure I'm doing it wrong. I was wondering if there is an example on doing out of sample inference after training the model. Thank you!

tokenizer.build_vocab error due to progbar.update problem

  3 tokenizer = SentenceWordTokenizer()

----> 4 tokenizer.build_vocab(X)

~/anaconda3/envs/msa/lib/python3.6/site-packages/keras_text/processing.py in build_vocab(self, texts, verbose, **kwargs)
381 count_tracker.finalize()
382 self._counts = count_tracker.counts
--> 383 progbar.update(len(texts), force=True)
384
385

TypeError: update() got an unexpected keyword argument 'force'

Relative reference

from token_model import TokenModelFactory
from sentence_model import SentenceModelFactory
from sequence_encoders import *

It seems a "." is necessary at the head of the __init__.py of models

from .token_model import TokenModelFactory
from .sentence_model import SentenceModelFactory
from .sequence_encoders import *

Failed module import

Maybe related to the whole bunch of python 3 issues around the repo, but a simple

from keras_text.models import TokenModelFactory

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-96307cb1e937> in <module>()
      1 # Will automagically handle padding for models that require padding (Ex: Yoon Kim CNN)
----> 2 from keras_text.models import TokenModelFactory
      3 from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN
      4 factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
      5 word_encoder_model = YoonKimCNN()

~/miniconda/envs/deeplearn/lib/python3.6/site-packages/keras_text/models/__init__.py in <module>()
----> 1 from token_model import TokenModelFactory
      2 from sentence_model import SentenceModelFactory
      3 from sequence_encoders import *

Unexpected kwyword argument for progressbar

TypeError                                 Traceback (most recent call last)
<ipython-input-41-09a65e684086> in <module>()
      4 print(texts[0])
      5 tokenizer = WordTokenizer()
----> 6 tokenizer.build_vocab(texts)
      7 
      8 #ds = Dataset(X, y, tokenizer=tokenizer)

~/miniconda/envs/deeplearn/lib/python3.6/site-packages/keras_text/processing.py in build_vocab(self, texts, verbose, **kwargs)
    381         count_tracker.finalize()
    382         self._counts = count_tracker.counts
--> 383         progbar.update(len(texts), force=True)
    384 
    385     def get_counts(self, i):

TypeError: update() got an unexpected keyword argument 'force'

Jobib and Jsonpickle should be added as a dependencies

Importing Datasets leads to error.

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-35-a263e6a09a43> in <module>()
      5 warnings.filterwarnings('ignore')
      6 
----> 7 from keras_text.data import Dataset

~/miniconda/envs/deeplearn/lib/python3.6/site-packages/keras_text/data.py in <module>()
      4 import numpy as np
      5 
----> 6 from .import utils
      7 from .import sampling
      8 

~/miniconda/envs/deeplearn/lib/python3.6/site-packages/keras_text/utils.py in <module>()
      3 import numpy as np
      4 import pickle
----> 5 import joblib
      6 import jsonpickle
      7 

ModuleNotFoundError: No module named 'joblib'

Struggling with vizualization of attention tensor

Hi guys,

I have successfully trained a classifier with attention with context mecanism, but i 'm struggling with the way to call the function get_attention_tensor . Do you have any clues in order to make it work ?

Thanks !

Léo

No Module name 'build_vocab'

I run these libraries :

import torch
import matplotlib.pyplot as plt
import numpy as np
import argparse
import pickle
import os
from torchvision import transforms
from build_vocab import Vocabulary
from model import EncoderCNN, DecoderRNN
from PIL import Image

I got an error :

ModuleNotFoundError Traceback (most recent call last)
in ()
6 import os
7 from torchvision import transforms
----> 8 from build_vocab import Vocabulary
9 from model import EncoderCNN, DecoderRNN
10 from PIL import Image

ModuleNotFoundError: No module named 'build_vocab'

So please anyone find the solution for this error

Thanks !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.