Git Product home page Git Product logo

brill's Introduction

Transformation-based POS tagger

This is a transformational part-of-speech tagger, implementing 8 basic rule templates from the classic Brill paper [1]:

Change tag a to tag b when:

  1. The preceding (following) word is w.
  2. The word two before (after) is w.
  3. One of the two preceding (following) words is w.
  4. The current word is w and the preceding (following) word is x.
  5. The current word is w and the preceding (following) word is tagged z.
  6. The current word is w.
  7. The preceding (following) word is w and the preceding (following) tag is t.
  8. The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t.

Howto

Installation:

  1. git clone https://github.com/szmer/brill.git
  2. cd brill
  3. make && make clean

Use ./Brill-train to learn the transformations (on your training corpus). Use ./Brill-test to perform the test (on your test corpus). Both these programs display some help if invoked without arguments.

Your tagset can have up to 499 distinct entries.

About & Implementation

I wrote this as an assignment for the Linguistic Engineering class (part of the Cognitive Science program in the University of Warsaw). The main assignment was to implement a trigram tagger in Python from scratch, but of course I chose to go as hardcore as I possibly could.

600 transformations learned on training corpus (extracted from the Brown corpus, 900k words, 472 tags) give 93.42% accuracy on the test corpus (100k words). (To train this it takes ~4 hours.) This compares to the 90.14% achieved by an unigram tagger and 92.28% accuracy achieved by a trigram tagger (shipped with NLTK Python library).

The main problem with implementation of a Brill classifier is that the size of space of possible rules is rather large. The Brown corpus, which I used as an example, actually has 472 part-of-speech tags, yielding ca. 670k possible rules just from the templates mentioned above. This figure should be multiplied by the number of the words in the corpus, and the number of algorithm iterations (we need one for each rule to be learned).

Some way of mitigating this problem was proposed e.g. in [2], and in the papers cited there. I had little experience with more advanced data structures at the time when I wrote this program, and eventually came up with my own crude, but working solution. Essentially:

  1. Each possible rule is assigned an unique integer (which we can compute in place) for speed of access.
  2. We compute the introduced error only for the rules which correct some mistake from the previous iterations (these are the eligible rules),
  3. In the second loop, these rules are removed from eligible rules set as soon as they prove to introduce more errors than they correct.

[1] E. Brill, 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computation Linguistics, 21(4):543-565. (pdf)

[2] G. Ngai and R. Florian, 2001. Transformation-Based Learning in the Fast Lane. Second Meeting of the North American Chapter of the Association for Computational Linguistics. (pdf)

License

MIT.

brill's People

Contributors

szmer avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

melodykxq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.