Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html
Date | Activity | Summary |
---|---|---|
1/22 | Follow setup instructions in 0.setup/ | Install anaconda and set up environment for class with specific Python libraries. |
1/24 | Complete 1.words/ExploreTokenization_TODO.ipynb before class | This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them. |
1/24 | Execute 1.words/EvaluateTokenizationForSentiment.ipynb | This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html |
1/24 | Complete 1.words/TokenizePrintedBooks_TODO.ipynb | Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks). |
1/29 | Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb | This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3). |