The anlp19's intro from allensmile

anlp19's Introduction

Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Date	Activity	Summary
1/22	Follow setup instructions in 0.setup/	Install anaconda and set up environment for class with specific Python libraries.
1/24	Complete 1.words/ExploreTokenization_TODO.ipynb before class	This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them.
1/24	Execute 1.words/EvaluateTokenizationForSentiment.ipynb	This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html
1/24	Complete 1.words/TokenizePrintedBooks_TODO.ipynb	Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks).
1/29	Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb	This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3).

Recommend Projects