Git Product home page Git Product logo

intronlp's Introduction

Introduction to Natural Language Processing

Friday March 2, 2017

Natural language processing (NLP) refers to the methods and technologies used to allow computers to process, understand, and perform tasks using human language. Common NLP tasks include sentiment analysis, part-of-speech tagging, named entity recognition, machine translation, document classification, clustering, and topic extraction. This course will introduce fundamental concepts in NLP including word and document representation, text processing, document classification, document similarity, and clustering, and dimensionality reduction. The course will be taught using jupyter notebooks in python. NLP tools covered will be sci-kit learn and ntlk.

Who: This course is targeted primarily at graduate students and researchers who have some experience with machine learning and python, but are new to NLP.

Requirements: Participants must bring a laptop with a few specific software packages installed (see Pre-Workshop Instructions).

Prerequisites: A previous course in programming is strongly recommended. Experience with basic machine learning is recommended.

Contact: Please mail [email protected] for more information.


Tentative Schedule

Time
8:30-9:00 Sign-in (coffee & bagels)
9:00-10:30 Text Processing and Document Classification
10:30 - 10:45 Break
10:45-1:00 Document Similarity and Clustering

Syllabus

  1. Introduction/Preparation
  • Common NLP Tasks
  • Word and Document Representation
  • Text Processing
  1. Document Classification
  • Text Processing
  • TFIDF
  • Evaluation
  1. Clustering
  • Document Similartiy
  1. Dimensionality Reduction
  • Topic Modeling
  • Visualization

Pre-Workshop Instructions

You will need the following programs to run the jupyter notebook:

  • Git
  • Python
  • Python modules (see requirements.txt)

Python modules can be installed using either anaconda (recommended for beginners) or pip.

Git

You will need to install git. After installing git, run the following command to clone the workshop repository:

git clone https://github.com/UCIDataScienceInitiative/NLP.git

Python

You will need python installed. If you do not already have python installed, we recommend downloading Anaconda, which will include python and the modules required for this workshop. The notebook will run with versions 3.5 and 2.9.

Anaconda

This is the recommended method of installation for users newer to python, or those that have not used pip. Anaconda should have all required modules. After installing Anaconda, run

conda update conda

to update all modules. If the LatentDirchletAllocation method will not import, it may help to update scikit-learn by running

conda update scikit-learn

Pip

To run this script you will need the packages listed in requirements.txt. To install run

pip install -r requirments.txt in the command line.

Operating Systems

I was able to get the notebook running using python 3.5 and 2.9 on both Mac and Windows machines. If you are having trouble installing any of the required software, please come to the workshop a few minutes early. Additionally, we will have scheduled setup time to address any problems.

Data

The data are comments and metadata from two mental health subreddits /r/SuicideWatch and /r/depression. The data were filtered from this dataset.

To unzip the data, run gunzip RC_2015-05.json.gz

Slides

If you would like to see the presentation in slide form, rather than an ipython notebook, you can run the following commands:

jupyter nbconvert --to slides IntroNLP.ipynb --post serve

intronlp's People

Contributors

geebioso avatar

Watchers

James Cloos avatar  avatar

Forkers

avishekrk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.