Git Product home page Git Product logo

natural-language-processing's Introduction

Natural Language Processing course resources

This github contains practical assignments for Natural Language Processing course by Higher School of Economics: https://www.coursera.org/learn/language-processing. In this course you will learn how to solve common NLP problems using classical and deep learning approaches.

From a practical side, we expect your familiarity with Python, since we will use it for all assignments in the course. Two of the assignments will also involve TensorFlow. You will work with many other libraries, including NLTK, Scikit-learn, and Gensim. You have several options on how to set it up.

1. Running on Google Colab

Google has released its own flavour of Jupyter called Colab, which has free GPUs!

Here's how you can use it:

  1. Open https://colab.research.google.com, click Sign in in the upper right corner, use your Google credentials to sign in.
  2. Click GITHUB tab, paste https://github.com/hse-aml/natural-language-processing and press Enter
  3. Choose the notebook you want to open, e.g. week1/week1-MultilabelClassification.ipynb
  4. Click File -> Save a copy in Drive... to save your progress in Google Drive
  5. If you need a GPU, click Runtime -> Change runtime type and select GPU in Hardware accelerator box
  6. Execute the following code in the first cell that downloads dependencies (change for your week number):
! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
import setup_google_colab
# please, uncomment the week you're working on
# setup_google_colab.setup_week1()  
# setup_google_colab.setup_week2()
# setup_google_colab.setup_week3()
# setup_google_colab.setup_week4()
# setup_google_colab.setup_project()
# setup_google_colab.setup_honor()
  1. If you run many notebooks on Colab, they can continue to eat up memory, you can kill them with ! pkill -9 python3 and check with ! nvidia-smi that GPU memory is freed.

Known issues:

  • No support for ipywidgets, so we cannot use fancy tqdm progress bars. For now, we use a simplified version of a progress bar suitable for Colab.
  • Blinking animation with IPython.display.clear_output(). It's usable, but still looking for a workaround.
  • If you see an error "No module named 'common'", make sure you've uncommented the assignment-specific line in step 6, restart your kernel and execute all cells again

2. Running locally

Two options here:

  1. Use the Docker container of our course. It already has all libraries, that you will need. The setup for you is very simple: install Docker application depending on your OS, download our container image, run everything within the container. Please, see this detailed Docker tutorial.

  2. Manually install all the libraries depending on your OS (each task contains a list of needed libraries in the very beginning). If you use Windows/MacOS you might find useful Anaconda distribution which allows to install easily most of the needed libraries. However, some tools, like StarSpace for week 2, are not compatible with Windows, so it's likely that you will have to use Docker anyways, if you go for these tasks.

It might take a significant amount of time and resources to run the assignments code, but we expect that an average laptop is enough to accomplish the tasks. All assignments were tested in the Docker on Mac with 8GB RAM. If you have memory errors, that could be caused by not tested configurations or inefficient code. Consider reporting these cases or double-checking your code.

If you want to run the code of the course on the AWS machine, we've prepared the AWS tutorial here.

natural-language-processing's People

Contributors

akashin avatar anyap avatar avbelyy avatar cgurkan avatar dependabot[bot] avatar eyurtsev avatar lydia-gu avatar maribax avatar nunorc avatar rmdr avatar stallians avatar voron13e02 avatar zemushka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

natural-language-processing's Issues

Python version in docker image

I retrieved the docker image like so:
# docker pull akashin/coursera-aml-nlp

#python3 --version
shows that this has python 3.5 installed

Unfortuantely python 3.5 has this bug. Because of this test_my_bag_of_words() fails as the dict order is not maintained. This works correctly in colab as the colab python version is 3.6.

I tried upgrading python 3.5 on the docker image to python 3.7 using this post. The upgrade seems to work. I also upgrade the jupyter notebook. But then the notebook doesn't work properly.

Is it possible to provide a docker image with an upgraded python version?

Module common not found

Hi,

module common not found. Tried on colab and on the docker as well.

here is the traceback:

`ImportError Traceback (most recent call last)
in ()
1 import sys
2 sys.path.append("..")
----> 3 from common.download_utils import download_week1_resources
4
5 download_week1_resources()

ImportError: No module named 'common'`

Week 1 MultiLabelBinarizer

In Week1 module during Multiclass training, scikit-learn raises this kind of exception:
"Scikit Learn Multilabel Classification: ValueError: You appear to be using a legacy multi-label data representation..."

So, I've found out that we should use MultiLabelBinarizer in order to preprocess labels, done.

But when we need to evaluate "val" dataset on trained classifiers, there is "mlb" variable referenced, which was not instantiated. I assume that it refers to "MultiLabelBinarizer". As you see, there is an inconsistency here, which currently should be manually fixed.

Broken link in `week2-NER` notebook

The following link in the Week 2 assignment notebook appears to be broken:

"First, we need to create [placeholders](https://www.tensorflow.org/versions/master/api_docs/python/tf/placeholder) to specify what data we are going to feed into the network during the execution time. For this task we will need the following placeholders:\n",

I believe it should point instead to https://www.tensorflow.org/api_docs/python/tf/compat/v1/placeholder

ModuleNotFound running notebook in CoLab

When I open one of the notebooks in CoLab, specifically week1-MultilabelClassification.ipynb:
https://colab.research.google.com/github/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb

When I try to run the notebook I get ModuleNotFound error on this line:
from common.download_utils import download_week1_resources

I have never run code from GitHub in CoLab and am not sure if I need to do something so that it can find the common module.

Week1: How to open in Colab?

Sorry, stupid question, but how do I open week 1 in Colab? Usually for my own files, there is always a "open in Colab" button, but there is none for week one task?

Running on Google Colab

You need to sign into Google account first or you won't see the GitHub tab from the README instruction

Week 1 incorrect answers in test_my_bag_of_words

In the function test_my_bag_of_words, answers is defined as a list of list while it should be just list

Original:
answers = [[1, 1, 0, 1]]

Should be:
answers = [1, 1, 0, 1]
as it is a return of my_bag_of_words function which takes text as input and returns np array

Problem opening 'data/text_prepare_tests.tsv' file

Using the docker container environment I am getting a UnicodeDecodeError. More speciffically:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Is giving the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

In order to run it I had to change it to:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv', encoding='utf-8'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Can also be solved by using pd.read_csv.

Is this error reproducible to anyone else?

Week 1: Invalid Syntax

Hello,
This code returns an "Invalid Syntax" error.

`REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
"""
text: a string

    return: modified initial string
"""
text = # lowercase text
text = # replace REPLACE_BY_SPACE_RE symbols by space in text
text = # delete symbols which are in BAD_SYMBOLS_RE from text
text = # delete stopwords from text
return text`

Final submission error

I am using google colab for week 3 assignment and at the last when i am finally submitting it, the compiler is not recognizing my E-mail id. Please tell me the solution.
----> 1 STUDENT_EMAIL = [email protected]# EMAIL
2 STUDENT_TOKEN = AT5ZyzLxuQfnEhNg# TOKEN
3 grader.status()

NameError: name 'mandloi19faraday96' is not defined

This is the error i am getting how should i submit my assignment please help.

Docker access to shared folder

Hi,

Having issues getting the access to work for Docker and would really appreciate some straightforward advice as I have never used Docker before. The Docker installation is the toolbox version on Windows 10 (not pro).

I have reached the point where I have Docker Quickstart Terminal and a Jupyter Notebook session running. Note that the Docker tutorial on Github fails at this point:

David@DESKTOP-TLE6KHC MINGW64 /c/Program Files/Docker Toolbox
$ docker run -it -p 8080:8080 --name coursera-aml-nlp --user root -v /C:/Users/David/natural-language-processing-master
/week3/data:/root/coursera
"docker run" requires at least 1 argument.
See 'docker run --help'.

Usage:  docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

Run a command in a new container

So, as I am using the toolbox I follow the instructions from Dr Shahin Rostami at:

https://shahinrostami.com/posts/tools/docker/docker-toolbox-windows-7-and-shared-volumes/

Which brings me to the point in the instructions:


Sharing Folders with a Docker Container

To create a Docker container from the jupyter/scipy-notebook image, type the following command and wait for it to complete execution: docker run --name="scipy" --user root -v /h/work:/home/jovyan -d -e GRANT_SUDO=yes -p 8888:8888 jupyter/scipy-notebook start-notebook.sh --NotebookApp.token=''

This may take some time, as it will need to download and extract the image. Once it's finished, you should be able to access the Jupyter notebook using 127.0.0.1:8888. I hope this helps you get up and running with Docker Toolbox and shared folders. Of course, the process is typically easier when using the non-legacy Docker solutions.


Which results in the page:
image

I'm really not sure what the token or password represent here. The access to the folder is all I am after but I do not know what to try next.

Thanks,

David

Week 3: cannot import name 'logsumexp' from scipy.misc

Running the Week 3 notebook on google colab (after previously encountering #33), I see

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-e70e92d32c6e> in <module>()
----> 1 import gensim

3 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py in <module>()
     49 
     50 # log(sum(exp(x))) that tries to avoid overflow
---> 51 from scipy.misc import logsumexp
     52 
     53 

ImportError: cannot import name 'logsumexp'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

main_bot struggles if you have non-ascii characters in your name

If you name contains any funny characters in Telegram, the bot will crash

Ready to talk!
An update received.
Traceback (most recent call last):
  File "main_bot.py", line 111, in <module>
    main()
  File "main_bot.py", line 103, in main
    print("Update content: {}".format(update))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 153: ordinal not in range(128)

Although adding some more computational complexity, adding the following function

def cast_to_utf_8(old_dict):
    """
    Encodes the string content of a dict to utf-8

    Parameters
    ----------
    old_dict : dict
        The dict to encode

    Returns
    -------
    new_dict : dict
        The encoded dict
    """

    def walk(node):
        """
        Recursively traverses a node ande encodes all strings to utf-8

        Parameters
        ----------
        node : dict
            The node to traverse

        Returns
        -------
        node : dict
            The node where the strings are encoded to utf-8
        """
        for key, item in node.items():
            if type(item)==dict:
                walk(item)
            elif type(item)==list:
                for i, elem in enumerate(item):
                    if type(elem) == str:
                        node[key][i] = elem.encode('utf-8')
            elif type(item)==str:
                node[key] = item.encode('utf-8')
        return node

    new_dict = walk(old_dict)

    return new_dict

and calling it like this in main()

                    if is_unicode(text):
                        update = cast_to_utf_8(update)
                        print("Update content: {}".format(update))
                        bot.send_message(chat_id, bot.get_answer(update["message"]["text"]))
                    else:
                        bot.send_message(chat_id, "Hmm, you are sending some weird characters to me...")

was a remedy for me

Typo in lemmatization notebook example

https://github.com/hse-aml/natural-language-processing/blob/master/week1/lemmatization_demo.ipynb

If you look at cell 5, the string is text = "operates operative operating", but in cell 6 after you apply the Porter Stemmer, the stemmed string applied on the one from cell 5 is different - in particular, there are no common characters in each word, most likely due to a previous string already cached prior to running the cell: u'feet cat wolv talk'. The same for the lemmatized string in cell 7.

The intent of the code is clear though, but the results should be changed in the future.

Week 1: Traceback & Name Error

Hello, this piece of code returns aftermentioned errors:

print(test_text_prepare())

`---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
----> 1 print(test_text_prepare())

in test_text_prepare()
5 "free c++ memory vectorint arr"]
6 for ex, ans in zip(examples, answers):
----> 7 if text_prepare(ex) != ans:
8 return "Wrong answer for the case: '%s'" % ex
9 return 'Basic tests are passed.'

NameError: name 'text_prepare' is not defined`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.