hse-aml / natural-language-processing Goto Github PK

Resources for "Natural Language Processing" Coursera course.

Home Page: https://www.coursera.org/learn/language-processing

Python 17.91% Jupyter Notebook 81.35% Shell 0.15% Dockerfile 0.59%

natural-language-processing's Introduction

Natural Language Processing course resources

This github contains practical assignments for Natural Language Processing course by Higher School of Economics: https://www.coursera.org/learn/language-processing. In this course you will learn how to solve common NLP problems using classical and deep learning approaches.

From a practical side, we expect your familiarity with Python, since we will use it for all assignments in the course. Two of the assignments will also involve TensorFlow. You will work with many other libraries, including NLTK, Scikit-learn, and Gensim. You have several options on how to set it up.

1. Running on Google Colab

Google has released its own flavour of Jupyter called Colab, which has free GPUs!

Here's how you can use it:

Open https://colab.research.google.com, click Sign in in the upper right corner, use your Google credentials to sign in.
Click GITHUB tab, paste https://github.com/hse-aml/natural-language-processing and press Enter
Choose the notebook you want to open, e.g. week1/week1-MultilabelClassification.ipynb
Click File -> Save a copy in Drive... to save your progress in Google Drive
If you need a GPU, click Runtime -> Change runtime type and select GPU in Hardware accelerator box
Execute the following code in the first cell that downloads dependencies (change for your week number):

! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
import setup_google_colab
# please, uncomment the week you're working on
# setup_google_colab.setup_week1()  
# setup_google_colab.setup_week2()
# setup_google_colab.setup_week3()
# setup_google_colab.setup_week4()
# setup_google_colab.setup_project()
# setup_google_colab.setup_honor()

If you run many notebooks on Colab, they can continue to eat up memory, you can kill them with ! pkill -9 python3 and check with ! nvidia-smi that GPU memory is freed.

Known issues:

No support for ipywidgets, so we cannot use fancy tqdm progress bars. For now, we use a simplified version of a progress bar suitable for Colab.
Blinking animation with IPython.display.clear_output(). It's usable, but still looking for a workaround.
If you see an error "No module named 'common'", make sure you've uncommented the assignment-specific line in step 6, restart your kernel and execute all cells again

2. Running locally

Two options here:

Use the Docker container of our course. It already has all libraries, that you will need. The setup for you is very simple: install Docker application depending on your OS, download our container image, run everything within the container. Please, see this detailed Docker tutorial.
Manually install all the libraries depending on your OS (each task contains a list of needed libraries in the very beginning). If you use Windows/MacOS you might find useful Anaconda distribution which allows to install easily most of the needed libraries. However, some tools, like StarSpace for week 2, are not compatible with Windows, so it's likely that you will have to use Docker anyways, if you go for these tasks.

It might take a significant amount of time and resources to run the assignments code, but we expect that an average laptop is enough to accomplish the tasks. All assignments were tested in the Docker on Mac with 8GB RAM. If you have memory errors, that could be caused by not tested configurations or inefficient code. Consider reporting these cases or double-checking your code.

If you want to run the code of the course on the AWS machine, we've prepared the AWS tutorial here.

natural-language-processing's People

Contributors

Stargazers

Watchers

Forkers

nhatnguyen12 mpizosdim nikolay-bushkov aipachakutiqwan ideis eyurtsev yhu314 bawcos yinhang18 vladkanchev jeknov fahadaijaz gkarmakar jackconnor rhmiller47 nunorc gachet kinetikm saikrishnaanudeepj kalicharanv ragabov achukka edvbb dilnozabobokalonova1 salmanmaq lavinathong anvaribs akiska getachew67 pankaj02 ankita1291 birdbird1117 deeptensors girishgupta vaishnav127 marcogomex mxc19912008 tejash-shah justin-graham gopigrip7 hadxu nicholasmoore-dac ashish-gupta03 stefmt2970 reddisanjeevkumar ujjwalranjan mainakchain filored tanjillatina sriram7777 hbunyamin robertus100 mwakaba2 eesql wesen wcollins-ebsco johnashdsouza abhishek11097 deeplearningpk subhashinih g-rutter zcwei askhari139 esppk cyrilou242 paturi1710 timefunny jianxungao akash13singh farnazmotamed zfxu judesafo sergey-chebanov kambehmw shk1993 saikiran321 wangtuobin haojunyu lydia-gu yonoklee jcsharp yasserantonio cyy7645 achmadalam denchik20071992 levibrown causalreinforcer colombia-ai eulertech wangz19 ichaitu4u sirelkhatim alfords menshikh-iv nirupam1sharma shrutimyideasmyblogs zhzou2020 srkancharla abhishekdharu paduvi

natural-language-processing's Issues

Tensorflow 2.0 support

Would it be possible to migrate it to Tf 2.0?

tf.nn.rnn_cell.BasicLSTMCell is deprecated - Week2

Hello
tf.nn.rnn_cell.BasicLSTMCell is deprecated and tf.nn.rnn_cell.LSTMCell is replaced.

And there is some text about optimization with GPU:

"Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU"

Running on Google Colab

Hi,

Can we have Colab environment support?

Thanks

Python version in docker image

I retrieved the docker image like so:
# docker pull akashin/coursera-aml-nlp

#python3 --version
shows that this has python 3.5 installed

Unfortuantely python 3.5 has this bug. Because of this test_my_bag_of_words() fails as the dict order is not maintained. This works correctly in colab as the colab python version is 3.6.

I tried upgrading python 3.5 on the docker image to python 3.7 using this post. The upgrade seems to work. I also upgrade the jupyter notebook. But then the notebook doesn't work properly.

Is it possible to provide a docker image with an upgraded python version?

Unavailability of Token while executing run_notebook in docker

When I execute run_notebook in terminal it shows the following and I am not able to access the notebook.

Module common not found

Hi,

module common not found. Tried on colab and on the docker as well.

here is the traceback:

`ImportError Traceback (most recent call last)
in ()
1 import sys
2 sys.path.append("..")
----> 3 from common.download_utils import download_week1_resources
4
5 download_week1_resources()

ImportError: No module named 'common'`

Week 1 MultiLabelBinarizer

In Week1 module during Multiclass training, scikit-learn raises this kind of exception:
"Scikit Learn Multilabel Classification: ValueError: You appear to be using a legacy multi-label data representation..."

So, I've found out that we should use MultiLabelBinarizer in order to preprocess labels, done.

But when we need to evaluate "val" dataset on trained classifiers, there is "mlb" variable referenced, which was not instantiated. I assume that it refers to "MultiLabelBinarizer". As you see, there is an inconsistency here, which currently should be manually fixed.

Natural LP

tf.nn.softmax_cross_entropy_with_logits is deprecated - Week2

tf.nn.softmax_cross_entropy_with_logits is deprecated and tf.nn.softmax_cross_entropy_with_logits_v2 is replaced.

Task 3 (BagOfWords)

I guess in task description it should be 12th row (index 11) instead of 11th row (index 10) for which the number many non-zero elements has to be determined. Not sure if I am understanding something wrong about indexing csr matrices.

Currently my code works when I use
row = X_train_mybag[11].toarray()[0]

https://github.com/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb

Broken link in `week2-NER` notebook

The following link in the Week 2 assignment notebook appears to be broken:

natural-language-processing/week2/week2-NER.ipynb

Line 356 in c3f2cb5

 "First, we need to create [placeholders](https://www.tensorflow.org/versions/master/api_docs/python/tf/placeholder) to specify what data we are going to feed into the network during the execution time. For this task we will need the following placeholders:\n", 

I believe it should point instead to https://www.tensorflow.org/api_docs/python/tf/compat/v1/placeholder

ModuleNotFound running notebook in CoLab

When I open one of the notebooks in CoLab, specifically week1-MultilabelClassification.ipynb:
https://colab.research.google.com/github/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb

When I try to run the notebook I get ModuleNotFound error on this line:
from common.download_utils import download_week1_resources

I have never run code from GitHub in CoLab and am not sure if I need to do something so that it can find the common module.

Week 5 Support

Can you please add week 5 support?

Week1: How to open in Colab?

Sorry, stupid question, but how do I open week 1 in Colab? Usually for my own files, there is always a "open in Colab" button, but there is none for week one task?

Link to `clipping` in `perform_optimization` documentation is broken - Week2

Link to Clippin in perform_optimization function documentation is broken.

Docker Tutorial Link not working

Kindly take a look to the provided Docker Tutorial link in the Readme, it isn;t working - https://github.com/hse-aml/natural-language-processing/blob/master/(Docker-tutorial.md)

Running on Google Colab

You need to sign into Google account first or you won't see the GitHub tab from the README instruction

Week 1 incorrect answers in test_my_bag_of_words

In the function test_my_bag_of_words, answers is defined as a list of list while it should be just list

Original:
answers = [[1, 1, 0, 1]]

Should be:
answers = [1, 1, 0, 1]
as it is a return of my_bag_of_words function which takes text as input and returns np array

Problem opening 'data/text_prepare_tests.tsv' file

Using the docker container environment I am getting a UnicodeDecodeError. More speciffically:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Is giving the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

In order to run it I had to change it to:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv', encoding='utf-8'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Can also be solved by using pd.read_csv.

Is this error reproducible to anyone else?

Week 1: Invalid Syntax

Hello,
This code returns an "Invalid Syntax" error.

`REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
"""
text: a string

    return: modified initial string
"""
text = # lowercase text
text = # replace REPLACE_BY_SPACE_RE symbols by space in text
text = # delete symbols which are in BAD_SYMBOLS_RE from text
text = # delete stopwords from text
return text`

Final submission error

I am using google colab for week 3 assignment and at the last when i am finally submitting it, the compiler is not recognizing my E-mail id. Please tell me the solution.
----> 1 STUDENT_EMAIL = [email protected]# EMAIL
2 STUDENT_TOKEN = AT5ZyzLxuQfnEhNg# TOKEN
3 grader.status()

NameError: name 'mandloi19faraday96' is not defined

This is the error i am getting how should i submit my assignment please help.

Docker access to shared folder

Hi,

Having issues getting the access to work for Docker and would really appreciate some straightforward advice as I have never used Docker before. The Docker installation is the toolbox version on Windows 10 (not pro).

I have reached the point where I have Docker Quickstart Terminal and a Jupyter Notebook session running. Note that the Docker tutorial on Github fails at this point:

David@DESKTOP-TLE6KHC MINGW64 /c/Program Files/Docker Toolbox
$ docker run -it -p 8080:8080 --name coursera-aml-nlp --user root -v /C:/Users/David/natural-language-processing-master
/week3/data:/root/coursera
"docker run" requires at least 1 argument.
See 'docker run --help'.

Usage:  docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

Run a command in a new container

So, as I am using the toolbox I follow the instructions from Dr Shahin Rostami at:

https://shahinrostami.com/posts/tools/docker/docker-toolbox-windows-7-and-shared-volumes/

Which brings me to the point in the instructions:

Sharing Folders with a Docker Container

To create a Docker container from the jupyter/scipy-notebook image, type the following command and wait for it to complete execution: docker run --name="scipy" --user root -v /h/work:/home/jovyan -d -e GRANT_SUDO=yes -p 8888:8888 jupyter/scipy-notebook start-notebook.sh --NotebookApp.token=''

This may take some time, as it will need to download and extract the image. Once it's finished, you should be able to access the Jupyter notebook using 127.0.0.1:8888. I hope this helps you get up and running with Docker Toolbox and shared folders. Of course, the process is typically easier when using the non-legacy Docker solutions.

Which results in the page:

I'm really not sure what the token or password represent here. The access to the folder is all I am after but I do not know what to try next.

Thanks,

David

Week 3: cannot import name 'logsumexp' from scipy.misc

Running the Week 3 notebook on google colab (after previously encountering #33), I see

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-e70e92d32c6e> in <module>()
----> 1 import gensim

3 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py in <module>()
     49 
     50 # log(sum(exp(x))) that tries to avoid overflow
---> 51 from scipy.misc import logsumexp
     52 
     53 

ImportError: cannot import name 'logsumexp'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Error while downloading resources

Following error is happening while I try to download the resources for week1

main_bot struggles if you have non-ascii characters in your name

If you name contains any funny characters in Telegram, the bot will crash

Ready to talk!
An update received.
Traceback (most recent call last):
  File "main_bot.py", line 111, in <module>
    main()
  File "main_bot.py", line 103, in main
    print("Update content: {}".format(update))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 153: ordinal not in range(128)

Although adding some more computational complexity, adding the following function

def cast_to_utf_8(old_dict):
    """
    Encodes the string content of a dict to utf-8

    Parameters
    ----------
    old_dict : dict
        The dict to encode

    Returns
    -------
    new_dict : dict
        The encoded dict
    """

    def walk(node):
        """
        Recursively traverses a node ande encodes all strings to utf-8

        Parameters
        ----------
        node : dict
            The node to traverse

        Returns
        -------
        node : dict
            The node where the strings are encoded to utf-8
        """
        for key, item in node.items():
            if type(item)==dict:
                walk(item)
            elif type(item)==list:
                for i, elem in enumerate(item):
                    if type(elem) == str:
                        node[key][i] = elem.encode('utf-8')
            elif type(item)==str:
                node[key] = item.encode('utf-8')
        return node

    new_dict = walk(old_dict)

    return new_dict

and calling it like this in main()

                    if is_unicode(text):
                        update = cast_to_utf_8(update)
                        print("Update content: {}".format(update))
                        bot.send_message(chat_id, bot.get_answer(update["message"]["text"]))
                    else:
                        bot.send_message(chat_id, "Hmm, you are sending some weird characters to me...")

was a remedy for me

Typo in lemmatization notebook example

https://github.com/hse-aml/natural-language-processing/blob/master/week1/lemmatization_demo.ipynb

If you look at cell 5, the string is text = "operates operative operating", but in cell 6 after you apply the Porter Stemmer, the stemmed string applied on the one from cell 5 is different - in particular, there are no common characters in each word, most likely due to a previous string already cached prior to running the cell: u'feet cat wolv talk'. The same for the lemmatized string in cell 7.

The intent of the code is clear though, but the results should be changed in the future.

Week 1: Traceback & Name Error

Hello, this piece of code returns aftermentioned errors:

print(test_text_prepare())

`---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
----> 1 print(test_text_prepare())

in test_text_prepare()
5 "free c++ memory vectorint arr"]
6 for ex, ans in zip(examples, answers):
----> 7 if text_prepare(ex) != ans:
8 return "Wrong answer for the case: '%s'" % ex
9 return 'Basic tests are passed.'

NameError: name 'text_prepare' is not defined`

problem with dependencies on colab

I am not able to download dependencies using google colab. Please help me solve this issue.

NotImplementedError: Open utils.py and fill with your code. In case of Google Colab, download(https://github.com/hse-aml/natural-language-processing/blob/master/project/utils.py), edit locally and upload using '> arrow on the left edge' -> Files -> UPLOAD