spandan-madan / deeplearningproject Goto Github PK

View Code? Open in Web Editor NEW

4.7K 4.7K 633.0 2.46 MB

An in-depth machine learning tutorial introducing readers to a whole machine learning pipeline from scratch.

Home Page: https://spandan-madan.github.io/DeepLearningProject/

License: MIT License

Jupyter Notebook 43.69% HTML 56.28% Dockerfile 0.03%

deep-learning machine-learning neural-networks tutorial

deeplearningproject's People

Contributors

Stargazers

Watchers

Forkers

pruthvishetty rahulrana95 deeplearningsky mbraihan nitingupta180 kitisak bapi-reddy avsolatorio akshayudhane prakritidev joyjeni karthiklml sach2211 vivanraaj pulkitpagare jrafaelamaral gihanali sagarmalla arihantjain15 biranchi2018 shyamsukhamit nsairakesh nvhoang iamvikas10 mnrmja007 krishvishal sahooamarjeet aashish-ak raj-maurya sad143007 pritom14 vkunal1996 chirasmita16 onwardmahachi kunal-lalwani marwaatia tanngo codeaudit brando90 poivrenoir chirayukong belalmohsen peratham jvlegend akashyssboddeda kuanliang charankesav leolorenzoluis alokranjan1234 ominux mutjas athiwatp vishallakha yanghaha11514 phammanhhiep aepuripraveenkumar rushib1 annamalainagappan prafful13 hivaids2512 ranababu1 adilkaleem shivajid iamihgam abhisheksachan mohanarunachalam www-go nguyenbaduy1995 arnabbir rohithyeravothula pkgodara shikhardb tpemartin bush333 sandeepk17 amanalip weburnit kbeankim ivanistheone cys4 ambhar scofieldyoo merajat learningmaster mahmoudelhamshary rafayet13 sayantanmukherjee6 rajivbits h2016102 turtlelabs hbcbh1999 akshanshchahal vsamidurai 0wnrepo jacknova idianale w4zir shandude vdt fbarrientos

deeplearningproject's Issues

Got IOError half way through learning

When I got to the last session on model textual, the model went through 5 epochs then threw this error:

IOError: [Errno 24] Too many open files

I went ahead trying to change $ulimit -n but realized the easiest way is to just change numb_workers to 4. It's arbitrary but someone suggested 4*num of GPU is a good approximation for num_workers.

It's not a specific issue per se but I think it may be beneficial for people to know this is one of the nuances in building ML pipeline which is not necessarily apparent.

Example binarized vector representation is syntactically incorrect

Now let's store the genre's for these movies in a list that we will later transform into a binarized vector.

Binarized vector representation is a very common and important way data is stored/represented in ML. Essentially, it's a way to reduce a categorical variable with n possible values to n binary indicator variables. What does that mean? For example, let [(1,3),(4)] be the list saying that sample A has two labels 1 and 3, and sample B has one label 4. For every sample, for every possible label, the representation is simply 1 if it has that label, and 0 if it doesn't have that label. So the binarized version of the above list will be -

> [(1,0,1,0]),
> (0,0,0,1])]

This section has output that contains a brackets mismatch syntax error. Not a huge problem, but probably a bit confusing for a beginner. Otherwise, great tutorial!

Would be nice to support Python 3.

Just to be clear - consider this as just a minor suggestion rather than a complaint.

Thank you for this tutorial! It's rare that people actually spend a lot of time to make a great free learning resource.

Broken images

Check the images, half of them are not loading

the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.

list_of_genres=genres.movie_list()['genres']

Thanks for the amazing tutorial .

I think now TensorFlow supports Python 3.6.

If so, is there something going to be different now that I think TF is on Python 3.5+??

Site not responsive on mobile.

Few things to look in Deep Learning to extract visual features from posters on Section 7

You declared VGG model function and stored in variable 'model' and used variable 'model_viz' for training, which means you did not use VGG at all. You can check your model by typing 'print(model_viz.layers)'. If you struggle to fix this issue, I can help you with this section if you add me as an author.
It is important to show how well your model is trained. I would recommend plotting curves of loss and accuracy with history instance returned from 'model.fit()' function or a confusion matrix from predictions to show false positives and vice versa.

Possible wrong syntax

In [26]: # Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

The above segment throws this error:
Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

I apologize if this is a trivial issue. I'm new to Python. It'll be great if someone can help me resolve this. T

Dependency Issue on Windows

Tried to setup with the .yml file which was aborted.
Manual installation of the requested packages led to an error: tensorflow on Windows is only supported in 64-bit Python 3.5. Updating python raises depency errors for functools32 and subprocess32, which only run with Python 2.7.
So based on my limited knowledge: there is no way of setting up the environment on Windows. Or am I missing something?

appnope 0.1.0 not available on win-64 and linux channels.

The appnope package is made to disable App Nap on OS X. If you are on a different platform you must remove or comment out the line.

Anyone interested in making a PyTorch version of this on Python 3?

Between research and another tutorial I'm working on (NLP), I have little time left. If someone would like to build a PyTorch version of this, we can add it to this repo.

Would be very helpful. If you're interested, mail me at [email protected].

Best,
Spandan

Can you please mentions the pre-requisites for the tutorial in README?

'Genres' object has no attribute 'list'

I get this error while compiling the code cell 15 of the notebook.

Help !

Hi am new to ML, can i start with this tutorial ?
or where i have to start ? and how to start?
thanks in advance

Possible points of confusion and typos

Points of confusion

This section uses f as the generalized function and g as the exact function, whereas before f was exact and g was generalized. This has the potential to confuse readers.
On In [51] and In [52], id is assigned a value but does not seem to be used
On the section after Out [62] it says that the shape of Y is 1666,20 but the output of print Y.shape is (1595, 20). Where does the 1666 come from?

Typos

In the last sentence of the first paragraph of the same section, "listen to" should be changed to "watch"
in the last paragraph before In [68] (this section) "vocabular" should be "vocabulary"
In the first paragraph of this section, "can only integer values" should be "can only be integer values"
In the second item of the first list in this section, "difference models" should be "different models"

"That" vs "Which" grammatical error.

Kudos on a very well done writeup. I have a simple grammatical correction ... in many cases, you have used 'which' in place of 'that'.

See http://www.writersdigest.com/online-editor/which-vs-that

If 'which' is used to describe something, and is not preceded by a comma, it is a likely candidate for the confusion.

For example,
'use the available data to learn a function which can' ==> 'use the available data to learn a function that can'

TensorFlow version of this tutorial on Python 3.

Feel Free to reach out if interested in implementing tf version of this tutorial.

Varibles undefined when run the scripts

In section 7, when extract VGG features for scraped images.
In the for loop where try and except block located, the varible 'imname' was not declared, may be change like the following:

for mov in poster_movies:
    i+=1
    mov_name=mov['original_title']
    mov_name1=mov_name.replace(':','/')
    poster_name=mov_name.replace(' ','_')+'.jpg'
    if poster_name in imnames:
        img_path=poster_folder+poster_name
        try:
            img = image.load_img(img_path, target_size=(224, 224))
            succesful_files.append(imname) # **imname undefined , change to poster_name ?**
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = preprocess_input(x)
            features = model.predict(x)
            file_order.append(img_path)
            feature_list.append(features)
            genre_list.append(mov['genre_ids'])
            if np.max(np.asarray(feature_list))==0.0:
                print('problematic',i)
            if i%250==0 or i==1:
               print "Working on Image : ",i
        except Exception,e:
            print Exception,":",e   # **for debuging**
            failed_files.append(imname) # **imname undefined , change to poster_name ?**
            continue
    else:
        continue

help

in [41] cell, when I am executing I am getting the following error:

HTTPError Traceback (most recent call last)
in ()
17 url += '&with_genres=' + str(g_id) + '&page=' + str(page)
18
---> 19 data = urllib2.urlopen(url).read()
20
21 dataDict = json.loads(data)

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
433 for processor in self.process_response.get(protocol, []):
434 meth = getattr(processor, meth_name)
--> 435 response = meth(req, response)
436
437 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_response(self, request, response)
546 if not (200 <= code < 300):
547 response = self.parent.error(
--> 548 'http', request, response, code, msg, hdrs)
549
550 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in error(self, proto, *args)
471 if http_err:
472 args = (dict, 'default', 'http_error_default') + orig_args
--> 473 return self._call_chain(*args)
474
475 # XXX probably also want an abstract factory that knows when it makes

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
--> 556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 401: Unauthorized

help appreciated.
Thanks

404 for figure 1

The link to the figure 1: https://spandan-madan.github.io/DeepLearningProject/docs/contour.png seems to be incorrect

TMDB API key

Great walkthrough!

One recommendation is to replace your actual TMDB API key with a placeholder. That way no one can abuse your account via your API key.

P.S. Super nitpicky, but in that same block, I think the Jupyter step should read In [5]:

Cut out warnings from imports due to numpy ufunc and dtype sizes

Nice jobs with the notebooks- On block 2, if you'd like to get rid of the RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 warnings, the you can just add at the bottom of the block:

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

Tutorial is broken? Low recall/precision in TF results

Hey Spandan,

Looks like something has changed in the data or model, the TF precision and recall in the final runs are very low (.2 or so) might need.to future proof this a bit more against changes to the TMDB or IMDB apis

TMDB Genre list() changed to movie_list()

In Section 3 when looking at returning Genres from TMDB the instructions state to use the .list() method of the Genre object returned by tmdb.Genres().

There has been an update to the API and list() no longer exists. There are separate lists for movies, tv, etc. Currently the function we're looking for is movie_list(), which returns the list of movie genres.

I want to translate your README.md !

Hello I'm university student from Korea. I'm interested in your project!

So I'd like to translate your README.md in korean.

Please contact me with email or comment below :)

My email address is [email protected]

Thank you

Finding words that are most predictive of a genre

Hi, This was an extremely useful document, and I learnt a lot from the tutorial. An interesting extension to the problem would be to identify the words in the synopses that most distinguish a genre from other genres in the model - I have an analogous task in my project.

Is there a way to find the words that are most predictive of a genre? For example, is there a way to identify that the words ‘battle’, ‘challenge’ and ‘fight’ (for example) are the most predictive of a movie falling into the ‘Action’ category, based on the model we trained? i.e. which are the words (in the synopsis) that most prominently indicate that the synopsis would fall under a particular genre? (Using the model we have fit).
This basically translates to decoding the algorithm to find out how it works “under the hood.” - what features (words?) it uses "under the hood" to classify a synopsis into a genre.

A solution I found online is in the code snippet below - Using the classifier coefficients from clf.coef_ (clf is the name of the model I fit) and picking the top 10 words (which the model uses to distinguish/identify a genre based on a given text).

def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))

Please let me know if this is appropriate and if there is a better way of doing this.