Git Product home page Git Product logo

nlp's Introduction

NLP - Tutorial

Repository to show how NLP can tacke real problem. Including the source code, dataset, state-of-the art in NLP

Data Augmentation

General

Text Preprocessing

Section Sub-Section Description Story
Tokenization Subword Tokenization Medium
Tokenization Word Tokenization Medium Github
Tokenization Sentence Tokenization Medium Github
Part of Speech Medium Github
Lemmatization Medium Github
Stemming Medium Github
Stop Words Medium Github
Phrase Word Recognition
Spell Checking Lexicon-based Peter Norvig algorithm Medium Github
Lexicon-based Symspell Medium Github
Machine Translation Statistical Machine Translation Medium
Machine Translation Attention Medium
String Matching Fuzzywuzzy Medium Github

Text Representation

Section Sub-Section Research Lab Story Source
Traditional Method Bag-of-words (BoW) Medium Github
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) Medium Github
Character Level Character Embedding NYU Medium Github Paper
Word Level Negative Sampling and Hierarchical Softmax Medium
Word2Vec, GloVe, fastText Medium Github
Contextualized Word Vectors (CoVe) Salesforce Medium Github Paper Code
Misspelling Oblivious (word) Embeddings Facebook Medium Paper
Embeddings from Language Models (ELMo) AI2 Medium Github Paper Code
Contextual String Embeddings Zalando Research Medium Paper Code
Sentence Level Skip-thoughts Medium Github Paper Code
InferSent Medium Github Paper Code
Quick-Thoughts Google Medium Paper Code
General Purpose Sentence (GenSen) Medium Paper Code
Bidirectional Encoder Representations from Transformers (BERT) Google Medium Paper(2019) Code
Generative Pre-Training (GPT) OpenAI Medium Paper(2019) Code
Self-Governing Neural Networks (SGNN) Google Medium Paper
Multi-Task Deep Neural Networks (MT-DNN) Microsoft Medium Paper(2019)
Generative Pre-Training-2 (GPT-2) OpenAI Medium Paper(2019) Code
Universal Language Model Fine-tuning (ULMFiT) OpenAI Medium Paper Code
BERT in Science Domain Medium Paper(2019) Paper(2019)
BERT in Clinical Domain NYU/PU Medium Paper(2019) Paper(2019)
RoBERTa UW/Facebook Medium Paper(2019) Paper
Unified Language Model for NLP and NLU (UNILM) Microsoft Medium Paper(2019)
Cross-lingual Language Model (XLMs) Facebook Medium Paper(2019)
Transformer-XL CMU/Google Medium Paper(2019)
XLNet CMU/Google Medium Paper(2019)
CTRL Salesforce Medium Paper(2019)
ALBERT Google/Toyota Medium Paper(2019)
T5 Googles Medium Paper(2019)
MultiFiT Medium Paper(2019)
XTREME Medium Paper(2020)
REALM Medium Paper(2020)

| Document Level | lda2vec | | Medium | Paper | | | doc2vec | Google | Medium Github | Paper |

NLP Problem

Section Sub-Section Description Research Lab Story Paper & Code
Named Entity Recognition (NER) Pattern-based Recognition Medium
Lexicon-based Recognition Medium
spaCy Pre-trained NER Medium Github
Optical Character Recognition (OCR) Printed Text Google Cloud Vision API Google Medium Paper
Handwriting LSTM Google Medium Paper
Text Summarization Extractive Approach Medium Github
Abstractive Approach Medium
Emotion Recognition Audio, Text, Visual 3 Multimodals for Emotion Recognition Medium

Acoustic Problem

Section Sub-Section Description Research Lab Story Paper & Code
Feature Representation Unsupervised Learning Introduction to Audio Feature Learning Medium Paper 1 Paper 2 Paper 3
Feature Representation Unsupervised Learning Speech2Vec and Sentence Level Embeddings Medium Paper 1 Paper 2
Feature Representation Unsupervised Learning Wav2vec Medium Paper
Speech-to-text Introduction to Speeh-to-text Medium

Text Distance Measurement

Section Sub-Section Description Research Lab Story Paper & Code
Euclidean Distance, Cosine Similarity and Jaccard Similarity Medium Github
Edit Distance Levenshtein Distance Medium Github
Word Moving Distance (WMD) Medium Github
Supervised Word Moving Distance (S-WMD) Medium
Manhattan LSTM Medium Paper

Model Interpretation

Section Sub-Section Description Research Lab Story Paper & Code
ELI5, LIME and Skater Medium Github
SHapley Additive exPlanations (SHAP) Medium Github
Anchors Medium Github

Graph

Section Sub-Section Description Research Lab Story Paper & Code
Embeddings TransE, RESCAL, DistMult, ComplEx, PyTorch BigGraph Medium RESCAL(2011) TransE(2013) DistMult(2015) ComplEx(2016) PyTorch BigGraph(2019)
Embeddings DeepWalk, node2vec, LINE, GraphSAGE Medium DeepWalk(2014) node2vec(2015) LINE(2015) GraphSAGE(2018)
Embeddings WLG, GCN, GAT, GIN Medium WLG(2011) GCN2017) GAT(2017) GraphSAGE(2018)
Embeddings PinSAGE(2018) Pinterest Medium
Embeddings HoIE(2015), SimpIE(2018) Medium
Embeddings ContE(2017), ETE(2017) Medium

Meta-Learning

Section Sub-Section Description Story
Introduction Matching Nets(2016) MANN(2016) LSTM-based meta-learner(2017) Prototypical Networks(2017) ARC(2017) MAML(2017) MetaNet(2017) Medium
NLP Dialog Generation DAML(2019), PAML(2019), NTMS(2019) Medium
Classification Intent Embeddings(2016) LEOPARD(2019) Medium
CV Unsupervised Learning CACTUs(2018) Medium
General Siamese Network(1994), Triplet Network(2015) Medium
MAML+(2018) Medium

Image

Section Sub-Section Description Research Lab Story Paper & Code
Object Detection R-CNN Medium Paper(2013)
Object Detection Fast R-CNN Medium Paper(2015)
Object Detection Faster R-CNN Medium Paper(2015)
Object Detection VGGNet Medium Paper(2014)
Instance Segmentation Mask R-CNN FAIR Medium Paper(2017)
Image Classification ResNet(2015) Microsoft Medium
Image Classification ResNeXt(2016) Medium

Evaluation

Section Sub-Section Description Story
Introduction Medium
Classification Confusion Matrix, ROC, AUC Medium
Regression MAE, MSE, RMSE, MAPE, WMAPE Medium
Textual Perplexity, BLEU, GER, WER, GLUE Medium

Source Code

Section Sub-Section Description Link
Spellcheck Github
InferSent Github

nlp's People

Contributors

makcedward avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp's Issues

ULMFiT from fastai?

Hi,

Just found your blog and I there are many useful information
I just found that you put ULMFiT under Openai in your readme table while according to I know, it is from fastai.
Thanks for compiling and sharing your knowledge.

No module named aion. Issue while importing aion

Hello Edward,

Thank you for the great article and detailed blog post about ELMO. I used one of the resources you posted for ELMO in keras before, I thought that implementation lacked some things, like ability to normalize vectors, your implementation is certainly better than the one in resources.

When I tried to import aion, I'm facing an error which says "ModuleNotFoundError: No module named 'aion'". I know that this is a fairly common packaging issue, but I wasn't able to resolve it, I didn't find any reliable online source either to be able to fix it. Please let me know if you have faced such an issue at all, any pointers will be appreciated.

I was able to successfully download aion by "pip install aion"

I'm using:

Python 3.6.7

Windows 10

Thanks

About DataSet

How can i get glove.6B.50d.vec which is imported in sample/nlp-word_mover_distance.ipynb of this repository

Cell #31 ValueError master/sample/nlp-model_interpretation_shap.ipynb

explainer = shap.DeepExplainer(pipeline.model, encoded_x_train[:10])
shap_values = explainer.shap_values(encoded_x_test[:1])

x_test_words = prepare_explanation_words(pipeline, encoded_x_test)
y_pred = pipeline.predict(x_test[:1])
print('Actual Category: %s, Predict Category: %s' % (y_test[0], y_pred[0]))

shap.force_plot(explainer.expected_value[0], shap_values[0][0], x_test_words[0])

RETURNS:

ValueError: Dimensions must be equal, but are 10 and 100 for '{{node gradient_tape/functional_1/global_max_pooling1d/truediv_1}} = RealDiv[T=DT_FLOAT](gradient_tape/functional_1/global_max_pooling1d/sub_1, gradient_tape/functional_1/global_max_pooling1d/sub)' with input shapes: [10,512], [10,100,512].

InferSent error (help needed)

Hi, I am getting an error while generating InferSent embeddings. The error is as follows, with details at the end of this email

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte

The error occurs after I run infer_sent_embs.build_vocab(x_train, tokenize=True) .

Note that I ran your code in Google Colab. Also note that the links to InferSent in the python file infersent.py also need to be updated (expired links).

The new links are

INFERSENT_GLOVE_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent1.pkl'
INFERSENT_FASTTEXT_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent2.pkl'

`

UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 infer_sent_embs.build_vocab(x_train, tokenize=True)
2 x_train_t = infer_sent_embs.encode(x_train, tokenize=True)
3 x_test_t = infer_sent_embs.encode(x_test, tokenize=True)

3 frames
/usr/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte
`

Doc2Vec

@makcedward I am trying to retrieve similar documents from the given document. Here is the code snippet:

x_train_t = doc2vec_embs.encode(documents=x_train)
x_test_t = doc2vec_embs.encode(documents=x_test)

def similiar_docs(doc2vec_embs, test_sample):
sims = doc2vec_embs.model.docvecs.most_similar([test_sample], topn=1)
for s in sims:
print(x_train[s[0]])

test_sample = x_test_t[0]
print(x_test[0])
similiar_docs(doc2vec_embs, test_sample)

However, the retrieved docs aren't similar. Am I missing something here?

nlp-character_embedding.ipynb

Hi, thank you for the awesome code on character embedding model, I had a lot of fun playing with it.
One little suggestion on the code, in CharCNN, def build_char_dictionary, you put chars = list(set(chars)). This will mess up the chars order in char_dict. Every time I start a new notebook, the chars' order will be different, therefore result in a different dictionary. What happened to me is I tried to load my trained keras model in a new notebook and found out that the model is not working. In the end I figured out it is because of my char_indices in preprocess step in totally different than old one. I didn't save the old char_indices before so I have no choice but to retrain the model, lol.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.