microsoft / nlp-recipes Goto Github PK

Natural Language Processing Best Practices & Examples

License: MIT License

Python 92.74% Makefile 0.08% Shell 0.59% C 6.21% sed 0.22% Dockerfile 0.16%

nlp natural-language-processing natural-language-understanding text deep-learning azure-ml nlu nli natural-language-inference mlflow

nlp-recipes's Introduction

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

"Natural language is not a synonym for English"
"English isn't generic for language, despite what NLP papers might lead you to believe"
"Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario	Models	Description	Languages
Text Classification	BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM	Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content.	English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition	BERT	Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest.	English
Text Summarization	BERTSumExt BERTSumAbs UniLM (s2s-ft) MiniLM	Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text.	English
Entailment	BERT, XLNet, RoBERTa	Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not.	English
Question Answering	BiDAF, BERT, XLNet	Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query.	English
Sentence Similarity	BERT, GenSen	Sentence similarity is the process of computing a similarity score given a pair of text documents.	English
Embeddings	Word2Vec fastText GloVe	Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.	English
Sentiment Analysis	Dependency Parser GloVe	Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect .	English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

Accessing Datastores to easily read and write your data in Azure storage services such as blob storage or file share.
Scaling up and out on Azure Machine Learning Compute.
Automated Machine Learning which builds high quality machine learning models by automating model and hyperparameter selection. AutoML explores BERT, BiLSTM, bag-of-words, and word embeddings on the user's dataset to handle text columns.
Tracking experiments and monitoring metrics to enhance the model creation process.
Distributed Training
Hyperparameter tuning
Deploying the trained machine learning model as a web service to Azure Container Instance for deveopment and test, or for low scale, CPU-based workloads.
Deploying the trained machine learning model as a web service to Azure Kubernetes Service for high-scale production deployments and provides autoscaling, and fast response times.

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository	Description
Transformers	A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks	ML and deep learning examples with Azure Machine Learning.
AzureML-BERT	End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS	MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN	Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM	Unified Language Model Pre-training.
DialoGPT	DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build	Branch	Status
Linux CPU	master
Linux CPU	staging
Linux GPU	master
Linux GPU	staging

nlp-recipes's People

Contributors

Stargazers

Watchers

Forkers

chhetri22 praneet22 aribornstein aiexperts algs ram-msft that-coder 7788wangzi roymachinelearning davtalab husnejahan intellifora jacopomangiavacchi ralami1859 nikitbegwani rushilgoyal amir22010 hiteshkalwani mmmmmmiracle tianyikenan tsingyixy77 hhy5277 songxianjin zwbjtu123 piggyjam allensmile miltonqiu pariyat batermj tingly3101 fishredleaf ameya-pandilwar awesome-archive shivlondon suhe36 charlottesean rahuljyala7 chenpe32cp ummae shaunstanislauslau uchihasr jingmouren kyriecham joseph-mutu ianswebpage lkjx77 yuyukoproject beesitech canglangshushu karmajun tngamemo sibodiamond anjo120 sketchdee tayzhiwen sunilkhedar db12138 saurabhtripathi813 ufukhurriyetoglu gitgirish2 lab4ai fazel94 ylmzfun creedofcool gyanachand1 yarimawa emekaborisama mbilawisdom steffy-zxf venu212 wangtong2015 michaelwty pratikfalke abhishekwebcode somakorada07 stalinkay ishan7390 avmi kumaramitrou amitkan1995 sanjoyb sorabluee venkyjntu-git rsadaphule prakritidev lyrl manadaym wxrui learrrrnin wunlung yesxiaoyu moomoofarm1 shenjiawei19 zhp1254 jiaqiang ziliangok shanepeckham xiazihao ravish0007 jeetds1994

nlp-recipes's Issues

[Example] Semantic Textual Similarity

Sentence similarity example using BERT and the STS benchmark dataset.

[Example] Text classification using BERT (Chinese dataset)

[Example] Sentiment Classification

Text Classification example using BERT and applied to the SST dataset

Word emdedding loaders

Adding a downloader, extractor and loader for 3 different pre-trained word vectors

Word2vec
FastText
GloVe

Update Readmes for current examples

Get datasets/packages approved by CELA

BERT PyTorch Repo
Yahoo Answers
IMDb Large Movie Review Dataset

[Example] Question Answering - BERT

Description

Expected behavior with the suggested feature

Other Comments

Finalize API Structure

Decide API Structure:

Implicit vs Explicit
Wrapper classes over BERT / FastText vs directly using libraries

Model 3:

# get tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True, cache_dir=BERT_CACHE_DIR)

# tokenize and truncate
tokens_train = [tokenizer.tokenize(x)[0 : MAX_LEN - 2] for x in text_train]
# BERT format
tokens_train = [["[CLS]"] + x + ["[SEP]"] for x in tokens_train]
# convert tokens to ids
tokens_train = [tokenizer.convert_tokens_to_ids(x) for x in tokens_train]
# pad
tokens_train = [x + [0] * (MAX_LEN - len(x)) for x in tokens_train]

# create input mask
input_mask_train = [[min(1, x) for x in y] for y in tokens_train]

# define model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", 	cache_dir=BERT_CACHE_DIR, num_labels=num_labels).to(device)
# define loss function
loss_func = nn.CrossEntropyLoss().to(device)
# define optimizer 
opt = BertAdam(optimizer_grouped_parameters, lr=2e-5)

# train
…
loop through epochs and batches
Compute loss
Update parameters
…

# score

Model 4:

# create classifier
model = BERTClassifier(language=ENGLISH)
# fit
trained_model = model.fit(
	text=text_train,
	labels=labels_train,
	num_epochs=NUM_EPOCHS,
	device=config.Device.GPU,
	multiple_gpus = True
)
# save model
trained_model.save(MODEL_FILE)
# score
preds = trained_model.predict(text=text_test)

Replace urlretrieve with requests

The maybe_download function in utils_nlp.dataset.url_utils uses urlretrieve which doesn't work with some URLs like XNLI's (they require a known user agent provided in the header). Can we use requests instead?

[Util] Labeling tool

Web-based tool for labeling text documents

Verify licensing of pytorch-pretrained-BERT with CELA

Apache vs. MIT
https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/LICENSE

Investigate Spanish datasets for NER and TC

[Example] fastText text classification

[Util] Bert Sentence Encoder

Create a BERT sentence encoder util that can be used in different tasks (classification, similarity, etc..)

[Scenario] Question Answering Examples

[Example] Question Answering - BiDAF

[FEATURE] Update readmes of scenarios

Description

Update READMEs on the different scenarios. Link the notebooks that we have and add a short explanation

Remove planning from here https://github.com/microsoft/nlp/tree/staging#planning

Update and review https://github.com/microsoft/nlp/blob/staging/CONTRIBUTING.md#steps-to-contributing

remove https://github.com/microsoft/nlp/tree/staging/benchmarks since we don't plan to have benchmarks in the near future

consider use pylint or similar tools to check code quality

[FEATURE] Need to detokenize a BertTokenizer output

Description

Currently the output of the NER prediction contains the subword, but the end user doesn't care about subword but the original word

For example , 'call Qingxiong Daisy'
tokenizer.tokenize([text]) -> [['call', 'Qing', '##xi', '##ong', 'Daisy']]
output label [['O', 'PersonName', 'X', 'X', 'X', 'PersonName', 'X', 'X']]

Expected behavior with the suggested feature

The desired output should be
'Qingxiong Daisy'->PersonName

It can also be helpful to provide the position of the entity

Other Comments

Attribution is untracked for non owners with staging -> master paradigm

I was looking at contributors and all of the MAIDAP team's PRs are not showing up as their contributions since only master is tracked. This is because staging is merged to master by a select few.

I think adopting a fork, open a PR to master, top level approvers can approve and merge would maintain attribution while still allowing owners to hold off on checkins to master. Mlflow and rstudio have similar contribution models.

Either way, any solution that maintains the attribution works for me

[UTIL] Migrate reusable code from Reco repo

https://github.com/Microsoft/Recommenders

[Util] PyTorch data loaders and samplers

add data loading and sampling functionality to be used for PyTorch model training/testing.

[Util] BERT model utils

Common BERT utils that can be used across examples.

[Scenario] Text Classification Examples

[Example] Text classification using MT-DNN

[TEST] Setup unit, smoke and integration tests

https://github.com/Microsoft/Recommenders/tree/master/tests

[Scenario] Engineering Work

[BUG] When doing a PR to staging, many tests are executed

When doing a PR to staging, only the staging tests should execute, but the master test and Microsoft.NLP are executed

[Scenario] Sentence Similarity Examples

Investigate ML Flow

We have integration with MLFlow to the AzureML Experimentation services so it would be good to investigate if we can use it throughout the repo and then show how to integrate with azureml
https://mlflow.org/

[Scenario] Named Entity Recognition Examples

[Example] TC pipeline using AML

[FEATURE] Text classification should provide the probability of each predicted class for the input text

Description

currently the prediction only provide the class name of the class with highest probability. the score of each class can be helpful in determine how confident the model is

Expected behavior with the suggested feature

Other Comments

[FEATURE] Evaluate PyTorch Text

Description

Look into depending on PyTorch Text. Document and evaluate :

General repo structure
Examples related to our current work and what would have to change for adoption
Aggregate issues and general direction / pace of contributions

Expected behavior with the suggested feature

This change should:

simplify our Dataset logic to depending on the library and contributing back to it.
2 Allow us to contribute back to the community
Keep up to date with NLP idioms being introduced
Increase team familiarity and possibly contributions to the PyTorch + NLP ecosystem

Other Comments

Add an NLP model explanation module proposed in our ICML 2019 paper

Hi, teams.

I am Chaoyu Guan from MSRA. Recently, we have published our paper Towards A Deep and Unified Understanding of Deep Neural Models in NLP in ICML 2019. It is a work that explains how a hidden state in an NLP model utilizes the input words by visualizing their contributions.

We want to publish our code in this repo, which will enrich this NLP toolkit. We will accomplish this by creating a folder named 'interpreter' under tool_nlp folder and adding our main codes to it. We'll also provide some examples under scenarios folder on how to utilize our main codes. Can we move ahead?

Test yahoo_answers text classification notebook

tc_yahoo_answers_bert.ipynb

[BUG] Review and refactor gensen and senteval notebooks

Description

DRY
Review imports
Group files into the same folder
...

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[Example] Text classification using BERT (Hindi dataset)

Test NER using BERT notebook.

NER_bert-demo-new.ipynb

Finalize folder structure for examples / scenarios

Finalize folder structure for examples / scenarios, and plan for how it will scale as we get new languages (Hindi, Chinese) and new domains (tax, finance).

Example:

utils_nlp
--- common utils submodule (like maybe download)
--- data loading submodule
--- text_classification
------ text classification class wrapper
--- named_entity_recognition
--- (etc)
scenarios
--- text_classification
------ bert_text_classification [english, tax]

[BUG] reduce the time of the unit tests

As of now, the unit tests take 25min. We should reduce them

[ASK] Remake utils folder structure

Description

azureml
common
- add pytorch folder: https://github.com/microsoft/nlp/tree/staging/utils_nlp/pytorch
models
- bert
- gensen
- pytorch_modules
- embeddings
datasets
evaluation

Other Comments

[ASK] Change default merge-into branch to staging?

Description

Currently when I create a PR, the default merge-into branch is master. Can we change the default to staging for convenience and avoiding merging into master by mistake?

Other Comments

Investigate Hindi datasets for NER and TC

Setup git tracker

We need to setup a cosmos db for tracking the metrics of the repo. Unless anyone disagrees, I was planning to use the same cosmos that we were using in reco: https://github.com/Microsoft/Recommenders/tree/master/scripts/repo_metrics

[FEATURE] NER scenario should provide the probability of each predicted label

Description

Similar ask as #118 for text classification

Expected behavior with the suggested feature

Other Comments

Comparison of FastText implementations (Facebook vs. Gensim)

@miguelgfierro mentioned that a colleague told him that Facebook's fastText implementation is faster and more accurate than the original one. He thinks its would be worthwhile to do a comparison (time, dataset size, accuracy, etc.) on several datasets.

References:
piskvorky/gensim#1940
https://github.com/miguelgfierro/sciblog_support/blob/master/Intro_to_NLP_with_fastText/Intro_to_NLP.ipynb
https://course.fast.ai/datasets#nlp

Show support for extensibility with different packages, w/ a base on Fast.AI
Requirements around yaml files

microsoft / nlp-recipes Goto Github PK

nlp-recipes's Introduction

NLP Best Practices

Overview

Target Audience

Focus Areas

Scenarios

Algorithms

Languages

Content

Getting Started

Azure Machine Learning Service

Contributing

Blog Posts

References

Build Status

nlp-recipes's People

Contributors

Stargazers

Watchers

Forkers

nlp-recipes's Issues

Description

Expected behavior with the suggested feature

Other Comments

Description

Description

Expected behavior with the suggested feature

Other Comments

Description

Expected behavior with the suggested feature

Other Comments

Description

Expected behavior with the suggested feature

Other Comments

Description

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Description

Other Comments

Description

Other Comments

Description

Expected behavior with the suggested feature

Other Comments

Recommend Projects

Recommend Topics

Recommend Org