Git Product home page Git Product logo

nlp-recipes's Introduction

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status

nlp-recipes's People

Contributors

abhirame avatar anishpimpley avatar aribornstein avatar bethz avatar catherine667 avatar chhetri22 avatar cocochrane avatar daden-ms avatar dipanjan77 avatar eedeleon avatar frozenmad avatar hayata-yamamoto avatar heatherbshapiro avatar hlums avatar irshaffe avatar jainr avatar janhavi13 avatar julienc91 avatar kehuangms avatar lishao avatar microsoftopensource avatar miguelgfierro avatar mikaelsouza avatar petulla avatar sahityamantravadi avatar saidbleik avatar shanepeckham avatar sharatsc avatar yijingchen avatar yueguoguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp-recipes's Issues

Word emdedding loaders

Adding a downloader, extractor and loader for 3 different pre-trained word vectors

  • Word2vec
  • FastText
  • GloVe

Finalize API Structure

Decide API Structure:

  • Implicit vs Explicit
  • Wrapper classes over BERT / FastText vs directly using libraries

Model 3:

# get tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True, cache_dir=BERT_CACHE_DIR)

# tokenize and truncate
tokens_train = [tokenizer.tokenize(x)[0 : MAX_LEN - 2] for x in text_train]
# BERT format
tokens_train = [["[CLS]"] + x + ["[SEP]"] for x in tokens_train]
# convert tokens to ids
tokens_train = [tokenizer.convert_tokens_to_ids(x) for x in tokens_train]
# pad
tokens_train = [x + [0] * (MAX_LEN - len(x)) for x in tokens_train]

# create input mask
input_mask_train = [[min(1, x) for x in y] for y in tokens_train]

# define model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", 	cache_dir=BERT_CACHE_DIR, num_labels=num_labels).to(device)
# define loss function
loss_func = nn.CrossEntropyLoss().to(device)
# define optimizer 
opt = BertAdam(optimizer_grouped_parameters, lr=2e-5)

# train
…
loop through epochs and batches
Compute loss
Update parameters
…

# score

Model 4:

# create classifier
model = BERTClassifier(language=ENGLISH)
# fit
trained_model = model.fit(
	text=text_train,
	labels=labels_train,
	num_epochs=NUM_EPOCHS,
	device=config.Device.GPU,
	multiple_gpus = True
)
# save model
trained_model.save(MODEL_FILE)
# score
preds = trained_model.predict(text=text_test)

Replace urlretrieve with requests

The maybe_download function in utils_nlp.dataset.url_utils uses urlretrieve which doesn't work with some URLs like XNLI's (they require a known user agent provided in the header). Can we use requests instead?

[FEATURE] Need to detokenize a BertTokenizer output

Description

Currently the output of the NER prediction contains the subword, but the end user doesn't care about subword but the original word

For example , 'call Qingxiong Daisy'
tokenizer.tokenize([text]) -> [['call', 'Qing', '##xi', '##ong', 'Daisy']]
output label [['O', 'PersonName', 'X', 'X', 'X', 'PersonName', 'X', 'X']]

Expected behavior with the suggested feature

The desired output should be
'Qingxiong Daisy'->PersonName

It can also be helpful to provide the position of the entity

Other Comments

Attribution is untracked for non owners with staging -> master paradigm

I was looking at contributors and all of the MAIDAP team's PRs are not showing up as their contributions since only master is tracked. This is because staging is merged to master by a select few.

I think adopting a fork, open a PR to master, top level approvers can approve and merge would maintain attribution while still allowing owners to hold off on checkins to master. Mlflow and rstudio have similar contribution models.

Either way, any solution that maintains the attribution works for me

Investigate ML Flow

We have integration with MLFlow to the AzureML Experimentation services so it would be good to investigate if we can use it throughout the repo and then show how to integrate with azureml
https://mlflow.org/

[FEATURE] Evaluate PyTorch Text

Description

Look into depending on PyTorch Text. Document and evaluate :

  1. General repo structure
  2. Examples related to our current work and what would have to change for adoption
  3. Aggregate issues and general direction / pace of contributions

Expected behavior with the suggested feature

This change should:

  1. simplify our Dataset logic to depending on the library and contributing back to it.
    2 Allow us to contribute back to the community
  2. Keep up to date with NLP idioms being introduced
  3. Increase team familiarity and possibly contributions to the PyTorch + NLP ecosystem

Other Comments

Add an NLP model explanation module proposed in our ICML 2019 paper

Hi, teams.

I am Chaoyu Guan from MSRA. Recently, we have published our paper Towards A Deep and Unified Understanding of Deep Neural Models in NLP in ICML 2019. It is a work that explains how a hidden state in an NLP model utilizes the input words by visualizing their contributions.

We want to publish our code in this repo, which will enrich this NLP toolkit. We will accomplish this by creating a folder named 'interpreter' under tool_nlp folder and adding our main codes to it. We'll also provide some examples under scenarios folder on how to utilize our main codes. Can we move ahead?

Finalize folder structure for examples / scenarios

Finalize folder structure for examples / scenarios, and plan for how it will scale as we get new languages (Hindi, Chinese) and new domains (tax, finance).

Example:

  • utils_nlp
    --- common utils submodule (like maybe download)
    --- data loading submodule
    --- text_classification
    ------ text classification class wrapper
    --- named_entity_recognition
    --- (etc)
  • scenarios
    --- text_classification
    ------ bert_text_classification [english, tax]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.