In this project we implement an algorithmic pipeline intended to train a model with minimum intervention. Given a Dataset, a new category name and few positive sentences as an input, train a classifier as described below.
Let’s say we want to create a new topic model that identifies violence, and we have unlabelled data for this new topic. We came up few relevant phrases and keywords such as:
“they were kicking him in the yard”
“we got involved in a fight”
“there was blood everywhere”
and we wish to train a model that will classify similar texts as positive for violence.
The main challenge of the task is to find (automatically) relevant positive and negative samples from the unlabelled data, to construct a small training dataset.
For example, you can select some positive examples by using a Nearest Neighbours database.
After training a model on this small training set and classifying the data (in this case, twitter sentiment analysis dataset – but we don’t use the existing labels, as we want to create a new topic model) we want the model to
classify the following tweets as positive:
"@huntermoore I don't want him to ever punch me.”
"I can't sleep. There's a fight outside. How inconsiderate. I wanna go SLEEP!"
"@JooceGossip Wow.... I wouldn't want to be w/ a man that would hit me in any way! I hope she doesn't go back to him!"
├── /data_models
│ ├── corpus_embeddings.pt <-- Here you should place your dataset with corpus embeddings
│ ├── query_embeddings.pt <-- Here you should place your dataset with query embedding
│
├── config.py <-- Constants and configurations
├── data_creation.py <-- All preprocesses and processes functions are here
├── app.py <-- main python flask application
├── log_file.txt <-- The application will logging in this file
├── requirements.txt <-- The list of mandatory python libraries and versions
├── README.md <-- Documentation to use this repository
This is a flask application starting from Step 4. To run it locally use python app.py
command from command line.
To pass a request use a browser or Postman application. The format of the URL to pass a request is:
http://127.0.0.1:5000/predict?twitts=["they were kicking him in the yard", "we got involved in a fight", "there was blood everywhere"]
or
http://127.0.0.1:5000/predict?twitts=[%22they%20were%20kicking%20him%20in%20the%20yard%22,%20%22we%20got%20involved%20in%20a%20fight%22,%22there%20was%20blood%20everywhere%22]
You should have a dataset with embeddings in the /data_models
folder.
Before Step 5 you can create a data set with enbeddings via googlecolab notebook.
All code can be found in GoogleCollab notebook with some explanations:
https://colab.research.google.com/drive/1MgzivO1sEkSFk-FYnGurVKysNu27popA?usp=sharing
- To download twitter sentiment analysis dataset from Kaggle we used Kaggle API.
You can find it in the first googlecollab cells. Make sure you have your personalkaggle.json
document. How to do it you can read here.
first step will load and unzip the dataset in a temporary folder on googlecollab. - The function
get_dataset(path_to_dataset)
indataset_creation.py
Loads the data (using only the text column). - From Step 2 we have a corpus dataset, now we can Extract features for each sample with
extract_embeddings()
.
Inside it use SentenceTransformers from Hugging Face and distilled Microsoft model with embeddings size of 384 tokens. - After this step you able to download the embeddings data set locally (2.4Gb) and place it into
/data_models
folder.
This dataset contains embeddings + cosine similarity scores + twitts.
Here starts the flask application run
- The next step is to create Nearest neighbour database/dataset with
create_nn_db(corpus_embeddings, query_embeddings)
function. It will label 25% of true positives samples with 1 and the rest 75% of samples will be unknown and marked with -1.
It will callget_samples()
function to sample positive example using the nearest neighbour DB (Based on input sentences) and negative examples - On this Step we Train an SVC model for ElkanotoPuClassifier from pulearn library leveraging the untagged data.
- Evaluation is done in function
evaluate_results()
with regular scikit-learn library: F1, ROC AUC, Recall and Precision scores.
It will be visible in thelog_file.txt
. - The Response will Show top-100 most similar for input sentences results.
Every operation and calculation results will be logged in the log_file.txt
in the project directory.
If you want to read more about Positive Learning please read the research papers below:
https://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Mordelet2013bagging.pdf
https://cseweb.ucsd.edu/~elkan/posonly.pdf
The code examples
https://github.com/roywright/pu_learning
https://github.com/aldro61/pu-learning
https://github.com/AdityaAS/pu-learning