Git Product home page Git Product logo

una's Introduction

UNA

This is the official code of our Paper "Unsupervised hard Negative Augmentation for contrastive learning"

Environments

This repository is tested on Python 3.8+

About

We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model.

Getting started

Environments

conda create -n una python=3.8
conda activate una
cd UNA
pip install -r requirements.txt

Dataset preparation

Training set

We used the training dataset from SimCSE, which can be downloaded by running the following script.

cd data/
bash data/download_wiki.sh

Prepare the paraphrased sentences

To create the paraphrased sentences, run the following script:

cd data/augment/
python paraphrase.py

Produce TF-IDF matrix offline

Run the following script to prepare the TD-IDF matrix. If you don't want to prepare the matrix offline, uncomment lines 94-101 in file data/dataset.py.

cd data/augment/
python create_dict.py

Change the mode to 'para' to produce the TF-IDF matrix for paraphrasing.

Evaluation set

The evaluation set can be downloaded by running the following script:

cd data/downstream/
bash download_dataset.sh

Code structure

After preparing the datasets, the structure of the code should look like this:

.
├── data  
│   ├── augment                      
│   ├── ├── paraphrase.py            # code for creating the paraphrased lines
│   ├── └── create_dict.py           # code for creating matrices under folder tfidf/
│   ├── downstream                   # folder containing evaluation dataset
│   ├── stsbenchmark                 # folder containing validation dataset
│   ├── training                     # folder containing training dataset
├── evaluate                         # Evaluation code *
├── function   
│   ├── metrics.py
│   ├── seed.py                      # initialize random seeds
│   └── tfidf_una.py                 # file for calculating the TF-IDF matrices and vectors for UNA
├── model 			
│   ├── lambda_scheduler.py          # contains different schedulers             
│   └── models.py                    # backbone BERT/RpBERTa model  
├── script                           # folder that contain .sh scripts to run the pre-training file
├── tfidf
│   ├── ori                          # folder for pre-saved TF-IDF representation of the original training dataset.
│   └── para                         # folder for pre-saved TF-IDF representation of the original training and the paraphrased dataset.
├── run.py                           # run pretraining with UNA
├── una.py                           # run pretraining with FaceSwap
└── utils.py

Train UNA

To reproduce our results (for STS) with UNA framework, run the following training scipt here.

Trained Model

Models can be downloaded from here.

Results

Acknowledgement

una's People

Contributors

claudiashu avatar

Stargazers

 avatar  avatar AI/ML Engineer avatar  avatar  avatar Abby avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.