Code repository for Active PETs. This repository would not be possible without previous open-source projects ALPS and PET.
Our main contribution is a weighted ensemble of PETs, which is used to actively sample the most beneficial samples from an unlabelled pool.
This readme file mainly describe how to use the code. For more details to reproduce the experiments reported in the paper, please see the readme file under scripts.
- Create virtual environment with Python 3.7+
- Run following commands:
pip install -r requirements.txt
The repository is organized as the following subfolders:
data
: folder for datasetssrc_pet
: source code for simulating active learningpet
: core code for PETsscripts
: scripts for running experimentspets
: saved models from running experimentsresults
: results of active learning experiments
All commands below should be ran in the top-level directory activepets
.
To simply train a PET model on the full training dataset, run
bash scripts/train.sh
After training, this model will be saved under a subdirectory called base
in pets
directory. Results on dev set will be saved in eval_results.txt
.
You may modify the parameters (like model type, task, seed, etc.) in scripts/train.sh
by configuring the variables at the top of the script.
To simulate various active learning without ensemble, run
bash scripts/active_train.sh
This script will sample data for a fixed number of iterations and then fine-tune the model on the sampled data for each iteration. The fine-tuned model will be saved under a subdirectory called {strategy}_{size}
where strategy
is the active learning strategy used to sample data and size
is the number of examples used to fine-tune the model. Results on dev set will be saved in eval_results.txt
.
To modify parameters in scripts/active_train.sh
, you can configure the variables at the top of the script. Please read the instructions below for more information.
To simulate various active learning, run
bash scripts/active_commitee.sh
Here are the naming conventions of the strategies from the paper:
- Random sampling:
rand
- BADDGE:
badge
- CAL:
cal
- ALPS:
alps
- Active-PETS:
activepets
So, whenever you want to use Active-PETs, you would pass in activepets
as input to the commands presented below.
To set the size of data sampled on each iteration, configure the variable INCREMENT
. To set the maximum size of total data sampled, configure the variable MAX_SIZE
. The number of iterations would be MAX_SIZE\INCREMENT
.
I'm interested in extending the current work in various ways, if there is any collaborating interests. For example, its efficiency has lots of potentials.
Xia Zeng ([email protected])