This project is a part of the NLP Tunnel Vision. The overview architecture is shown in following figure.
The goal of this project is to generate comments for a new article by considering the history of articles and comments of a news reader. The project is divided into four parts:
- Data Process: Download data from Kaggle website and converting it into a specific format.
- Model Fine-tunning: Fine-tune model using the processed dataset.
- Model Inferencing: Generate comment for a given article.
- Model Deployment(optional): Deploy the model as a web service to Paperspace.
.
โโโ configure
โ โโโ openai.yaml
โโโ data
โ โโโ processed
โ โ โโโ <processed data>
โ โโโ raw
โ โ โโโ kaggle.json
โ โ โโโ <raw data from kaggle>
โโโ models
โ โโโ <save openai file and model job info>
โโโ scripts
โ โโโ run_fine_tune.sh
โ โโโ run_inference.sh
โ โโโ run_openai_check_job.sh
โ โโโ run_openai_data_formatter.sh
โ โโโ run_openai_data_validation.sh
โ โโโ run_prepare_data.sh
โโโ src
โ โโโ __init__.py
โ โโโ fine_tune.py
โ โโโ inference.py
โ โโโ openai_check_job.py
โ โโโ openai_data_formatter.py
โ โโโ openai_data_validation.py
โ โโโ prepare_data.py
โ โโโ serve.py
โ โโโ utils.py
โโโ .env
โโโ .gitignore
โโโ docker-compose.yml
โโโ Dockerfile
โโโ run.sh
โโโ run_build_and_deployment.sh
โโโ venv.yaml
โโโ README.md
-
Install Kaggle CLI
pip install kaggle
-
Configure Kaggle CLI
Download
kaggle.json
from Kaggle website and move it todata/raw
folder. Then run the following command to configure Kaggle CLI.mkdir ~/.kaggle mv kaggle.json ~/.kaggle chmod 600 ~/.kaggle/kaggle.json
-
Download data from Kaggle
kaggle datasets download -d benjaminawd/new-york-times-articles-comments-2020 unzip new-york-times-articles-comments-2020.zip
- Install conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
- Create and activate virtual environment
conda env create -f venv.yaml conda activate nlp-tunnel-vision
-
Prepare data
bash scripts/run_prepare_data.sh
-
Convert data to OpenAI format
bash scripts/run_openai_data_formatter.sh
-
Validate data
๐จ make sure no error information is printed out
bash scripts/run_openai_data_validation.sh
-
Fine-tune model
Add
OPENAI_API_KEY
and key value to.env
file, then run the following command to fine-tune the model.bash scripts/run_fine_tune.sh
The processed file will be uploaded to OpenAI server and the file information will be saved in
models/file-xxx.json
. -
Check job status
To check the job status, run the following command.
bash scripts/run_openai_check_job.sh
If the job is completed, the file
models/ftjob-xxx.json
will be created.
-
Generate comment
Update values of
OPENAI_API_KEY
,OPENAI_FINE_TUNED_MODEL_ID
(found it inmodels/ftjob-xxx.json
), andOPENAI_TEMPERATURE
in filesrc/inference.py
. Also, set some testing data or read from a file to generate comments using fine-tuned model.bash scripts/run_inference.sh
-
Install docker and docker-compose
Follow the instructions in docker and docker-compose to install docker and docker-compose.
-
Build and up docker container
Before running the docker container, make sure the fine-tuning job is completed and all values in
.env
is updated (includingOPENAI_API_KEY
,OPENAI_FINE_TUNED_MODEL_ID
, andOPENAI_TEMPERATURE
). Then run the following command to build and up the docker container locally, and the service will be live (http://127.0.0.1:8080).docker compose --env-file .env up --build
-
Send POST request to generate comment
Use Postman to send a POST request to http://127.0.0.1:8080/infer with the following body to generate comment.
{ "history": [ ["This is first test article.", "this is a test comment."], ["This is secondary test article", "this is a secondary test comment."] ], "new_article": "this is a new article." }
-
Register Docker Hub account
Register a Docker Hub account before the following steps and copy account name.
-
Install Paperspace CLI
Follow the instructions in Paperspace CLI to install and configure Paperspace SDK.
-
Set secret values in Paperspace Secrets
Login Paperspace, create two pairs of name-value in
Paperspace --> Account --> Team settings --> Secrets
:OPENAI_API_KEY
andOPENAI_FINE_TUNED_MODEL_ID
(found it inmodels/ftjob-xxx.json
). -
Deploy model API to Paperspace
./run_build_and_deployment.sh <docker_hub_account_name> <paperspace_api_key> <paperspace_project_id>
-
Send POST request to generate comment
Use Postman to send a POST request to http://<paperspace_deployment_endpoint>/infer with the following body to generate comment. The deployment endpoint can be found in
Paperspace --> <Project> --> Deployments --> <deployment_name> --> Endpoint
.{ "history": [ ["This is first test article.", "this is a test comment."], ["This is secondary test article", "this is a secondary test comment."] ], "new_article": "this is a new article." }
- Optimize the prompt for model fine-tuning
- Try different window sizes for the model
- Create UI for comment generation
- The project docker image is build using python 3.10, however, the Paperspace deployment is using python 3.8 because of the Paperspace SDK. Therefore, the project docker image should be built using python 3.8 in the future.