This repo holds codes of the solution for the Sign Spotting Challenge at ECCV (Multi-Shot Supervised Learning track)
Our team ranked 3rd in the final test phase.
Team leader: Xilin Chen
Team member: Yuecong Min, Peiqi Jiao, Aiming Hao
We produce the extracted features from multuple modalities for sign spotting, which can be trained in ten minutes and achieve acceptable performance. To reproduce the result, you need:
Our solution is based on the basic opencv and pytorch and provide requirements for conda environment:
conda create --name <env> --file requirements.txt
The extracted features and the trained model can be downloaded from Google Drive. After download the extracted features, unzip them to the dataset folder:
unzip extracted features.zip -d ./dataset/
The main desired directory tree is expected as follows,
.
├── configs
│ ├── test
│ └── train
├── dataset
│ ├── data_preprocess.sh
│ └── MSSL_dataset
│ ├── final_train_input.txt
│ ├── train_input.txt
│ └── valid_input.txt
│ ├── test_input.txt
│ ├── TRAIN
│ │ ├── MSSL_TRAIN_SET_GT.pkl
│ │ └── MSSL_TRAIN_SET_GT_TXT
│ ├── VALIDATION
│ │ ├── MSSL_VAL_SET_GT.pkl
│ │ └── MSSL_VAL_SET_GT_TXT
│ └── processed
│ ├── features
│ │ ├── flow
│ │ ├── mask_video
│ │ ├── skeleton
│ │ └── video
│ ├── test
│ │ ├── clipwise_label
│ │ └── framewise_label
│ ├── train
│ │ ├── clipwise_label
│ │ └── framewise_label
│ └── valid
│ ├── clipwise_label
│ └── framewise_label
├── weights
│ ├──final_model.pth
└── submission
├── ref
│ └── ground_truth.pkl
└── res
└── predictions.pkl
For evaluation with the provided model (./final_model.pth), simply run:
python generate_predictions.py
The final prediction can be found in submission/prediction_validate/res/predictions.pkl
.
For training the final spotting model with the extracted features, simply run (it takes about 10 minutes):
python main.py --config ./configs/train/fusion_detector.yml
The Feature Extraction process generate cropped video, optical flow, skeleton and masked video. To obtain skeleton data, we adopt mediapipe for pose and hands estimation, which should be installed first.
Download the data set provided in challenge and put them in ./dataset/
, then run the script:
cd dataset
bash data_preprocess.sh
The organization is expected as follows:
dataset
├── data_preprocess.sh
└── MSSL_dataset
├── final_train_input.txt
├── train_input.txt
├── valid_input.txt
├── test_input.txt
├── TRAIN
├── VALIDATION
├── MSSL_TEST_SET_VIDEOS
└── processed
├── train_pose.pkl
├── valid_pose.pkl
├── test_pose.pkl
├── train
│ ├── original_video
│ ├── video
│ ├── flow
│ ├── pose
│ ├── clipwise_label
│ └── framewise_label
├── valid
│ ├── original_video
│ ├── video
│ ├── flow
│ ├── pose
│ ├── clipwise_label
│ └── framewise_label
└── test
├── original_video
├── video
├── flow
├── pose
├── clipwise_label
└── framewise_label
We adopt a two-round training scheme for feature extraction, in the first round, only a subset that contains clips of query signs is built to increase the discriminative ability of the backbone. On the second round, all clips are used for training.
For the first round training, run the command:
python main.py --config ./configs/train/video_config.yml
python main.py --config ./configs/train/mask_video_config.yml
python main.py --config ./configs/train/skeleton_config.yml
python main.py --config ./configs/train/skeleton_config.yml
Then select the best (validation) or the last (test) weight for the next round training, by modifying the remove_bg=False
and weights=<path_to_best_weight>
in the config files, and run the above command again.
For the feature extraction, modify the weight path in feature_extraction
, and run
python feature_extraction.py
which will generate feats as step 1 shown.
I3D code and pretrained model
P3D code and pretrained model
ST-GCN
For more information, please contact Yuecong Min (yuecong.min [AT] vipl.ict.ac.cn)