Git Product home page Git Product logo

clip4str's Introduction

CLIP4STR

PWC PWC PWC PWCPWCPWCPWCPWCPWCPWCPWC

This is a dedicated re-implementation of CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model .

Table of Contents

News

  • [02/05/2024] Add new CLIP4STR models pre-trained on DataComp-1B, LAION-2B, and DFN-5B. Add CLIP4STR models trained on RBU(6.5M).

Introduction

This is a third-party implementation of the paper CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model.

The framework of CLIP4STR. It has a visual branch and a cross-modal branch. The cross-modal branch refines the prediction of the visual branch for the final output. The text encoder is partially frozen.

CLIP4STR aims to build a scene text recognizer with the pre-trained vision-language model. In this re-implementation, we try to reproduce the performance of the original paper and evaluate the effectiveness of pre-trained VL models in the STR area.

Installation

Prepare data

First of all, you need to download the STR dataset.

Generally, directories are organized as follows:

${ABSOLUTE_ROOT}
├── dataset
│   │
│   ├── str_dataset_ub
│   └── str_dataset           
│       ├── train
│       │   ├── real
│       │   └── synth
│       ├── val     
│       └── test
│
├── code              
│   │
│   └── clip4str
│
├── output (save the output of the program)
│
│
├── pretrained
│   └── clip (download the CLIP pre-trained weights and put them here)
│       └── ViT-B-16.pt
│
...

Dependency

Requires Python >= 3.8 and PyTorch >= 1.12. The following commands are tested on a Linux machine with CUDA Driver Version 525.105.17 and CUDA Version 11.3.

conda create --name clip4str python=3.8.5
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 -c pytorch
pip install -r requirements.txt 

If you meet problems in continual training of an intermediate checkpoint, try to upgrade your PyTorch

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Results

CLIP4STR pre-trained on OpenAI WIT-400M

CLIP4STR-B means using the CLIP-ViT-B/16 as the backbone, and CLIP4STR-L means using the CLIP-ViT-L/14 as the backbone.

Method Train data IIIT5K SVT IC13 IC15 IC15 SVTP CUTE HOST WOST
3,000 647 1,015 1,811 2,077 645 288 2,416 2,416
CLIP4STR-B MJ+ST 97.70 95.36 96.06 87.47 84.02 91.47 94.44 80.01 86.75
CLIP4STR-L MJ+ST 97.57 95.36 96.75 88.02 84.40 91.78 94.44 81.08 87.38
CLIP4STR-B Real(3.3M) 99.20 98.30 98.23 91.44 90.61 96.90 99.65 77.36 87.87
CLIP4STR-L Real(3.3M) 99.43 98.15 98.52 91.66 91.14 97.36 98.96 79.22 89.07
Method Train data COCO ArT Uber Checkpoint
9,825 35,149 80,551
CLIP4STR-B MJ+ST 66.69 72.82 43.52 a5e3386222
CLIP4STR-L MJ+ST 67.45 73.48 44.59 3544c362f0
CLIP4STR-B Real(3.3M) 80.80 85.74 86.70 d70bde1f2d
CLIP4STR-L Real(3.3M) 81.97 85.83 87.36 f125500adc

CLIP4STR pre-trained on DataComp-1B, LAION-2B, and DFN-5B

All models are trained on RBU(6.5M).

Method Pre-train Train IIIT5K SVT IC13 IC15 IC15 SVTP CUTE HOST WOST
3,000 647 1,015 1,811 2,077 645 288 2,416 2,416
CLIP4STR-B DC-1B RBU 99.5 98.3 98.6 91.4 91.1 98.0 99.0 79.3 88.8
CLIP4STR-L DC-1B RBU 99.6 98.6 99.0 91.9 91.4 98.1 99.7 81.1 90.6
CLIP4STR-H LAION-2B RBU 99.7 98.6 98.9 91.6 91.1 98.5 99.7 80.6 90.0
CLIP4STR-H DFN-5B RBU 99.5 99.1 98.9 91.7 91.0 98.0 99.0 82.6 90.9
Method Pre-train Train COCO ArT Uber log Checkpoint
9,825 35,149 80,551
CLIP4STR-B DC-1B RBU 81.3 85.8 92.1 6e9fe947ac_log 6e9fe947ac, BaiduYun
CLIP4STR-L DC-1B RBU 82.7 86.4 92.2 3c9d881b88_log 3c9d881b88, BaiduYun
CLIP4STR-H LAION-2B RBU 82.5 86.2 91.2 5eef9f86e2_log 5eef9f86e2, BaiduYun
CLIP4STR-H DFN-5B RBU 83.0 86.4 91.7 3e942729b1_log 3e942729b1, BaiduYun

Training

  • Before training, you should set the path properly. Find the /PUT/YOUR/PATH/HERE in configs, scripts, strhub/vl_str, and strhub/str_adapter. For example, the /PUT/YOUR/PATH/HERE in the configs/main.yaml. Then replace it with your own path. A global searching and replacement is recommended.

For CLIP4STR with CLIP-ViT-B, refer to

bash scripts/vl4str_base.sh

8 NVIDIA GPUs with more than 24GB memory (per GPU) are required. For users with limited GPUs, you can change trainer.gpus=A, trainer.accumulate_grad_batches=B, and model.batch_size=C under the condition A * B * C = 1024 in the bash scripts. Do not modify the code, the PyTorch Lightning will handle the left.

For CLIP4STR with CLIP-ViT-L, refer to

bash scripts/vl4str_large.sh

We also provide the training script of CLIP4STR + Adapter described in the original paper,

bash scripts/str_adapter.sh

Inference

bash test.sh {gpu_id} {subpath_of_ckpt}

For example,

bash scripts/test.sh 0 clip4str_base16x16_d70bde1f2d.ckpt

If you want to read characters from an image, try:

bash test.sh {gpu_id} {subpath_of_ckpt} {image_folder_path}

For example,

bash scripts/read.sh 0 clip4str_base16x16_d70bde1f2d.ckpt misc/test_images

Output:
image_1576.jpeg: Chicken

Citations

@article{zhao2023clip4str,
  title={Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model},
  author={Zhao, Shuai and Quan, Ruijie and Zhu, Linchao and Yang, Yi},
  journal={arXiv preprint arXiv:2305.14014},
  year={2023}
}

Acknowledgements

clip4str's People

Contributors

mzhaoshuai avatar vamosc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

clip4str's Issues

Error locating target for VL4STR

Thank you for your great work!
I tried to run train.py (not from pretrained) on Google Colab and getting this Error:

Error executing job with overrides: []
Error locating target 'strhub.models.vl_str.system.VL4STR', see chained exception above.
full_key: model
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I have tried to use absolute path for field "target" in "clip4str\configs\model\vl4str.yaml" but still get the above error.
I'm using the right hydra_core version (1.2.0) as in requirements.txt
Do you have any suggestions? Thank you!

Inference only

I want to infer with my dataset and do not want to download the training sets, is there any way to modify the code for doing this?
image

Bug in run bash script

image
I run this in google colab and i get error in bash file,
how can i do it for try this model.

The provided lr scheduler `OneCycleLR` doesn't follow PyTorch's LRScheduler API

Thank you for your great work!
I am trying to use Multilngual_CLIP to train clip4str for Vietnamese (with charset contains 229 tokens) (use Google Colab)
I have changed charset, code in strhub/models/vl_str/systems.py and other files so that I can use Text_encoder from Multilingual_CLIP for Vietnamese
Now I am getting an error for Learning rate scheduler as following:

The dimension of the visual decoder is 768.
Len of Tokenizer 232
Done creating model!
| Name | Type | Params

0 | clip_model | CLIP | 427 M
1 | clip_model.visual | VisionTransformer | 303 M
2 | clip_model.transformer | Transformer | 85.1 M
3 | clip_model.token_embedding | Embedding | 37.9 M
4 | clip_model.ln_final | LayerNorm | 1.5 K
5 | M_clip_model | MultilingualCLIP | 560 M
6 | M_clip_model.transformer | XLMRobertaModel | 559 M
7 | M_clip_model.LinearTransformation | Linear | 787 K
8 | visual_decoder | Decoder | 9.8 M
9 | visual_decoder.layers | ModuleList | 9.5 M
10 | visual_decoder.text_embed | TokenEmbedding | 178 K
11 | visual_decoder.norm | LayerNorm | 1.5 K
12 | visual_decoder.dropout | Dropout | 0
13 | visual_decoder.head | Linear | 176 K
14 | cross_decoder | Decoder | 9.8 M
15 | cross_decoder.layers | ModuleList | 9.5 M
16 | cross_decoder.text_embed | TokenEmbedding | 178 K
17 | cross_decoder.norm | LayerNorm | 1.5 K
18 | cross_decoder.dropout | Dropout | 0
19 | cross_decoder.head | Linear | 176 K

675 M Trainable params
332 M Non-trainable params
1.0 B Total params
4,031.815 Total estimated model params size (MB)
[dataset] mean (0.48145466, 0.4578275, 0.40821073), std (0.26862954, 0.26130258, 0.27577711)
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:117: UserWarning: When using Trainer(accumulate_grad_batches != 1) and overriding LightningModule.optimizer_{step,zero_grad}, the hooks will not be called on every batch (rather, they are called on every optimization step).
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[VL4STR] The length of encoder params with and without weight decay is 259 and 479, respectively.
[VL4STR] The length of decoder params with and without weight decay is 14 and 38, respectively.
Loading train_dataloader to estimate number of stepping batches.
dataset root: /content/drive/MyDrive/clip4str/dataset/str_dataset/train/real
lmdb: ArT num samples: 34984
lmdb: The number of training samples is 34984
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/MyDrive/clip4str/code/clip4str/train.py", line 145, in
main()
File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 90, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
raise ex
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/content/drive/MyDrive/clip4str/code/clip4str/train.py", line 104, in main
trainer.fit(model, datamodule=datamodule, ckpt_path=config.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/single_device.py", line 72, in setup
super().setup(trainer)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 139, in setup
self.setup_optimizers(trainer)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 128, in setup_optimizers
self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 195, in _init_optimizers_and_lr_schedulers
_validate_scheduler_api(lr_scheduler_configs, model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 350, in _validate_scheduler_api
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: The provided lr scheduler OneCycleLR doesn't follow PyTorch's LRScheduler API. You should override the LightningModule.lr_scheduler_step hook with your own logic if you are using a custom LR scheduler.

I can not see any problem in OneCycleLR, do you have any suggestions for me with this matter? Is it a problem of package version?

inference error

when i run python read.py clip4str_large_3c9d881b88.pt --images_path misc/test_image/

The following error occurred:

root@e33ba27efab3:/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main# python read.py clip4str_large_3c9d881b88.pt --images_path misc/test_image/
[2024-07-03 13:45:24,525] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Additional keyword arguments: {}

config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

Try to load CLIP model from /workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin

config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

Try to load CLIP model from /workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/ViT-L-14.pt
loading checkpoint from /workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/ViT-L-14.pt
The dimension of the visual decoder is 768.
Traceback (most recent call last):
File "/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/strhub/models/utils.py", line 108, in load_from_checkpoint
model = ModelClass.load_from_checkpoint(checkpoint_path, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(
_cls_kwargs)
File "/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/strhub/models/vl_str/system.py", line 77, in init
assert os.path.exists(kwargs["clip_pretrained"])
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/read.py", line 54, in
main()
File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/read.py", line 37, in main
model = load_from_checkpoint(args.checkpoint, **kwargs).eval().to(args.device)
File "/workspace/data_dir/data_user/zyy/OCR/CLIP4STR-main/strhub/models/utils.py", line 117, in load_from_checkpoint
model.load_state_dict(checkpoint)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VL4STR:
Missing key(s) in state_dict: "clip_model.positional_embedding", "clip_model.text_projection", "clip_model.logit_scale", "clip_model.visual.class_embedding", "clip_model.visual.positional_embedding", "clip_model.visual.proj", "clip_model.visual.conv1.weight", "clip_model.visual.ln_pre.weight", "clip_model.visual.ln_pre.bias", "clip_model.visual.transformer.resblocks.0.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.0.attn.in_proj_bias", "clip_mode...........

Error While Inferencing

Hello @VamosC @mzhaoshuai

I am constantly Facing the same error of

File "/home/lincode/Documents/OCR/CLIP4STR-main/strhub/models/utils.py", line 110, in load_from_checkpoint
ModelClass, experiment = _get_model_class(checkpoint_path)
File "/home/lincode/Documents/OCR/CLIP4STR-main/strhub/models/utils.py", line 77, in _get_model_class
raise InvalidModelError("Unable to find model class for '{}'".format(key))
strhub.models.utils.InvalidModelError: Unable to find model class for 'clip.ckpt'

i am not understanding on how to proceed.
can anyone help me out.
have renamed clip4str_base16x16_d70bde1f2d.pt as clip.pt

the command that i have given is (inside CLIP4STR cloned Folder)

python read.py clip.ckpt --images_path abspath/misc/test_image

thanks in advance.
you guys are making me love AI more, thanks, One day Id like to implement papers like you guys , keep going.

Issue with inference !

Hi @VamosC . Thanks for great work ! i'm using your project for my school project . but when i run python3 read.py checkpoint /mnt/d/Users/Downloads/ocr_project-dev/modules/clip4str_base16x16_d70bde1f2d.ckpt --images_path /mnt/d/Users/Downloads/ocr_project-dev/modules/misc/test_image/image_1576.jpeg --device cuda . i encountered an error raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'hydra.conf.JobConf.JobConfig.OverrideDirname'> for field override_dirname is not allowed: use default_factory

Can you help me? Once a gain , thanks @VamosC !

Recognize French accents

Hi @mzhaoshuai , Can your model recognize French accents such as " café" , "réveillait", I tried it but it doesn't seem to detect it . Is there any way to detect it ? Hope you respond soon ! Thanks

Using the LaTeX dataset to train CLIP4STR

Hello, thank you for your published paper and the open model. I am preparing to use your method to train on LaTeX-type data, such as im2latex. I would like to ask for your opinion on this task.

I currently have two concerns:

  1. The textual data corresponding to LaTeX images has less semantic meaning compared to STR (Scene Text Recognition) tasks. I'm unsure whether the CLIP4STR method is applicable and if it has an advantage over trocr.
  2. The character set for LaTeX recognition tasks far exceeds the 94 characters in the English set. For example, the formula recognition model trained based on trocr, as seen in this link, has approximately 1200+ tokens.

I would greatly appreciate any advice you may propose.

clip4str attention graph

This is an excellent work and I am very much interested in the advanced effects of clip attention maps in it.

Can you share the code used to generate the clip attention graph in the paper.

Thank you very much.

how to Finetuning in korean (or other language)

Before I ask you a question, I would like to say thank you for sharing the good information.

I'm going to ask you 3 questions

  1. I tried to use the Multiligual-CLIP you mentioned in issue1, but the only models using the ViT-B in that model are ViT-B/16+ and ViT-B/32. Is it correct that only ViT-B/16 is available for the base model of the pretrained CLIP of CLIP4STR?

  2. Is there a way to produce inference results in different languages without additional Finetuning?

  3. Do you have any plans to write a guide for Finetuning CLIP4STR in another language?

Inference time on CPU/GPU

It would be nice if you can add the latency results to the README as well. I am planning to use this for an industry application, but before experimenting, it would be nice to know if it's even a feasible option (since I have an SLA of like 1 sec per image).

Issue with inference

Hi, I am trying to perform inference using the following script:
bash code/clip4str/scripts/read.sh 7 clip4str_b_plus.ckpt /home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/misc/test_image

The error i get is:

Additional keyword arguments: {}
args.checkpoint /home/shreyans/scratch/tata1mg/clip4str_og/output/clip4str_base16x16_d70bde1f2d.ckpt

config of VL4STR:
image_freeze_nlayer: 0, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

loading checkpoint from /home/shreyans/scratch/tata1mg/clip4str_og/pretrained/clip/ViT-B-16.pt
The dimension of the visual decoder is 512.
Traceback (most recent call last):
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/utils.py", line 104, in load_from_checkpoint
model = ModelClass.load_from_checkpoint(checkpoint_path, **kwargs)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(
_cls_kwargs)
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/vl_str/system.py", line 70, in init
assert os.path.exists(kwargs["clip_pretrained"])
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/read.py", line 54, in
main()
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/read.py", line 37, in main
model = load_from_checkpoint(args.checkpoint, **kwargs).eval().to(args.device)
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/utils.py", line 113, in load_from_checkpoint
model.load_state_dict(checkpoint)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VL4STR:
Missing key(s) in state_dict: "clip_model.positional_embedding", "clip_model.text_projection", "clip_model.logit_scale", "clip_model.visual.class_embedding", "clip_model.visual.positional_embedding", "clip_model.visual.proj", "clip_model.visual.conv1.weight", "clip_model.visual.ln_pre.weight", "clip_model.visual.ln_pre.bias", "clip_model.visual.transformer.resblocks.0.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.0.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.0.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.0.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.0.ln_1.weight", "clip_model.visual.transformer.resblocks.0.ln_1.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.0.ln_2.weight", "clip_model.visual.transformer.resblocks.0.ln_2.bias", "clip_model.visual.transformer.resblocks.1.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.1.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.1.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.1.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_1.weight", "clip_model.visual.transformer.resblocks.1.ln_1.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.1.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_2.weight", "clip_model.visual.transformer.resblocks.1.ln_2.bias", "clip_model.visual.transformer.resblocks.2.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.2.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.2.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.2.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.2.ln_1.weight", "clip_model.visual.transformer.resblocks.2.ln_1.bias", "clip_model.visual.transformer.resblocks.2.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.2.ln_2.weight", "clip_model.visual.transformer.resblocks.2.ln_2.bias", "clip_model.visual.transformer.resblocks.3.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.3.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.3.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.3.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_1.weight", "clip_model.visual.transformer.resblocks.3.ln_1.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.3.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_2.weight", "clip_model.visual.transformer.resblocks.3.ln_2.bias", "clip_model.visual.transformer.resblocks.4.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.4.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.4.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.4.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_1.weight", "clip_model.visual.transformer.resblocks.4.ln_1.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_2.weight", "clip_model.visual.transformer.resblocks.4.ln_2.bias", "clip_model.visual.transformer.resblocks.5.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.5.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.5.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.5.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.5.ln_1.weight", "clip_model.visual.transformer.resblocks.5.ln_1.bias", "clip_model.visual.transformer.resblocks.5.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.5.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.5.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.5.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.5.ln_2.weight", "clip_model.visual.transformer.resblocks.5.ln_2.bias", "clip_model.visual.transformer.resblocks.6.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.6.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.6.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.6.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.6.ln_1.weight", "clip_model.visual.transformer.resblocks.6.ln_1.bias", "clip_model.visual.transformer.resblocks.6.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.6.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.6.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.6.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.6.ln_2.weight", "clip_model.visual.transformer.resblocks.6.ln_2.bias", "clip_model.visual.transformer.resblocks.7.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.7.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.7.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.7.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.7.ln_1.weight", "clip_model.visual.transformer.resblocks.7.ln_1.bias", "clip_model.visual.transformer.resblocks.7.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.7.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.7.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.7.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.7.ln_2.weight", "clip_model.visual.transformer.resblocks.7.ln_2.bias", "clip_model.visual.transformer.resblocks.8.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.8.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.8.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.8.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.8.ln_1.weight", "clip_model.visual.transformer.resblocks.8.ln_1.bias", "clip_model.visual.transformer.resblocks.8.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.8.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.8.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.8.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.8.ln_2.weight", "clip_model.visual.transformer.resblocks.8.ln_2.bias", "clip_model.visual.transformer.resblocks.9.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.9.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.9.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.9.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.9.ln_1.weight", "clip_model.visual.transformer.resblocks.9.ln_1.bias", "clip_model.visual.transformer.resblocks.9.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.9.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.9.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.9.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.9.ln_2.weight", "clip_model.visual.transformer.resblocks.9.ln_2.bias", "clip_model.visual.transformer.resblocks.10.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.10.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.10.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.10.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.10.ln_1.weight", "clip_model.visual.transformer.resblocks.10.ln_1.bias", "clip_model.visual.transformer.resblocks.10.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.10.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.10.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.10.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.10.ln_2.weight", "clip_model.visual.transformer.resblocks.10.ln_2.bias", "clip_model.visual.transformer.resblocks.11.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.11.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.11.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.11.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.11.ln_1.weight", "clip_model.visual.transformer.resblocks.11.ln_1.bias", "clip_model.visual.transformer.resblocks.11.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.11.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.11.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.11.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.11.ln_2.weight", "clip_model.visual.transformer.resblocks.11.ln_2.bias", "clip_model.visual.ln_post.weight", "clip_model.visual.ln_post.bias", "clip_model.transformer.resblocks.0.attn.in_proj_weight", "clip_model.transformer.resblocks.0.attn.in_proj_bias", "clip_model.transformer.resblocks.0.attn.out_proj.weight", "clip_model.transformer.resblocks.0.attn.out_proj.bias", "clip_model.transformer.resblocks.0.ln_1.weight", "clip_model.transformer.resblocks.0.ln_1.bias", "clip_model.transformer.resblocks.0.mlp.c_fc.weight", "clip_model.transformer.resblocks.0.mlp.c_fc.bias", "clip_model.transformer.resblocks.0.mlp.c_proj.weight", "clip_model.transformer.resblocks.0.mlp.c_proj.bias", "clip_model.transformer.resblocks.0.ln_2.weight", "clip_model.transformer.resblocks.0.ln_2.bias", "clip_model.transformer.resblocks.1.attn.in_proj_weight", "clip_model.transformer.resblocks.1.attn.in_proj_bias", "clip_model.transformer.resblocks.1.attn.out_proj.weight", "clip_model.transformer.resblocks.1.attn.out_proj.bias", "clip_model.transformer.resblocks.1.ln_1.weight", "clip_model.transformer.resblocks.1.ln_1.bias", "clip_model.transformer.resblocks.1.mlp.c_fc.weight", "clip_model.transformer.resblocks.1.mlp.c_fc.bias", "clip_model.transformer.resblocks.1.mlp.c_proj.weight", "clip_model.transformer.resblocks.1.mlp.c_proj.bias", "clip_model.transformer.resblocks.1.ln_2.weight", "clip_model.transformer.resblocks.1.ln_2.bias", "clip_model.transformer.resblocks.2.attn.in_proj_weight", "clip_model.transformer.resblocks.2.attn.in_proj_bias", "clip_model.transformer.resblocks.2.attn.out_proj.weight", "clip_model.transformer.resblocks.2.attn.out_proj.bias", "clip_model.transformer.resblocks.2.ln_1.weight", "clip_model.transformer.resblocks.2.ln_1.bias", "clip_model.transformer.resblocks.2.mlp.c_fc.weight", "clip_model.transformer.resblocks.2.mlp.c_fc.bias", "clip_model.transformer.resblocks.2.mlp.c_proj.weight", "clip_model.transformer.resblocks.2.mlp.c_proj.bias", "clip_model.transformer.resblocks.2.ln_2.weight", "clip_model.transformer.resblocks.2.ln_2.bias", "clip_model.transformer.resblocks.3.attn.in_proj_weight", "clip_model.transformer.resblocks.3.attn.in_proj_bias", "clip_model.transformer.resblocks.3.attn.out_proj.weight", "clip_model.transformer.resblocks.3.attn.out_proj.bias", "clip_model.transformer.resblocks.3.ln_1.weight", "clip_model.transformer.resblocks.3.ln_1.bias", "clip_model.transformer.resblocks.3.mlp.c_fc.weight", "clip_model.transformer.resblocks.3.mlp.c_fc.bias", "clip_model.transformer.resblocks.3.mlp.c_proj.weight", "clip_model.transformer.resblocks.3.mlp.c_proj.bias", "clip_model.transformer.resblocks.3.ln_2.weight", "clip_model.transformer.resblocks.3.ln_2.bias", "clip_model.transformer.resblocks.4.attn.in_proj_weight", "clip_model.transformer.resblocks.4.attn.in_proj_bias", "clip_model.transformer.resblocks.4.attn.out_proj.weight", "clip_model.transformer.resblocks.4.attn.out_proj.bias", "clip_model.transformer.resblocks.4.ln_1.weight", "clip_model.transformer.resblocks.4.ln_1.bias", "clip_model.transformer.resblocks.4.mlp.c_fc.weight", "clip_model.transformer.resblocks.4.mlp.c_fc.bias", "clip_model.transformer.resblocks.4.mlp.c_proj.weight", "clip_model.transformer.resblocks.4.mlp.c_proj.bias", "clip_model.transformer.resblocks.4.ln_2.weight", "clip_model.transformer.resblocks.4.ln_2.bias", "clip_model.transformer.resblocks.5.attn.in_proj_weight", "clip_model.transformer.resblocks.5.attn.in_proj_bias", "clip_model.transformer.resblocks.5.attn.out_proj.weight", "clip_model.transformer.resblocks.5.attn.out_proj.bias", "clip_model.transformer.resblocks.5.ln_1.weight", "clip_model.transformer.resblocks.5.ln_1.bias", "clip_model.transformer.resblocks.5.mlp.c_fc.weight", "clip_model.transformer.resblocks.5.mlp.c_fc.bias", "clip_model.transformer.resblocks.5.mlp.c_proj.weight", "clip_model.transformer.resblocks.5.mlp.c_proj.bias", "clip_model.transformer.resblocks.5.ln_2.weight", "clip_model.transformer.resblocks.5.ln_2.bias", "clip_model.transformer.resblocks.6.attn.in_proj_weight", "clip_model.transformer.resblocks.6.attn.in_proj_bias", "clip_model.transformer.resblocks.6.attn.out_proj.weight", "clip_model.transformer.resblocks.6.attn.out_proj.bias", "clip_model.transformer.resblocks.6.ln_1.weight", "clip_model.transformer.resblocks.6.ln_1.bias", "clip_model.transformer.resblocks.6.mlp.c_fc.weight", "clip_model.transformer.resblocks.6.mlp.c_fc.bias", "clip_model.transformer.resblocks.6.mlp.c_proj.weight", "clip_model.transformer.resblocks.6.mlp.c_proj.bias", "clip_model.transformer.resblocks.6.ln_2.weight", "clip_model.transformer.resblocks.6.ln_2.bias", "clip_model.transformer.resblocks.7.attn.in_proj_weight", "clip_model.transformer.resblocks.7.attn.in_proj_bias", "clip_model.transformer.resblocks.7.attn.out_proj.weight", "clip_model.transformer.resblocks.7.attn.out_proj.bias", "clip_model.transformer.resblocks.7.ln_1.weight", "clip_model.transformer.resblocks.7.ln_1.bias", "clip_model.transformer.resblocks.7.mlp.c_fc.weight", "clip_model.transformer.resblocks.7.mlp.c_fc.bias", "clip_model.transformer.resblocks.7.mlp.c_proj.weight", "clip_model.transformer.resblocks.7.mlp.c_proj.bias", "clip_model.transformer.resblocks.7.ln_2.weight", "clip_model.transformer.resblocks.7.ln_2.bias", "clip_model.transformer.resblocks.8.attn.in_proj_weight", "clip_model.transformer.resblocks.8.attn.in_proj_bias", "clip_model.transformer.resblocks.8.attn.out_proj.weight", "clip_model.transformer.resblocks.8.attn.out_proj.bias", "clip_model.transformer.resblocks.8.ln_1.weight", "clip_model.transformer.resblocks.8.ln_1.bias", "clip_model.transformer.resblocks.8.mlp.c_fc.weight", "clip_model.transformer.resblocks.8.mlp.c_fc.bias", "clip_model.transformer.resblocks.8.mlp.c_proj.weight", "clip_model.transformer.resblocks.8.mlp.c_proj.bias", "clip_model.transformer.resblocks.8.ln_2.weight", "clip_model.transformer.resblocks.8.ln_2.bias", "clip_model.transformer.resblocks.9.attn.in_proj_weight", "clip_model.transformer.resblocks.9.attn.in_proj_bias", "clip_model.transformer.resblocks.9.attn.out_proj.weight", "clip_model.transformer.resblocks.9.attn.out_proj.bias", "clip_model.transformer.resblocks.9.ln_1.weight", "clip_model.transformer.resblocks.9.ln_1.bias", "clip_model.transformer.resblocks.9.mlp.c_fc.weight", "clip_model.transformer.resblocks.9.mlp.c_fc.bias", "clip_model.transformer.resblocks.9.mlp.c_proj.weight", "clip_model.transformer.resblocks.9.mlp.c_proj.bias", "clip_model.transformer.resblocks.9.ln_2.weight", "clip_model.transformer.resblocks.9.ln_2.bias", "clip_model.transformer.resblocks.10.attn.in_proj_weight", "clip_model.transformer.resblocks.10.attn.in_proj_bias", "clip_model.transformer.resblocks.10.attn.out_proj.weight", "clip_model.transformer.resblocks.10.attn.out_proj.bias", "clip_model.transformer.resblocks.10.ln_1.weight", "clip_model.transformer.resblocks.10.ln_1.bias", "clip_model.transformer.resblocks.10.mlp.c_fc.weight", "clip_model.transformer.resblocks.10.mlp.c_fc.bias", "clip_model.transformer.resblocks.10.mlp.c_proj.weight", "clip_model.transformer.resblocks.10.mlp.c_proj.bias", "clip_model.transformer.resblocks.10.ln_2.weight", "clip_model.transformer.resblocks.10.ln_2.bias", "clip_model.transformer.resblocks.11.attn.in_proj_weight", "clip_model.transformer.resblocks.11.attn.in_proj_bias", "clip_model.transformer.resblocks.11.attn.out_proj.weight", "clip_model.transformer.resblocks.11.attn.out_proj.bias", "clip_model.transformer.resblocks.11.ln_1.weight", "clip_model.transformer.resblocks.11.ln_1.bias", "clip_model.transformer.resblocks.11.mlp.c_fc.weight", "clip_model.transformer.resblocks.11.mlp.c_fc.bias", "clip_model.transformer.resblocks.11.mlp.c_proj.weight", "clip_model.transformer.resblocks.11.mlp.c_proj.bias", "clip_model.transformer.resblocks.11.ln_2.weight", "clip_model.transformer.resblocks.11.ln_2.bias", "clip_model.token_embedding.weight", "clip_model.ln_final.weight", "clip_model.ln_final.bias", "visual_decoder.pos_queries", "visual_decoder.layers.0.self_attn.in_proj_weight", "visual_decoder.layers.0.self_attn.in_proj_bias", "visual_decoder.layers.0.self_attn.out_proj.weight", "visual_decoder.layers.0.self_attn.out_proj.bias", "visual_decoder.layers.0.cross_attn.in_proj_weight", "visual_decoder.layers.0.cross_attn.in_proj_bias", "visual_decoder.layers.0.cross_attn.out_proj.weight", "visual_decoder.layers.0.cross_attn.out_proj.bias", "visual_decoder.layers.0.linear1.weight", "visual_decoder.layers.0.linear1.bias", "visual_decoder.layers.0.linear2.weight", "visual_decoder.layers.0.linear2.bias", "visual_decoder.layers.0.norm1.weight", "visual_decoder.layers.0.norm1.bias", "visual_decoder.layers.0.norm2.weight", "visual_decoder.layers.0.norm2.bias", "visual_decoder.layers.0.norm_q.weight", "visual_decoder.layers.0.norm_q.bias", "visual_decoder.layers.0.norm_c.weight", "visual_decoder.layers.0.norm_c.bias", "visual_decoder.text_embed.embedding.weight", "visual_decoder.norm.weight", "visual_decoder.norm.bias", "visual_decoder.head.weight", "visual_decoder.head.bias", "cross_decoder.pos_queries", "cross_decoder.layers.0.self_attn.in_proj_weight", "cross_decoder.layers.0.self_attn.in_proj_bias", "cross_decoder.layers.0.self_attn.out_proj.weight", "cross_decoder.layers.0.self_attn.out_proj.bias", "cross_decoder.layers.0.cross_attn.in_proj_weight", "cross_decoder.layers.0.cross_attn.in_proj_bias", "cross_decoder.layers.0.cross_attn.out_proj.weight", "cross_decoder.layers.0.cross_attn.out_proj.bias", "cross_decoder.layers.0.linear1.weight", "cross_decoder.layers.0.linear1.bias", "cross_decoder.layers.0.linear2.weight", "cross_decoder.layers.0.linear2.bias", "cross_decoder.layers.0.norm1.weight", "cross_decoder.layers.0.norm1.bias", "cross_decoder.layers.0.norm2.weight", "cross_decoder.layers.0.norm2.bias", "cross_decoder.layers.0.norm_q.weight", "cross_decoder.layers.0.norm_q.bias", "cross_decoder.layers.0.norm_c.weight", "cross_decoder.layers.0.norm_c.bias", "cross_decoder.text_embed.embedding.weight", "cross_decoder.norm.weight", "cross_decoder.norm.bias", "cross_decoder.head.weight", "cross_decoder.head.bias".
Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "loops", "callbacks", "optimizer_states", "lr_schedulers", "NativeMixedPrecisionPlugin", "hparams_name", "hyper_parameters"

Ihave done everything as mentioned, what is the reason?

Training other language

I also referred to [#1] to proceed with learning in another language. So, I follow #1 (comment). And then download this one (https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14/resolve/main/pytorch_model.bin). I tried to run the learning after setting it to clip_pretrained, but the following error code occurred.

<loading checkpoint from /data1/home/ict07/HSJ/CLIP4STR/pretrained/clip/XLM-Roberta-Large-Vit-L-14.bin
Error executing job with overrides: ['+experiment=vl4str-large', 'model=vl4str', 'dataset=real', 'data.root_dir=/NasData/datasets/str_dataset_ub', 'model.lr=8.4e-5', 'model.batch_size=16', 'trainer.accumulate_grad_batches=8', 'trainer.max_epochs=5', 'trainer.gpus=8', 'trainer.val_check_interval=10000', 'model.clip_pretrained=/data1/home/ict07/HSJ/CLIP4STR/pretrained/clip/clip-vit-large-patch14-ko.bin']
Error in call to target 'strhub.models.vl_str.system.VL4STR':
KeyError('model')
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.>

Please tell me how to deal with it.

I set it up well for my path and trained vl4str_large.sh well using weights pretrained provided by CLIP4STR. And I renamed (https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14/resolve/main/pytorch_model.bin) pytorch_model.bin to XLM-Roberta-Large-Vit-L-14.bin.

Convert to ONNX

How can I convert a model to ONNX for inference implementation?

Is there a way to detect spaces?

Thank you for the great work and released models. I noticed the tokenizer does not include spaces. Was the model not trained on them or is there a way to add them to the tokenizer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.