rucaibox / textbox Goto Github PK
View Code? Open in Web Editor NEWTextBox 2.0 is a text generation library with pre-trained language models
Home Page: https://github.com/RUCAIBox/TextBox
License: MIT License
TextBox 2.0 is a text generation library with pre-trained language models
Home Page: https://github.com/RUCAIBox/TextBox
License: MIT License
ๅจ่ฟ่ก
python run_textbox.py --model=OpenAI-GPT --model_path=openai-gpt --dataset=gyafc_em
็ๆถๅ๏ผไผๆฅ้
[Errno 2] No such file or directory: 'textbox/evaluator/utils/gyafc_em.ckpt'
ๅจไฝฟ็จRUCAIBox/StyleTransfer้้ข็GYAFCๆฐๆฎ้็ๆถๅไป้่ฆ่ฟไธชckptๆไปถ๏ผ่ฏท้ฎๅจๅช้ไธ่ฝฝๆ่
็ๆ๏ผ
Hi,
After I trained the SeqGAN Model on Arabic texts I have a generated file (SeqGAN-wiki_ar-Dec-05-2022_20-05-03.txt) which contains a huge number of Arabic-generated sentences.
But my question to you is, is it possible for me, after the end of the training period for this model (which was approximately three days), to run it in order to get a new sentence that was just generated every time, immediately?
Does TextBox tools support this feature and how can I run the SeqGAN after training on the special dataset to generate a new sentence every time?
ๆ่ฟฐ่ฟไธช bug
accelerateๅคๅก่ฎญ็ปๅบ้ใ
ๅฆไฝๅค็ฐ
accelerate launch run_textbox.py \ --gpu_id=1,3 \ --dataset=csl \ --model=CPT \ --model_path=fnlp/cpt-base \ --saved_dir=./saved/ \ --filename=DEBUG \ --epochs=5 \ --learning_rate=1e-5 \ --train_batch_size=16 \ --eval_batch_size=16 \ --max_save=1 \ --wandb=disabled \ --quick_test=1000 \
ๆฅๅฟ
13 Feb 11:15 ERROR Traceback (most recent call last): File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment yield True File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 136, in run self._do_train_and_valid() File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid self.valid_result = self.trainer.fit(train_data, valid_data) File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 451, in fit loss = self._train_epoch(train_data, epoch_idx, valid_data)['loss'] File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 221, in _train_epoch loss = self.model(data, epoch_idx=epoch_idx) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=Trueto
torch.nn.parallel.DistributedDataParallel, and by making sure all
forwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's
forwardfunction. Please include the loss function and the structure of the return value of
forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
can I run textbox-0.2.1 on a Windows server without using a Linux subsystem?
Is the tool requires any dependencies that are specific to Linux or require a Linux environment to run?
When I run python run_textbox.py --rnn_type=lstm --max_vocab_size=4000
, I found a bug in the textbox/module/Decoder/rnn_decoder.py, which occurred on line 78. The init_hidden()
function returns a tuple when initialzed with lstm so got AttributeError: 'tuple' object has no attribute 'contiguous'
.
Hello,
Is that a pandas version problem? which pandas version does TextBox need?
python: 3.8.15
pandas: 1.5.2
(textbox) hy@xxx:~/TextBox$ python run_textbox.py --model_path=facebook/bart-base
Traceback (most recent call last):
File "run_textbox.py", line 2, in <module>
from textbox import run_textbox
File "/home/hy/TextBox/textbox/__init__.py", line 8, in <module>
from textbox.quick_start.hyper_tuning import run_hyper
File "/home/hy/TextBox/textbox/quick_start/hyper_tuning.py", line 14, in <module>
from .experiment import Experiment
File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 12, in <module>
from ..trainer.trainer import Trainer
File "/home/hy/TextBox/textbox/trainer/__init__.py", line 1, in <module>
from textbox.trainer.trainer import Trainer
File "/home/hy/TextBox/textbox/trainer/trainer.py", line 16, in <module>
from textbox.utils.dashboard import get_dashboard, Timestamp, EpochTracker
File "/home/hy/TextBox/textbox/utils/dashboard.py", line 13, in <module>
from pandas.core.resample import f
ImportError: cannot import name 'f' from 'pandas.core.resample'
Does TextBox support custom preprocessing for a dataset?
I work with source code data, and such data may require custom preprocessing. For example, extracting abstract syntax tree, or dataflow graph.
ไฝฟ็จrun_textbox.py --model=Blenderbot-Small --model_path=facebook/blenderbot_small-90M --dataset=dd ็ๆๅพฎ่ฐๅ็chekpointๆไปถๅคน๏ผ้ๅฝๅไธบblenderbot_small-90M๏ผ็ถๅ่ฐ็จ่ฏฅๆจกๅ๏ผๆฅ้
Exception: Error while initializing BPE: Token _</w>
out of vocabulary
ไปฃ็ ๏ผ
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("๏ฝ/blenderbot_small-90M")
model =AutoModelForSeq2SeqLM.from_pretrained("๏ฝ/blenderbot_small-90M")
predict = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
text = "hello"
pred = predict(text)
print(pred)
Describe the bug
AttributeError: 'GPT2LMHeadModel' object has no attribute 'set_efficient_tuning'
When I use efficient_methods๏ผ
python run_textbox.py
--model=T5
--model_path=t5-large
--dataset=webnlg
--gpu_id=4
--efficient_methods=['prefix-tuning']
--efficient_kwargs={'prefix_length':\ 100,\ 'prefix_dropout':\ 0.1,\ 'prefix_mid_dim':\ 512}
--filename CP/T5_large_prefix_tuning
Hi! Thx for sharing the amazing code repo.
First of all I've tried the trial on gpt2 finetuning with textbox, eveything is good, works pretty smooth.
However, I came across an error when I tried to use prompt tuning for gpt-2.
I set the efficient training hyper-parameters at overall.yaml
to be
efficient_methods: ['prompt-tuning']
efficient_kwargs: {'prompt_length': 100}
efficient_unfreeze_model: False
Then when prompt-tuning it returns the following error:
generating: 0%| | 0/7 [00:00<?, ?it/s]
generating: 0%| | 0/7 [00:00<?, ?it/s]
24 Nov 21:13 ERROR Traceback (most recent call last):
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/utils/dashboard.py", line 323, in new_experiment
yield True
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/quick_start/experiment.py", line 130, in run
self._do_train_and_valid()
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/quick_start/experiment.py", line 105, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 453, in fit
self.stopped |= self._valid(valid_data, 'epoch')
File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 297, in _valid
valid_results = self.evaluate(valid_data, is_valid=True)
File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 526, in evaluate
generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator)
File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/model/abstract_model.py", line 90, in generate
sample_outputs = accelerator.unwrap_model(self.model).generate(**inputs, **self.generation_kwargs)
File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/transformers/generation_utils.py", line 934, in generate
raise ValueError("For decoder-only generation, one must pass `input_ids`.")
ValueError: For decoder-only generation, one must pass `input_ids`
mainly ValueError: For decoder-only generation, one must pass
input_ids`` at transformers/generation_utils.py
when validating
How can I solve this problem?
Hello,
Thank you for this tool. I would like to add the possibility of training using Reinforcement Learning using a reward such as ROUGE or BLEU, for seq2seq tasks.
I would be happy to contribute!
Best,
Muhammad
get the error In follow image
File "/textbox/Textbox_test/TextBox/textbox/trainer/trainer.py", line 516, in evaluate generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator) TypeError: generate() missing 1 required positional argument: 'accelerator'
I compared the older version:
latest version
line 516: generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator
older version
line 505: generated = self.accelerator.unwrap_model(self.model).generate(batch_data, eval_data, self.accelerator)
so, I added the argument "eval_data" to generate().
Sure enough, problem solved.
Please confirm this error
what to do if a phrase has several translations.
Will this dataset be correct?
train.src
:
I like the color green.
I like the color green.
train.tgt
ะะฝะต ะฝัะฐะฒะธััั ะทะตะปัะฝัะน ัะฒะตั.
ะฏ ะปัะฑะปั ะทะตะปัะฝัะน ัะฒะตั.
how will the BLEU metric work on it?
I checked the code in textbox/model/ptg.py
.
However, I am confused that how task_key
is updated.
Is the only last embedding updated?
Also, I am wondering how to determine task_key
for new task..
Doesn't it refer to the other tasks?
I download Persona chat from BaiduWangpan that was given in README, but it may need more processes.
In source code, it uses 'MultipleSentenceDataset' to deal with dialog task's dataset. Given dataset(persona chat) is split to source and target, and lacks knowledge part, which makes wrong in training. So I want to know more details about how to deal with this question, thank you~
Why do textbox tools become doesn't support GANs Models for generation texts?
How can I use SeqGAN, TextGAN, RankGAN Models if possible?
HOW TO REPRODUCE?
embedding_size: 128
hidden_size: 256
num_enc_layers: 2
bidirectional: True
num_dec_layers: 2
dropout_ratio: 0.2
rnn_type: "gru"
#attention_type: "LuongAttention"
#alignment_method: "concat"
beam_size: 5
decoding_strategy: 'beam_search'
model RNNEncDec.
dataset IWSLT14_DE_EN
Thanks for your attention.
I have some data like this:
These are heads and I saved then in train.input_text
PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX changes men 's ___
PersonX changes men 's ___
And tails are the other node in a relation. For example for "intention" relation, these could be tails, which specify what is the intention of a person of doing an action (heads above) and they are saved as train.target_text
:
to have an advantage
to fulfill a desire
to get out of trouble
to be powerful
to be influential
to reform men with a bad attitude.
good about themselves
Each row corresponds to a row in tails. I tried to model them using t5
, but I don't know which task_type
to use. I tried summarization
and the generated output is like this:
personx's ____
personx's ____
personx's ____
personx's ____
personx's ____
personx changes men's _____
personx changes men's _____
Which is obviously far from targets
. It seems it jut tried to summarize them. but it must find a mapping between them.
Which task type is suitable for this common task?
ๆ่ฟฐ่ฟไธช bug
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 744 but got size 768 for tensor number 1 in the list.
ๅฆไฝๅค็ฐ
gyafc_em.yaml้็train_batch_size่ฎพไธบ16 ๏ผ็จ็3090 24G๏ผ ่ฎพ64็่ฏๆฅ้่ฏดๆพๅญไธๅค๏ผ
่ฟ่กpython run_textbox.py --model=Context_Tuning --dataset=gyafc_em
ๆฅๅฟ
๏ผ15 Apr 16:30 ERROR Traceback (most recent call last):
File "/root/TextBox/textbox/utils/dashboard.py", line 311, in new_experiment
yield True
File "/root/TextBox/textbox/quick_start/experiment.py", line 138, in run
self._do_train_and_valid()
File "/root/TextBox/textbox/quick_start/experiment.py", line 113, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/root/TextBox/textbox/trainer/trainer.py", line 451, in fit
loss = self._train_epoch(train_data, epoch_idx, valid_data)['loss']
File "/root/TextBox/textbox/trainer/trainer.py", line 221, in _train_epoch
loss = self.model(data, epoch_idx=epoch_idx)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/TextBox/textbox/model/abstract_model.py", line 69, in forward
inputs = self._process_prompt_tuning_input(inputs, batch)
File "/root/TextBox/textbox/model/context_tuning.py", line 88, in _process_prompt_tuning_input
inputs_embeds = torch.cat([prompt_embeds[:, 0], inputs_embeds, prompt_embeds[:, 1]], dim=1) # b, pl+l+pl, e
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 744 but got size 768 for tensor number 1 in the list.
Hey there,
Thanks for the amazing work.
Actually I face some codec problem when loading an example "python run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base"
Traceback (most recent call last): File "run_textbox.py", line 12, in <module> run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={}) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\quick_start.py", line 20, in run_textbox experiment = Experiment(model, dataset, config_file_list, config_dict) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\experiment.py", line 52, in __init__ self._init_data(self.get_config(), self.accelerator) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\experiment.py", line 78, in _init_data train_data, valid_data, test_data = data_preparation(config, tokenizer) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\utils.py", line 23, in data_preparation train_dataset = AbstractDataset(config, 'train') File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\abstract_dataset.py", line 25, in __init__ self.source_text = load_data(source_filename, max_length=self.quick_test) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\misc.py", line 25, in load_data for line in fin: File "C:\Users\korn2\anaconda3\envs\TextBox\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6797: character maps to <undefined>
any ideas ?
Cheers,
Kevin L
Nice work and thank you for your team's contribution.
FYI. As for the seq2seq generation results of GPT/BART etc. do you plan to release?
Hello, I am very interested in your work.
I know โ--pretrain_task=denoisingโ can use denoising auto-encoding.
However, I did not see the relevant parameter descriptions and restrictions in the code.
When I want to use other pre training methods, How do I set this parameter?
Thank you very much for your work, and I hope you can tell me the method when you have time.
่ฏท้ฎไฝ่ ๏ผๆๅฆไฝ็จ่ฟไปฝไปฃ็ ๏ผๅค็ฐๆๆ่กจๆ ผไธญ็ๆฐๆฎๅข๏ผๆฏไธชๅฏนๅบ็ๆฐๆฎ้ๆๆฒกๆๅฏนๅบ็่ฟ่กๅฝไปคๅข
ๅฝๆ่ฟ่ก python run_textbox.py --model=LeakGAN --dataset=COCO --task_type=unconditional
็ๆถๅ, ๆปๆฏๆฅ AttributeError: 'SingleSentenceDataLoader' object has no attribute 'vocab_size
, ๅฆๆไฝฟ็จ้ข่ฎญ็ปๆจกๅ็่ฏ๏ผไฝฟ็จ้ฃ้ LeakGANๆจกๅๅข๏ผ๐
fast-bleuๅจwindowsไธ็ผ่ฏๅคฑ่ดฅ๏ผๅฏๅฆๆไพไธไธfast-bleu็wheelๆไปถ๏ผ
I would like to know if I can enter the path for other models like mt5
or mbart-50
etc... does it support multilingual models?
Thanks for open-sourcing this exciting tool!
When I used TextBox for pre-training the BART from scratch, I found that the corpus mentioned in the document wudao
has not been provided. Where can I get this data?
python run_textbox.py --model=BART --dataset=wudao --pretrain_task=denoising
Since I did not have the wudao
dataset, I try to use the example dataset samsum
for pre-training a BART using the denoising task.
However, I got the following error:
05 Jan 23:25 INFO ====== Start training ======
train 1: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 77/77 [03:16<00:00, 2.55s/step, loss=2.78]
05 Jan 23:29 INFO Train epoch 1 [time: 196.60s, loss: 2.78]
generating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 52/52 [01:22<00:00, 1.58s/it]
05 Jan 23:30 ERROR Traceback (most recent call last):
File "/mnt/windata/projects/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment
yield True
File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 136, in run
self._do_train_and_valid()
File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 455, in fit
self.stopped |= self._valid(valid_data, 'epoch')
File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 294, in _valid
valid_results = self.evaluate(valid_data, is_valid=True)
File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 548, in evaluate
corpus_len = len(eval_data.dataset.target_text)
AttributeError: 'AbstractDataset' object has no attribute 'target_text'
Would you pls help me to find the mistake of using TextBox? Thanks!
Hi so which models except gpt support prompt-tuning in your repo?
RuntimeError: Error(s) in loading state_dict for CPTForConditionalGeneration: size mismatch for model.encoder.embeddings.position_ids: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 512]). size mismatch for model.encoder.embeddings.position_embeddings.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]). You may consider adding
ignore_mismatched_sizes=Truein the model
from_pretrained method.
ๅ ่ฝฝcpt-baseๅcpt-large็ๆถๅ้ฝๆฅ่ฟไธช้๏ผ่ฟๆฏไธๆฏconfigๆไปถ็็ปดๅบฆๅ้ไบๅฏผ่ดๅๅงๅ็ๆจกๅ็ปดๅบฆๅๆ้็ปดๅบฆไธๅน้
ๆ่ฎพ็ฝฎmax_save=1๏ผ่ทไบ10ไธชepoch๏ผbestๅบ็ฐๅจepoch 8๏ผไฝๆฏไฟๅญไบepoch-8ๅepoch-10ไธคไธชๆไปถๅคน๏ผๆ็ งREADME็่ฏดๆณไธๆฏๅบ่ฏฅๅชๆไธไธชepoch-8ๅ๏ผๅนถไธๅชๆepoch-10ๆไปถๅคนๆgeneration.txt๏ผ็ฎๅ็ไบไธๆบ็ ๅณไฝฟไธคไธชๆไปถๅคนไธไนๅบ่ฏฅๆฏไธชๆไปถๅคน้ฝ็ๆไธไธชgeneration.txtๅ๏ผ
่ฟ่กๆถๆฅ้ModuleNotFoundError: No module named 'transformers.models.unilm',
ๆๆฅๅๅ็ฐๆฏtextbox\utils\utils.pyๆไปถไธญ
from transformers.models.unilm.tokenization_unilm import UnilmTokenizer
from transformers.models.mass.tokenization_mass import MassTokenizer
ๅบ้,ไผผไนๆ ๆณไปtransformersไธญๅฏผๅ
ฅUnilmTokenizerๅMassTokenizer,่ฏฅๆไปถๆฅ้ๅ
ๅฎนไธบ:
Import "transformers.models.unilm.tokenization_unilm" could not be resolved
Import "transformers.models.mass.tokenization_mass" could not be resolved
ๆๅฐ่ฏๅ็บงtransformers,้ไธtransformers้ฝๆช่ฝ่งฃๅณ
Hi, thanks for providing such a powerful tool. After I clone the Textbox from the source code, I tried to run the command:"python run_textbox.py", the result was reported as follows: ModuleNotFoundError: No module named 'bert_score'. Is it a bug? And how to run the code correctly?
hi,
Can I train SeqGAN of textbox without evaluating part? what the command will be?
The fast_bleu version in the fast_bleu_wheel4windows folder is 0.0.86, but the environment requirement is 0.0.89. In the Windows operating system, fast_bleu cannot be installed directly, can you please update the version in the fast_bleu_wheel4windows folder, thank youใ
Hi,
I am using TextBox tools specially SeqGAN Model with Arabic texts for text generation.
When I upload a large file (train.tgt with 5 Giga size), the SeqGAN model after running gets me:
"
24 Mar 14:25 INFO Loading data from scratch
killed
"
Why does this show me that is there a limitation on data size? Because when I use train.tgt with 2 Mega, the SeqGAN model runs correctly, but its results are poor.
According to your experiences and observations, what are the factors and parameters that I can change in order to improve the generation of Arabic texts?
Thank you :)
Hi,
firstly, thanks a lot for your work.
I have a question about the kg2text task in the project, no example and information about the kg2text, but i read your code, and found the kg dataset processor.
Could you please take some dataset and the corresponding model for example to further explain the kg2text details?
from posix import listdir
should be from os import listdir
, which resulted a lack of portability on Windows.with open(dataset_path, "r") as fin:
should specify encoding to utf-8
.ไฝ ๅฅฝ๏ผๆณ้ฎไธ่ฟไธช้ฎ้ข่ฆๅฆไฝ่งฃๅณๅ
example of data:
train.src
ABC
BCA
train.tgt
['Abc', 'ABc']
'bca'
the textbox/properties/dataset/my_data.yaml
dataset config is same as textbox/properties/dataset/wmt19-ru-en.yaml
command:
python run_textbox.py --model=BART --dataset=my_data --model_path=facebook/bart-base
I got the following error from tokenizer transformers.AutoTokenizer
:
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
error occurs in the method of the textbox.data.abstract_dataset.AbstractDataset
class on 127 line.
target_ids = tokenizer(
text_target=self.target_text,
add_special_tokens=False,
return_token_type_ids=False,
return_attention_mask=False,
)["input_ids"]
It caused due that the self.target_text
is created by the textbox.data.misc.load_data
function
def load_data(dataset_path: str, max_length: int = 0):
...
text = []
with open(dataset_path, "r") as fin:
if max_length:
fin = itertools.islice(fin, max_length)
for line in fin:
l = line.strip()
if len(l) >= 2 and ((l[0] == '"' and l[-1] == '"') or (l[0] == "'" and l[-1] == "'") or
(l[0] == '[' and l[-1] == ']')):
try:
l = eval(l)
if not isinstance(l, list):
l = str(l)
except:
pass
text.append(l)
return text
this function return from my_data [['Abc', 'ABc'], 'bca']
and that structure not follow requirement of transformers.AutoTokenizer - Union[TextInputSequence, Tuple[InputSequence, InputSequence]].
are there examples of learning a machine learning model with a one-to-many dataset?
if self.rnn_type == "lstm": self.hidden_to_mean = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.hidden_to_logvar = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.latent_to_hidden = nn.Linear(self.latent_size, 2 * self.hidden_size) elif self.rnn_type == 'gru' or self.rnn_type == 'rnn': self.hidden_to_mean = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.hidden_to_logvar = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.latent_to_hidden = nn.Linear(self.latent_size, 2 * self.hidden_size)
Is there any difference between LSTM and GRU branches?
Hi,
"python run_textbox.py --model=RNN --dataset=COCO --task_type=unconditional --load_experiment=<ckpt_file> --test_only=true"
I try to use this command after the program stopped accidentally but it cannot find the ckpt_file so i am wondering where the ckpt_file will be and how to load them into the program. Looking forward for your reply! Thank you
When I try to call run_demo
as this:
python run_demo.py --model=T5 --dataset=xInt
ent_en2en --pretrained_model_path=/drive2/pretrained/mt5/hf/mt5-small/
It gives the following error:
...mini/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line
230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line
211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'saved/T5-xIntent_en2en-Jul-31-2021_
13-45-46.pth'
It searches for a file with current time (13-45) but my model exist in saved
directory with another time, why it must matches the minutes and hours:
(base) pouramini@nlplab-server:~/TextBox/saved$ ls
GPT2-COCO-Jul-30-2021_21-51-08.pth
T5-xIntent_en2en-Jul-31-2021_11-46-29.pth
RNN-COCO-Jul-30-2021_13-34-31.pth
Hello,
I have tried to use the command to finetune gpt2-medium with e2e dataset, but got some errors.
Could you please give me an example to train the model with TextBox?
python run_textbox.py --model=GPT2 --dataset=e2e --model_path=./PTMs/gpt2-medium/
When do generating after training one epoch,
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
After generating,
26 Nov 01:02 ERROR Traceback (most recent call last):
File "/home/hy/TextBox/textbox/utils/dashboard.py", line 320, in new_experiment
yield True
File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 130, in run
self._do_train_and_valid()
File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 105, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/home/hy/TextBox/textbox/trainer/trainer.py", line 453, in fit
self.stopped |= self._valid(valid_data, 'epoch')
File "/home/hy/miniconda3/envs/textbox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/hy/TextBox/textbox/trainer/trainer.py", line 297, in _valid
valid_results = self.evaluate(valid_data, is_valid=True)
File "/home/hy/miniconda3/envs/textbox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/hy/TextBox/textbox/trainer/trainer.py", line 571, in evaluate
result = self.evaluator.evaluate(generate_corpus, reference_dataset)
File "/home/hy/TextBox/textbox/evaluator/base_evaluator.py", line 151, in evaluate
metric_result = evaluator.evaluate(generate_corpus, reference_corpus, avg=avg)
File "/home/hy/TextBox/textbox/evaluator/abstract_evaluator.py", line 31, in evaluate
metric_dict = self._calc_metrics_info(generate_corpus=generate_corpus, reference_corpus=reference_corpus)
File "/home/hy/TextBox/textbox/evaluator/bleu_evaluator.py", line 92, in _calc_metrics_info
reference_corpus = list(zip_longest(*reference_corpus))
TypeError: type object argument after * must be an iterable, not Corpus
ๅฆ้ข๏ผ
a = ['็จๅบๅ่ฆๆๆกๅชไบๆ่ฝ?',
'ไฝ ๆไปไน็นๆฎๆ่ฝ?',
'ๆๅญฆ่ฟLinuxใ',
'ไฝ ๅจไธปไฟฎไปไน?',
'่ฐข้,ๆไธปไฟฎๆฐๆฎๅบ',
'ไฝ ่ฝๅ่ฏๆ็่ฑ่ฏญๆ่ฒๅ?',
'็ผ็จ,ๅญฆๅฅฝ่ฑ่ฏญ']
b = ['็จๅบๅ่ฆๆๆกๅชไบๆ่ฝ?', 'ไฝ ๆไปไน็นๆฎๆ่ฝ?', 'ๆๅจ่ฎก็ฎๆบ็ณปๅญฆไน ',
'ไฝ ๅจไธปไฟฎไปไน?', 'ๆไธปไฟฎ่ฑ่ฏญ', 'ไฝ ่ฝๅ่ฏๆ็่ฑ่ฏญๆ่ฒๅ?']
ไธ้ขๆฏ Open-ended dialogue system ็ไธไธๆไพๅญ๏ผไธชไบบๆ่งa ๆฏbๅฅฝไธไบ
็จ็ฌฌไธๅฅ่ฏ'็จๅบๅ่ฆๆๆกๅชไบๆ่ฝ?'ไฝไธบๅฏน่ฏ็ไธป้ข
ๆๆฒกๆๅฅฝ็่ฏไผฐaๅbๅชไธไธชๆดๅฅฝ็ๆนๆณ๏ผ
For the Quick Start in readme.md, I have tried python run_textbox.py
๏ผ it works and return bleu score.
But when I tried python run_textbox.py --rnn_type=lstm --max_vocab_size=4000
, it shows:
06 Apr 10:58 INFO epoch 38 training [time: 0.76s, train loss: 2.8981]
06 Apr 10:58 INFO epoch 38 evaluating [time: 0.23s, valid_loss: 4.281664]
06 Apr 10:58 INFO valid ppl: 72.36075204659402
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 156/156 [00:00<00:00, 206.01it/s]
06 Apr 10:58 INFO epoch 39 training [time: 0.76s, train loss: 2.8782]
06 Apr 10:58 INFO epoch 39 evaluating [time: 0.23s, valid_loss: 4.282278]
06 Apr 10:58 INFO valid ppl: 72.40518443339685
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 156/156 [00:00<00:00, 190.98it/s]
06 Apr 10:58 INFO epoch 40 training [time: 0.82s, train loss: 2.8627]
06 Apr 10:58 INFO epoch 40 evaluating [time: 0.23s, valid_loss: 4.286773]
06 Apr 10:58 INFO valid ppl: 72.73138277179176
06 Apr 10:58 INFO Finished training, best eval result in epoch 37
06 Apr 10:58 INFO best valid loss: 4.267283218029218, best valid ppl: 71.32759063735446
06 Apr 10:58 INFO Loading model structure and parameters from saved/RNN-COCO-Apr-06-2022_10-57-51.pth
0%| | 0/157 [00:00<?, ?it/s]
Traceback (most recent call last):
File "run_textbox.py", line 18, in <module>
run_textbox(model=args.model, dataset=args.dataset, config_file_list=config_file_list, config_dict={})
File "/home/LAB/TextBox/textbox/quick_start/quick_start.py", line 90, in run_textbox
test_result = trainer.evaluate(test_data, load_best_model=saved)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 446, in evaluate
generated = self.model.generate(batch_data, eval_data)
File "/home/LAB/TextBox/textbox/model/LM/rnn.py", line 64, in generate
outputs, hidden_states = self.decoder(decoder_input, hidden_states)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/LAB/TextBox/textbox/module/Decoder/rnn_decoder.py", line 79, in forward
outputs, hidden_states = self.decoder(input_embeddings, hidden_states)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 689, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 634, in check_forward_args
'Expected hidden[0] size {}, got {}')
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 226, in check_hidden_size
raise RuntimeError(msg.format(expected_hidden_size, list(hx.size())))
RuntimeError: Expected hidden[0] size (2, 1, 128), got [1, 128]
Can you tell me how to solve it?
ERROR: Could not find a version that satisfies the requirement rouge-score>=0.1.2 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4)
rouge-score่ฃ ไธไธ0.1.2็ๆฌ่ฏท้ฎๆไน่งฃๅณ
Hello ,
Thanks for developing such a nice library. I am looking for a detailed documentation. The readthedocs.io page is also empty and I feel the instructions given on github are not enough to run experiments. can you please share proper documentation for Textbox ? Moroever, If example notebooks/scripts can be shared , then it will be great for the first-time users to get started with training, evaluation, dataset loading as well decoding using different strategies like (beam, top_k etc).
Thanks a lot .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.