Comments (8)
The only changes you should have to make are:
- Update
max_length
if you want to train on longer contexts - Update
transformer_model
to a HF model that supports longer input sequences (e.g. allenai/longformer-base-4096).
Of course I could be forgetting something. Feel free to follow up here if these changes cause an error.
from declutr.
I tried integrating but am getting issue while integrating allenai/longformer-large-4096
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 80, in __iter__
for instance in self._instance_generator(self._file_path):
File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 446, in _instance_iterator
yield from self._multi_worker_islice(self._read(file_path), ensure_lazy=True)
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1180, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/declutr/dataset_reader.py", line 124, in _read
file_path = cached_path(file_path)
File "/usr/local/lib/python3.7/dist-packages/allennlp/common/file_utils.py", line 175, in cached_path
raise FileNotFoundError(f"file {url_or_filename} not found")
FileNotFoundError: file None not found
Also when I downloaded the model in local in colab and referenced the path, I am able to use the model but then it stops with error
File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/seq2vec_encoders/boe_encoder.py", line 43, in forward
tokens = tokens * mask.unsqueeze(-1)
RuntimeError: The size of tensor a (512) must match the size of tensor b (503) at non-singleton dimension 1
from declutr.
The first error looks like one of your datapaths is wrong or non-existent (train_data_path
or validation_data_path
).
Your second, I am not sure. At some point AllenNLP tries to multiply some tokens of size 512 with a mask of size 503. You may have to look into the specifics of longformer to see how to use it properly, or try another model that accepts long input sequences.
from declutr.
Unfortunately the clarity is not there how to train longer sequence models using Declutr on allen NLP. Is there a way I can use Declutr base on longer context sequence with some striding parameters or any other way to handle longer sequence for semantic tasks.
from declutr.
Could you provide more detail as to exactly what you are trying to do? Are you trying to re-train the model with a longer max_length
? Or are you trying to use a trained model on longer sequences?
from declutr.
I want to handle longer sequences for semantic tasks.SO there are two approaches:-
- Use a declutr objective on long sequence model to support longer seq.
- Use the declutr on roberta base or roberta large but have some sort of striding logic so that longer sequences input during inferencing does not create bottlenecks.
Hope this explains the task in hand.
from declutr.
Training on longer sequences is going to be tricky as you would need to collect even longer training documents.
I would go with approach 2. AlleNLP has support for chunking up some text into blocks of max_length
, embedding each, and then concatenating the embeddings. The general approach would be:
- Set the
max_length
argument of the tokenizer to the maximum length of documents you want to embed. - Set the
max_length
argument of thePretrainedTransformerIndexer
to 512 (the maximum input size of our pretrained transformer) - Set the
max_length
argument of thePretrainedTransformerEmbedder
to 512 (the maximum input size of our pretrained transformer)
The setup would go something like:
from declutr import Encoder
overrides = '{"dataset_reader.tokenizer.max_length": 1024, "dataset_reader.token_indexers.tokens.max_length": 512, "model.text_field_embedder.token_embedders.tokens.max_length": 512}'
encoder = Encoder("declutr-base", overrides=overrides)
text = " ".join(["this is a very long string"] * 2048)
encoder(text)
I have not extensively tested this code but it should run and be enough to get started. In particular you should check out the max_length
argument of PretrainedTransformerIndexer
and PretrainedTransformerEmbedder
for more details.
from declutr.
Closing, please feel free to re-open and file another issues if you questions are not answered.
from declutr.
Related Issues (20)
- Cant set up DECLUTR in local AWS linux machine HOT 2
- argument 'lazy' for dataset_reader HOT 2
- Superclass initialization in token embedder HOT 2
- Could not lex the character code 194 HOT 3
- Minimum text length violated despite preprocessing HOT 2
- How to plot the learning curve from the output logs created post training of declutr? HOT 1
- Impact of "shorter" documents (span, number of tokens) for extended pretraining HOT 7
- Installation issue HOT 8
- Wrong training procedure? HOT 6
- Strange issue occuring during Training HOT 2
- load pretrained tf1 model with pytorch HOT 5
- Encoder class breaks for long strings
- can i finetune the model ? HOT 2
- Update DeCLUTR requirements? HOT 5
- How to use a validation dataset when training? HOT 8
- RuntimeError: Error(s) in loading state_dict for DeCLUTR: HOT 2
- Error while encoding HOT 4
- Training with multi gpus HOT 6
- Installation fails in colab notebook HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from declutr.