Git Product home page Git Product logo

Comments (8)

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

The only changes you should have to make are:

  1. Update max_length if you want to train on longer contexts
  2. Update transformer_model to a HF model that supports longer input sequences (e.g. allenai/longformer-base-4096).

Of course I could be forgetting something. Feel free to follow up here if these changes cause an error.

from declutr.

kingafy avatar kingafy commented on July 23, 2024

I tried integrating but am getting issue while integrating allenai/longformer-large-4096

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 80, in __iter__
    for instance in self._instance_generator(self._file_path):
  File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 446, in _instance_iterator
    yield from self._multi_worker_islice(self._read(file_path), ensure_lazy=True)
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/declutr/dataset_reader.py", line 124, in _read
    file_path = cached_path(file_path)
  File "/usr/local/lib/python3.7/dist-packages/allennlp/common/file_utils.py", line 175, in cached_path
    raise FileNotFoundError(f"file {url_or_filename} not found")
FileNotFoundError: file None not found

Also when I downloaded the model in local in colab and referenced the path, I am able to use the model but then it stops with error

 File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/seq2vec_encoders/boe_encoder.py", line 43, in forward
    tokens = tokens * mask.unsqueeze(-1)
RuntimeError: The size of tensor a (512) must match the size of tensor b (503) at non-singleton dimension 1

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

The first error looks like one of your datapaths is wrong or non-existent (train_data_path or validation_data_path).

Your second, I am not sure. At some point AllenNLP tries to multiply some tokens of size 512 with a mask of size 503. You may have to look into the specifics of longformer to see how to use it properly, or try another model that accepts long input sequences.

from declutr.

kingafy avatar kingafy commented on July 23, 2024

Unfortunately the clarity is not there how to train longer sequence models using Declutr on allen NLP. Is there a way I can use Declutr base on longer context sequence with some striding parameters or any other way to handle longer sequence for semantic tasks.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Could you provide more detail as to exactly what you are trying to do? Are you trying to re-train the model with a longer max_length? Or are you trying to use a trained model on longer sequences?

from declutr.

kingafy avatar kingafy commented on July 23, 2024

I want to handle longer sequences for semantic tasks.SO there are two approaches:-

  1. Use a declutr objective on long sequence model to support longer seq.
  2. Use the declutr on roberta base or roberta large but have some sort of striding logic so that longer sequences input during inferencing does not create bottlenecks.
    Hope this explains the task in hand.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Training on longer sequences is going to be tricky as you would need to collect even longer training documents.

I would go with approach 2. AlleNLP has support for chunking up some text into blocks of max_length, embedding each, and then concatenating the embeddings. The general approach would be:

  1. Set the max_length argument of the tokenizer to the maximum length of documents you want to embed.
  2. Set the max_length argument of the PretrainedTransformerIndexer to 512 (the maximum input size of our pretrained transformer)
  3. Set the max_length argument of the PretrainedTransformerEmbedder to 512 (the maximum input size of our pretrained transformer)

The setup would go something like:

from declutr import Encoder

overrides = '{"dataset_reader.tokenizer.max_length": 1024, "dataset_reader.token_indexers.tokens.max_length": 512, "model.text_field_embedder.token_embedders.tokens.max_length": 512}'

encoder = Encoder("declutr-base", overrides=overrides)

text = " ".join(["this is a very long string"] * 2048)
encoder(text)

I have not extensively tested this code but it should run and be enough to get started. In particular you should check out the max_length argument of PretrainedTransformerIndexer and PretrainedTransformerEmbedder for more details.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Closing, please feel free to re-open and file another issues if you questions are not answered.

from declutr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.