Git Product home page Git Product logo

Comments (14)

vinid avatar vinid commented on June 11, 2024

dataset size should not be a problem by itself, how large is your vocab?

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

I think that's the issue, the vocab is probably too large.

Also note that CTM works better with very small vocab sizes, like 2k

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

Could you try setting num_data_loader_workers=1 in CombinedTM or ZeroShotTM?

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

I just did and getting similar error:

0it [12:44, ?it/s]
0it [00:00, ?it/s]


OSError Traceback (most recent call last)
in
4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25,num_data_loader_workers=1) # 50 topics
----> 6 ctm.fit(training_dataset) # run the model
7 ctm.get_topic_lists(15)

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples)
272 # train epoch
273 s = datetime.datetime.now()
--> 274 sp, train_loss = self._train_epoch(train_loader)
275 samples_processed += sp
276 e = datetime.datetime.now()

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader)
171 samples_processed = 0
172
--> 173 for batch_samples in loader:
174 # batch_size x vocab_size
175 X_bow = batch_samples['X_bow']

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in iter(self)
433 return self._iterator
434 else:
--> 435 return self._get_iterator()
436
437 @Property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self)
379 else:
380 self.check_worker_number_rationality()
--> 381 return _MultiProcessingDataLoaderIter(self)
382
383 @Property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in init(self, loader)
1032 # before it starts, and del tries to join but will get:
1033 # AssertionError: can only join a started process.
-> 1034 w.start()
1035 self._index_queues.append(index_queue)
1036 self._workers.append(w)

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self)
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)
225
226 class DefaultContext(BaseContext):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
275 def _Popen(process_obj):
276 from .popen_fork import Popen
--> 277 return Popen(process_obj)
278
279 class SpawnProcess(process.BaseProcess):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in init(self, process_obj)
17 self.returncode = None
18 self.finalizer = None
---> 19 self._launch(process_obj)
20
21 def duplicate_for_child(self, fd):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj)
64 parent_r, child_w = os.pipe()
65 child_r, parent_w = os.pipe()
---> 66 self.pid = os.fork()
67 if self.pid == 0:
68 try:

OSError: [Errno 12] Cannot allocate memory

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

And this happens only with the 7GB dataset, am I right?

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

I am currently not sure about what could be causing the problem, but I'll look into this

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.

from contextualized-topic-models.

vinid avatar vinid commented on June 11, 2024

from contextualized-topic-models.

josepius-clemson avatar josepius-clemson commented on June 11, 2024

from contextualized-topic-models.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.