Comments (14)
dataset size should not be a problem by itself, how large is your vocab?
from contextualized-topic-models.
from contextualized-topic-models.
I think that's the issue, the vocab is probably too large.
Also note that CTM works better with very small vocab sizes, like 2k
from contextualized-topic-models.
from contextualized-topic-models.
Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more
from contextualized-topic-models.
from contextualized-topic-models.
Could you try setting num_data_loader_workers=1
in CombinedTM or ZeroShotTM?
from contextualized-topic-models.
I just did and getting similar error:
0it [12:44, ?it/s]
0it [00:00, ?it/s]
OSError Traceback (most recent call last)
in
4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25,num_data_loader_workers=1) # 50 topics
----> 6 ctm.fit(training_dataset) # run the model
7 ctm.get_topic_lists(15)
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples)
272 # train epoch
273 s = datetime.datetime.now()
--> 274 sp, train_loss = self._train_epoch(train_loader)
275 samples_processed += sp
276 e = datetime.datetime.now()
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader)
171 samples_processed = 0
172
--> 173 for batch_samples in loader:
174 # batch_size x vocab_size
175 X_bow = batch_samples['X_bow']
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in iter(self)
433 return self._iterator
434 else:
--> 435 return self._get_iterator()
436
437 @Property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self)
379 else:
380 self.check_worker_number_rationality()
--> 381 return _MultiProcessingDataLoaderIter(self)
382
383 @Property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in init(self, loader)
1032 # before it starts, and del tries to join but will get:
1033 # AssertionError: can only join a started process.
-> 1034 w.start()
1035 self._index_queues.append(index_queue)
1036 self._workers.append(w)
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self)
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)
225
226 class DefaultContext(BaseContext):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
275 def _Popen(process_obj):
276 from .popen_fork import Popen
--> 277 return Popen(process_obj)
278
279 class SpawnProcess(process.BaseProcess):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in init(self, process_obj)
17 self.returncode = None
18 self.finalizer = None
---> 19 self._launch(process_obj)
20
21 def duplicate_for_child(self, fd):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj)
64 parent_r, child_w = os.pipe()
65 child_r, parent_w = os.pipe()
---> 66 self.pid = os.fork()
67 if self.pid == 0:
68 try:
OSError: [Errno 12] Cannot allocate memory
from contextualized-topic-models.
And this happens only with the 7GB dataset, am I right?
from contextualized-topic-models.
from contextualized-topic-models.
I am currently not sure about what could be causing the problem, but I'll look into this
from contextualized-topic-models.
Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.
from contextualized-topic-models.
from contextualized-topic-models.
from contextualized-topic-models.
Related Issues (20)
- Inference with the last version in master HOT 2
- How to create 'miscellaneous' topic from this model HOT 1
- Numpy error evalation scores HOT 17
- OSError: [Errno 22] Invalid argument HOT 5
- representation embedding HOT 18
- Large Dataset HOT 4
- How to Find coherence of this Topic and Model? HOT 1
- GPU and CPU usage HOT 2
- Custom Embedding vs Vocabulary HOT 10
- [help] Required versions HOT 4
- Perplexity HOT 3
- AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names' HOT 2
- Loading own embedding & division by zero error HOT 7
- Testing with custom embedding HOT 7
- More time spent for finding smaller number of topics HOT 5
- Add patience to reduce LR as CTM argument HOT 1
- Bug: Minor bug when constructing the model directory path
- Running cythonize failed! HOT 2
- Variable naming issues HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from contextualized-topic-models.