Contextualized Topic Models version:2.4.2 Python version: 3.9

How to work with Large dataset? about contextualized-topic-models HOT 14 CLOSED

josepius-clemson commented on June 11, 2024

How to work with Large dataset?

from contextualized-topic-models.

Comments (14)

vinid commented on June 11, 2024

dataset size should not be a problem by itself, how large is your vocab?

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

Vocab size: 1976192

…

On Sat, Mar 4, 2023 at 10:24 PM Federico Bianchi ***@***.***> wrote: dataset size should not be a problem by itself, how large is your vocab? — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQZ2FUJVIU625MLP2J2COLTW2QBQBANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

from contextualized-topic-models.

vinid commented on June 11, 2024

I think that's the issue, the vocab is probably too large.

Also note that CTM works better with very small vocab sizes, like 2k

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

I suppose a large dataset will have a large vocab too. .Pls correct me if I am wrong.

…

On Sat, Mar 4, 2023 at 10:34 PM Federico Bianchi ***@***.***> wrote: I think that's the issue, the vocab is probably too large. Also note that CTM works better with very small vocab sizes, like 2k — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQZ2FULWJDT64UDEJCVNSBDW2QCTHANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

from contextualized-topic-models.

vinid commented on June 11, 2024

Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

I did text pre-processing as below and got the len(vocab) as 1057. But I am still getting the memory error: ``` vectorizer = CountVectorizer(ngram_range=(2,2),min_df=900,max_df=0.50) #from sklearn ``` Error: ``` 0it [00:00, ?it/s]

---------------------------------------------------------------------------OSError Traceback (most recent call last)<ipython-input-18-777c581685e6> in <module> 4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None) 5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25) # 50 topics----> 6 ctm.fit(training_dataset) # run the model 7 ctm.get_topic_lists(15) ~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples) 272 # train epoch 273 s = datetime.datetime.now()--> 274 sp, train_loss = self._train_epoch(train_loader) 275 samples_processed += sp 276 e = datetime.datetime.now() ~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader) 171 samples_processed = 0 172 --> 173 for batch_samples in loader: 174 # batch_size x vocab_size 175 X_bow = batch_samples['X_bow'] ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __iter__(self) 433 return self._iterator 434 else:--> 435 return self._get_iterator() 436 437 @Property ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self) 379 else: 380 self.check_worker_number_rationality()--> 381 return _MultiProcessingDataLoaderIter(self) 382 383 @Property ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __init__(self, loader) 1032 # before it starts, and __del__ tries to join but will get: 1033 # AssertionError: can only join a started process.-> 1034 w.start() 1035 self._index_queues.append(index_queue) 1036 self._workers.append(w) /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self) 119 'daemonic processes are not allowed to have children' 120 _cleanup()--> 121 self._popen = self._Popen(self) 122 self._sentinel = self._popen.sentinel 123 # Avoid a refcycle if the target function holds an indirect /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj) 222 @staticmethod 223 def _Popen(process_obj):--> 224 return _default_context.get_context().Process._Popen(process_obj) 225 226 class DefaultContext(BaseContext): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj) 275 def _Popen(process_obj): 276 from .popen_fork import Popen--> 277 return Popen(process_obj) 278 279 class SpawnProcess(process.BaseProcess): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in __init__(self, process_obj) 17 self.returncode = None 18 self.finalizer = None---> 19 self._launch(process_obj) 20 21 def duplicate_for_child(self, fd): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj) 64 parent_r, child_w = os.pipe() 65 child_r, parent_w = os.pipe()---> 66 self.pid = os.fork() 67 if self.pid == 0: 68 try: OSError: [Errno 12] Cannot allocate memory ,,,

On Sat, Mar 4, 2023 at 10:49 PM Federico Bianchi ***@***.***> wrote: Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQZ2FULOY5NA3ONAGLIY3O3W2QEMBANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

```

from contextualized-topic-models.

vinid commented on June 11, 2024

Could you try setting num_data_loader_workers=1 in CombinedTM or ZeroShotTM?

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

I just did and getting similar error:

0it [12:44, ?it/s]
0it [00:00, ?it/s]

OSError Traceback (most recent call last)
in
4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25,num_data_loader_workers=1) # 50 topics
----> 6 ctm.fit(training_dataset) # run the model
7 ctm.get_topic_lists(15)

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples)
272 # train epoch
273 s = datetime.datetime.now()
--> 274 sp, train_loss = self._train_epoch(train_loader)
275 samples_processed += sp
276 e = datetime.datetime.now()

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader)
171 samples_processed = 0
172
--> 173 for batch_samples in loader:
174 # batch_size x vocab_size
175 X_bow = batch_samples['X_bow']

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in iter(self)
433 return self._iterator
434 else:
--> 435 return self._get_iterator()
436
437 @Property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self)
379 else:
380 self.check_worker_number_rationality()
--> 381 return _MultiProcessingDataLoaderIter(self)
382
383 @Property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in init(self, loader)
1032 # before it starts, and del tries to join but will get:
1033 # AssertionError: can only join a started process.
-> 1034 w.start()
1035 self._index_queues.append(index_queue)
1036 self._workers.append(w)

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self)
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)
225
226 class DefaultContext(BaseContext):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
275 def _Popen(process_obj):
276 from .popen_fork import Popen
--> 277 return Popen(process_obj)
278
279 class SpawnProcess(process.BaseProcess):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in init(self, process_obj)
17 self.returncode = None
18 self.finalizer = None
---> 19 self._launch(process_obj)
20
21 def duplicate_for_child(self, fd):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj)
64 parent_r, child_w = os.pipe()
65 child_r, parent_w = os.pipe()
---> 66 self.pid = os.fork()
67 if self.pid == 0:
68 try:

OSError: [Errno 12] Cannot allocate memory

from contextualized-topic-models.

vinid commented on June 11, 2024

And this happens only with the 7GB dataset, am I right?

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

Yes, you are right

…

On Sun, Mar 5, 2023, 2:38 PM Federico Bianchi ***@***.***> wrote: And this happens only with the 7GB dataset, am I right? — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQZ2FUM3ZEJLI4E7R3PLQPDW2TTTPANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from contextualized-topic-models.

vinid commented on June 11, 2024

I am currently not sure about what could be causing the problem, but I'll look into this

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.

from contextualized-topic-models.

vinid commented on June 11, 2024

Yea, it should work (even if it's not the best one)

…

On Fri, Mar 10, 2023, 21:52 josepius-clemson ***@***.***> wrote: Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm. — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you commented.Message ID: ***@***.***>

from contextualized-topic-models.

josepius-clemson commented on June 11, 2024

Ok. Kindly let me know if you find a solution for the best one. On Sat, Mar 11, 2023 at 9:29 AM Federico Bianchi ***@***.***> wrote:

…

Yea, it should work (even if it's not the best one) On Fri, Mar 10, 2023, 21:52 josepius-clemson ***@***.***> wrote: > Hi, It worked on large dataset when I tried > SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual > embedding. Hope its fine to use it for building the training dataset. Pls > confirm. > > — > Reply to this email directly, view it on GitHub > < #129 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM > > . > You are receiving this because you commented.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQZ2FUMY4YWQGOGBIWIL5CTW3SD6BANCNFSM6AAAAAAVP3YUWM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

from contextualized-topic-models.

How to work with Large dataset? about contextualized-topic-models HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent