I'm trying out to prepare own model having 20+ classes (need to add more to create own

java.lang.OutOfMemoryError while preparing model from own datasets. about datumbox-framework HOT 1 CLOSED

datumbox commented on July 22, 2024

java.lang.OutOfMemoryError while preparing model from own datasets.

from datumbox-framework.

Comments (1)

datumbox commented on July 22, 2024

Hi,

The code in question seems to be from AbstractNaiveBayes.java:269:

        streamExecutor.forEach(StreamMethods.stream(trainingData.getXDataTypes().keySet().stream(), isParallelized()), feature -> {
            for(Object theClass : classesSet) {
                List<Object> featureClassTuple = Arrays.asList(feature, theClass);
                logLikelihoods.put(featureClassTuple, 0.0); //the key is unique across threads and the map is concurrent
            }
        });

This snippet is responsible for creating tuples for every keyword in your vocabulary and every class in your dataset. So if you have W words and C classes, this should be W*C items. Normally the implementation should be able to handle this just fine. I have trainned way larger models with hundreds of classes and several thousands words without a problem. It is very hard for me to spot the error from the stacktrace. Please note that you have not pasted the entire stacktrace here, so I can't be sure which memory allocation statement fails.

I'll make a big assumption here that you are trying to enhance the classifier with a language (such as Chinese or Thai) that does not use spaces to separate the words. So everything I propose in this paragraph relays on this assumption. If you are dealing with a language that does not use spaces to separate words, the keywords extracted by the default tokenizer are actually sentences and they are extremely long (hense you get a memory error). For parsing such laguages you need to write a custom tokenizer (have a look here for an example) to split the words correctly (possibly by character). One issue that you will face is that the high-level TextClassifier does not allow you to use different tokenizers for different languages. This should not be a problem though as you can always directly parse those languages separately and append the items in the Dataset object. It will not be as convenient for you but it will do the trick just fine. An alternative easier solution would be to preprocess the text files of the particular languages to introduce spaces between the characters. Then you just train the classifier as usual using the TextClassifier class.

Hope this helps.

from datumbox-framework.

Recommend Projects

java.lang.OutOfMemoryError while preparing model from own datasets. about datumbox-framework HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent