Comments (1)
Hi,
The code in question seems to be from AbstractNaiveBayes.java:269:
streamExecutor.forEach(StreamMethods.stream(trainingData.getXDataTypes().keySet().stream(), isParallelized()), feature -> {
for(Object theClass : classesSet) {
List<Object> featureClassTuple = Arrays.asList(feature, theClass);
logLikelihoods.put(featureClassTuple, 0.0); //the key is unique across threads and the map is concurrent
}
});
This snippet is responsible for creating tuples for every keyword in your vocabulary and every class in your dataset. So if you have W words and C classes, this should be W*C items. Normally the implementation should be able to handle this just fine. I have trainned way larger models with hundreds of classes and several thousands words without a problem. It is very hard for me to spot the error from the stacktrace. Please note that you have not pasted the entire stacktrace here, so I can't be sure which memory allocation statement fails.
I'll make a big assumption here that you are trying to enhance the classifier with a language (such as Chinese or Thai) that does not use spaces to separate the words. So everything I propose in this paragraph relays on this assumption. If you are dealing with a language that does not use spaces to separate words, the keywords extracted by the default tokenizer are actually sentences and they are extremely long (hense you get a memory error). For parsing such laguages you need to write a custom tokenizer (have a look here for an example) to split the words correctly (possibly by character). One issue that you will face is that the high-level TextClassifier does not allow you to use different tokenizers for different languages. This should not be a problem though as you can always directly parse those languages separately and append the items in the Dataset object. It will not be as convenient for you but it will do the trick just fine. An alternative easier solution would be to preprocess the text files of the particular languages to introduce spaces between the characters. Then you just train the classifier as usual using the TextClassifier class.
Hope this helps.
from datumbox-framework.
Related Issues (20)
- Access output of StepwiseRegression prediction HOT 1
- Serialize Dataframe HOT 4
- Cross Validation in Datumbox for parameter selection HOT 2
- Train Text Classifier from String array HOT 1
- How to Set configs so that I can read Training Data from Disk? HOT 4
- How to use Pretrained Models in Datumbox Framework HOT 3
- Can we perform Named Entity Extraction Using Datumbox HOT 2
- How can to make datumbox train data in disk HOT 1
- Will this work on Android HOT 2
- Created model is giving slow response? HOT 2
- FlatDataList with null values gets an exception when trying to calculate the variance HOT 5
- SVM example for text classfication HOT 2
- Unable to download the framework using Maven HOT 1
- WordSequenceExtractor can not work with MultinomialNaiveBayes Training HOT 1
- How to setLogPriors for Naive Bayes model during cross validation? HOT 1
- Why Holt-Winters only returns one-step-ahead forecast ? HOT 2
- How to load a big dataset and use multiple TextClassifier to predict it? HOT 1
- Entity based Sentiment Analysis HOT 1
- Possible Error in Shapiro-Wilk P-Value HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datumbox-framework.