Comments (5)
Oh i'd love to have a pull in your library, but i suppose i would have to mess around the unit tests and i'd rather not to. I'm not sure about you having to work double cleaning what i tried :-D
On the other hand, i know little about NLP, so no... i wasn't aware of neither of the other two things you said (the order of execution for feature selection -> standardizer and the naive bayes method not working with a standardizer).
Btw, awesome work you've done here, i can't believe i haven't told you this yet.
from datumbox-framework.
Slight correction in the post above.
Also i have noted something through your code. You have various double d = it.next()
when using iterators from common.dataobjects.AbstractDataStructureCollection.iteratorDouble()
, but such method can return a null value through TypeInference.toDouble()
, and here lies the problem (i think), you can't assign null to a double
primitive, that throws an exception. You can, however, do it to a Double
class. Is this intentional?
That being said, in Descriptives.variance()
i've made this change and "it seems" to work for me:
Iterator<Double> it = flatDataCollection.iteratorDouble();
while(it.hasNext()) {
Double v = it.next();
if (v != null) {
mean += v;
squaredMean += v * v;
}
}
instead of
Iterator<Double> it = flatDataCollection.iteratorDouble();
while(it.hasNext()) {
double v = it.next();
mean+=v;
squaredMean+=v*v;
}
Of course if i'm right (and that's a big IF), simillar changes should be made through other lines of the code. What do you think? Is this simple or there is something i'm missing again?
from datumbox-framework.
@jluis2k10 Thanks a lot for the detailed info.
You are right. As we discussed the Descriptives can be made to handle null values. I'll have a look soon.
from datumbox-framework.
@jluis2k10 I've checked out all the info you provided and I agree with your comments.
First of all we should indeed make the Descriptives class ignore nulls and consider them missing values. Currently half of the methods ignore them and half of them fail. This needs to be fixed. As we discussed on your original PR, all counters need to be treated accordingly (replace the use of size() with a private static method that return the number of non-null values in the collection). I can patch it myself quickly but since you spotted it, if you want, you can contribute it. Just let me know if you plan to send a new PR. :)
Now concerning the exception. Normally when you try to standardize a column with single value, you must get an exception. So that is not something that needs to be fixed (hence my original reluctance to fix this). Your problem is caused by a different bug. You see the TextClassifier is supposed to be a quick a dirty pipeline for those who don't want to have full control of the steps. Originally the TextClassifier was a separate class from the Modeler but I recently rewrote it to inherit from it. The problem is that the modeler performs first a standardization of the data and then feature selection. In NLP applications though, sparisity is a defacto problem, so one should first do feature selection to remove rare words. If feature selection has happened first, the standardizer would not fail. This needs to be fixed by me once the Descriptives are in place.
Last but not least, I assume that the snippet that you sent me is for illustration purposes only. I say this because all the NaiveBayes models supported at the moment work with word counts, so you should not use a standardizer.
from datumbox-framework.
I patched the bug and made the changes that we discussed. If you download the latest 0.8.1-SNAPSHOT jar from the repo you should be able to check it out.
Note that the latest update removes completely the unnecessary Categorical Encoding step from the pipeline of TextClassifier (already taken care off by the Text Extractor), so I would advise you to remove the setCategoricalEncoderTrainingParameters() method call from your code as it has no effect to the class.
from datumbox-framework.
Related Issues (20)
- Access output of StepwiseRegression prediction HOT 1
- Serialize Dataframe HOT 4
- Cross Validation in Datumbox for parameter selection HOT 2
- Train Text Classifier from String array HOT 1
- How to Set configs so that I can read Training Data from Disk? HOT 4
- How to use Pretrained Models in Datumbox Framework HOT 3
- Can we perform Named Entity Extraction Using Datumbox HOT 2
- How can to make datumbox train data in disk HOT 1
- Will this work on Android HOT 2
- java.lang.OutOfMemoryError while preparing model from own datasets. HOT 1
- Created model is giving slow response? HOT 2
- SVM example for text classfication HOT 2
- Unable to download the framework using Maven HOT 1
- WordSequenceExtractor can not work with MultinomialNaiveBayes Training HOT 1
- How to setLogPriors for Naive Bayes model during cross validation? HOT 1
- Why Holt-Winters only returns one-step-ahead forecast ? HOT 2
- How to load a big dataset and use multiple TextClassifier to predict it? HOT 1
- Entity based Sentiment Analysis HOT 1
- Possible Error in Shapiro-Wilk P-Value HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datumbox-framework.