medallia / word2vecjava Goto Github PK

Word2Vec Java Port

License: MIT License

Java 100.00%

word2vecjava's Introduction

Word2vecJava

This is a port of the open source C implementation of word2vec (https://code.google.com/p/word2vec/). You can browse/contribute the repository via Github. Alternatively you can pull it from the central Maven repositories:

<dependency>
  <groupId>com.medallia.word2vec</groupId>
  <artifactId>Word2VecJava</artifactId>
  <version>0.10.3</version>
</dependency>

For more background information about word2vec and neural network training for the vector representation of words, please see the following papers.

For comprehensive explanation of the training process (the gradiant descent formula calculation in the back propagation training), please see:

http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf

Note that this isn't a completely faithful rewrite, specifically:

When building the vocabulary from the training file:

The original version does a reduction step when learning the vocabulary from the file when the vocab size hits 21 million words, removing any words that do not meet the minimum frequency threshold. This Java port has no such reduction step.
The original version injects a token into the vocabulary (with a word count of 0) as a substitute for newlines in the input file. This Java port's vocabulary excludes the token.
The original version does a quicksort which is not stable, so vocabulary terms with the same frequency may be ordered non-deterministically. The Java port does an explicit sort first by frequency, then by the token's lexicographical ordering.

In partitioning the file for processing

The original version assumes that sentences are delimited by newline characters and injects a sentence boundary per 1000 non-filtered tokens, i.e. valid token by the vocabulary and not removed by the randomized sampling process. Java port mimics this behavior for now ...
When the original version encounters an empty line in the input file, it re-processes the first word of the last non-empty line with a sentence length of 0 and updates the random value. Java port omits this behavior.

In the sampling function

The original C documentation indicates that the range should be between 0 and 1e-5, but the default value is 1e-3. This Java port retains that confusing information.
The random value generated for comparison to determine if a token should be filtered uses a float. This Java port uses double precision for twice the fun.

In the distance function to find the nearest matches to a target query

The original version includes an unnecessary normalization of the vector for the input query which may lead to tiny inaccuracies. This Java port foregoes this superfluous operation.
The original version has an O(n * k) algorithm for finding top matches and is hardcoded to 40 matches. This Java port uses Google's lovely com.google.common.collect.Ordering.greatestOf(java.util.Iterator, int) which is O(n + k log k) and takes in arbitrary k.

Note: The k-means clustering option is excluded in the Java port

Please do not hesitate to peek at the source code. It should be readable, concise, and correct. Please feel free to reach out if it is not.

Building the Project

To verify that the project is building correctly, run

./gradlew build && ./gradlew test

It should run 7 tests without any error.

Note: this project requires gradle 2.2+, if you are using older version of gradle, please upgrade it and run:

./gradlew clean test

to have a clean build and re-run the tests.

Contact

Andrew Ko ([email protected])

word2vecjava's People

Contributors

Stargazers

Watchers

word2vecjava's Issues

How to train the model that have been trained before?

I know how to reload the model, but I don't know after I load the trained model, how to train it by giving new sentences. Could you please give me an example about how to achieve that?
Thanks!

Built with mvn but there are problems with pox.xml

Built with mvn for Eclipse successfully for my Mac. After I importing the mvn project, there are two issues:

the org.apache libraries are missing;
pom.xml shows 13 errors.

Has anyone built this and sharing with us your experiences.

Thanks.

Pengchu

Model load takes a very long time!

I am using the Word2VecModel.fromBinFile(modelFile) API to open the default word2vec model, GoogleNews-vectors-negative300.bin (https://code.google.com/p/word2vec/). The model is ~3.6 gigs. My program has been stuck loading this model for more than an hour now. Heap usage is at 11 gigs.

What is the expected load time for a model this size? Why is it taking 11 gigs for a 3.6 gig model?
Would converting the model to the Thrift format be a solution?

Appreciate your help!

Regards,
Sumithra

sl4j error

Hi everyone
I am trying to compile word2vecExample.java file i got an error as stated below.

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at org.slf4j.LoggerFactory.getSingleton(LoggerFactory.java:223)
at org.slf4j.LoggerFactory.bind(LoggerFactory.java:120)
at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:111)
at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:269)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
at org.apache.thrift.transport.TIOStreamTransport.(TIOStreamTransport.java:38)
at org.apache.thrift.TSerializer.(TSerializer.java:45)
at com.medallia.word2vec.util.ThriftUtils.serializeJson(ThriftUtils.java:16)
at com.medallia.word2vec.Word2VecExamples.demoWord(Word2VecExamples.java:73)
at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.ClassNotFoundException: org.slf4j.impl.StaticLoggerBinder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 15 more

I downloaded slf4j-simple 1.7 jar from below link and added dependency in gradle file
http://mvnrepository.com/artifact/org.slf4j/slf4j-simple/1.7.13
Graddle file-
compile 'org.slf4j:slf4j-simple:1.7.13'

I recoompiled word2VecExample.java i got another error as described below.what is the way resolve the error

SLF4J: The requested version 1.6.99 by your slf4j binding is not compatible with [1.5.5, 1.5.6, 1.5.7, 1.5.8]
SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further details.

Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1

medallia-word2vec is showing error while i run the function searcher.
[cosineDistance("money", "Honey")]
Error comes like
Exception in thread "main" com.medallia.word2vec.Searcher$UnknownWordException: Unknown search word 'Honey'
at com.medallia.word2vec.SearcherImpl.getVector(SearcherImpl.java:83)
at com.medallia.word2vec.SearcherImpl.cosineDistance(SearcherImpl.java:45)
at com.medallia.word2vec.Word2VecExamples.interact(Word2VecExamples.java:130)
at com.medallia.word2vec.Word2VecExamples.demoWord(Word2VecExamples.java:73)

at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:33)

BUILD FAILURE

Total time: 15.075s
Finished at: Fri May 22 13:26:27 IST 2015

Final Memory: 5M/15M

Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project medallia-word2vec: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException.

PLease i need kind help.

Make getMatches(final double[] vec, int maxNumMatches) public

It would be great to make the function List getMatches(final double[] vec, int maxNumMatches) in SearcherImpl public and exposed by the Searcher interface. This way, you can lookup matches in the model via a vector, not just by words.
As far as I understand, the analogy mechanism should work that way.

Large Bin File Error

DoubleBuffer vectors = ByteBuffer.allocateDirect(vocabSize * layerSize * 8).asDoubleBuffer();

this line was throwing error since the int multiplication vocabSize * layerSize * 8 > Integer.MAX_VALUE so negative number was passed into the method.

As a dirty fix i change it to the following:

DoubleBuffer vectors = DoubleBuffer.allocate(1000000000);

change CBOW's dimensionality

I'm working on using this word2vec (I set trainer.type(NeuralNetworkType.CBOW)) to do some representing learning staff, and I wonder how to change the vector dimensionality
I use getRawVector() method in Searcher.class, it seems that the vector returned is all one dimensionality.

Word2VecJava is taking very long time to execute on text8

I have a working code that i have configured on Eclipse. I am just wondering that nothing has happened. i have waited for 45 mins and it is still with a message "Acquire vocab is 0.00% complete". I don't know why it is taking very long time.

DeepLearning 4j

cd C:\Users\Abhilasha\Documents\NetBeansProjects\dl4j-0.4-examples-master; "JAVA_HOME=C:\Program Files\Java\jdk1.8.0" cmd /c """C:\Program Files\NetBeans 8.0.1\java\maven\bin\mvn.bat" -Dexec.args="-classpath %classpath org.deeplearning4j.examples.rnn.GravesLSTMCharModellingExample" -Dexec.executable="C:\Program Files\Java\jdk1.8.0\bin\java.exe" -Dexec.classpathScope=runtime -Dmaven.ext.class.path="C:\Program Files\NetBeans 8.0.1\java\maven-nblib\netbeans-eventspy.jar" org.codehaus.mojo:exec-maven-plugin:1.2.1:exec""
Running NetBeans Compile On Save execution. Phase execution is skipped and output directories of dependency projects (with Compile on Save turned on) will be used instead of their jar artifacts.
Scanning for projects...

Some problems were encountered while building the effective model for org.deeplearning4j:deeplearning4j-examples:jar:0.4-rc0-SNAPSHOT
'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 146, column 21

It is highly recommended to fix these problems because they threaten the stability of your build.

For this reason, future Maven versions might no longer support building such malformed projects.

Building DeepLearning4j Examples 0.4-rc0-SNAPSHOT

--- exec-maven-plugin:1.2.1:exec (default-cli) @ deeplearning4j-examples ---
File downloaded to C:\Users\ABHILA~1\AppData\Local\Temp\Shakespeare.txt
Loaded and converted file: 5459809 valid characters of 5465100 total characters (5291 removed)
SLF4J: The requested version 1.6 by your slf4j binding is not compatible with [1.5.5, 1.5.6, 1.5.7, 1.5.8]
SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further details.
Exception in thread "main" java.lang.ExceptionInInitializerError: unable to load from [netlib-native_system-win-x86_64.dll]
at com.github.fommil.jni.JniLoader.load(JniLoader.java:81)
at com.github.fommil.netlib.NativeSystemBLAS.(NativeSystemBLAS.java:42)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at org.nd4j.linalg.cpu.BlasWrapper.(BlasWrapper.java:41)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:4532)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:4490)
at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:136)
at org.deeplearning4j.nn.conf.NeuralNetConfiguration$Builder.seed(NeuralNetConfiguration.java:415)

at org.deeplearning4j.examples.rnn.GravesLSTMCharModellingExample.main(GravesLSTMCharModellingExample.java:73)

BUILD FAILURE

Total time: 1:01.600s
Finished at: Tue Oct 27 21:24:13 IST 2015

Final Memory: 12M/165M

Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project deeplearning4j-examples: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

retraining model problem

Getting error in SearcherImpl.java

The method ```
greatestOf(Iterables.transform(model.vocab, new Function<String, Match>() {
x() {
super();
}
public @OverRide Match apply(String other) {
double[] otherVec = getVectorOrNull(other);
double d = calculateDistance(otherVec, vec);
return new MatchImpl(other, d);
}
}), int)

How to run medallia Word2Vec Java implementation in windows machine

Sir,
I got the llink https://github.com/medallia/Word2VecJava as word2vec java implementation. I would like to run the implementation in my windows machine. currently i am using Netbean 8 for java development along with maven.
Thanks in advance.

When loading word embeddings via thrift: close words (for and four) were modified to one word (for and for)?! and resulting in an error

Hey guys,

I exported the word embeddings from polyglot (https://sites.google.com/site/rmyeid/projects/polyglot) in the thrift/jason format in order to load them into this model.

I got an error: "Multiple entries with same key" I checked the file and there are no words more than once. So I let me print out the "double words":

model.vocab.size: 1000
++++ for i: 20| h: 195: for
++++ or i: 43| h: 910: or
++++ p i: 83| h: 203: p
++++ de i: 165| h: 304: de
++++ for i: 195| h: 195: for
++++ p i: 203| h: 203: p
++++ de i: 304| h: 304: de
++++ or i: 910| h: 910: or
Exception in thread "main" java.lang.IllegalArgumentException: Multiple entries with same key: for=[D@1813ed0e and for=[D@44303e7b
at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:150)
at com.google.common.collect.RegularImmutableMap.checkNoConflictInBucket(RegularImmutableMap.java:104)
at com.google.common.collect.RegularImmutableMap.(RegularImmutableMap.java:70)
at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:254)
at com.medallia.word2vec.SearcherImpl.(SearcherImpl.java:39)
at com.medallia.word2vec.Word2VecModel.forSearch(Word2VecModel.java:39)
at com.medallia.word2vec.Word2VecExamples.loadModel(Word2VecExamples.java:93)
at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:35)

In the next step I looked up in the thrift-file and I found out that they are nearly the same, but different words. The program transform them to same words after loading them. Here are two examples:

In [75]: words[20]
Out[75]: u'for'

In [76]: words[195]
Out[76]: u'four'

In [77]: words[43]
Out[77]: u'or'

In [78]: words[910]
Out[78]: u'our'

I find this behavior is not correct. Can you say me where I can fix that? I can't find the correspoding code lines... thanks in advance!

Here is the coresponding part of the code, where I added my print outs:

SearcherImpl(Word2VecModel model) {
this.model = model;
ImmutableMap.Builder<String, double[]> result = ImmutableMap.builder();
System.out.println("model.vocab.size: "+model.vocab.size());
for (int i = 0; i < model.vocab.size(); i++) {
double[] m = Arrays.copyOfRange(model.vectors, i * model.layerSize, (i + 1) * model.layerSize);
normalize(m);
result.put(model.vocab.get(i), m);
int count=0;
for (int h = 0; h < model.vocab.size(); h++) {
if(model.vocab.get(i).equals(model.vocab.get(h))){
count= count +1;
}
if(count==2){

                System.out.println("++++ "+model.vocab.get(i)+" i: "+i+"| h: "+h +": "+model.vocab.get(h));
                count=1;
            }
        }

    }
    normalized = result.build();
}

Unrecognized Arabic Text in a Corpus and Understanding the Content of the model file

Hi,

I am trying to use your implementation of Word2Vec to generate features for my text.
My Corpus is in arabic.
When running Word2VecExamples on the file containing the sentences all the words won't be recognized and will be displayed as a sequence of "?".
Even in the generated model, I get the same issue:

  {"1":{"lst":["str",666,"","??","??","???","????","?",",","??","??","..","??","??"

First, how could I fix this problem ?
Then, how to interpret the content of the generated model file ?

Thank for your help :)

Bad 0.10.2 release in maven central

http://search.maven.org/#search%7Cga%7C1%7CWord2VecJava
In maven central, the 0.10.2 release has an groupidcom.medallia.word4vec instead of com.medallia.word2vec. Is this a misprint?

error in the pom file netbeans 8

0
down vote
favorite
i have this error what can i do ? !!

"Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project weka-dev: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch. Re-run Maven using the -X switch to enable full debug logging."

Following is my POM.

<modelVersion>4.0.0</modelVersion>

<groupId>nz.ac.waikato.cms.weka</groupId>
<artifactId>weka-dev</artifactId>
<version>3.7.14-SNAPSHOT</version><!-- weka-version -->
<packaging>jar</packaging>

<name>weka-dev</name>
<description>The Waikato Environment for Knowledge Analysis (WEKA), a machine 
    learning workbench. This version represents the developer version, the
    "bleeding edge" of development, you could say. New functionality gets added
    to this version.
</description>
<url>http://www.cms.waikato.ac.nz/ml/weka/</url>
<organization>
    <name>University of Waikato, Hamilton, NZ</name>
    <url>http://www.waikato.ac.nz/</url>
</organization>
<licenses>
    <license>
        <name>GNU General Public License 3</name>
        <url>http://www.gnu.org/licenses/gpl-3.0.txt</url>
        <distribution>repo</distribution>
    </license>
</licenses>

<developers>
    <developer>
        <id>wekateam</id>
        <name>The WEKA Team</name>
        <email>[email protected]</email>
    </developer>
</developers>

<mailingLists>
    <mailingList>
        <name>wekalist</name>
        <subscribe>https://list.waikato.ac.nz/mailman/listinfo/wekalist</subscribe>
        <unsubscribe>https://list.waikato.ac.nz/mailman/listinfo/wekalist</unsubscribe>
        <archive>https://list.waikato.ac.nz/mailman/htdig/wekalist/</archive>
    </mailingList>
</mailingLists>

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

<scm>
    <connection>scm:svn:https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka</connection>
    <developerConnection>scm:svn:https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka</developerConnection>
    <url>https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka</url>
</scm>

<profiles>
    <profile>
        <id>release-sign-artifacts</id>
        <activation>
            <property>
                <name>performRelease</name>
                <value>true</value>
            </property>
        </activation>
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-gpg-plugin</artifactId>
                    <version>1.1</version>
                    <executions>
                        <execution>
                            <id>sign-artifacts</id>
                            <phase>verify</phase>
                            <goals>
                                <goal>sign</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </profile>

    <profile>
        <!-- used for skipping tests -->
        <id>no-tests</id>
        <properties>
            <skipTests>true</skipTests>
        </properties>
    </profile>
</profiles>

<dependencies>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>3.8.2</version>
        <scope>test</scope>
    </dependency>

    <dependency>
        <groupId>nz.ac.waikato.cms.weka.thirdparty</groupId>
        <artifactId>java-cup-11b</artifactId>
        <version>2015.03.26</version>
        <scope>compile</scope>
    </dependency>    

    <dependency>
        <groupId>nz.ac.waikato.cms.weka.thirdparty</groupId>
        <artifactId>java-cup-11b-runtime</artifactId>
        <version>2015.03.26</version>
    </dependency>    

    <dependency>
        <groupId>nz.ac.waikato.cms.weka.thirdparty</groupId>
        <artifactId>bounce</artifactId>
        <version>0.18</version>
    </dependency>    

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-compress</artifactId>
        <version>1.10</version>
        <optional>true</optional>
    </dependency>
</dependencies>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <weka.main.class>weka.gui.GUIChooser</weka.main.class>
</properties>

<build>
    <directory>dist</directory>
    <outputDirectory>build/classes</outputDirectory>
    <testOutputDirectory>build/testcases</testOutputDirectory>

    <resources>
        <resource>
            <targetPath>${project.build.outputDirectory}</targetPath>
            <directory>${project.build.sourceDirectory}</directory>
            <includes>
                <include>**/*.arff</include>
                <include>**/*.cost</include>
                <include>**/*.cup</include>
                <include>**/*.default</include>
                <include>**/*.excludes</include>
                <include>**/*.flex</include>
                <include>**/*.gif</include>
                <include>**/*.icns</include>
                <include>**/*.ico</include>
                <include>**/*.jflex</include>
                <include>**/*.jpeg</include>
                <include>**/*.jpg</include>
                <include>**/*.kfml</include>
                <include>**/*.matrix</include>
                <include>**/*.png</include>
                <include>**/*.properties</include>
                <include>**/*.props</include>
                <include>**/*.txt</include>
                <include>**/*.xml</include>
                <include>**/DatabaseUtils.props.*</include>
                <include>weka/gui/beans/README</include>
            </includes>
        </resource>
        <resource>
            <targetPath>${project.build.testOutputDirectory}</targetPath>
            <directory>${project.build.testSourceDirectory}</directory>
            <includes>
                <include>**/*.arff</include>
                <include>**/*.cost</include>
                <include>**/*.cup</include>
                <include>**/*.default</include>
                <include>**/*.excludes</include>
                <include>**/*.flex</include>
                <include>**/*.gif</include>
                <include>**/*.icns</include>
                <include>**/*.ico</include>
                <include>**/*.jflex</include>
                <include>**/*.jpeg</include>
                <include>**/*.jpg</include>
                <include>**/*.kfml</include>
                <include>**/*.matrix</include>
                <include>**/*.png</include>
                <include>**/*.properties</include>
                <include>**/*.props</include>
                <include>**/*.txt</include>
                <include>**/*.xml</include>
                <include>**/DatabaseUtils.props.*</include>
                <include>weka/gui/beans/README</include>
            </includes>
        </resource>
    </resources>

    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.0.2</version>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.7.2</version>
                <configuration>
                    <includes>
                        <include>**/*Test.java</include>
                    </includes>
                    <disableXmlReport>true</disableXmlReport>
                    <redirectTestOutputToFile>true</redirectTestOutputToFile>
                    <systemPropertyVariables>
                        <weka.test.Regression.root>src/test/resources/wekarefs</weka.test.Regression.root>
                        <weka.test.maventest>true</weka.test.maventest>
                        <user.timezone>Pacific/Auckland</user.timezone>
                    </systemPropertyVariables>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-release-plugin</artifactId>
                <version>2.1</version>
                <configuration>
                    <tagBase>https://svn.cms.waikato.ac.nz/svn/weka/tags</tagBase>
                    <useReleaseProfile>false</useReleaseProfile>
                    <!-- tests are performed with the ant build file, hence skipped here. -->
                    <preparationGoals>clean verify -P no-tests</preparationGoals>
                    <goals>deploy -P no-tests</goals> 
                </configuration>
            </plugin>
        </plugins>
    </pluginManagement>

    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-clean-plugin</artifactId>
            <version>2.4.1</version>
            <configuration>
                <filesets>
                    <fileset>
                        <directory>.</directory>
                        <includes>
                            <include>**/*~</include>
                            <include>**/.attach_pid*</include>
                            <include>**/hs_err_pid*</include>
                        </includes>
                        <followSymlinks>false</followSymlinks>
                    </fileset>
                </filesets>
            </configuration>
        </plugin>
  

        
  
  
  
  
  
  
  
  
  
  
  

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <version>2.3.2</version>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>${weka.main.class}</mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <goals>
                        <goal>test-jar</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-source-plugin</artifactId>
            <version>2.1.2</version>
            <executions>
                <execution>
                    <id>attach-sources</id>
                    <goals>
                        <goal>jar</goal>
                    </goals>
                    <configuration>
                        <excludeResources>true</excludeResources>
                    </configuration>
                </execution>
                <execution>
                    <id>attach-test-sources</id>
                    <goals>
                        <goal>test-jar</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-javadoc-plugin</artifactId>
            <version>2.8.1</version>
            <configuration>
                <maxmemory>256m</maxmemory>
                <subpackages>weka:org</subpackages>
                <show>public</show>
                <outputDirectory>${project.basedir}/doc</outputDirectory>
                <!-- when using Java 8, use this setting to avoid overly strict javadoclet -->
                <!--additionalparam>-Xdoclint:none</additionalparam-->
            </configuration>
            <executions>
                <execution>
                    <id>attach-javadocs</id>
                    <goals>
                        <goal>jar</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        
   
    </plugins>
</build>

Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:

Sir i want to use word2vec in my project . i add it in my project , but shows this error .

Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project medallia-word2vec: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

i am using Netbeans 8.1 with maven . please tell me how can i remove it and use it. please guide me
how can i use it . Thanks

Won't read data from UTF-8 model created by C version of word2vec

Hallo,

The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.

This is because the vocab's characters are appended to a string buffer as if a byte is a character.

A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:

            byte[] buff = new byte[1024];
            for (int lineno = 0; lineno < vocabSize; lineno++) {
                // read vocab
                int bpos = 0;
                byte b = buffer.get();
                while (b != ' ') {
                    if (b != '\n') {
                        buff[bpos++] = b;
                    }
                    b = buffer.get();
                }
                vocabs.add(new String(buff, 0, bpos, "UTF-8"));

Accuracy rate seems to be 20% lower than the original C version

Hello, dear medallia staffs.
Thank you for your nice Java code. It is beautiful, neatly but seems not precious.

I computed the accuracy rate, and it is 20% lower than the original version.
I trained on text8 with the same parameters, which are:

Java

File f = new File("text8");
        if (!f.exists())
            throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
        List<String> read = Common.readToList(f);
        List<List<String>> partitioned = Lists.transform(read, new Function<String, List<String>>() {
            @Override
            public List<String> apply(String input) {
                return Arrays.asList(input.split(" "));
            }
        });

        Word2VecModel model = Word2VecModel.trainer()
                .setMinVocabFrequency(5)
                .useNumThreads(20)
                .setWindowSize(8)
                .type(NeuralNetworkType.CBOW)
                .setLayerSize(200)
                .useNegativeSamples(25)
                .setDownSamplingRate(1e-4)
                .setNumIterations(15)
                .setListener(new TrainingProgressListener() {
                    @Override public void update(Stage stage, double progress) {
                        System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                    }
                })
                .train(partitioned);

        try(final OutputStream os = Files.newOutputStream(Paths.get("vectors.bin"))) {
            model.toBinFile(os);
        }

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Use the same judge program and test file:

./compute-accuracy vectors.bin 30000 < questions-words.txt

Your Java implementation:

capital-common-countries:
ACCURACY TOP1: 58.30 %  (295 / 506)
Total accuracy: 58.30 %   Semantic accuracy: 58.30 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 36.78 %  (534 / 1452)
Total accuracy: 42.34 %   Semantic accuracy: 42.34 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 12.69 %  (34 / 268)
Total accuracy: 38.77 %   Semantic accuracy: 38.77 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 25.21 %  (396 / 1571)
Total accuracy: 33.16 %   Semantic accuracy: 33.16 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 55.23 %  (169 / 306)
Total accuracy: 34.80 %   Semantic accuracy: 34.80 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 8.07 %  (61 / 756)
Total accuracy: 30.64 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.07 % 
gram2-opposite:
ACCURACY TOP1: 9.48 %  (29 / 306)
Total accuracy: 29.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.47 % 
gram3-comparative:
ACCURACY TOP1: 38.25 %  (482 / 1260)
Total accuracy: 31.13 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.63 % 
gram4-superlative:
ACCURACY TOP1: 23.91 %  (121 / 506)
Total accuracy: 30.60 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.50 % 
gram5-present-participle:
ACCURACY TOP1: 22.08 %  (219 / 992)
Total accuracy: 29.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 23.87 % 
gram6-nationality-adjective:
ACCURACY TOP1: 63.17 %  (866 / 1371)
Total accuracy: 34.50 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.25 % 
gram7-past-tense:
ACCURACY TOP1: 26.35 %  (351 / 1332)
Total accuracy: 33.47 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.64 % 
gram8-plural:
ACCURACY TOP1: 44.25 %  (439 / 992)
Total accuracy: 34.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.17 % 
gram9-plural-verbs:
ACCURACY TOP1: 18.15 %  (118 / 650)
Total accuracy: 33.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.90 % 
Questions seen / total: 12268 19544   62.77 %

Original C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed.

Thank you.

Cut a new release and publish it?

Could I trouble you to make a new release and upload it to Maven? We are currently using this in a bunch of projects by copying the jar file around, but it would be much better if we could depend on the latest in Maven.

Cheers!

Failed to execute goal org.codehaus.mojo

Exception in thread "main" java.lang.NoClassDefFoundError: groep3/Mail/Mail
at com.mycompany.cvo_crescendo_main.TestMail.main(TestMail.java:29)
Caused by: java.lang.ClassNotFoundException: groep3.Mail.Mail
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

this is the error , In the Class Mail , there is no fault , but it stil give a error!!
Can anyone help with this , ?

Read from bin file

I think it will be very helpful if we have a new functionality of reading word2vec bin model file, so we can use many available trained files in binary format. I can definitely help you with it if no one is currently working on it.

Java Heap space error

Sir Please tell me when i am increase the size of file dataset which i want to use for training neural network. then i am facing the exception :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(String.java:1956)
at java.lang.String.split(String.java:2340)
at java.lang.String.split(String.java:2409)
at word2vector.Word2VecExamples$1.apply(Word2VecExamples.java:51)
at word2vector.Word2VecExamples$1.apply(Word2VecExamples.java:48)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:617)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:548)
at word2vector.Word2VecTrainer.count(Word2VecTrainer.java:43)
at word2vector.Word2VecTrainer.train(Word2VecTrainer.java:77)
at word2vector.Word2VecTrainerBuilder.train(Word2VecTrainerBuilder.java:243)
at word2vector.Word2VecExamples.demoWord(Word2VecExamples.java:69)
at word2vector.Word2VecExamples.main(Word2VecExamples.java:36)

please tell me what should i am doing to remove this error.

Word2phrase

Do you have any intention of porting word2phrase as well?

Training for a large dataset, Java heap space memory out

I trained with Text8 from Eclipse, worked OK. But when I loaded a 2.1 GB text file, OutOfMemoryError occurred even I increased the Eclipse memory to maximum. Any ideas about this?

Thanks.

Pengchu

Utilizing only one CPU

I'm running the Word2vec example on Linux with Java 8 on an core i7 processor (allocated 8GB of ram to the java process) and I see only one CPU pegged at 100% (numThreads is set to 20). What should I look into?

jar available on a public maven repo

Hi,

I'd like to know if this project build is available as a jar on a public Maven or Gradle repository? I'd like to use it without needing to drop the sources into my project.

Thanks!

medallia / word2vecjava Goto Github PK

word2vecjava's Introduction

Word2vecJava

When building the vocabulary from the training file:

In partitioning the file for processing

In the sampling function

In the distance function to find the nearest matches to a target query

Building the Project

Contact

word2vecjava's People

Contributors

Stargazers

Watchers

Forkers

word2vecjava's Issues

at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:33)

BUILD FAILURE

Final Memory: 5M/15M

Building DeepLearning4j Examples 0.4-rc0-SNAPSHOT

at org.deeplearning4j.examples.rnn.GravesLSTMCharModellingExample.main(GravesLSTMCharModellingExample.java:73)

BUILD FAILURE

Final Memory: 12M/165M

Here is the coresponding part of the code, where I added my print outs:

Recommend Projects

Recommend Topics

Recommend Org