zhujiangang / berkeleylm Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/berkeleylm
Automatically exported from code.google.com/p/berkeleylm
When running MakeLmBinaryFromGoogle I get the exception below (last lines of
logger output also pasted).
The same exception is thrown if I call readLmFromGoogleNgramDir(path, compress)
directly with compress set to true.
I could not yet figure out what is going on.
Do you have any clues?
-Torsten
<trace ---------------------------------------------------------->
Line 13587000
Line 13588000
} [1m14s]
} [1m14s]
Reading ngrams of order 2 {
Exception in thread "main" } [0s]
java.lang.ArrayIndexOutOfBoundsException: 1
at edu.berkeley.nlp.lm.map.CompressedNgramMap.handleNgramsFinished(CompressedNgramMap.java:135)
at edu.berkeley.nlp.lm.io.NgramMapAddingCallback.handleNgramOrderFinished(NgramMapAddingCallback.java:40)
at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:99)
at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:25)
at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:437)
at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:391)
at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:210)
at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:193)
at de.tudarmstadt.ukp.dkpro.teaching.frequency.berkeleylm.CreateGoogleBinary.run(CreateGoogleBinary.java:25)
at de.tudarmstadt.ukp.dkpro.teaching.frequency.berkeleylm.CreateGoogleBinary.main(CreateGoogleBinary.java:18)
</trace ---------------------------------------------------------->
Original issue reported on code.google.com by [email protected]
on 29 Jun 2011 at 8:20
Hi
I am a bit confused on how to find the log probabilities of ngrams. From
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
sentScore += score;
}
The thing I am not getting is why is it starting from 1 and why i +
lm_.getLmOrder() and why sent is only number of words in line + 2.
Ideally I was expecting sent to be number of words line + 3. So if I have a
sentence, Hello how are you, sent should be START START Hello how are you STOP.
So the first trigram should be START START Hello. So if I wanted to find the
log probability of the first trigram I would use startpos 0 and endpos 2. The
last trigram will be "are you STOP" , startpos 4, and endpos 6.
Obviously I am making some assumptions here. I tried to dig the code to prove
myself otherwise but unfortunately could not get much intelligence in this
context.
I will be grateful for any help on this.
Regards
Deb
Original issue reported on code.google.com by [email protected]
on 20 Mar 2014 at 9:03
I'm working with textfiles extracted from the "Reuters-21587, distribution 1.0"
dataset and I have had trouble creating and then reading an ARPA file from it.
1. The code seems to be dependant on the use of "." as decimal separator, so
using a german locale results in this error:
Exception in thread "main" java.lang.NumberFormatException: For input string:
"-2,624282"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Float.parseFloat(Unknown Source)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:176)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:136)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:131)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:112)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:108)
at [...]
2. Using text files with multiple tabs results in this exception:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String
index out of range: -4
at java.lang.String.substring(Unknown Source)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGram(ArpaLmReader.java:200)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:136)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:131)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:112)
at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:108)
at [...]
3. Stripping all duplicate whitespace-characters and replacing them with one
single space resulted in another error:
Exception in thread "main" java.lang.RuntimeException: Hash map is full with
100 keys. Should never happen.
at edu.berkeley.nlp.lm.map.ExplicitWordHashMap.put(ExplicitWordHashMap.java:56)
at edu.berkeley.nlp.lm.map.HashNgramMap.putHelpWithSuffixIndex(HashNgramMap.java:283)
at edu.berkeley.nlp.lm.map.HashNgramMap.putWithOffsetAndSuffix(HashNgramMap.java:247)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.addNgram(KneserNeyLmReaderCallback.java:171)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:148)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:37)
at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:80)
at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:53)
at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:47)
at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:301)
at [...]
I could work arround this issue by adding a newline character at the start of
each textfile.
I'm creating and reading the model with the following code:
static void createModel(File dir, File arpa) {
List<String> files = new LinkedList<>();
for(File file : dir.listFiles())
files.add(file.getAbsolutePath());
final StringWordIndexer wordIndexer = new StringWordIndexer();
wordIndexer.setStartSymbol(ArpaLmReader.START_SYMBOL);
wordIndexer.setEndSymbol(ArpaLmReader.END_SYMBOL);
wordIndexer.setUnkSymbol(ArpaLmReader.UNK_SYMBOL);
LmReaders.createKneserNeyLmFromTextFiles(files, wordIndexer, 3, arpa, new ConfigOptions());
}
public static void main(String[] args) throws IOException {
Locale.setDefault(Locale.US);
File arpa = new File([...]);
File directory = new File([...]);
createModel(directory, arpa);
ContextEncodedNgramLanguageModel<String> lm = LmReaders.readContextEncodedLmFromArpa(arpa.getAbsolutePath());
}
Original issue reported on code.google.com by [email protected]
on 16 Oct 2012 at 8:52
Attachments:
make-binary-from-google.sh currently uses -mx1000m
java -ea -mx1000m -server -cp ../src
edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle
../test/edu/berkeley/nlp/lm/io/googledir google.binary
However, I quickly run out of heap space.
I tried -mx4000m but that ran out of heap space in about 2.5hrs.
What is an appropriate -mx setting for training on all 5 grams?
What size EC2 instance should I spin up?
How long will it take to train on all 5grams?
Original issue reported on code.google.com by [email protected]
on 24 Nov 2011 at 2:59
What steps will reproduce the problem?
1. Download berkeleylm-1.0.0 or berkeleylm-1.0b3
2. Run examples\make-kneserney-arpa-from-raw-text.sh without the -server option
What is the expected output? What do you see instead?
Generate the ngram arpa file
What version of the product are you using? On what operating system?
1.0.0 or 1.0b3
Message Error:
--------------
Exception in thread "main" java.lang.NoClassDefFoundError:
edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText
Caused by: java.lang.ClassNotFoundException:
edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class:
edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText. Program will exit.
Original issue reported on code.google.com by [email protected]
on 19 Feb 2012 at 12:38
The link for the Google Books corpus in Web1T format currently points to:
http://tomato.banatao.berkeley.edu:8080/google_books_dirs/books_google_ngrams_gr
e.tar.gz
... but it should be books_google_ngrams_ger.tar.gz.
Original issue reported on code.google.com by alex.rudnick
on 11 Mar 2014 at 6:37
I have two ngram language models, A and B. B is a 3-gram LM trained on a
super-set of the data used to train the 5-gram LM A. When I use B to estimate
the likelihood of some sequences, the following exception is raised very
frequently:
java.lang.ArrayIndexOutOfBoundsException: 2
at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetHelpFromMap(HashNgramMap.java:405)
at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetForContextEncoding(HashNgramMap.java:396)
at edu.berkeley.nlp.lm.map.HashNgramMap.getValueAndOffset(HashNgramMap.java:294)
at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getBackoffSum(ArrayEncodedProbBackoffLm.java:133)
at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:97)
at edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel$DefaultImplementations.getLogProb(ArrayEncodedNgramLanguageModel.java:65)
at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:163)
The exception is not raised when using A.
Interestingly, when using B the exception is not _always_ raised, also for very
similar strings. For example, the string:
"till you drive over the telly ."
does not generate an exception, while
"till you drive over the failure ."
does.
Even though it should not be relevant, both "telly" and "failure" are observed
unigrams.
I am using berkeleylm 1.1.2 on OSX 10.8.2.
java -version:
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06-434-11M3909)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)
Both language models are estimated with make-kneserney-arpa-from-raw-text and
subsequently converted to binary using make-binary-from-arpa.
The problematic language model is quite large, so uploading it for testing
could be complicated. I am wondering whether anyone has ever observed a similar
error and has any clue about the cause of the problem.
Thanks!
Original issue reported on code.google.com by [email protected]
on 3 Feb 2013 at 2:40
I see the example file for training on the Google n-grams.
However, I don't know how the Google n-gram directory should be laid out.
What directory structure should I have?
This is how I currently have things laid out:
.
./web_5gram_2
./web_5gram_2/data
./web_5gram_2/data/3gms
./web_5gram_2/data/4gms
./web_5gram_2/docs
./web_5gram_v1_1.btw
./web_5gram_v1_1.btw/data
./web_5gram_v1_1.btw/data/1gms
./web_5gram_v1_1.btw/data/2gms
./web_5gram_v1_1.btw/data/3gms
./web_5gram_v1_1.btw/docs
./web_5gram_4
./web_5gram_4/data
./web_5gram_4/data/4gms
./web_5gram_4/data/5gms
./web_5gram_4/docs
./web_5gram_5
./web_5gram_5/data
./web_5gram_5/data/5gms
./web_5gram_5/docs
./web_5gram_6
./web_5gram_6/data
./web_5gram_6/data/5gms
./web_5gram_6/docs
./web_5gram_3
./web_5gram_3/data
./web_5gram_3/data/4gms
./web_5gram_3/docs
From looking at src/edu/berkeley/nlp/lm/io/GoogleLmReader.java
it seemed that I should make one directory, alldata/, and put every data file
in there. However, this didn't work either.
What is the correct way to lay out the ngram directory?
Original issue reported on code.google.com by [email protected]
on 19 Nov 2011 at 11:55
I am training multiple language models using Kneser-Ney on different corpuses,
and then trying to classify new sentences by scoring them with each language
model and taking the highest score (Naive Bayes).
Does this work using this library's Kneser-Ney smoothing? As in, are the
distributions properly normalized so that I can compare scores across language
models?
Original issue reported on code.google.com by [email protected]
on 18 Jul 2013 at 12:45
I am running edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText on some German
text but keep running into an ArrayIndexOutOfBoundsException exception. If I
try to build a model from very limited data no such error arises. Is there a
limit on the number of distinct characters the input text can contain? The out
of bounds array value is 256 which is suspiciously the size of a byte.
I have attached the input file (German wikipedia data prepared for a character
level n-gram model).
Here is the output I am seeing:
Reading text files [de-test.txt] and writing to file en-test.model {
Reading from files [de-test.txt] {
On line 0
Writing ARPA {
On order 1
Writing line 0
On order 2
Writing line 0
On order 3
Writing line 0
Writing line 0
On order 4
Writing line 0
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 256
at java.lang.Long.valueOf(Long.java:548)
at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:132)
at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:113)
at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.writeToPrintWriter(KneserNeyLmReaderCallback.java:130)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.cleanup(KneserNeyLmReaderCallback.java:111)
at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:85)
at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:51)
at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:44)
at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:280)
at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:55)
Original issue reported on code.google.com by [email protected]
on 9 Aug 2012 at 4:48
Attachments:
Hello,
I would like to use this LM for classification and therefore I need to
calculate the log probability of an entire document.
One of the getLogProb() methods state:
"Calculate language model score of an n-gram. <b>Warning:</b> if you
* pass in an n-gram of length greater than <code>getLmOrder()</code>,
* this call will silently ignore the extra words of context. In other
* words, if you pass in a 5-gram (<code>endPos-startPos == 5</code>) to
* a 3-gram model, it will only score the words from <code>startPos + 2</code>
* to <code>endPos</code>."
Is it correct to assume that the only way to get the log probability score for
an entire document (sentence that contains more than LMOrder words) is to split
up the document in separate n-grams and query the log probability score for
these separately?
Original issue reported on code.google.com by [email protected]
on 26 May 2015 at 7:49
Good Afternoon,
How to generate a map of frequency of n-grams?
Thank you.
Original issue reported on code.google.com by [email protected]
on 8 Dec 2014 at 4:48
A request for adding the feature to obtain also the raw count of an n-gram if
Google n-gram data is used in the back-end.
Original issue reported on code.google.com by [email protected]
on 14 Jul 2011 at 7:14
The method documentation doesn't say and it's very non-apparent from the code.
Original issue reported on code.google.com by [email protected]
on 6 May 2013 at 9:38
What steps will reproduce the problem?
1. building a LM over some input files consistently generates this exception
2.
3.
What is the expected output? What do you see instead?
The expected output is a learned LM written to a file. Instead, I get the
exception:
Runtime exception: Hash map is full with 100 keys. Should never happen.
What version of the product are you using? On what operating system?
berkeleylm 1.1.3 on Windows 7
Please provide any additional information below.
java.lang.RuntimeException: Hash map is full with 100 keys. Should never happen.
at edu.berkeley.nlp.lm.map.ExplicitWordHashMap.put(ExplicitWordHashMap.java:56)
at edu.berkeley.nlp.lm.map.HashNgramMap.putHelpWithSuffixIndex(HashNgramMap.java:283)
at edu.berkeley.nlp.lm.map.HashNgramMap.putWithOffsetAndSuffix(HashNgramMap.java:247)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.addNgram(KneserNeyLmReaderCallback.java:171)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:148)
at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:37)
at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:80)
at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:53)
at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:47)
at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:301)
at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:57)
at yr.haifa.NLP.lm.BerkleyLanguageModel.train(BerkleyLanguageModel.java:51)
Original issue reported on code.google.com by [email protected]
on 24 Apr 2013 at 1:14
What steps will reproduce the problem?
writing code in netbeans to use createKneserNeyLmFromTextFiles function but I
did not know what "w" in WordIndexer<W> wordIndexer .
What is the expected output? What do you see instead?
I expected that netbeans recognize w but it want to me to create w class
What version of the product are you using? On what operating system?
berkeleylm-1.1.5.tar.gz
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Aug 2015 at 7:26
Hi
Adding to my previous posts in issues 19, I am trying to use google binary
(from google books) and get log probabilities of trigrams from some text. I am
getting NAN from the last trigrams. Attached is the code of what I am trying to
do. I am slightly modified these files and added some System.out.printlns to
see the outputs.
I text I am testing with is "Hello how are you". So essentially it is giving me
a sent [7380255 15474 152 26 45 7380256]. 7380255 is the start symbol and
7380256 is the stop symbol.
I am first getting the log probability of the bigram 7380255 15474, by passing
startpos as 0 and endpos as 2. Thereafter I am getting the log probabilities of
trigrams starting with startpos 0, like the code below
for (int i = 0; i <= sent.length - 3; i++) {
System.out.println("Getting score from " + sent[i] + " to " + sent[i+2]);
score = lm_.getLogProb(sent, i, i+3);
System.out.println("score " + score);
if(Float.isNaN(score))
System.out.println("Returned NaN");
else
sentScore += score;
}
The problem is happening with within StupidBackoffLm in the following line
probContext = localMap.getValueAndOffset(probContext, probContextOrder,
ngram[i], scratch);
only with the last trigram when startpost is 3 and end pos is 6.
scratch.value is returning -1 with ngram[i] being the end symbol or 7380256.
This is resulting in a NAN logprob.
I tried the same with scoreSentence, it gives the same problem.
Can you please help me in understanding what mistake I am doing ?
Thanks
Regards
Debanjan
Original issue reported on code.google.com by [email protected]
on 24 Mar 2014 at 11:36
Attachments:
I'm trying to evaluate 5-gram model on a Vietnamese corpus but the perplexity
doesn't seem to be right...
What steps will reproduce the problem?
1. Download and extract problem.zip
2. Follow the README file
What is the expected output? What do you see instead?
The result from BerkeleyLM and SRILM should be comparable but in fact
BerkeleyLM return an unrealistic perplexity of around 1.
What version of the product are you using? On what operating system?
1.1.5 on Ubuntu.
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 12 Feb 2014 at 3:27
Attachments:
When would it be possible to use the code?
Could you provide scripts so that one can easily import and use the Google
n-gram corpus?
Original issue reported on code.google.com by [email protected]
on 12 May 2011 at 7:19
Hello, now I want to build a chinese language model from an arpa file, However,
it fails as following:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8641
at edu.berkeley.nlp.lm.map.ImplicitWordHashMap.setWordRanges(ImplicitWordHashMap.java:84)
at edu.berkeley.nlp.lm.map.ImplicitWordHashMap.<init>(ImplicitWordHashMap.java:52)
at edu.berkeley.nlp.lm.map.HashNgramMap.<init>(HashNgramMap.java:66)
at edu.berkeley.nlp.lm.map.HashNgramMap.createImplicitWordHashNgramMap(HashNgramMap.java:49)
at edu.berkeley.nlp.lm.io.LmReaders.createNgramMap(LmReaders.java:473)
at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:439)
at edu.berkeley.nlp.lm.io.LmReaders.buildMapArpa(LmReaders.java:419)
at edu.berkeley.nlp.lm.io.LmReaders.secondPassArrayEncoded(LmReaders.java:383)
at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:160)
but when a use a smaller file, it is ok. is there any argument size need to
ajdust?
Original issue reported on code.google.com by [email protected]
on 4 Oct 2011 at 3:48
What steps will reproduce the problem?
1. An n-gram dataset in Google Web-IT format, but with no unigrams or bigrams
(because I am only interested in higher-order n-grams).
2. To conform to the required format, place an empty vocab_cs.gz file under
subdir "1gms", and create an empty subdir by the name "2gms" with one empty
file in it called "2gm-0001"
3. The file names under the subdirs for higher-order n-grams do not start with
<n>gm-0001 (for example, the files under 3gms start with 3gm-0021.
What is the expected output? What do you see instead?
Expected output:
the expected binary file.
What actually happens:
after reading and adding the n-grams, the following error is thrown:
<a really big number> missing suffixes or prefixes were found, doing another pass to add n-grams {
Exception in thread "main" java.lang.NullPointerException
at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:473)
at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:417)
at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:228)
at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:204)
at edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle.main(MakeLmBinaryFromGoogle.java:36)
From the source code, I can see that the null pointer exception is thrown at
the line which says
numNgramsForEachWord[ngramOrder].incrementCount(headWord, 1);
What version of the product are you using? On what operating system?
Tried with 1.1.2 and 1.1.5, both on Ubuntu 12.04
Please provide any additional information below.
I am unable to share the dataset here, but I did manage to reproduce the error by making changes in the folder "/test/edu/berkeley/nlp/lm/io/googledir". These changes are the ones I describe in steps 1, 2 and 3 above. It seems that the empty vocab_cs.gz is what is causing this.
So the core of my question is this:
What should I do if I only want to build a language model on 3-, 4- and 5-grams?
Original issue reported on code.google.com by [email protected]
on 21 Nov 2014 at 3:57
If we have a very large corpus that I would like to take counts of in some
distributed way, is there a way to give those raw counts to this code to build
my model for me?
Original issue reported on code.google.com by [email protected]
on 17 Jul 2013 at 7:27
When I try to train a unigram Kneser-Ney model, I get the exception below. This
is the offending line:
dotdotTypeCounts = new LongArray[maxNgramOrder - 2];
here is the exception:
Exception in thread "main" java.lang.NegativeArraySizeException
at
edu.berkeley.nlp.lm.values.KneserNeyCountValueContainer.<init>(KneserNeyCountVal
ueContainer.java:85)
at
edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.<init>(KneserNeyLmReaderCallbac
k.java:123)
at
edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:3
01)
at
edu.berkeley.nlp.lm.io.LmReaders.readKneserNeyLmFromTextFile(LmReaders.java:283)
at
edu.berkeley.nlp.lm.io.LmReaders.readKneserNeyLmFromTextFile(LmReaders.java:272)
at dragon.lm.NGramLanguageModel.<init>(NGramLanguageModel.java:85)
at
dragon.ml.NaiveBayesClassifier.initalizeLanguageModels(NaiveBayesClassifier.java
:154)
at dragon.ml.NaiveBayesClassifier.main(NaiveBayesClassifier.java:189)
Original issue reported on code.google.com by [email protected]
on 18 Jul 2013 at 11:52
I've written my own ARPA file generator, and when I create a small test file
with it, reading it in by doing:
NGramLanguageModel arpaLm = new NGramLanguageModel(arpaLmFilePath);
everything works fine. For ARPA files generated with a larger data set (see
attached), I get an ArrayOutOfBoundsException:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGram(ArpaLmReader.java:201)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:151)
at dragon.lm.NGramLanguageModel.<init>(NGramLanguageModel.java:68)
at dragon.lm.NGramLanguageModel.main(NGramLanguageModel.java:191)
Any guidance you could give me would be appreciated! The file is encoded as
UTF-8.
Thanks.
Here's the version of Java I'm using:
$ java -version
java version "1.7.0_09"
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)
Original issue reported on code.google.com by [email protected]
on 17 Jul 2013 at 10:48
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.