IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.

This tool can be used to quickly create an index. It is then expected that a programmer will write some code to use the index. This project does not aim to build an end-user index.

It is useful as part of research projects.

Usage:

install java (JDK) if needed
install maven if needed
grap your wikipedia dump: you might be grab quickly part of the dump by typing a command like wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2. (Sorry the database dumps are not at a fixed location so we cannot provide a precise URI.) Be mindful that there are many types of Wikipedia dumps and not all of them contain the articles: when in doubt, read the documentation.
mvn compile
Create a directory where your index will reside, such as WikipediaIndex. E.g., you might be able to type mkdir WikipediaIndex. Be mindful not to reuse the same directory for different projects or different Lucene versions.
mvn exec:java -Dexec.args="yourdump someoutputdirectory

Actual example:

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Note that this precise example may fail unless you adjust the URI https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 since Wikipedia dumps are not guaranteed to stay at the same URI.

The documents have title, name, docid and body fields, all of which are stored with the index.

To see how you might then query the index, see the class file 'Query.java' for a working example.

Extracting word-frequency pairs

There is also a poorly named utility to extract all word-frequency pairs called me.lemire.lucene.CreateFreqSortedDictionary. Deliberately, it is currently undocumented.

multistream bz2 file failed to build index

I downloaded enwiki-latest-pages-articles-multistream.xml.bz2 but failed to use this tool to build the index, below is the error

➜  IndexWikipedia git:(master) ✗ mvn exec:java -Dexec.args="/home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2 /home/vimos/Data/Dataset/Wikipedia/index"
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for me.lemire.lucene:IndexWikipedia:bundle:0.0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 178, column 17
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-eclipse-plugin is missing. @ line 78, column 12
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Building IndexWikipedia 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] >>> exec-maven-plugin:1.1:java (default-cli) > validate @ IndexWikipedia >>>
[INFO] 
[INFO] --- maven-enforcer-plugin:1.0-beta-1:enforce (enforce-maven) @ IndexWikipedia ---
[INFO] 
[INFO] <<< exec-maven-plugin:1.1:java (default-cli) < validate @ IndexWikipedia <<<
[INFO] 
[INFO] 
[INFO] --- exec-maven-plugin:1.1:java (default-cli) @ IndexWikipedia ---
------------> config properties:
content.source.forever = false
docs.file = /home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2
keep.image.only.docs = false
-------------------------------
Starting Indexing of Wikipedia dump /home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2
org.apache.lucene.benchmark.byTask.feeds.NoMoreDataException
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.next(EnwikiContentSource.java:95)
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource.getNextDocData(EnwikiContentSource.java:300)
	at org.apache.lucene.benchmark.byTask.feeds.DocMaker.makeDocument(DocMaker.java:374)
	at me.lemire.lucene.IndexDump.main(IndexDump.java:93)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:290)
	at java.lang.Thread.run(Thread.java:748)
Indexing 0 documents took 46 ms
Index should be located at /home/vimos/Data/Dataset/Wikipedia/index
We are going to test the index by querying the word 'other' and getting the top 3 documents:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.669 s
[INFO] Finished at: 2018-05-15T14:12:19+08:00
[INFO] Final Memory: 16M/374M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.1:java (default-cli) on project IndexWikipedia: An exception occured while executing the Java class. org.xml.sax.SAXParseException; lineNumber: 45; columnNumber: 1; XML document structures must start and end within the same entity. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

I found several links may relate to this problem

I am still not so sure about this.

A method in explicit-semantic-analysis may be a solution, I haven't try it yet.

    public void parseXmlDump(File file) {
        try {
            SAXParser saxParser = saxFactory.newSAXParser();
            InputStream wikiInputStream = new FileInputStream(file);
            wikiInputStream = new BufferedInputStream(wikiInputStream);
            wikiInputStream = new BZip2CompressorInputStream(wikiInputStream, true);
            saxParser.parse(wikiInputStream, this);
        } catch (ParserConfigurationException | SAXException | FileNotFoundException ex) {
            Logger.getLogger(WikiIndexer.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(WikiIndexer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

lemire / indexwikipedia Goto Github PK

indexwikipedia's Introduction

IndexWikipedia

Usage:

Extracting word-frequency pairs

indexwikipedia's People

Contributors

Stargazers

Watchers

Forkers

indexwikipedia's Issues

Stored documents' content is not accessible

Document Id

Dump is not fully processed

Incomplete document returned

multistream bz2 file failed to build index

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent