Git Product home page Git Product logo

indexwikipedia's Introduction

IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.

This tool can be used to quickly create an index. It is then expected that a programmer will write some code to use the index. This project does not aim to build an end-user index.

It is useful as part of research projects.

Usage:

  • install java (JDK) if needed
  • install maven if needed
  • grap your wikipedia dump: you might be grab quickly part of the dump by typing a command like wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2. (Sorry the database dumps are not at a fixed location so we cannot provide a precise URI.) Be mindful that there are many types of Wikipedia dumps and not all of them contain the articles: when in doubt, read the documentation.
  • mvn compile
  • Create a directory where your index will reside, such as WikipediaIndex. E.g., you might be able to type mkdir WikipediaIndex. Be mindful not to reuse the same directory for different projects or different Lucene versions.
  • mvn exec:java -Dexec.args="yourdump someoutputdirectory

Actual example:

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Note that this precise example may fail unless you adjust the URI https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 since Wikipedia dumps are not guaranteed to stay at the same URI.

The documents have title, name, docid and body fields, all of which are stored with the index.

To see how you might then query the index, see the class file 'Query.java' for a working example.

Extracting word-frequency pairs

There is also a poorly named utility to extract all word-frequency pairs called me.lemire.lucene.CreateFreqSortedDictionary. Deliberately, it is currently undocumented.

indexwikipedia's People

Contributors

lemire avatar vimos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

indexwikipedia's Issues

Stored documents' content is not accessible

Hey,
I needed to store the information in my documents in the index. So, I set the doc.stored property to true at line 80 in the IndexDump.java file ([https://github.com/lemire/IndexWikipedia/blob/master/src/main/java/me/lemire/lucene/IndexDump.java]) and my index has indeed grown to around 15GB. But I can't access the bodies when I query using Luke or when I import the index into Solr. Could you please tell how to access them?

Thank you!!

Document Id

Hi,
The document ids for the corresonding wiki pages seems to be wrong.
Ex : I gave a query doctitle:(cowboys AND aliens)
gave results with docids which are either wrong page ids or pointing to some other pages.

Could anyone please confirm if the document ids to get the corresponding wiki pages are right

Dump is not fully processed

I've tried to build an index of enwiki-20190820-pages-articles.xml.bz2 (15.3 GB) with mvn:exec, which terminated very quickly but without building a full index. While there is no error message, the runtime elapsed and the small (900k) size of the generated index file make it impossible that it is completely processed.

Incomplete document returned

Hi,

I've been trying to index a wiki article dump using this library, but when I try to retrieve documents by searching some queries, the returned documents only contain docid as a field. (There should be more like "docname", "doctitle", "body", etc).

Here is how I retrieved queries (using Pylucene):
reader = DirectoryReader.open(FSDirectory.open(File(INDEX_PATH)))
searcher = IndexSearcher(reader)
analyzer = StandardAnalyzer(Version.LUCENE_4_10_0)
field = "doctitle"
parser = QueryParser(Version.LUCENE_4_10_0, field, analyzer)
query = parser.parse("Sedimentary")
rst = searcher.search(query, 100000)
scoredocs = rst.scoreDocs

I tried to print document object during index, it seems they have all those fields when they were fed into IndexWriter, but I just can't retrieve those fields by searching.

Thanks in advance!

multistream bz2 file failed to build index

I downloaded enwiki-latest-pages-articles-multistream.xml.bz2 but failed to use this tool to build the index, below is the error

➜  IndexWikipedia git:(master) ✗ mvn exec:java -Dexec.args="/home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2 /home/vimos/Data/Dataset/Wikipedia/index"
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for me.lemire.lucene:IndexWikipedia:bundle:0.0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 178, column 17
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-eclipse-plugin is missing. @ line 78, column 12
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Building IndexWikipedia 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] >>> exec-maven-plugin:1.1:java (default-cli) > validate @ IndexWikipedia >>>
[INFO] 
[INFO] --- maven-enforcer-plugin:1.0-beta-1:enforce (enforce-maven) @ IndexWikipedia ---
[INFO] 
[INFO] <<< exec-maven-plugin:1.1:java (default-cli) < validate @ IndexWikipedia <<<
[INFO] 
[INFO] 
[INFO] --- exec-maven-plugin:1.1:java (default-cli) @ IndexWikipedia ---
------------> config properties:
content.source.forever = false
docs.file = /home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2
keep.image.only.docs = false
-------------------------------
Starting Indexing of Wikipedia dump /home/vimos/Data/Dataset/Wikipedia/enwiki-latest-pages-articles-multistream.xml.bz2
org.apache.lucene.benchmark.byTask.feeds.NoMoreDataException
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.next(EnwikiContentSource.java:95)
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource.getNextDocData(EnwikiContentSource.java:300)
	at org.apache.lucene.benchmark.byTask.feeds.DocMaker.makeDocument(DocMaker.java:374)
	at me.lemire.lucene.IndexDump.main(IndexDump.java:93)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:290)
	at java.lang.Thread.run(Thread.java:748)
Indexing 0 documents took 46 ms
Index should be located at /home/vimos/Data/Dataset/Wikipedia/index
We are going to test the index by querying the word 'other' and getting the top 3 documents:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.669 s
[INFO] Finished at: 2018-05-15T14:12:19+08:00
[INFO] Final Memory: 16M/374M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.1:java (default-cli) on project IndexWikipedia: An exception occured while executing the Java class. org.xml.sax.SAXParseException; lineNumber: 45; columnNumber: 1; XML document structures must start and end within the same entity. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

I found several links may relate to this problem

I am still not so sure about this.

A method in explicit-semantic-analysis may be a solution, I haven't try it yet.

    public void parseXmlDump(File file) {
        try {
            SAXParser saxParser = saxFactory.newSAXParser();
            InputStream wikiInputStream = new FileInputStream(file);
            wikiInputStream = new BufferedInputStream(wikiInputStream);
            wikiInputStream = new BZip2CompressorInputStream(wikiInputStream, true);
            saxParser.parse(wikiInputStream, this);
        } catch (ParserConfigurationException | SAXException | FileNotFoundException ex) {
            Logger.getLogger(WikiIndexer.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(WikiIndexer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.