Git Product home page Git Product logo

priyendumori / wiki-search-engine Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 10 KB

A complete search engine experience built on top of 75 GB Wikipedia corpus with subsecond latency for searches. Results contain wiki pages ordered by TF/IDF relevance based on given search word/s. From an optimized code to the K-Way mergesort algorithm, this project addresses latency, indexing, and big data challenges.

Python 100.00%
search-engine wikipedia-dump external-merge-sort tf-idf-score ranking-algorithm indexing

wiki-search-engine's Introduction

WikiPedia Search Engine

Step 1 : Parsing the Data

To parse the data, run the file "parser.py" To run the file, the syntax is "python3 parser.py " It'll parse the whole dump and file the index files in the index files directory. It also creates the document to title mapping file in the current directory named 'docTitleMap.txt' which will be used by the search module later.

Step 2 : Merging the Indexes and Creating Secondary Indexes

To merge the individual index files and create the secondary index, run the file "merge_index.py" There isn't any need for command line arguments. It takes the index files from index files directory and populates the 'merged_index' directory with indexes of given chunk size and creates a secondary index named 'secondary_index.txt' in the same folder.

Step 3 : Running the Search Engine

To search for queries, run the file 'search.py'. It loads the index from 'merged_index' (both primary and secondary). It also uses the file 'docTitleMap.txt' to display titles corresponding to the docIDs. After it loads up, it gives the user a prompt to enter the query. After that the result of query is displayed. (top 10 results).

For Field Queries, follow the format:

	f1:<query> f2:<query> ...

where f1,f2 are fields : title, body, ref, infobox, category, link

wiki-search-engine's People

Contributors

priyendumori avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.