Git Product home page Git Product logo

wikivoyage_search's Introduction

wikivoyage_search

An Elasticsearch search engine with the Wikivoyage data

Instructions:

Operate this in a linux environment: Linux vm-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

It also requires:

How to build search engine: source build_search_engine.sh

This will run 5 operations:

  • 0: Download the data
  • 1: Build the docker image to parse the raw data
  • 2: Initialise the elasticsearch image and start the service
  • 3: Write the mapping for the wikivoyage data
  • 4: Wait for the elasticsearch image to be up and running and then write the data to the index

To execute a query, first ensure that curl and jq are installed, then run: curl -XPOST localhost:4567/ -d 'query=museums in melbourne' | jq .

This will return a list of the top ten most relevant items related to that search.

What are the 2-3 most common relevance challenges?

  • Finding associated, but not explicitly declared, phrases or words to assist in the search query. For example if one were to search for 'galleries in berlin', it should also be looking for 'arthouses in berlin' or any other places that might house art. In order to solve this, the stack would need to take each term in the query sentence, find associative terms and include them in the query. This could be achieved by using an synonym lookup table like the wordnet corpus which has an API in the python nltk library. Elasticsearch also has the notion of proximity matching of keywords or look-a-likes which could add implicit results to a search response.

  • The next challenge presents an ongoing issue regarding the accuracy/precision of relevancy. The notion of relevancy can easily be measured on small data sets with labelling, but as the dataset grows larger, the task becomes more ominous. We could label the relevancy of a document for a particular query and measure the success of our engine, but when the query is client facing, and the permutations of queries and possible rankings balloon in size, the measure of success is far more subjective. So we need to find other data to address to make it as objective as possible. This can be achieved with measuring user behaviour and their repsonse to the search engine's results. This would consist of a log of user queries, and the respective actions of the user; if they clicked on the first two or three links, completed their business and moved on then that was a relevant search (with some exceptions obviously). If they are scrolling through a search and not clicking, that is a strong indication of a poor search result. If you were to map queries to user behaviour, and visualise it in different ways, you could start to understand the health of your relevancy algorithm. This can then be used as an 'objective' metric to relevancy success, using A/B testing and other metrics to fine-tune the algorithm.

wikivoyage_search's People

Contributors

alexlance avatar

Watchers

Sam avatar

Forkers

alexlance

wikivoyage_search's Issues

Question with regards to the data source

This is a cool repo! I am trying to do something similar but with the actual index dumps wikimedia generates [https://dumps.wikimedia.org/other/cirrussearch/current/] and got stuck setting up the right coordinate->geo point mappings. Was wondering why you chose to download the article dumps instead of the index dumps/perform xml parsing etc...performance/storage reasons or something else?
Did you try bulk importing the index dumps? Seems to work for me but haven't figured out how to convert the coordinate fields into geopoints.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.