Git Product home page Git Product logo

yet-another-search-site's Introduction

Yet Another Search Site

This projet while contianing no code will act as the core for exploring the way to create a scaleable Search Engine similar to the ones that are in use by the major search engine sites.

Architecture

Here is an overview of the architecture used. This is by no means all that is involved in modern search engines however will work as a base from which to learn and grow.

Architecture Image

Kafka is used as the core for this system, because it can support the scale needed for this project. The way this works is nodes that have the same functionality are assigned a consumer group. Any time one of the consumers reaches out for the next link for example Kafka will produce the next link that has not been consumed by that consumer group. In this way multiple consumer groups can query the same data and be operating at their own speed over the data.

Kafka operates similar to TCP in the sense that it has measures in place to ensure that the consumer was able to process the data. If a consumer was to die mid-process it could provide that link to the next consumer.

Kubernetes controls the node creation. It ensures that each consumer group has the alloted number of nodes. If nodes die off then Kubernetes will reinstate them. Also, if containers need to be added it is as simple as modifying a config file.

Download / Render

This process will be managed by a service that interacts with htmlspitter. This service will read urls from the Kafka Links topic and will download and render the pages. The rendered DOM will then be put onto the DOM stack. If there are errors we will put them on the error topic to retry at another date potentially

Crawler

Here we are going to grab all of the url's off of a page. Once we have the links off of a page we will go ahead and make an attempt to place them on the Link topic. The link topic will not allow duplicates.

While this may seem like a simple task part of the problem that will be had with this is that many new and modern websites are completely controlled through Javascript. This can make traversing them difficult if not impossible. Also, many websites use linking to direct you to different parts of the same page.

Analyzer

The analyzers will extract metadata from the DOM and use the DOM to generate data that will be usefull for indexing the page.

yet-another-search-site's People

Contributors

dtaivpp avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.