Git Product home page Git Product logo

mwmbl-crawler-extension's Introduction

An open source web crawler for the Mwmbl non-profit search engine - Firefox extension

This is the next component in the Mwmbl non-profit search engine (see discussion on Hacker News from December 2021) project: a distributed crawler where the clients run in volunteer's browsers. This repo is for the Firefox extension, see also the Crawler server, which is implemented in Python.

Why?

Our goal is to eventually build a search engine that can compete with commercial ones. Since we don't have very much money, we have to build things differently from commercial search engines. In particular, crawling the web is costly. That's why we are asking you to help us. If many people contribute a small amount of CPU and bandwidth, we can, in time, compete at a very low cost.

Screenshot

mwmbl-crawler-extension

What it does

The pages crawled are determined by a central server at api.crawler.mwmbl.org. They are restricted to a curated set of domains (currently determined by analysing Hacker News votes) and pages linked from those domains.

The URLs to crawl are returned in batches from the central server. The browser extension then crawls each URL in turn. We currently use a single thread as we want to make use of minimal CPU and bandwidth of our supporters.

For each URL, it first checks if downloading is allowed by robots.txt. If it is, it then downloads the URL and attempts to extract the title and the beginning of the body text. An attempt is made to exclude boilerplate, but this is not 100% effective. The results are batched up and the completed batch is then sent to the central server.

The batches are stored in long term storage (currently Backblaze) for later indexing. Currently indexing is a manual process, so you won't necessarily see pages you've crawled in search results any time soon.

What do the emojis mean?

When you click the Mwmbl icon, you can get a view into what's happening with the crawler process. Emojis are used as a shorthand for various errors/issues encountered:

  • ๐Ÿค– means the URL was blocked by robots.txt
  • โฐ means the page timed out (we allow 3s for the page to load)
  • ๐Ÿ˜ต means we got a 404 (with a plan to extend this to 4xx)
  • โŒ means some other kind of error
  • โœ… means we got a 2xx result.

Installation

Currently only Firefox is supported. Either install from Mozilla add-ons or follow instructions below to build, then install by going to about:debugging select "This Firefox" then "Load Temporary Add-on".

How to deploy/customise your own crawler

If you want to run your own crawler you will first need to deploy the crawler server. This will run happily on Google Cloud Run. You will also need a Backblaze or AWS account for storing the crawled batches. Change this line in background.js:

const DOMAIN = 'https://api.crawler.mwmbl.org'

to point to your crawler server instance. If you want to customize the curated domains (these influence the type of pages crawled) then you can edit the hn-top-domains.json file.

How to build

git clone https://github.com/mwmbl/crawler-extension.git
cd crawler-extension
npm install
npm run build

The extension will be created in the dist folder.

mwmbl-crawler-extension's People

Contributors

adjagu avatar colinespinas avatar daoudclarke avatar omasanori avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.