Git Product home page Git Product logo

crawler's Introduction

Crawler

Build Status CodeBeat Coverage Module Version Hex Docs Total Download License Last Updated

A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.

Features

  • Crawl assets (javascript, css and images).
  • Save to disk.
  • Hook for scraping content.
  • Restrict crawlable domains, paths or content types.
  • Limit concurrent crawlers.
  • Limit rate of crawling.
  • Set the maximum crawl depth.
  • Set timeouts.
  • Set retries strategy.
  • Set crawler's user agent.
  • Manually pause/resume/stop the crawler.

See Hex documentation.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

  1. Use Crawler.Store
  2. Tap into the registry(?) Crawler.Store.DB
  3. Use your own scraper
  4. If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
  5. Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option Type Default Value Description
:assets list [] Whether to fetch any asset files, available options: "css", "js", "images".
:save_to string nil When provided, the path for saving crawled pages.
:workers integer 10 Maximum number of concurrent workers for crawling.
:interval integer 0 Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
:max_depths integer 3 Maximum nested depth of pages to crawl.
:max_pages integer :infinity Maximum amount of pages to crawl.
:timeout integer 5000 Timeout value for fetching a page, in ms. Can also be set to :infinity, useful when combined with Crawler.pause/1.
:retries integer 2 Number of times to retry a fetch.
:store module nil Module for storing the crawled page data and crawling metadata. You can set it to Crawler.Store or use your own module, see Crawler.Store.add_page_data/3 for implementation details.
:force boolean false Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data.
:scope term nil Similar to :force, but you can pass a custom :scope to determine how Crawler should perform on links already seen.
:user_agent string Crawler/x.x.x (...) User-Agent value sent by the fetch requests.
:url_filter module Crawler.Fetcher.UrlFilter Custom URL filter, useful for restricting crawlable domains, paths or content types.
:retrier module Crawler.Fetcher.Retrier Custom fetch retrier, useful for retrying failed crawls, nullifies the :retries option.
:modifier module Crawler.Fetcher.Modifier Custom modifier, useful for adding custom request headers or options.
:scraper module Crawler.Scraper Custom scraper, useful for scraping content as soon as the parser parses it.
:parser module Crawler.Parser Custom parser, useful for handling parsing differently or to add extra functionalities.
:encode_uri boolean false When set to true apply the URI.encode to the URL to be crawled.
:queue pid nil You can pass in an OPQ pid so that multiple crawlers can share the same queue.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See Crawler.Fetcher.Modifier.

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("https://elixir-lang.org")

Crawler.running?(opts) # => true

Crawler.pause(opts)

Crawler.running?(opts) # => false

Crawler.resume(opts)

Crawler.running?(opts) # => true

Crawler.stop(opts)

Crawler.running?(opts) # => false

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

Multiple Crawlers

It is possible to start multiple crawlers sharing the same queue.

{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)

Crawler.crawl("https://elixir-lang.org", queue: queue)
Crawler.crawl("https://github.com", queue: queue)

Find All Scraped URLs

Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]

Examples

Google Search + Github

This example performs a Google search, then scrapes the results to find Github projects and output their name and description.

See the source code.

You can run the example by cloning the repo and run the command:

mix run -e "Crawler.Example.GoogleSearch.run()"

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

Copyright and License

Copyright (c) 2016 Fred Wu

This work is free. You can redistribute it and/or modify it under the terms of the MIT License.

crawler's People

Contributors

arashsc avatar fredwu avatar jpyamamoto avatar kianmeng avatar merongivian avatar phereford avatar rhnonose avatar supamic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawler's Issues

plug_cowboy dependency

Hi @fredwu .

Thank you so much for this awesome (and popular) library! One quick question:

  • Is the plug_cowboy dependency used in the library outside of testing?

I can't quite find it being used in the library. If not, would you be ok for me to create a PR moving that dependency to the :test environment only?

Crawler killed from consuming all memory or processes

I'm having trouble getting the crawler to successfully crawl with a depth of more than 2. It's able to filter many links and scrape pages, but after a couple of minutes the process gets killed. At the end it outputs:

erl_child_setup closed
Killed

I don't see any errors on the console and the erl_crash.dump doesn't really say much. I see a bunch of references to ('Elixir.Crawler.QueueHandler':enqueue/1 + 248), but that's all.

Has anyone else seen this?

Example Project for basic config

I keep getting errors when dropping it into a blank project. What is the basic config for getting it up and running? Maybe an example might help or instructions on how to integrate into your project in the readme. If I figure it out, i'll make a pull request

Error:


iex(1)> Scraper.crawl
:ok
iex(2)> 
17:31:26.677 [error] GenServer #PID<0.318.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.317.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.315.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [error] GenServer #PID<0.319.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [info]  Application crawler exited: shutdown

Filter by dynamic base URL

I'm trying to create a custom UrlFilter that lets through all URLs that start with the base URL passed to Crawler.crawl/2. I know I can use Registry or Agent to track this value, but is there a better way? The url_filter option only accepts a module but not a fun.

How to deal with WX?

When I add crawler as deps and run my app, I got this working:

core-1  | =INFO REPORT==== 22-Jun-2024::14:41:08.600961 ===
core-1  |     application: kernel
core-1  |     exited: {{shutdown,
core-1  |                  {failed_to_start_child,on_load,
core-1  |                      {on_load_function_failed,gl,
core-1  |                          {error,
core-1  |                              {load_failed,
core-1  |                                  "Failed to load NIF library: '/app/lib/wx-2.4.1/priv/erl_gl.so: cannot open shared object file: No such file or directory'"}}}}},
core-1  |              {kernel,start,[normal,[]]}}
core-1  |     type: permanent
core-1  |
core-1  | Kernel pid terminated (application_controller) ("{application_start_failure,kernel,{{shutdown,{failed_to_start_child,on_load,{on_load_function_failed,gl,{error,{load_failed,\"Failed to load NIF library: '/app/lib/wx-2.4.1/priv/erl_gl.so: cannot open shared object file: No such file or directory'\"}}}}},{kernel,start,[normal,[]]}}}")

Digging into this, I found that _build/prod/lib/crawler/ebin/crawler.app has wx inside applications:

{application,crawler,
             [{config_mtime,1719031138},
              {optional_applications,[]},
              {applications,[kernel,stdlib,elixir,logger,runtime_tools,
                             observer,wx,httpoison,floki,opq,retry]},

So it is clear that crawler introduce this problem for me.
Am I the only one has this problem in Prod? Any suggestions on how to fix this?

I tried adding :wx in my extra_applications, and install libwxgtk3.0-gtk3-0v5 libwxbase3.0-0v5 erlang-wx in my dockerfile, but it did not solve my issue.

could not compile dependency :crawler

Hello people,

I've created a simple project (with and without the --sup), I've added {:crawler, "~> 1.1"} to my mix.exs and then mix deps.get. Everything was fine. But when I launch it with: iex -S mix I get:

# details above omitted

==> crawler
Compiling 32 files (.ex)

== Compilation error in file lib/crawler/fetcher/retrier.ex ==
** (CompileError) lib/crawler/fetcher/retrier.ex:25: undefined function exp_backoff/0
    (elixir 1.10.1) src/elixir_locals.erl:114: anonymous fn/3 in :elixir_locals.ensure_no_undefined_local/3
    (stdlib 3.11.2) erl_eval.erl:680: :erl_eval.do_apply/6
could not compile dependency :crawler, "mix compile" failed. You can recompile this dependency with "mix deps.compile crawler", update it with "mix deps.update crawler" or clean it with "mix deps.clean crawler"

Environment info:

  • Linux x30 5.6.10-3-MANJARO #1 SMP PREEMPT Sun May 3 11:03:28 UTC 2020 x86_64 GNU/Linux
  • Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
  • Elixir 1.10.1 (compiled with Erlang/OTP 22)

Any idea about what's happening? Thanks in advance.

customization strategy

Hi Fred, Advice?
I looked at your code, and here's my understanding so far:

Every new url to be parsed gets a new child Worker spun up, then you cast an (opts) message to the worker, and the worker gets the url (and other options) from the opts.

The worker uses the url to
|> fetch
|> parse

Wherein (fetch), if the Policer validity test passes and the Recorder successfully stores the url of the page that we're about to fetch, the Fetcher retrieves the page.

Next (parse), the parser finds all the links in the body; at each link-find event ("OnLink," so to speak), the link_handler function is invoked; and by default, the link_handler == Dispatcher.dispatch, which in turn calls Crawler.crawl (recursively).

Now, thinking about my use cases (more later), for a given root domain I will be going 4 layers deep, but I will need different logic at each layer; and, for different root domains the logic will be different. By different "logic," I mean (I think) different Parser and Fetcher logic (how to find the link(s) for the next level, what to find in the body and record/persist).

My first thought of how I might do this is by adding a field to the opts keys. I.e., when Crawler.crawl pipes to |> Options.assign_url(url), I could use pattern matching on the url in the assign_url function to add (depending on root domain and parse depth -- the current state) new opts fields -- :parser_strategy and :recorder_strategy.

These new opts fields could hold values that correspond to states of a Fetcher (or Recorder) state machine and a Parser state machine. I.e., in fetch_url_200, where Recorder.store_page is called, that call will invoke one of many (pattern matched) functions; so, a lot of my domain-specific, level-specific logic would be implemented in Recorder.store_page, based on opts[:recorder_strategy].

In summary, if I followed what I describe above, the places I see where I would be adding my customization would be in:

Options.assign_url
(Use business logic to set :recorder_strategy and :parser_strategy states, based on previous state and url. Hence, "OnLink" are FSM transitions.)

Recorder.store_page
(Call one of multiple versions of do_store_page(), depending on :recorder_strategy state)

Parser.parse_links
(Call one of multiple versions of do_parse_links(), depending on :parser_strategy state)


Thoughts?
Thanks,
David


also, fyi, to get it to build on Windows I had to bump :httpoison, "~> 0.13".

When finished

How can we detect that the crawler exhausted crawling the site?

** (ArgumentError) invalid syntax, only "retry", "after" and "else" are permitted

It seems as though by updating the ElixirRetry package it has introduced an issue. I cannot add this project to my mix.exs and run

iex -S mix

without receiving this error

== Compilation error in file lib/crawler/fetcher/retrier.ex ==
** (ArgumentError) invalid syntax, only "retry", "after" and "else" are permitted
    expanding macro: Retry.retry/2
    lib/crawler/fetcher/retrier.ex:25: Crawler.Fetcher.Retrier.perform/2
    (elixir) lib/kernel/parallel_compiler.ex:198: anonymous fn/4 in Kernel.ParallelCompiler.spawn_workers/6

Retrieving all items in Crawler.Store

It appears you can only find pages that have been crawled if you know the URL, there's no way to return all pages in the store. Is this by design, or just not a completed API for interfacing with the registry?

I ask because it's getting all the pages it seems but if I don't know the URL I've no access to get the information from the store?

High performance: Store gets flooded when too many pages are crawled

if I try to launch 100 page to be crawled (each with depth 5)
after a bit the Store process get flooded and starts dropping messages

2020-02-10 16:07:42.374 [debug] "Failed to fetch https://mystays.rwiths.net/r-withs/tfi0020a.do?GCode=mystays&ciDateY=2020&ciDateM=02&ciDateD=10&coDateY=2020&coDateM=02&coDateD=11&s1=0&s2=0&y1=0&y2=0&y3=0&y4=0&room=1&otona=4&hotelNo=38599&dataPattern=PL&cdHeyaShu=t2&planId=4114101&f_lang=ja, reason: checkout_timeout"

the upside of having a Registry is that the store is global and the crawler can be run from multiple machines.
The downside is that this single process will become a bottleneck for high performance.
Would you be open to using mnesia ? (fast, distributer, in memory db)
if you don't mind the distributed part, I would use an ets for the store, which should be able to handle more load.

the solution to this is to break the crawling of all those urls, and not send them all at the same time.

let me know if you are open to this, I'm open to putting a tentative PR

robots.txt

Does crawler respect robots.txt? Or has an option for that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.