fredwu / crawler Goto Github PK

View Code? Open in Web Editor NEW

933.0 32.0 91.0 394 KB

A high performance web crawler / scraper in Elixir.

Elixir 100.00%

elixir crawler spider scraper scraper-engine offline files

crawler's Introduction

Crawler

A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawl assets (javascript, css and images).
Save to disk.
Hook for scraping content.
Restrict crawlable domains, paths or content types.
Limit concurrent crawlers.
Limit rate of crawling.
Set the maximum crawl depth.
Set timeouts.
Set retries strategy.
Set crawler's user agent.
Manually pause/resume/stop the crawler.

See Hex documentation.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

Use Crawler.Store
Tap into the registry(?) Crawler.Store.DB
Use your own scraper
If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option	Type	Default Value	Description
`:assets`	list	`[]`	Whether to fetch any asset files, available options: `"css"`, `"js"`, `"images"`.
`:save_to`	string	`nil`	When provided, the path for saving crawled pages.
`:workers`	integer	`10`	Maximum number of concurrent workers for crawling.
`:interval`	integer	`0`	Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit.
`:max_depths`	integer	`3`	Maximum nested depth of pages to crawl.
`:max_pages`	integer	`:infinity`	Maximum amount of pages to crawl.
`:timeout`	integer	`5000`	Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`.
`:retries`	integer	`2`	Number of times to retry a fetch.
`:store`	module	`nil`	Module for storing the crawled page data and crawling metadata. You can set it to `Crawler.Store` or use your own module, see `Crawler.Store.add_page_data/3` for implementation details.
`:force`	boolean	`false`	Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data.
`:scope`	term	`nil`	Similar to `:force`, but you can pass a custom `:scope` to determine how Crawler should perform on links already seen.
`:user_agent`	string	`Crawler/x.x.x (...)`	User-Agent value sent by the fetch requests.
`:url_filter`	module	`Crawler.Fetcher.UrlFilter`	Custom URL filter, useful for restricting crawlable domains, paths or content types.
`:retrier`	module	`Crawler.Fetcher.Retrier`	Custom fetch retrier, useful for retrying failed crawls, nullifies the `:retries` option.
`:modifier`	module	`Crawler.Fetcher.Modifier`	Custom modifier, useful for adding custom request headers or options.
`:scraper`	module	`Crawler.Scraper`	Custom scraper, useful for scraping content as soon as the parser parses it.
`:parser`	module	`Crawler.Parser`	Custom parser, useful for handling parsing differently or to add extra functionalities.
`:encode_uri`	boolean	`false`	When set to `true` apply the `URI.encode` to the URL to be crawled.
`:queue`	pid	`nil`	You can pass in an `OPQ` pid so that multiple crawlers can share the same queue.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See Crawler.Fetcher.Modifier.

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("https://elixir-lang.org")

Crawler.running?(opts) # => true

Crawler.pause(opts)

Crawler.running?(opts) # => false

Crawler.resume(opts)

Crawler.running?(opts) # => true

Crawler.stop(opts)

Crawler.running?(opts) # => false

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

Multiple Crawlers

It is possible to start multiple crawlers sharing the same queue.

{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)

Crawler.crawl("https://elixir-lang.org", queue: queue)
Crawler.crawl("https://github.com", queue: queue)

Find All Scraped URLs

Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]

Examples

Google Search + Github

This example performs a Google search, then scrapes the results to find Github projects and output their name and description.

See the source code.

You can run the example by cloning the repo and run the command:

mix run -e "Crawler.Example.GoogleSearch.run()"

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

Copyright and License

This work is free. You can redistribute it and/or modify it under the terms of the MIT License.

crawler's People

Contributors

Stargazers

Watchers

Forkers

timlang slientgoat henriquecf sqsy zanjs kickinespresso merongivian 1ternal aoner international hodak davidalphafox sadiqmmm superkunn zsqsir zhengqingchen mallozup kenpyfin peterwillcn darkslategrey blueszeng eamonpenland joeellis robvolk snewcomer rhnonose 276361270 manjufy 330wuyanzu vincent1981 gusaiani hongsw arashsc simonbowen omunroe-com crunchysoul a-tarr captainhurst yurgon lmlynik praxis-of-nines nandakumartc janajri lazarus404 phereford maxneuvians davidianbonner stevenbonner swarut hhy5277 chengjingfeng keenua mikhailbot hauyingli umuro dapdizzy 123mitnik binharby diggleweb fraelsarn lewandowski jamescheuk91 happysalada morristech gordon-parrott boxinnovationx daenecompass lualsiv skota rana-salma ttuanhung stjordanis wall-eeeeeee william-bola kianmeng lolalset renews brunorscc matteoredaelli xukullc ps491 alimnastaev niccolox edmondfrank edwardceballos supamic itsmarydan uldza strogo djcoin dmkwon

crawler's Issues

plug_cowboy dependency

Hi @fredwu .

Thank you so much for this awesome (and popular) library! One quick question:

Is the plug_cowboy dependency used in the library outside of testing?

I can't quite find it being used in the library. If not, would you be ok for me to create a PR moving that dependency to the :test environment only?

Crawler killed from consuming all memory or processes

I'm having trouble getting the crawler to successfully crawl with a depth of more than 2. It's able to filter many links and scrape pages, but after a couple of minutes the process gets killed. At the end it outputs:

erl_child_setup closed
Killed

I don't see any errors on the console and the erl_crash.dump doesn't really say much. I see a bunch of references to ('Elixir.Crawler.QueueHandler':enqueue/1 + 248), but that's all.

Has anyone else seen this?

Example Project for basic config

I keep getting errors when dropping it into a blank project. What is the basic config for getting it up and running? Maybe an example might help or instructions on how to integrate into your project in the readme. If I figure it out, i'll make a pull request

Error:


iex(1)> Scraper.crawl
:ok
iex(2)> 
17:31:26.677 [error] GenServer #PID<0.318.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.317.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.315.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [error] GenServer #PID<0.319.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [info]  Application crawler exited: shutdown

Request for a working example

Can we have a working example of a scraper using this package ? It would be very helpful

Cutting a new release?

Hey @fredwu, I'm just wondering if you're planning on cutting a new release that includes all the latest changes?

Filter by dynamic base URL

I'm trying to create a custom UrlFilter that lets through all URLs that start with the base URL passed to Crawler.crawl/2. I know I can use Registry or Agent to track this value, but is there a better way? The url_filter option only accepts a module but not a fun.

Links with space are not url encoded and return 404

Example:

"https://www.city-discovery.com/continent/North America" => 404
"https://www.city-discovery.com/continent/North%20America" => 200

How to deal with WX?

When I add crawler as deps and run my app, I got this working:

core-1  | =INFO REPORT==== 22-Jun-2024::14:41:08.600961 ===
core-1  |     application: kernel
core-1  |     exited: {{shutdown,
core-1  |                  {failed_to_start_child,on_load,
core-1  |                      {on_load_function_failed,gl,
core-1  |                          {error,
core-1  |                              {load_failed,
core-1  |                                  "Failed to load NIF library: '/app/lib/wx-2.4.1/priv/erl_gl.so: cannot open shared object file: No such file or directory'"}}}}},
core-1  |              {kernel,start,[normal,[]]}}
core-1  |     type: permanent
core-1  |
core-1  | Kernel pid terminated (application_controller) ("{application_start_failure,kernel,{{shutdown,{failed_to_start_child,on_load,{on_load_function_failed,gl,{error,{load_failed,\"Failed to load NIF library: '/app/lib/wx-2.4.1/priv/erl_gl.so: cannot open shared object file: No such file or directory'\"}}}}},{kernel,start,[normal,[]]}}}")

Digging into this, I found that _build/prod/lib/crawler/ebin/crawler.app has wx inside applications:

{application,crawler,
             [{config_mtime,1719031138},
              {optional_applications,[]},
              {applications,[kernel,stdlib,elixir,logger,runtime_tools,
                             observer,wx,httpoison,floki,opq,retry]},

So it is clear that crawler introduce this problem for me.
Am I the only one has this problem in Prod? Any suggestions on how to fix this?

I tried adding :wx in my extra_applications, and install libwxgtk3.0-gtk3-0v5 libwxbase3.0-0v5 erlang-wx in my dockerfile, but it did not solve my issue.

could not compile dependency :crawler

Hello people,

I've created a simple project (with and without the --sup), I've added {:crawler, "~> 1.1"} to my mix.exs and then mix deps.get. Everything was fine. But when I launch it with: iex -S mix I get:

# details above omitted

==> crawler
Compiling 32 files (.ex)

== Compilation error in file lib/crawler/fetcher/retrier.ex ==
** (CompileError) lib/crawler/fetcher/retrier.ex:25: undefined function exp_backoff/0
    (elixir 1.10.1) src/elixir_locals.erl:114: anonymous fn/3 in :elixir_locals.ensure_no_undefined_local/3
    (stdlib 3.11.2) erl_eval.erl:680: :erl_eval.do_apply/6
could not compile dependency :crawler, "mix compile" failed. You can recompile this dependency with "mix deps.compile crawler", update it with "mix deps.update crawler" or clean it with "mix deps.clean crawler"

Environment info:

Linux x30 5.6.10-3-MANJARO #1 SMP PREEMPT Sun May 3 11:03:28 UTC 2020 x86_64 GNU/Linux
Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
Elixir 1.10.1 (compiled with Erlang/OTP 22)

Any idea about what's happening? Thanks in advance.

Is there a view to determine webfonts used?

Just like css, images, JavaScript can we get all webfonts used on a webpage?

customization strategy

Hi Fred, Advice?
I looked at your code, and here's my understanding so far:

Every new url to be parsed gets a new child Worker spun up, then you cast an (opts) message to the worker, and the worker gets the url (and other options) from the opts.

The worker uses the url to
|> fetch
|> parse

Wherein (fetch), if the Policer validity test passes and the Recorder successfully stores the url of the page that we're about to fetch, the Fetcher retrieves the page.

Next (parse), the parser finds all the links in the body; at each link-find event ("OnLink," so to speak), the link_handler function is invoked; and by default, the link_handler == Dispatcher.dispatch, which in turn calls Crawler.crawl (recursively).

Now, thinking about my use cases (more later), for a given root domain I will be going 4 layers deep, but I will need different logic at each layer; and, for different root domains the logic will be different. By different "logic," I mean (I think) different Parser and Fetcher logic (how to find the link(s) for the next level, what to find in the body and record/persist).

My first thought of how I might do this is by adding a field to the opts keys. I.e., when Crawler.crawl pipes to |> Options.assign_url(url), I could use pattern matching on the url in the assign_url function to add (depending on root domain and parse depth -- the current state) new opts fields -- :parser_strategy and :recorder_strategy.

These new opts fields could hold values that correspond to states of a Fetcher (or Recorder) state machine and a Parser state machine. I.e., in fetch_url_200, where Recorder.store_page is called, that call will invoke one of many (pattern matched) functions; so, a lot of my domain-specific, level-specific logic would be implemented in Recorder.store_page, based on opts[:recorder_strategy].

In summary, if I followed what I describe above, the places I see where I would be adding my customization would be in:

Options.assign_url
(Use business logic to set :recorder_strategy and :parser_strategy states, based on previous state and url. Hence, "OnLink" are FSM transitions.)

Recorder.store_page
(Call one of multiple versions of do_store_page(), depending on :recorder_strategy state)

Parser.parse_links
(Call one of multiple versions of do_parse_links(), depending on :parser_strategy state)

Thoughts?
Thanks,
David

also, fyi, to get it to build on Windows I had to bump :httpoison, "~> 0.13".

Switch to Req

Much of the Elixir community is switching from HTTPoison to Req

Why not use floki as a DSL for scraping page content ?

Would u consider using it? i can contribute with this

When finished

How can we detect that the crawler exhausted crawling the site?

Is the crawler able to get async javascript render contents?

** (ArgumentError) invalid syntax, only "retry", "after" and "else" are permitted

It seems as though by updating the ElixirRetry package it has introduced an issue. I cannot add this project to my mix.exs and run

iex -S mix

without receiving this error

== Compilation error in file lib/crawler/fetcher/retrier.ex ==
** (ArgumentError) invalid syntax, only "retry", "after" and "else" are permitted
    expanding macro: Retry.retry/2
    lib/crawler/fetcher/retrier.ex:25: Crawler.Fetcher.Retrier.perform/2
    (elixir) lib/kernel/parallel_compiler.ex:198: anonymous fn/4 in Kernel.ParallelCompiler.spawn_workers/6

Retrieving all items in Crawler.Store

It appears you can only find pages that have been crawled if you know the URL, there's no way to return all pages in the store. Is this by design, or just not a completed API for interfacing with the registry?

I ask because it's getting all the pages it seems but if I don't know the URL I've no access to get the information from the store?

Question: What did you use for the SVG graphics

What software did you use to create the architecture diagram? Looks nice, would like to create the same for my projects. Thanks!

High performance: Store gets flooded when too many pages are crawled

if I try to launch 100 page to be crawled (each with depth 5)
after a bit the Store process get flooded and starts dropping messages

2020-02-10 16:07:42.374 [debug] "Failed to fetch https://mystays.rwiths.net/r-withs/tfi0020a.do?GCode=mystays&ciDateY=2020&ciDateM=02&ciDateD=10&coDateY=2020&coDateM=02&coDateD=11&s1=0&s2=0&y1=0&y2=0&y3=0&y4=0&room=1&otona=4&hotelNo=38599&dataPattern=PL&cdHeyaShu=t2&planId=4114101&f_lang=ja, reason: checkout_timeout"

the upside of having a Registry is that the store is global and the crawler can be run from multiple machines.
The downside is that this single process will become a bottleneck for high performance.
Would you be open to using mnesia ? (fast, distributer, in memory db)
if you don't mind the distributed part, I would use an ets for the store, which should be able to handle more load.

the solution to this is to break the crawling of all those urls, and not send them all at the same time.

let me know if you are open to this, I'm open to putting a tentative PR

robots.txt

Does crawler respect robots.txt? Or has an option for that?