glutexo / onigumo Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 297 KB

Parallel web scraping framework

License: MIT License

Elixir 100.00%

crawler

onigumo's Introduction

Onigumo

About

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

Architecture

Onigumo is composed of three sequentially interconnected components:

the Operator,
the Downloader,
the Parser.

The flowchart below illustrates the flow of data between those parts:

flowchart LR
    start([START])               -->         onigumo_operator[OPERATOR]
    onigumo_operator   -- <hash>.urls ---> onigumo_downloader[DOWNLOADER]
    onigumo_downloader -- <hash>.raw  ---> onigumo_parser[PARSER]
    onigumo_parser     -- <hash>.json ---> onigumo_operator
	
	onigumo_operator          <-.->        spider_operator[OPERATOR]
	onigumo_parser            <-.->        spider_parser[PARSER]

    onigumo_operator           -->         spider_materialization[MATERIALIZER]
	
	subgraph "Onigumo (kernel)"
	    onigumo_operator
		onigumo_downloader
		onigumo_parser
	end

    subgraph "Spider (application)"
       spider_operator
       spider_parser
       spider_materialization
    end

Operator

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

initialize a Spider,
extract new URLs from structured data,
insert those URLs onto the Downloader queue.

Downloader

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

read URLs for download,
check for the already downloaded URLs,
fetch the URLs contents along with its metadata,
save the downloaded data.

Parser

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost parseru se skládá z:

kontroly stažených URL adres ke zpracování,
zpracovávání obsahu a metadat stažených URL do strukturované podoby,
ukládání strukturovaných dat.

Aplikace (pavouci)

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

Materializer

Usage

Credits

Licenced under the MIT license.

onigumo's People

Contributors

Stargazers

Watchers

Forkers

nappex

onigumo's Issues

📝 Readme suggestions

I received some feedback on the README:

Promote the Spider/Application section to the top. The fact that Onigumo is a framework and not a ready-to-use solution is essential information. (Agreed.)
Calling applications “Spiders” is unintuitive. Either explain it right at the top and use punctuation to emphasize it is a custom term, or ditch it entirely and revert to “application” instead.

Use Streams instead of lists

The URL input file is read as a Stream, line by line. This is how the URLs should be handled all the way down to save resources and have a cleaner code. The goal is to set up processing the whole input line as a Stream of actions. Like that it would be possible to stack all the steps, without keeping the whole lists in memory. That could get problematic e.g. when having the actual HTTP request and save to the disk, as suggested in #19, as separate steps.

The flow would become like this:

Read a line from the input file. (Done naturally by File.stream!/3)
Trim the trailing newline to get a valid URL.
Perform an HTTP request.
Validate the response.
Extract the body from the response.
Generate a filename. (Not in place yet.)
Save the body to a file.

Steps 1. to 2. are addressed by #66. Steps 3. to 5. would be really resource-heavy if a whole list would be rebuilt every time.

♻️ Split Onigumo module to smaller parts

URLs file reader
HTTP client, response extractor
Integrating downloader

Debian 11 - compile with mix

reproduce error:

cd onigumo
mix compile

error message: ** (Mix) You're trying to run :onigumo on Elixir v1.10.3 but it has declared in its mix.exs file it supports only Elixir ~> 1.12

Unfortunately last stable version for bullseye is 1.10.3 - https://packages.debian.org/search?suite=stable&section=all&arch=any&searchon=names&keywords=elixir

elixir ver. 1.12 is testing unstable - https://manpages.debian.org/bullseye/elixir/elixir.1.en.html

Separate download func to smaller units

We'd like to divide Onigumo.download to smaller parts with philosophy one func make one task.
This approach is more suitable for deep testing also. We'd like to use tmp_dir in test for testing files, for this intention we need to have the way how to give filepath as argument. When we reach it, we will be able to give tmp_dir as argument to make integration tests.
Onigumo.download make three different tasks at this moment.

Three tasks are:

make request
check request and match body of request to variable
write body content to file

Update copyright information

Add @nappex, #101
bump the year. #100

Improve argument order for download/3

A follow-up to #32 and #40. More information in a review.

tl;dr

def download(url, path, http)

feels better than

def download(url, http, path)

because then the behaviors (modules) are on one side of the call and literals (strings) on the other side together.

Test download of multiple URLs

Follow up to #12 and #24.

📝 Add build, run and test instructions to README

build
run
test

$ mix escript.build
$ mix test

Use fucntion pattern matching to solve naming issues

If we use function pattern matching feature available in elixir then we can use same name for some functions in Downloader module.

Consider to rename main as download and download_urls_from_file also as download then these will diff just by defined arguments. Download without any args will behave as our main and download with defined arg as only one a needs to be file will behave as our download_urls_from_file. This approach will solve our naming issue.

function pattern matching - https://inquisitivedeveloper.com/lwm-elixir-23/

Alternative execution

Use escript

Rename Onigumo.CLI to Onigumo.Cli

For consistency with HttpPoison. I believe acronyms should behave as words in identifiers. In the snake case, it would be cli, not c_l_i.

Use :tmp_dir for all tests that touch filesystem

This may not be possible now because of lacking parametrization.

Test Onigumo CLI

Add some basic test to find out that commands $ ./onigumo or $ ./onigumo Downloader are running without any errors.

Handle network errors

But how? What is the correct behavior? Currently the program crashes. Maybe retry? But what if it does not help? I’d say skip and log. But we don’t log anything yet.

🎨 Consolidate flowchart formatting

The flowchart whitespace has become inconsistent. It no longer serves for proper centering or alignment. Either reformat it or collapse the whitespace into single spaces.

I prefer the latter because although whitespace alignment looks nice, it brings a considerable overhead to maintenance. Also, once the chart becomes more complex, it may even become impossible to format it nicely.

Whitespace #109
Arrows #110

Consider to use tool hygiene

This article advise to use at the start of each elixir project tools for control the quality of code.

basic tool for check:

mix format
mix credo --strict
mix dialyzer

We can use all tools above in one with name hygiene with command mix hygiene.

Move `mix.lock` to `.gitignore`

Currently, dependencies update by $ mix deps.update change mix.lock file. Particulary, hashes of dependencies that are created by its version.
I think mix.lock should be generated locally by run commands as $ mix deps.update, it should not be created or rewritten by git pull.

♻️ Rename Onigumo module to Downloader

The current Onigumo module does the downloading part of the workflow. Other components like the Parser and the Operator become distinct modules. The rename would improve semantics and prevent confusion when more modules appear in the codebase.

Enclose workers in an endless loop

For the app to become barely usable, its main components must run in an endless loop. The Operator, Downloader, and Parser must be able to pick new items without a need to re-run them manually.

The Downloader can already make use of such a loop.

The loop may be better outside, in the umbrella, rather than the concrete workers. How does Elixir handle workers and scaling? OTP and thus Elixir utilizes the concept of Applications. GenServers and Supervisors look related too.

Update readme

Consider if we want to do types

Check https://hexdocs.pm/elixir/1.12/typespecs.html

🌐 Translate README

Hash the file name

So file names are unique and based on the URL. MD5?

Create a Mix executable

Can be just a hello world.

🏗 Use mix run instead of an escript

Escripts are for development tasks, not for running an application. Citing the Mix documentation:

Escripts should be used as a mechanism to share scripts between developers and not as a deployment mechanism. For running live systems, consider using mix run or building releases.

Pick a HTTP client

This suggests using Erlang libraries like native https or Hackney. But mentions HTTPoison.

Load URLs from a file

A plain text file with a list of URLs.

Parametrize tests

Some of tests seem to be similar. For example download a single URL and download multiple URLs.

some useful links:

🏗 Move File cwd to Onigumo.CLI

The command line usage brings in the concept of a current working directory. The CWD does not make sense for a hypothetical GUI or web application. In the case of library usage, the parent application has an interface and determines the working directory for Onigumo.

Moving the working path determination to the CLI module brings another benefit in making the main function testable (#56). It prevents the usage of :tmp_dir in the tests.

Blocks #56

Rename elixir to master

The transition of this project to Elixir is complete, and its Ruby origin is long gone. For the sake of easiness, let’s rename the main branch back to master.

Support redirects

HTTP 3xx codes.

Add a LICENSE file

And link it from the README.

Log progress

Log when a download starts and when it finishes.

Improve input path documentation

#45 moved the URLs input filename/path to a configuration file. It needs to be clear that this must be a relative path.

For the runtime, it may make sense to have an absolute path in there, but not much. The application is planned to operate in a single directory, now that being a current working directory, further to be configurable by an argument #54.

The tests using temporary directories however need for any files read and written to be stored in a relative path.

✅ Test main

This needs good mocks, probably also creating a behavior. Maybe even splitting the Onigumo module.

Add github actions

We'd like to run by github actions:

tests by - mix test
check formatting - mix format --check-formatted resp. mix hygiene see #23

Add reverted changes from #29

To keep the changeset as small as possible we reverted some improvements originally introduced in #29. Add those back more atomically. The list is in #29 (comment).

I reverted my own change too:

Rename private func assert_downloaded in test

defp assert_downloaded(url, tmp_dir) do

defp verify_downloaded_files(url, tmp_dir) do

because -> https://slovniky.lingea.cz/anglicko-cesky/ov%C4%9B%C5%99it

Parse args from CLI with `OptionParser.parse`

Currently, we parse CLI args and options just with pattern matching #145. It will be nice if we parse args from CLI by built-in module OptionParse check official docs or tutorial - elixir school. This approach ensure to handle errors more elegant with help message and so on. Currently we get some ugly error. It is not very user friendly interface.

Extract input file name to Config

Use Application config to store the input URL file name. Like that it can be shared between the executive code and tests as well as overridden for the test environment.

📝 Unify checks in the components’ job description lists

The Operator checks the existence of .urls files. If it’s there, the Operator has already processed the corresponding structured data in the .json file.
The Downloader checks the existence of .raw files. If it’s there, the Downloader already downloaded the corresponding URL.
The Parser checks the existence of .json files. If it’s there, the Parser already parsed the corresponding downloaded content in the .raw file.

Consolidate the numbered lists so they describe the check consistently.

Allow to specify a working directory

This will require introducing argument parsing. Elixir’s OptionParser will come handy.

$ ./onigumo --path .

$ ./onigumo --path pycz

There are no tests

Onigumo.download already does stuff. Test it!

Test that it downloads a file and writes it to body.html. Simple as that.

Follow the right style

https://github.com/christopheradams/elixir_style_guide

Load http_client from env inside the get_url/2

Rename get! to something as prepare_response

📝 Extend the flowchart with Spider tasks

The flowchart lacks details about how the data flows from Onigumo to a Spider and back. I don’t want the flowchart to become overly complicated, but the distinction between the framework and its apps is crucial, in my opinion.

My idea is something like

Start → Onigumo Operator → Spider Operator → Downloader → Onigumo Parser → Spider Parser → back to Onigumo Operator → Spider Materialization

The main point here is that Onigumo only determines what to process, and the Spiders do the actual work. Spider Operator receives a list of parsed entities and tells what to download. Spider Parser gets a list of unparsed entities and provides structured data.

There is no parsed data at the beginning, which marks a starting state. When there is nothing more to download, it is time for materialization.

The flowchart may be a bit more colorful to distinguish between the core and the userspace.

💚 Make the PR check fail on a warning

Our code compiles without any warnings; newly appearing ones point to genuine errors. I suggest configuring the pipeline to fail if the compilation yields any.

Create module parser

Očekáváme předem známa vstupní data stažená downloaderem. Donwloader stahne textovy obsah stranky.
data tvořící text budou ve formě co řadek to jedna url (jako kdysi napč. gopher).
Vystupem parseru bude json soubor ktery bude obsahovat seznam vsech url nalezenutych v textu dané stranky.