Git Product home page Git Product logo

onigumo's Introduction

Onigumo

About

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

Architecture

Onigumo is composed of three sequentially interconnected components:

The flowchart below illustrates the flow of data between those parts:

flowchart LR
    start([START])               -->         onigumo_operator[OPERATOR]
    onigumo_operator   -- <hash>.urls ---> onigumo_downloader[DOWNLOADER]
    onigumo_downloader -- <hash>.raw  ---> onigumo_parser[PARSER]
    onigumo_parser     -- <hash>.json ---> onigumo_operator
	
	onigumo_operator          <-.->        spider_operator[OPERATOR]
	onigumo_parser            <-.->        spider_parser[PARSER]

    onigumo_operator           -->         spider_materialization[MATERIALIZER]
	
	subgraph "Onigumo (kernel)"
	    onigumo_operator
		onigumo_downloader
		onigumo_parser
	end

    subgraph "Spider (application)"
       spider_operator
       spider_parser
       spider_materialization
    end

Operator

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

  1. initialize a Spider,
  2. extract new URLs from structured data,
  3. insert those URLs onto the Downloader queue.

Downloader

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

  1. read URLs for download,
  2. check for the already downloaded URLs,
  3. fetch the URLs contents along with its metadata,
  4. save the downloaded data.

Parser

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost parseru se skládá z:

  1. kontroly stažených URL adres ke zpracování,
  2. zpracovávání obsahu a metadat stažených URL do strukturované podoby,
  3. ukládání strukturovaných dat.

Aplikace (pavouci)

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

Materializer

Usage

Credits

© Glutexo, nappex 2019 – 2022

Licenced under the MIT license.

onigumo's People

Contributors

dependabot[bot] avatar glutexo avatar nappex avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

nappex

onigumo's Issues

📝 Readme suggestions

I received some feedback on the README:

  • Promote the Spider/Application section to the top. The fact that Onigumo is a framework and not a ready-to-use solution is essential information. (Agreed.)
  • Calling applications “Spiders” is unintuitive. Either explain it right at the top and use punctuation to emphasize it is a custom term, or ditch it entirely and revert to “application” instead.

Use Streams instead of lists

The URL input file is read as a Stream, line by line. This is how the URLs should be handled all the way down to save resources and have a cleaner code. The goal is to set up processing the whole input line as a Stream of actions. Like that it would be possible to stack all the steps, without keeping the whole lists in memory. That could get problematic e.g. when having the actual HTTP request and save to the disk, as suggested in #19, as separate steps.

The flow would become like this:

  1. Read a line from the input file. (Done naturally by File.stream!/3)
  2. Trim the trailing newline to get a valid URL.
  3. Perform an HTTP request.
  4. Validate the response.
  5. Extract the body from the response.
  6. Generate a filename. (Not in place yet.)
  7. Save the body to a file.

Steps 1. to 2. are addressed by #66. Steps 3. to 5. would be really resource-heavy if a whole list would be rebuilt every time.

Separate download func to smaller units

We'd like to divide Onigumo.download to smaller parts with philosophy one func make one task.
This approach is more suitable for deep testing also. We'd like to use tmp_dir in test for testing files, for this intention we need to have the way how to give filepath as argument. When we reach it, we will be able to give tmp_dir as argument to make integration tests.
Onigumo.download make three different tasks at this moment.

Three tasks are:

  1. make request
  2. check request and match body of request to variable
  3. write body content to file

Improve argument order for download/3

A follow-up to #32 and #40. More information in a review.

tl;dr

def download(url, path, http)

feels better than

def download(url, http, path)

because then the behaviors (modules) are on one side of the call and literals (strings) on the other side together.

Use fucntion pattern matching to solve naming issues

If we use function pattern matching feature available in elixir then we can use same name for some functions in Downloader module.

Consider to rename main as download and download_urls_from_file also as download then these will diff just by defined arguments. Download without any args will behave as our main and download with defined arg as only one a needs to be file will behave as our download_urls_from_file. This approach will solve our naming issue.

function pattern matching - https://inquisitivedeveloper.com/lwm-elixir-23/

Rename Onigumo.CLI to Onigumo.Cli

For consistency with HttpPoison. I believe acronyms should behave as words in identifiers. In the snake case, it would be cli, not c_l_i.

Test Onigumo CLI

Add some basic test to find out that commands $ ./onigumo or $ ./onigumo Downloader are running without any errors.

Handle network errors

But how? What is the correct behavior? Currently the program crashes. Maybe retry? But what if it does not help? I’d say skip and log. But we don’t log anything yet.

🎨 Consolidate flowchart formatting

The flowchart whitespace has become inconsistent. It no longer serves for proper centering or alignment. Either reformat it or collapse the whitespace into single spaces.

I prefer the latter because although whitespace alignment looks nice, it brings a considerable overhead to maintenance. Also, once the chart becomes more complex, it may even become impossible to format it nicely.

Consider to use tool hygiene

This article advise to use at the start of each elixir project tools for control the quality of code.

basic tool for check:

  • mix format
  • mix credo --strict
  • mix dialyzer

We can use all tools above in one with name hygiene with command mix hygiene.

Move `mix.lock` to `.gitignore`

Currently, dependencies update by $ mix deps.update change mix.lock file. Particulary, hashes of dependencies that are created by its version.
I think mix.lock should be generated locally by run commands as $ mix deps.update, it should not be created or rewritten by git pull.

♻️ Rename Onigumo module to Downloader

The current Onigumo module does the downloading part of the workflow. Other components like the Parser and the Operator become distinct modules. The rename would improve semantics and prevent confusion when more modules appear in the codebase.

Enclose workers in an endless loop

For the app to become barely usable, its main components must run in an endless loop. The Operator, Downloader, and Parser must be able to pick new items without a need to re-run them manually.

The Downloader can already make use of such a loop.

The loop may be better outside, in the umbrella, rather than the concrete workers. How does Elixir handle workers and scaling? OTP and thus Elixir utilizes the concept of Applications. GenServers and Supervisors look related too.

🏗 Use mix run instead of an escript

Escripts are for development tasks, not for running an application. Citing the Mix documentation:

Escripts should be used as a mechanism to share scripts between developers and not as a deployment mechanism. For running live systems, consider using mix run or building releases.

🏗 Move File cwd to Onigumo.CLI

The command line usage brings in the concept of a current working directory. The CWD does not make sense for a hypothetical GUI or web application. In the case of library usage, the parent application has an interface and determines the working directory for Onigumo.

Moving the working path determination to the CLI module brings another benefit in making the main function testable (#56). It prevents the usage of :tmp_dir in the tests.

Blocks #56

Rename elixir to master

The transition of this project to Elixir is complete, and its Ruby origin is long gone. For the sake of easiness, let’s rename the main branch back to master.

Log progress

Log when a download starts and when it finishes.

Improve input path documentation

#45 moved the URLs input filename/path to a configuration file. It needs to be clear that this must be a relative path.

For the runtime, it may make sense to have an absolute path in there, but not much. The application is planned to operate in a single directory, now that being a current working directory, further to be configurable by an argument #54.

The tests using temporary directories however need for any files read and written to be stored in a relative path.

✅ Test main

This needs good mocks, probably also creating a behavior. Maybe even splitting the Onigumo module.

Add github actions

We'd like to run by github actions:

  1. tests by - mix test
  2. check formatting - mix format --check-formatted resp. mix hygiene see #23

Add reverted changes from #29

To keep the changeset as small as possible we reverted some improvements originally introduced in #29. Add those back more atomically. The list is in #29 (comment).

  • #34
  • #35
  • #36
  • String composition in the URLs input file from interpolation to concatenation.
  • #37

I reverted my own change too:

Extract input file name to Config

Use Application config to store the input URL file name. Like that it can be shared between the executive code and tests as well as overridden for the test environment.

📝 Unify checks in the components’ job description lists

  • The Operator checks the existence of .urls files. If it’s there, the Operator has already processed the corresponding structured data in the .json file.
  • The Downloader checks the existence of .raw files. If it’s there, the Downloader already downloaded the corresponding URL.
  • The Parser checks the existence of .json files. If it’s there, the Parser already parsed the corresponding downloaded content in the .raw file.

Consolidate the numbered lists so they describe the check consistently.

There are no tests

Onigumo.download already does stuff. Test it!

Test that it downloads a file and writes it to body.html. Simple as that.

📝 Extend the flowchart with Spider tasks

The flowchart lacks details about how the data flows from Onigumo to a Spider and back. I don’t want the flowchart to become overly complicated, but the distinction between the framework and its apps is crucial, in my opinion.

My idea is something like

Start → Onigumo Operator → Spider Operator → Downloader → Onigumo Parser → Spider Parser → back to Onigumo Operator → Spider Materialization

The main point here is that Onigumo only determines what to process, and the Spiders do the actual work. Spider Operator receives a list of parsed entities and tells what to download. Spider Parser gets a list of unparsed entities and provides structured data.

There is no parsed data at the beginning, which marks a starting state. When there is nothing more to download, it is time for materialization.

The flowchart may be a bit more colorful to distinguish between the core and the userspace.

💚 Make the PR check fail on a warning

Our code compiles without any warnings; newly appearing ones point to genuine errors. I suggest configuring the pipeline to fail if the compilation yields any.

Create module parser

Očekáváme předem známa vstupní data stažená downloaderem. Donwloader stahne textovy obsah stranky.
data tvořící text budou ve formě co řadek to jedna url (jako kdysi napč. gopher).
Vystupem parseru bude json soubor ktery bude obsahovat seznam vsech url nalezenutych v textu dané stranky.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.