Git Product home page Git Product logo

url-crawler's Introduction

schul-cloud-url-crawler

Build Status Python Package Index

This crawler fetches ressources from urls and posts them to a server.

Purpose

The purpose of this crawler is:

  • We can provide test data to the API.
  • It can crawl ressources which are not active and cannot post.
  • Other crawl services can use this crawler to upload their conversions.
  • It has the full crawler logic but does not transform into other formats.
    • Maybe we can create recommendations or a library for crawlers from this case.

Requirements

The crawler should work as follows:

  • Provide urls
    • as command line arguments
    • as a link to a file with one url per line
  • Provide ressources
    • as one ressource in a file
    • as a list of ressources

The crawler must be invoked to crawl.

Example

This example gets a ressource from the url and post it to the api.

python3 -m ressource_url_crawler http://localhost:8080 \
        https://raw.githubusercontent.com/schul-cloud/ressources-api-v1/master/schemas/ressource/examples/valid/example-website.json

Authentication

You can specify the authentication like this:

  • --basic=username:password for basic authentication
  • --apikey=apikey for api key authentication

Further Requirements

  • The crawler does not post ressources twice. This can be implemented by
    • caching the ressources locally, to see if they changed
      • compare ressource
      • compare timestamp
    • removing the ressources from the database if they are updated after posting new ressources.

This may require some form of state for the crawler. The state could be added to the ressources in a X-Ressources-Url-Crawler-Source field. This allows local caching and requires getting the objects from the database.

url-crawler's People

Contributors

niccokunzmann avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

url-crawler's Issues

Introduce multithreading - speed up crawling

fetch calls fetch recursively.
This is a place where we can speed up crawling by introducing multithreading.

To implement multithreading:

  • use a concurrent.futures.ThreadPoolExecutor
  • make the ResourceClient class threadsave when accessing the shared state _ids.
  • make sure the api supports multithreading

Hints:

  • Maybe rename fetch to _fetch and make fetch create the ThreadPoolExecutor
    • pass the number of threads
    • wait for the completion of the call

Add crawl support for a list of resources

This should be a list as described by the JSONAPI specification:

{
   "data": [
      { "attributes": { ... resource ... }}, 
      { "attributes": { ... resource ... }}, 
      ...
  ]
}
  • add tests
  • add test for malformed data
  • add description to the readme
  • add an example , see #4

Describe part of the architecture

Add this documentation to the README.rst file:

This crawler

  • is part of the resources api v1
    • describe
    • add a link
  • tests against the test server
    • describe
    • add link
  • is part of the architecture
    • describe
    • link to blog post

Document parameters

If you use --help, you can see the parameters.
They are described in cli.py.

  • add all parameters to the readme file
  • describe what they do

Security: file:// urls

We should allow file:// urls for locally generated content.

  • test that file:// urls can be crawled
  • make sure that a url list from http(s) cannot access file:// urls
    This way, resources from the internet can not access local files.
  • file:// urls from a file:// list can be accessed (recursively)

Create examples

The crawler can be used to crawl examples.

  • create a branch for examples
  • include an install.sh file to install the crawler
  • start the test server in an extra file, silently
  • add examples
    • one url from examples
    • at least two urls from examples
    • show what happens with invalid resources (file an issue if crawler fails)
    • a file with a list of urls
    • using the file:// schema
      • point out that this eases crawling to the local file system
  • add a link to the example usages to the README
  • add a very simple example to the readme

To work on the issue, choose one point and create a pull-request for it.

Test malformed input

Test what happens with

  • invalid resources
  • invalid json
  • invalid urls
  • empty urls
  • urls with white space the the end/beginning

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.