Git Product home page Git Product logo

datacrawl's Introduction

cover

Tiny Web Crawler

CI Coverage badge Stable Version License: MIT Download Stats Discord

A simple and efficient web crawler for Python.

Features

  • Crawl web pages and extract links starting from a root URL recursively
  • Concurrent workers and custom delay
  • Handle relative and absolute URLs
  • Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings

settings = SpiderSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Spider(settings)
spider.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = SpiderSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Spider(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

  • If you are a first time contributor you can pick a good-first-issue and get started.
  • Please feel free to ask questions.
  • Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
  • We are working on doing our first major release. Please check this issue and see if anything interests you.

Dev setup

  • Install poetry in your system pipx install poetry
  • Clone the repo you forked
  • Create a venv or use poetry shell
  • Run poetry install --with dev
  • pre-commit install (see)
  • pre-commit install --hook-type pre-push

Before raising a PR. Please make sure you have these checks covered

  • An issue exists or is created which address the PR
  • Tests are written for the changes
  • All lint/test passes

datacrawl's People

Contributors

indrajithi avatar mews avatar

Stargazers

Emrah SALMAN avatar  avatar ולדי avatar Faisal Ali Sayyed avatar Onur ULUSOY avatar Aayushman Choudhary avatar cm o'kale avatar  avatar  avatar John Thomas avatar Aman Verasia avatar dia ♡ avatar maxfinnsjo avatar Chris Blanchard avatar  avatar Colin Mason avatar 姚文强 avatar Youssef avatar indrajanambiar avatar Dhananjay Haridas avatar null data avatar panchajanya. avatar Shubhang Balkundi avatar Zach avatar  avatar surjithkumarmahalingam avatar Vignesh kumar Dharmalingam avatar  avatar Pranav Maddula avatar  avatar d avatar Mark B avatar image72 avatar Akkaiah Jagarlamudi avatar  avatar Ben Wilson avatar GairyS avatar  avatar eric avatar lps avatar xzxx0z avatar James Rocker avatar Krazy Bug avatar Wang QingWen avatar Thomas Vanwelden avatar  avatar Av1at0r avatar  avatar  avatar Vintaclectic avatar Erik avatar stavros vagionitis avatar Goofy avatar Dave Lam avatar faulty.lee avatar Roberval T. avatar Florian B. avatar Lyle Jantzi III avatar

Watchers

James Cloos avatar Dorai Thodla avatar  avatar E avatar Allen avatar  avatar  avatar  avatar  avatar

datacrawl's Issues

Housekeeping: Refactor the code base to a more Modular and Extensible Architecture

  1. Separation of Concerns:

    • Separate the core crawling logic from the utilities and configurations.
    • Define interfaces or abstract classes for components that may need to be extended or swapped out.
  2. Use of Dependency Injection:

    • Allow different components to be injected into the main crawler class, making it easy to extend or replace specific parts of the crawler.
  3. Organize Code into Modules:

    • Organize the code into different files or modules based on their functionality.
  4. Configure git hook to ensure all the test passes before push #20

  5. Update how we import the package and update the readme usage doc

    • Make imports like from tiny_web_crawler import Spider Moved to #26

PRS:
#34

Implement logging

  • Currently we use print for stdout.
  • Remove print across the code base
  • Use logging
  • Explore how to provide different coloured logs

Can't run tests on local machine

On my machine, I can't run the unit tests.
When I do poetry run pytest, or when the tests are ran through the push hook, it raises

_______________________________________ ERROR collecting tests/networking/test_validator.py ________________________________________
ImportError while importing test module 'C:\Users\Public\codg\forks\tiny-web-crawler\tests\networking\test_validator.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Users\Tiago Charrão\AppData\Local\Programs\Python\Python312\Lib\importlib\__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests\networking\test_validator.py:1: in <module>
    from tiny_web_crawler.networking.validator import is_valid_url
E   ModuleNotFoundError: No module named 'tiny_web_crawler'

For every test file.

Steps to reproduce

  1. Create a virtual environment with py -m venv .
  2. Install the dependencies with poetry install --with dev
  3. Install the pre-commit hooks with pre-commit install pre-commit install --hook-type pre-push
  4. Run poetry run pytest

Additional info
The tests still run fine on the ci.
I was able to run them by replacing
from tiny_web_crawler.etc...
with
from src.tiny_web_crawler.etc...

Fix `crawl_result` type hint

Because of #19 , the type hint for Spider.crawl_result broke, and it was temporarily replaced with Dict[str, Dict[str, Any]].
This should be fixed to actually reflect the contents of crawl_result, which has the following format:

crawl_result = {
    "url1":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
    "url2":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
}

Where body is only present if the include_body argument is set to True, and as such might not always be present.
See #19 for previous discussions about this.
You can verify the type hint is working if the mypy checks pass.

Respect robots.txt when crawling when set as True

  • Option to set respect_robots_txt (Default should be True because of legal obligation in some jurisdictions)
  • Fetch and Parse robots.txt (urllib.robotparser will helps in parsing robots.txt)
  • Create crawl rule per domain
  • Check URL permissions before crawling a URL
  • Make sure it works when concurrent workers are fetching different domains
  • Use the rules provided in robots.txt to fetch. (eg: use robots.txt crawl-delay if present. Check the rule before crawling a path)

Add mypy to pre-commit hooks

I noticed this when working in #19
Type checking with mypy is done in the ci, but its not included in the pre-commit hooks.
Should probably be added.

Feature: Support for crawling dynamic javascript heavy site

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

  • Enable the crawler to fetch and parse content from JavaScript-heavy sites.
  • Use a headless browser to render JavaScript content. (explore playwright-python)
  • Ensure compatibility with the existing crawler structure and options.
  • Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

  • Single Headless Browser Instance:
    • Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
  • Concurrency Management:
    • Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
    • Integrate the asynchronous fetching logic with our existing web crawler structure.
  • Error Handling:
    • Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
    • Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

First Major Release v1.0.0

This is a place holder Issue for the first major release v1.0.0

Please feel free to create issue from this list

Scope and Features: First major version v1.0.0

Functional Requirements

  • Basic Crawling Functionality #1
  • Configurable options for maximum links to crawl #1
  • Handle both relative and absolute URLs #1
  • Save crawl results to a specified file #1
  • Configurable verbosity levels for logging #7
  • Concurrency and custom delay #7
  • Support Regular expression #16
  • Crawl internal / external links only #11
  • Return optional html in response #19
  • Crawl depth per website/domain #37
  • Logging #38
  • Retry mechanism for transient errors #39
  • Support Javascript heavy dynamic websites #10
  • (Optional) Respect Robots.txt #42
  • (Optional) User-Agent Customization
  • (Optional) Proxy support
  • (Optional) Use Asynchronous I/O
  • (Optional) Crawl output to database (Mongo mabye)

Non Functional Requirements

  • Git workflow for CI/CD #4
  • Documentation (API and Developer) #18
  • Test coverage above 80% #28
  • Git hooks #22
  • Modular and Extensible Architecture #17
  • (Optional) Memory Benchmark: Monitor Monitor memory usage during the crawling process
  • (optional) Security considerations (e.g., handling of malicious content)

Make `Spider` importable from main module

Feels like you should be able to do from tiny_web_crawler import Spider, not just from tiny_web_crawler.crawler import Spider

Would be as simple as adding from tiny_web_crawler.crawler import Spider to __init__.py

Edit:

  • Update how we import the package and update the readme usage doc (moved from #9 )

What is the `main` function for?

At the bottom of crawler.py, there is this piece of code:

def main() -> None:
    root_url = 'https://pypi.org/'
    max_links = 5

    crawler = Spider(root_url, max_links, save_to_file='out.json')
    print(Fore.GREEN + f"Crawling: {root_url}")
    crawler.start()


if __name__ == '__main__':
    main()

I'm just curious as to what the purpose of this being here is. Looking at it seems like a small piece of code to test the module, but if this is the case it should probably be on a separate file like examples.py, not on crawler.py (if it is meant to be in the source at all)

Feature: Add option to return the crawled website body in the response

Currently we do not return the html body from the crawled sites. We only returns the links we find.

  • Should have a flag to toggle this option
  • Default set to False
  • Modify the Json response to have keys ['urls', 'body']

Eg:

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

This is a feature to return the html body as well. And the result should look look like this.

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ]
        "body": "<html>stuff</html>",
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ],
         "body": "<html>other stuff</html>",
      }
    }
}

Test coverage above 80%

Currently the test coverage is around 79% (this is including the soon to be removed main function in crawler.py)

Use one or more `Options` classes

Instead of having every option for the Spider class be a separate argument, there should be one or more Options classes that store these options and then get passed to the Spider class.

Once this is addressed, the max-attributes option in .pylintrc should be set back to 15

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.