The datacrawl from datacrawl-ai

Feature: Support flag to crawl only the root website. Do not hop to external links

Very straightforward feature to add a flag to crawl only the root website and do not crawl to external links.
eg: If the root url provided is https://github.com. It should crawl pages in this domain only. It should not crawl https://exmaple.com
(optional) Can we also support an option to crawl only external links and no internal links. There could be some use cases for that

Housekeeping: Refactor the code base to a more Modular and Extensible Architecture

Separation of Concerns:
- Separate the core crawling logic from the utilities and configurations.
- Define interfaces or abstract classes for components that may need to be extended or swapped out.
Use of Dependency Injection:
- Allow different components to be injected into the main crawler class, making it easy to extend or replace specific parts of the crawler.
Organize Code into Modules:
- Organize the code into different files or modules based on their functionality.
Configure git hook to ensure all the test passes before push #20
~~Update how we import the package and update the readme usage doc~~
- ~~Make imports like from tiny_web_crawler import Spider~~ Moved to #26

PRS:
#34

Implement logging

Currently we use print for stdout.
Remove print across the code base
Use logging
Explore how to provide different coloured logs

Crawl depth per domain

Option to control how many links will be crawled from the same domain

Can't run tests on local machine

On my machine, I can't run the unit tests.
When I do poetry run pytest, or when the tests are ran through the push hook, it raises

_______________________________________ ERROR collecting tests/networking/test_validator.py ________________________________________
ImportError while importing test module 'C:\Users\Public\codg\forks\tiny-web-crawler\tests\networking\test_validator.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Users\Tiago Charrão\AppData\Local\Programs\Python\Python312\Lib\importlib\__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests\networking\test_validator.py:1: in <module>
    from tiny_web_crawler.networking.validator import is_valid_url
E   ModuleNotFoundError: No module named 'tiny_web_crawler'

For every test file.

Steps to reproduce

Create a virtual environment with py -m venv .
Install the dependencies with poetry install --with dev
Install the pre-commit hooks with pre-commit install pre-commit install --hook-type pre-push
Run poetry run pytest

Additional info
The tests still run fine on the ci.
I was able to run them by replacing
from tiny_web_crawler.etc...
with
from src.tiny_web_crawler.etc...

Because of #19 , the type hint for Spider.crawl_result broke, and it was temporarily replaced with Dict[str, Dict[str, Any]].
This should be fixed to actually reflect the contents of crawl_result, which has the following format:

crawl_result = {
    "url1":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
    "url2":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
}

Where body is only present if the include_body argument is set to True, and as such might not always be present.
See #19 for previous discussions about this.
You can verify the type hint is working if the mypy checks pass.

Respect robots.txt when crawling when set as True

Option to set respect_robots_txt (Default should be True because of legal obligation in some jurisdictions)
Fetch and Parse robots.txt (urllib.robotparser will helps in parsing robots.txt)
Create crawl rule per domain
Check URL permissions before crawling a URL
Make sure it works when concurrent workers are fetching different domains
Use the rules provided in robots.txt to fetch. (eg: use robots.txt crawl-delay if present. Check the rule before crawling a path)

Docs: Auto generate documentation

Auto generate documentation using Sphinx or other suitable tools

`poetry install --with dev` doesn't install pre-commit hooks

Running poetry install --with dev doesn't install the pre-commit hooks, as of right now they need to be installed manually through pre-commit install

Feature: Add a feature to only crawl the given list of urls

Accept a argument from the user. Something like url_list
Crawl only the urls provided by the users as an argument and nothing else.

Display coverage percentage in readme

Use this action to create a coverage badge and display it on the readme. (Proposed in #29 )

Add mypy to pre-commit hooks

I noticed this when working in #19
Type checking with mypy is done in the ci, but its not included in the pre-commit hooks.
Should probably be added.

Feature: Support for crawling dynamic javascript heavy site

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

Enable the crawler to fetch and parse content from JavaScript-heavy sites.
Use a headless browser to render JavaScript content. (explore playwright-python)
Ensure compatibility with the existing crawler structure and options.
Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

Single Headless Browser Instance:
- Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
Concurrency Management:
- Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
- Integrate the asynchronous fetching logic with our existing web crawler structure.
Error Handling:
- Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
- Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

Implement a retry mechanism for transient errors

Use exponential delay in retry controlled by a max_retry_attempts=5 (this can be hardcoded and need not be accepted as a param from user)

First Major Release v1.0.0

This is a place holder Issue for the first major release v1.0.0

Please feel free to create issue from this list

Scope and Features: First major version v1.0.0

Functional Requirements

Non Functional Requirements

Git workflow for CI/CD #4
Documentation (API and Developer) #18
Test coverage above 80% #28
Git hooks #22
Modular and Extensible Architecture #17
(Optional) Memory Benchmark: Monitor Monitor memory usage during the crawling process
(optional) Security considerations (e.g., handling of malicious content)

Make `Spider` importable from main module

Feels like you should be able to do from tiny_web_crawler import Spider, not just from tiny_web_crawler.crawler import Spider

Would be as simple as adding from tiny_web_crawler.crawler import Spider to __init__.py

Edit:

Update how we import the package and update the readme usage doc (moved from #9 )

Housekeeping: Refactor the Spider class to reduce the max args. Use a dataclass

What is the `main` function for?

At the bottom of crawler.py, there is this piece of code:

def main() -> None:
    root_url = 'https://pypi.org/'
    max_links = 5

    crawler = Spider(root_url, max_links, save_to_file='out.json')
    print(Fore.GREEN + f"Crawling: {root_url}")
    crawler.start()


if __name__ == '__main__':
    main()

I'm just curious as to what the purpose of this being here is. Looking at it seems like a small piece of code to test the module, but if this is the case it should probably be on a separate file like examples.py, not on crawler.py (if it is meant to be in the source at all)

Feature: Add option to return the crawled website body in the response

Currently we do not return the html body from the crawled sites. We only returns the links we find.

Should have a flag to toggle this option
Default set to False
Modify the Json response to have keys ['urls', 'body']

Eg:

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

This is a feature to return the html body as well. And the result should look look like this.

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ]
        "body": "<html>stuff</html>",
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ],
         "body": "<html>other stuff</html>",
      }
    }
}

Test coverage above 80%

Currently the test coverage is around 79% (this is including the soon to be removed main function in crawler.py)

Feature: Support for regular expression pattern for url crawling

Only crawl the websites which matches the regular expression pattern

Use one or more `Options` classes

Instead of having every option for the Spider class be a separate argument, there should be one or more Options classes that store these options and then get passed to the Spider class.

Once this is addressed, the max-attributes option in .pylintrc should be set back to 15

datacrawl-ai / datacrawl Goto Github PK

datacrawl's Introduction

Tiny Web Crawler

Features

Installation

Usage

Output Format

Contributing

Dev setup

Before raising a PR. Please make sure you have these checks covered

datacrawl's People

Contributors

Stargazers

Watchers

Forkers

datacrawl's Issues

Scope and Features: First major version v1.0.0

Functional Requirements

Non Functional Requirements

Recommend Projects

Recommend Topics

Recommend Org