Git Product home page Git Product logo

datacrawl's Issues

Use one or more `Options` classes

Instead of having every option for the Spider class be a separate argument, there should be one or more Options classes that store these options and then get passed to the Spider class.

Once this is addressed, the max-attributes option in .pylintrc should be set back to 15

Feature: Support for crawling dynamic javascript heavy site

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

  • Enable the crawler to fetch and parse content from JavaScript-heavy sites.
  • Use a headless browser to render JavaScript content. (explore playwright-python)
  • Ensure compatibility with the existing crawler structure and options.
  • Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

  • Single Headless Browser Instance:
    • Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
  • Concurrency Management:
    • Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
    • Integrate the asynchronous fetching logic with our existing web crawler structure.
  • Error Handling:
    • Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
    • Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

Test coverage above 80%

Currently the test coverage is around 79% (this is including the soon to be removed main function in crawler.py)

Fix `crawl_result` type hint

Because of #19 , the type hint for Spider.crawl_result broke, and it was temporarily replaced with Dict[str, Dict[str, Any]].
This should be fixed to actually reflect the contents of crawl_result, which has the following format:

crawl_result = {
    "url1":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
    "url2":{
        "urls":["some url", "some other url", ...],
        "body": "the html of the page"
    },
}

Where body is only present if the include_body argument is set to True, and as such might not always be present.
See #19 for previous discussions about this.
You can verify the type hint is working if the mypy checks pass.

Can't run tests on local machine

On my machine, I can't run the unit tests.
When I do poetry run pytest, or when the tests are ran through the push hook, it raises

_______________________________________ ERROR collecting tests/networking/test_validator.py ________________________________________
ImportError while importing test module 'C:\Users\Public\codg\forks\tiny-web-crawler\tests\networking\test_validator.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Users\Tiago Charrão\AppData\Local\Programs\Python\Python312\Lib\importlib\__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests\networking\test_validator.py:1: in <module>
    from tiny_web_crawler.networking.validator import is_valid_url
E   ModuleNotFoundError: No module named 'tiny_web_crawler'

For every test file.

Steps to reproduce

  1. Create a virtual environment with py -m venv .
  2. Install the dependencies with poetry install --with dev
  3. Install the pre-commit hooks with pre-commit install pre-commit install --hook-type pre-push
  4. Run poetry run pytest

Additional info
The tests still run fine on the ci.
I was able to run them by replacing
from tiny_web_crawler.etc...
with
from src.tiny_web_crawler.etc...

Make `Spider` importable from main module

Feels like you should be able to do from tiny_web_crawler import Spider, not just from tiny_web_crawler.crawler import Spider

Would be as simple as adding from tiny_web_crawler.crawler import Spider to __init__.py

Edit:

  • Update how we import the package and update the readme usage doc (moved from #9 )

Housekeeping: Refactor the code base to a more Modular and Extensible Architecture

  1. Separation of Concerns:

    • Separate the core crawling logic from the utilities and configurations.
    • Define interfaces or abstract classes for components that may need to be extended or swapped out.
  2. Use of Dependency Injection:

    • Allow different components to be injected into the main crawler class, making it easy to extend or replace specific parts of the crawler.
  3. Organize Code into Modules:

    • Organize the code into different files or modules based on their functionality.
  4. Configure git hook to ensure all the test passes before push #20

  5. Update how we import the package and update the readme usage doc

    • Make imports like from tiny_web_crawler import Spider Moved to #26

PRS:
#34

First Major Release v1.0.0

This is a place holder Issue for the first major release v1.0.0

Please feel free to create issue from this list

Scope and Features: First major version v1.0.0

Functional Requirements

  • Basic Crawling Functionality #1
  • Configurable options for maximum links to crawl #1
  • Handle both relative and absolute URLs #1
  • Save crawl results to a specified file #1
  • Configurable verbosity levels for logging #7
  • Concurrency and custom delay #7
  • Support Regular expression #16
  • Crawl internal / external links only #11
  • Return optional html in response #19
  • Crawl depth per website/domain #37
  • Logging #38
  • Retry mechanism for transient errors #39
  • Support Javascript heavy dynamic websites #10
  • (Optional) Respect Robots.txt #42
  • (Optional) User-Agent Customization
  • (Optional) Proxy support
  • (Optional) Use Asynchronous I/O
  • (Optional) Crawl output to database (Mongo mabye)

Non Functional Requirements

  • Git workflow for CI/CD #4
  • Documentation (API and Developer) #18
  • Test coverage above 80% #28
  • Git hooks #22
  • Modular and Extensible Architecture #17
  • (Optional) Memory Benchmark: Monitor Monitor memory usage during the crawling process
  • (optional) Security considerations (e.g., handling of malicious content)

Feature: Add option to return the crawled website body in the response

Currently we do not return the html body from the crawled sites. We only returns the links we find.

  • Should have a flag to toggle this option
  • Default set to False
  • Modify the Json response to have keys ['urls', 'body']

Eg:

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

This is a feature to return the html body as well. And the result should look look like this.

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ]
        "body": "<html>stuff</html>",
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ],
         "body": "<html>other stuff</html>",
      }
    }
}

What is the `main` function for?

At the bottom of crawler.py, there is this piece of code:

def main() -> None:
    root_url = 'https://pypi.org/'
    max_links = 5

    crawler = Spider(root_url, max_links, save_to_file='out.json')
    print(Fore.GREEN + f"Crawling: {root_url}")
    crawler.start()


if __name__ == '__main__':
    main()

I'm just curious as to what the purpose of this being here is. Looking at it seems like a small piece of code to test the module, but if this is the case it should probably be on a separate file like examples.py, not on crawler.py (if it is meant to be in the source at all)

Add mypy to pre-commit hooks

I noticed this when working in #19
Type checking with mypy is done in the ci, but its not included in the pre-commit hooks.
Should probably be added.

Respect robots.txt when crawling when set as True

  • Option to set respect_robots_txt (Default should be True because of legal obligation in some jurisdictions)
  • Fetch and Parse robots.txt (urllib.robotparser will helps in parsing robots.txt)
  • Create crawl rule per domain
  • Check URL permissions before crawling a URL
  • Make sure it works when concurrent workers are fetching different domains
  • Use the rules provided in robots.txt to fetch. (eg: use robots.txt crawl-delay if present. Check the rule before crawling a path)

Implement logging

  • Currently we use print for stdout.
  • Remove print across the code base
  • Use logging
  • Explore how to provide different coloured logs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.