Git Product home page Git Product logo

apify / crawlee-python Goto Github PK

View Code? Open in Web Editor NEW
3.8K 26.0 265.0 23.88 MB

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev/python/

License: Apache License 2.0

Makefile 0.20% Python 86.60% JavaScript 10.34% CSS 2.67% Shell 0.20%
apify automation beautifulsoup crawler crawling headless headless-chrome pip playwright python

crawlee-python's Introduction

Crawlee
A web scraping and browser automation library

PyPI version PyPI - Downloads PyPI - Python Version Chat on discord

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

🚀 Crawlee for Python is open to early adopters!

Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.

👉 View full documentation, guides and examples on the Crawlee project website 👈

We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information Crawlee for JS/TS on GitHub.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee is available as the crawlee PyPI package. The core functionality is included in the base package, with additional features available as optional extras to minimize package size and dependencies. To install Crawlee with all features, run the following command:

pip install 'crawlee[all]'

Then, install the Playwright dependencies:

playwright install

Verify that Crawlee is successfully installed:

python -c 'import crawlee; print(crawlee.__version__)'

For detailed installation instructions see the Setting up documentation page.

With Crawlee CLI

The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. First, ensure you have Pipx installed:

pipx --help

Then, run the CLI and choose from the available templates:

pipx run crawlee create my-crawler

If you already have crawlee installed, you can spin it up by running:

crawlee create my-crawler

Examples

Here are some practical examples to help you get started with different types of crawlers in Crawlee. Each example demonstrates how to set up and run a crawler for specific use cases, whether you need to handle simple HTML pages or interact with JavaScript-heavy sites. A crawler run will create a storage/ directory in your current working directory.

BeautifulSoupCrawler

The BeautifulSoupCrawler downloads web pages using an HTTP library and provides HTML-parsed content to the user. By default it uses HttpxHttpClient for HTTP communication and BeautifulSoup for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use PlaywrightCrawler. Also if you want to use this crawler, make sure you install crawlee with beautifulsoup extra.

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

PlaywrightCrawler

The PlaywrightCrawler uses a headless browser to download web pages and provides an API for data extraction. It is built on Playwright, an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the BeautifulSoupCrawler. Also if you want to use this crawler, make sure you install crawlee with playwright extra.

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of requests.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

More examples

Explore our Examples page in the Crawlee documentation for a wide range of additional use cases and demonstrations.

Features

Why Crawlee is the preferred choice for web scraping and crawling?

Why use Crawlee instead of just a random HTTP library with an HTML parser?

  • Unified interface for HTTP & headless browser crawling.
  • Automatic parallel crawling based on available system resources.
  • Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
  • Automatic retries on errors or when you’re getting blocked.
  • Integrated proxy rotation and session management.
  • Configurable request routing - direct URLs to the appropriate handlers.
  • Persistent queue for URLs to crawl.
  • Pluggable storage of both tabular data and files.
  • Robust error handling.

Why to use Crawlee rather than Scrapy?

  • Crawlee has out-of-the-box support for headless browser crawling (Playwright).
  • Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
  • Complete type hint coverage.
  • Based on standard Asyncio.

Running on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised for eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

crawlee-python's People

Contributors

asymness avatar b4nan avatar barjin avatar black7375 avatar cadlagtrader avatar eltociear avatar fauzaanu avatar janbuchar avatar kpcofgs avatar mantisus avatar renovate[bot] avatar siddiqkaithodu avatar souravjain540 avatar tymeek avatar vdusek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawlee-python's Issues

Kick off the AutoscaledPool

Tasks

  • Explore the Apify/Crawlee AutoscaledPool in JavaScript (https://crawlee.dev/api/core/class/AutoscaledPool).
  • Figure out how to implement a similar functionality in Python.
  • Prepare a PoC of the AutoscaledPool for Python SDK.
  • Measure the performance of the PoCs

Test Actor for PoCs:

import asyncio
from dataclasses import dataclass, field
from time import time
from typing import Callable
from urllib.parse import urljoin

from apify import Actor
from apify.storages import RequestQueue
from bs4 import BeautifulSoup, Tag
from httpx import AsyncClient


@dataclass(frozen=True)
class ActorInput:
    start_urls: list[dict] = field(default_factory=lambda: [{'url': 'https://apify.com'}])
    max_depth: int = 1
    desired_concurrency: int = 10


class BeautifulSoupCrawler:
    def __init__(self, handle_request: Callable, max_depth: int, desired_concurrency: int) -> None:
        self.handle_request = handle_request
        self.max_depth = max_depth
        self.desired_concurrency = desired_concurrency

    async def run(self, start_urls: list) -> None:
        # Todo: every PoC will implement this differently


async def handle_request(request: dict, request_queue: RequestQueue, max_depth: int) -> None:
    url = request['url']
    depth = request['userData']['depth']
    Actor.log.info(f'Scraping {url} (depth={depth}) ...')

    try:
        async with AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)
        soup = BeautifulSoup(response.content, 'html.parser')

        # If we haven't reached the max depth, look for nested links and enqueue their targets
        if depth < max_depth:
            for link in soup.find_all('a'):
                link_href = link.get('href')
                link_url = urljoin(url, link_href)
                if link_url.startswith(('http://', 'https://')):
                    Actor.log.info(f'Enqueuing {link_url} ...')
                    await request_queue.add_request(
                        {
                            'url': link_url,
                            'userData': {'depth': depth + 1},
                        }
                    )

        result = {
            'url': url,
            'title': soup.title.string if isinstance(soup.title, Tag) else None,
        }
        await Actor.push_data(result)

    except Exception:
        Actor.log.exception(f'Cannot extract data from {url}.')
    finally:
        # Mark the request as handled so it's not processed again
        await request_queue.mark_request_as_handled(request)


async def main() -> None:
    async with Actor:
        actor_input = ActorInput(**(await Actor.get_input() or {}))
        crawler = BeautifulSoupCrawler(handle_request, actor_input.max_depth, actor_input.desired_concurrency)
        start = time()
        await crawler.run(actor_input.start_urls)
        elapsed_time = time() - start
        Actor.log.info(f'Time taken: {elapsed_time}')

Passing context to crawler handlers

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})
  • types are correct
  • we get suggestions for context helpers
  • implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)
  • if a type checker and annotations are used, types are correct (can't get better than that in Python)
  • we get suggestions for context helpers
  • the implementation is type-safe enough, but very inflexible
    • in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
  • there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)
  • no context. prefix
  • the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
  • allows for a more flexible implementation with easier code reuse
  • no suggestions of parameter names
  • the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

Explore what doc tooling we use in SDK and how it deals with dataclasses docstrings

Let's consider the following example:

@dataclass
class MemorySnapshot:
    """A snapshot of memory usage.

    Args:
        total_bytes: Total memory available in the system.
        current_bytes: Memory usage of the current Python process and its children.
        max_memory_bytes: The maximum memory that can be used by `AutoscaledPool`.
        max_used_memory_ratio: The maximum acceptable ratio of `current_bytes` to `max_memory_bytes`.
        created_at: The time at which the measurement was taken.
    """

    total_bytes: int
    current_bytes: int
    max_memory_bytes: int
    max_used_memory_ratio: float
    created_at: datetime = field(default_factory=lambda: datetime.now(tz=timezone.utc))

    @property
    def is_overloaded(self) -> bool:
        """Returns whether the memory is considered as overloaded."""
        return (self.current_bytes / self.max_memory_bytes) > self.max_used_memory_ratio

Is doc tooling (maybe the one we use in SDK) able to handle it properly?

Based on the discussion in here #20 (comment).

Simplify argument type `requests`

Somewhere we use the following:

requests: list[BaseRequestData | Request]

Let's refactor the code to accept only one type.

On the places where we need to use:

arg_name: list[Request | str]

Let's use a different identifier than requests, e.g. sources.

See the following conversation for context - #56 (comment).

Configure Renovate bot

Configure Renovate to keep Python dependencies up to date (poetry lock file, dev dependencies in pyproject.toml, ...) as we use it in JS/TS projects.

I expect that the Renovate bot will update dependencies at regular intervals. If they pass the tests, it will commit the changes directly to the master. Otherwise, it will open a pull request.

Once this is done, please open the same issue for SDK, Client, and Shared Python repositories with a link to this one.

Blocked by #6.

Implement auto-purging of storages

We need the same bahavior as with the JS version:

  • crawlee implements the base storage classes
  • every async operation checks if it was the first call, and purges automatically unless opted-out via CRAWLEE_PURGE_ON_START env var (with a falsy value like 0 or false)
  • we have this method that it called on many places in the storage methods like open or getInput https://crawlee.dev/api/core/function/purgeDefaultStorages
  • since the SDK uses those storage classes, it has the same behavior out of box
  • internally this works by calling purge method on the storage client, so this also means both memory storage and apify client need to implement this purge method

Related: apify/apify-cli#545

Refactor initialization of storages

Description

  • Currently, if you want to initialize Dataset/KVS/RQ you should use open() constructor. And it goes like the following:
    • dataset.open()
    • base_storage.open()
    • dataset.__init__()
    • base_storage.__init__()
  • In the base_storage.open() a specific client is selected (local - MemoryStorageClient or cloud - ApifyClient) using StorageClientManager.
  • Refactor initialization of memory storage resource clients as well.

Desired state

  • Make it more readable, less error-prone (e.g. user uses a wrong constructor), and extensible by supporting other clients.

Improve error handling in `Dataset._get_data_internal()`

Simulate this error in Python and handle it accordingly.

    try {
        return await this.client.listItems(options);
    } catch (e) {
        const error = e as Error;
        if (error.message.includes('Cannot create a string longer than')) {
            throw new Error('dataset.getData(): The response is too large for parsing. You can fix this by lowering the "limit" option.');
        }
        throw e;
    }

https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/dataset.py#L240:L249

Remove `json_` and `order_no` from `Request`

The purpose of the fields is somewhat unclear, but it's certain that they don't belong to the Request class.

We should definitely explore the notion of an internal request in Crawlee and how it translates to the Python version.

Use `uv` as packaging tool used in CI builds

Recently, the creators of Ruff (Astral) released a new package installer and resolver called uv, written in Rust. Perhaps we could integrate it into our CI pipelines, as installing everything for all supported Python versions, as well as on Linux and Windows, can take some time.

This week, a similar approach was implemented in Apache Airflow: apache/airflow#37692.

BasicCrawler status logging

  • configurable interval
  • configurable status message callback (constructor parameter, property or decorator?)
  • we periodically set the crawler status via storage client
  • in javascript crawlee, this does nothing when MemoryStorage is being used

BasicCrawler statistics

  • Statistics shall be collected during the crawler run
  • BasicCrawler.run should return a (non-empty) statistics object
  • statistics should be logged periodically

Improve unit testing of Snapshotter

We're touching a lot of private stuff there, let's do it in a better way.

We discussed it in discussion_r1521267138.

Mainly

Or we could make a testing implementation of EventManager where emitting events could be done from the outside (I mean from the test).

is a good idea.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions
.github/workflows/_check_changelog_entry.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_check_docs_build.yaml
  • actions/checkout v4
  • actions/setup-node v4
  • actions/setup-node v4
.github/workflows/_check_version_conflict.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_linting.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_publish_to_pypi.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_type_checking.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_unit_tests.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/docs.yml
  • actions/checkout v4
  • actions/setup-node v4
  • actions/configure-pages v5
  • actions/upload-pages-artifact v3
  • actions/deploy-pages v4
.github/workflows/run_release.yaml
.github/workflows/update_new_issue.yaml
  • actions/github-script v7
npm
website/package.json
  • @apify/utilities ^2.8.0
  • @docusaurus/core 3.4.0
  • @docusaurus/mdx-loader 3.4.0
  • @docusaurus/plugin-client-redirects 3.4.0
  • @docusaurus/preset-classic 3.4.0
  • @giscus/react ^3.0.0
  • @mdx-js/react ^3.0.1
  • axios ^1.5.0
  • buffer ^6.0.3
  • clsx ^2.0.0
  • crypto-browserify ^3.12.0
  • docusaurus-gtm-plugin ^0.0.2
  • docusaurus-plugin-typedoc-api ^4.2.0
  • prism-react-renderer ^2.1.0
  • process ^0.11.10
  • prop-types ^15.8.1
  • raw-loader ^4.0.2
  • react ^18.2.0
  • react-dom ^18.2.0
  • react-lite-youtube-embed ^2.3.52
  • stream-browserify ^3.0.0
  • unist-util-visit ^5.0.0
  • @apify/eslint-config-ts ^0.4.0
  • @apify/tsconfig ^0.1.0
  • @docusaurus/module-type-aliases 3.4.0
  • @docusaurus/types 3.4.0
  • @types/react ^18.0.28
  • @typescript-eslint/eslint-plugin ^7.0.0
  • @typescript-eslint/parser ^7.0.0
  • eslint ^8.35.0
  • eslint-plugin-react ^7.32.2
  • eslint-plugin-react-hooks ^4.6.0
  • fs-extra ^11.1.0
  • patch-package ^8.0.0
  • path-browserify ^1.0.1
  • prettier ^3.0.0
  • rimraf ^5.0.0
  • typescript 5.5.2
  • yarn 4.3.1
website/roa-loader/package.json
  • loader-utils ^3.2.1
pep621
pyproject.toml
templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml
templates/playwright/{{cookiecutter.project_name}}/pyproject.toml
poetry
pyproject.toml
  • python ^3.9
  • aiofiles ^23.2.1
  • aioshutil ^1.3
  • beautifulsoup4 ^4.12.3
  • colorama ^0.4.6
  • docutils ^0.21.0
  • eval-type-backport ^0.2.0
  • html5lib ^1.1
  • httpx ^0.27.0
  • lxml ^5.2.1
  • more_itertools ^10.2.0
  • playwright ^1.43.0
  • psutil ^6.0.0
  • pydantic ^2.6.3
  • pydantic-settings ^2.2.1
  • pyee ^11.1.0
  • python-dateutil ^2.9.0
  • sortedcollections ^2.1.0
  • typing-extensions ^4.1.0
  • tldextract ^5.1.2
  • cookiecutter ^2.6.0
  • typer ^0.12.3
  • inquirer ^3.3.0
  • build ~1.2.0
  • filelock ~3.15.0
  • ipdb ^0.13.13
  • mypy ~1.10.0
  • pre-commit ~3.7.0
  • pydoc-markdown ~4.8.2
  • pytest ~8.2.0
  • pytest-asyncio ~0.23.5
  • pytest-cov ~5.0.0
  • pytest-only ~2.1.0
  • pytest-timeout ~2.3.0
  • pytest-xdist ~3.6.0
  • respx ~0.21.0
  • ruff ~0.5.0
  • setuptools ^70.0.0
  • proxy-py ^2.4.4
templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml
  • python ^3.9
  • crawlee *
templates/playwright/{{cookiecutter.project_name}}/pyproject.toml
  • python ^3.9
  • crawlee *

  • Check this box to trigger a request for Renovate to run again on this repository

Separate `MemoryStorageClient` and `FilesystemStorageClient`

Description

Currently, we have a MemoryStorageClient, that can persist the data in the file system.

Let's separate them, FilesystemStorageClient could probably extend MemoryStorageClient

Other related things

  • There are memory storage-only data models in the storage/models.py module. Move them to the memory storage subpackage.

Add base storage client and resource subclients

Description

Currently, our resource clients are memory storage specific. Let's update them to be storage-agnostic. It will probably require the update of the BaseStorageClient & MemoryStorageClient as well.

Storage-agnostic resource clients are not an option regarding the structure of Apify (platform) clients. So instead of that, let's implement a unified interface (abstract base classes) for BaseStorage and all resource sub-clients (it will be based on the ApifyClient). All of the specific storage clients should inherit the base class and implement the relevant methods.

Soon we will have MemoryStorageClient, FileSystemStorageClient (probably extending the MemoryStorageClient), and ApifyStorageClient (implemented in the apify-sdk or in apify-client). All of them should implement an interface from BaseStorageClient.

StorageClientManager will take care of setting the specific StorageClient.

Implement fingerprinting

Coordinate with @barjin before implementing anything.

There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling (same in JavaScript).

Add `enqueue_links` helper

We should provide a similar helper to what we have in crawlee.

https://crawlee.dev/api/core/function/enqueueLinks

In a nutshell, there is base implementation, which requires a list of URLs, filters them based on the provided options (e.g. globs/regexps or the enqueue strategies) and adds them to the RQ. Then we have contextual helpers in each crawler, e.g. CheerioCrawler has its own context-aware variant, which operates on the current page, and automatically finds all the links (matching the selector option, which defaults to just a).

The enqueuing strategies are described here:

https://crawlee.dev/api/core/enum/EnqueueStrategy

We should first come up with the basic support for autoscaling, and have a BasicCrawler and BeautifulsoupCrawler classes.

We could start with a simple variant that will only work with regexps, and add more features/options going forward.

Introduce a better solution for dealing with byte size

Current state

  • Currently, we have many variables describing "byte size" as integers. It leads to identifiers with the _bytes suffixes, e.g. max_memory_bytes, buffer_memory_bytes, threshold_memory_bytes, ...
  • Then we have to use some conversion functions, for example when we want to log it (e.g. to_mb function).

Goal state

  • Use some more clever ways of dealing with these kinds of variables. Similarly, we utilize datetime.timedelta.
  • Either by implementing our solution, e.g. like this:
# src/crawlee/_utils/byte_size.py

from dataclasses import dataclass

_BYTES_PER_KB = 1024
_BYTES_PER_MB = _BYTES_PER_KB**2
_BYTES_PER_GB = _BYTES_PER_KB**3
_BYTES_PER_TB = _BYTES_PER_KB**4

@dataclass
class ByteSize:
    """Represents a size in bytes."""

    bytes_: int

    def to_kb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_KB

    def to_mb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_MB

    def to_gb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_GB

    def to_tb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_TB

    def __str__(self: ByteSize) -> str:
        if self.bytes_ >= _BYTES_PER_TB:
            return f'{self.to_tb():.2f} TB'

        if self.bytes_ >= _BYTES_PER_GB:
            return f'{self.to_gb():.2f} GB'

        if self.bytes_ >= _BYTES_PER_MB:
            return f'{self.to_mb():.2f} MB'

        if self.bytes_ >= _BYTES_PER_KB:
            return f'{self.to_kb():.2f} KB'

        return f'{self.bytes_} Bytes'
  • Or use some existing solution. Explore the following:
    • typing.NewType;
    • packages on PyPI.

Add package for run-time type checking

Description

Based on the PR apify/apify-sdk-python#171, @janbuchar suggested the usage of some run-time checking for Python.

E.g. typeguard, it can be applied either using a decorator @typechecked for a specific function or import hook typeguard.install_import_hook() for the whole module.

For some methods/functions where we check manually the type of args/return type it could make sense to use it. E.g. here https://github.com/apify/apify-sdk-python/blob/v1.5.1/src/apify/scrapy/utils.py#L44.

Potential problems

I suppose it is implemented by using typing.get_type_hints for getting the type hints for a specific function. I run into a bug when typing.get_type_hints and from __future__ import annotations are used together, see the issue apify/apify-sdk-python#151. However, tests should reveal it.

Better approach of making a cache

  • This is a follow-up issue to the discussion in #82 (comment).
  • Currently, we have our own implementation of LRU cache in crawlee/_utils/lru_cache.py.
  • Let's do it in a more Pythonic way, maybe utilizing the built-in caching from functools std module (lru_cache decorator)?

Request queue v2 support

  • Implement methods for request queue v2 (locking, batch operations)
  • Implement request queue v2 into local request queue (locking, batch operations)
  • On top of that there is a difference between Python and js clients, we are missing parallelism and retries in python client so we need to implement in into sdk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.