apify / crawlee-python Goto Github PK

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev/python/

License: Apache License 2.0

Makefile 0.20% Python 86.60% JavaScript 10.34% CSS 2.67% Shell 0.20%

apify automation beautifulsoup crawler crawling headless headless-chrome pip playwright python

crawlee-python's Introduction

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

🚀 Crawlee for Python is open to early adopters!

Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.

👉 View full documentation, guides and examples on the Crawlee project website 👈

We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information Crawlee for JS/TS on GitHub.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee is available as the crawlee PyPI package. The core functionality is included in the base package, with additional features available as optional extras to minimize package size and dependencies. To install Crawlee with all features, run the following command:

pip install 'crawlee[all]'

Then, install the Playwright dependencies:

playwright install

Verify that Crawlee is successfully installed:

python -c 'import crawlee; print(crawlee.__version__)'

For detailed installation instructions see the Setting up documentation page.

With Crawlee CLI

The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. First, ensure you have Pipx installed:

pipx --help

Then, run the CLI and choose from the available templates:

pipx run crawlee create my-crawler

If you already have crawlee installed, you can spin it up by running:

crawlee create my-crawler

Examples

Here are some practical examples to help you get started with different types of crawlers in Crawlee. Each example demonstrates how to set up and run a crawler for specific use cases, whether you need to handle simple HTML pages or interact with JavaScript-heavy sites. A crawler run will create a storage/ directory in your current working directory.

BeautifulSoupCrawler

The BeautifulSoupCrawler downloads web pages using an HTTP library and provides HTML-parsed content to the user. By default it uses HttpxHttpClient for HTTP communication and BeautifulSoup for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use PlaywrightCrawler. Also if you want to use this crawler, make sure you install crawlee with beautifulsoup extra.

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

PlaywrightCrawler

The PlaywrightCrawler uses a headless browser to download web pages and provides an API for data extraction. It is built on Playwright, an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the BeautifulSoupCrawler. Also if you want to use this crawler, make sure you install crawlee with playwright extra.

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of requests.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

More examples

Explore our Examples page in the Crawlee documentation for a wide range of additional use cases and demonstrations.

Features

Why Crawlee is the preferred choice for web scraping and crawling?

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Unified interface for HTTP & headless browser crawling.
Automatic parallel crawling based on available system resources.
Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
Automatic retries on errors or when you’re getting blocked.
Integrated proxy rotation and session management.
Configurable request routing - direct URLs to the appropriate handlers.
Persistent queue for URLs to crawl.
Pluggable storage of both tabular data and files.
Robust error handling.

Why to use Crawlee rather than Scrapy?

Crawlee has out-of-the-box support for headless browser crawling (Playwright).
Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
Complete type hint coverage.
Based on standard Asyncio.

Running on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised for eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

crawlee-python's People

Contributors

Stargazers

Watchers

Forkers

mantisus surendratamang ssteo chuckn408 megnidro freshy969 v0idmatr1x malsherlock amircp fourpartswater beimingmaster polya20 fauzaanu foleydom dragon28 marcoschaarbr aousabdo jade2290 ashleypng negro1907 utopic-dev shuaibibobo dfsosa83 sunlandli olabodejames mekongdelta-mind antarbou diogodsa techthiyanes khurramjaved1141 jecky100000 lowryel gaelkbertrand valeman marslanabdulrauf mbijon siddiqkaithodu qnegwwf danilovmy ymgyehlqo cleorp eagalon eros-rama kuntal-c partnerise lamardealmaker hbcbh1999 qianyouliang eltociear 10gtown10 xujinken prawi lynccrypto jfcollazo ratinabox jarandilla teresa0 cognicloudtech yyetkin68 iyetkin65 mojowebs xc0r mercuryyy anil2799 sorokinvld rezabehnoud daniel-ddv mohammadrezasoraya ehsansoraya starrm kekewolf ahmed-sabri aadityaverma linecode healthmemmo clarysf ysfyf alxsbr2411 joshdayax kotamadelin ysfadlaa laywookbarat alephdungeon goaclement goasilon minmin2411 gunungcrude sungaimimir ujungdunia istanalautb1ru pulauterlarang laywooktimur gunungtravia sungaiglasis hutansilon hesam7771 jorik041 j3din00b damon-choeng bravet

crawlee-python's Issues

Implement `Snapshotter`

Snapshotter will be a class for resource monitoring in the autoscaling subpackage.
Implement the initial version.
Snapshotter in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/snapshotter.ts.

Kick off the AutoscaledPool

Tasks

Explore the Apify/Crawlee AutoscaledPool in JavaScript (https://crawlee.dev/api/core/class/AutoscaledPool).
Figure out how to implement a similar functionality in Python.
Prepare a PoC of the AutoscaledPool for Python SDK.
Measure the performance of the PoCs

Test Actor for PoCs:

import asyncio
from dataclasses import dataclass, field
from time import time
from typing import Callable
from urllib.parse import urljoin

from apify import Actor
from apify.storages import RequestQueue
from bs4 import BeautifulSoup, Tag
from httpx import AsyncClient


@dataclass(frozen=True)
class ActorInput:
    start_urls: list[dict] = field(default_factory=lambda: [{'url': 'https://apify.com'}])
    max_depth: int = 1
    desired_concurrency: int = 10


class BeautifulSoupCrawler:
    def __init__(self, handle_request: Callable, max_depth: int, desired_concurrency: int) -> None:
        self.handle_request = handle_request
        self.max_depth = max_depth
        self.desired_concurrency = desired_concurrency

    async def run(self, start_urls: list) -> None:
        # Todo: every PoC will implement this differently


async def handle_request(request: dict, request_queue: RequestQueue, max_depth: int) -> None:
    url = request['url']
    depth = request['userData']['depth']
    Actor.log.info(f'Scraping {url} (depth={depth}) ...')

    try:
        async with AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)
        soup = BeautifulSoup(response.content, 'html.parser')

        # If we haven't reached the max depth, look for nested links and enqueue their targets
        if depth < max_depth:
            for link in soup.find_all('a'):
                link_href = link.get('href')
                link_url = urljoin(url, link_href)
                if link_url.startswith(('http://', 'https://')):
                    Actor.log.info(f'Enqueuing {link_url} ...')
                    await request_queue.add_request(
                        {
                            'url': link_url,
                            'userData': {'depth': depth + 1},
                        }
                    )

        result = {
            'url': url,
            'title': soup.title.string if isinstance(soup.title, Tag) else None,
        }
        await Actor.push_data(result)

    except Exception:
        Actor.log.exception(f'Cannot extract data from {url}.')
    finally:
        # Mark the request as handled so it's not processed again
        await request_queue.mark_request_as_handled(request)


async def main() -> None:
    async with Actor:
        actor_input = ActorInput(**(await Actor.get_input() or {}))
        crawler = BeautifulSoupCrawler(handle_request, actor_input.max_depth, actor_input.desired_concurrency)
        start = time()
        await crawler.run(actor_input.start_urls)
        elapsed_time = time() - start
        Actor.log.info(f'Time taken: {elapsed_time}')

Passing context to crawler handlers

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})

types are correct
we get suggestions for context helpers
implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)

if a type checker and annotations are used, types are correct (can't get better than that in Python)
we get suggestions for context helpers
the implementation is type-safe enough, but very inflexible
- in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)

no context. prefix
the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
allows for a more flexible implementation with easier code reuse
no suggestions of parameter names
the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

Explore what doc tooling we use in SDK and how it deals with dataclasses docstrings

Let's consider the following example:

@dataclass
class MemorySnapshot:
    """A snapshot of memory usage.

    Args:
        total_bytes: Total memory available in the system.
        current_bytes: Memory usage of the current Python process and its children.
        max_memory_bytes: The maximum memory that can be used by `AutoscaledPool`.
        max_used_memory_ratio: The maximum acceptable ratio of `current_bytes` to `max_memory_bytes`.
        created_at: The time at which the measurement was taken.
    """

    total_bytes: int
    current_bytes: int
    max_memory_bytes: int
    max_used_memory_ratio: float
    created_at: datetime = field(default_factory=lambda: datetime.now(tz=timezone.utc))

    @property
    def is_overloaded(self) -> bool:
        """Returns whether the memory is considered as overloaded."""
        return (self.current_bytes / self.max_memory_bytes) > self.max_used_memory_ratio

Is doc tooling (maybe the one we use in SDK) able to handle it properly?

Based on the discussion in here #20 (comment).

Add tests for `LocalEventManager`

Simplify argument type `requests`

Somewhere we use the following:

requests: list[BaseRequestData | Request]

Let's refactor the code to accept only one type.

On the places where we need to use:

arg_name: list[Request | str]

Let's use a different identifier than requests, e.g. sources.

See the following conversation for context - #56 (comment).

Configure Renovate bot

Configure Renovate to keep Python dependencies up to date (poetry lock file, dev dependencies in pyproject.toml, ...) as we use it in JS/TS projects.

I expect that the Renovate bot will update dependencies at regular intervals. If they pass the tests, it will commit the changes directly to the master. Otherwise, it will open a pull request.

Once this is done, please open the same issue for SDK, Client, and Shared Python repositories with a link to this one.

Blocked by #6.

Add tests for `EventManager`

Handle configuration (env. variables) using `pydantic-settings`

In code migrated from the SDK, there is a homebrewed solution - let's use something more standard.

Implement auto-purging of storages

We need the same bahavior as with the JS version:

crawlee implements the base storage classes
every async operation checks if it was the first call, and purges automatically unless opted-out via CRAWLEE_PURGE_ON_START env var (with a falsy value like 0 or false)
we have this method that it called on many places in the storage methods like open or getInput https://crawlee.dev/api/core/function/purgeDefaultStorages
since the SDK uses those storage classes, it has the same behavior out of box
internally this works by calling purge method on the storage client, so this also means both memory storage and apify client need to implement this purge method

Related: apify/apify-cli#545

Refactor initialization of storages

Description

Currently, if you want to initialize Dataset/KVS/RQ you should use open() constructor. And it goes like the following:
- dataset.open()
- base_storage.open()
- dataset.__init__()
- base_storage.__init__()
In the base_storage.open() a specific client is selected (local - MemoryStorageClient or cloud - ApifyClient) using StorageClientManager.
Refactor initialization of memory storage resource clients as well.

Desired state

Make it more readable, less error-prone (e.g. user uses a wrong constructor), and extensible by supporting other clients.

Batched request addition in `RequestQueue`

Thus far, there's just a dummy implementation.

Add tests for `RecurringTask` and `weighted_avg`

Improve error handling in `Dataset._get_data_internal()`

Simulate this error in Python and handle it accordingly.

    try {
        return await this.client.listItems(options);
    } catch (e) {
        const error = e as Error;
        if (error.message.includes('Cannot create a string longer than')) {
            throw new Error('dataset.getData(): The response is too large for parsing. You can fix this by lowering the "limit" option.');
        }
        throw e;
    }

https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/dataset.py#L240:L249

Remove `json_` and `order_no` from `Request`

The purpose of the fields is somewhat unclear, but it's certain that they don't belong to the Request class.

We should definitely explore the notion of an internal request in Crawlee and how it translates to the Python version.

Use `uv` as packaging tool used in CI builds

Recently, the creators of Ruff (Astral) released a new package installer and resolver called uv, written in Rust. Perhaps we could integrate it into our CI pipelines, as installing everything for all supported Python versions, as well as on Linux and Windows, can take some time.

This week, a similar approach was implemented in Apache Airflow: apache/airflow#37692.

Implement `LocalEventManager`

Implement the initial version.
LocalEventManager in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/events/local_event_manager.ts.

Add tests for `CrawleeLogFormatter`

BasicCrawler status logging

configurable interval
configurable status message callback (constructor parameter, property or decorator?)
we periodically set the crawler status via storage client
in javascript crawlee, this does nothing when MemoryStorage is being used

BasicCrawler statistics

Statistics shall be collected during the crawler run
BasicCrawler.run should return a (non-empty) statistics object
statistics should be logged periodically

Improve unit testing of Snapshotter

We're touching a lot of private stuff there, let's do it in a better way.

We discussed it in discussion_r1521267138.

Mainly

Or we could make a testing implementation of EventManager where emitting events could be done from the outside (I mean from the test).

is a good idea.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore(deps): update dependency typescript to v5.5.3

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions

.github/workflows/_check_changelog_entry.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_check_docs_build.yaml

actions/checkout v4

actions/setup-node v4

actions/setup-node v4

.github/workflows/_check_version_conflict.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_linting.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_publish_to_pypi.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_type_checking.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_unit_tests.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/docs.yml

actions/checkout v4

actions/setup-node v4

actions/configure-pages v5

actions/upload-pages-artifact v3

actions/deploy-pages v4

.github/workflows/run_release.yaml

.github/workflows/update_new_issue.yaml

actions/github-script v7

npm

website/package.json

@apify/utilities ^2.8.0

@docusaurus/core 3.4.0

@docusaurus/mdx-loader 3.4.0

@docusaurus/plugin-client-redirects 3.4.0

@docusaurus/preset-classic 3.4.0

@giscus/react ^3.0.0

@mdx-js/react ^3.0.1

axios ^1.5.0

buffer ^6.0.3

clsx ^2.0.0

crypto-browserify ^3.12.0

docusaurus-gtm-plugin ^0.0.2

docusaurus-plugin-typedoc-api ^4.2.0

prism-react-renderer ^2.1.0

process ^0.11.10

prop-types ^15.8.1

raw-loader ^4.0.2

react ^18.2.0

react-dom ^18.2.0

react-lite-youtube-embed ^2.3.52

stream-browserify ^3.0.0

unist-util-visit ^5.0.0

@apify/eslint-config-ts ^0.4.0

@apify/tsconfig ^0.1.0

@docusaurus/module-type-aliases 3.4.0

@docusaurus/types 3.4.0

@types/react ^18.0.28

@typescript-eslint/eslint-plugin ^7.0.0

@typescript-eslint/parser ^7.0.0

eslint ^8.35.0

eslint-plugin-react ^7.32.2

eslint-plugin-react-hooks ^4.6.0

fs-extra ^11.1.0

patch-package ^8.0.0

path-browserify ^1.0.1

prettier ^3.0.0

rimraf ^5.0.0

typescript 5.5.2

yarn 4.3.1

website/roa-loader/package.json

loader-utils ^3.2.1

pep621

pyproject.toml

templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml

templates/playwright/{{cookiecutter.project_name}}/pyproject.toml

poetry

pyproject.toml

python ^3.9

aiofiles ^23.2.1

aioshutil ^1.3

beautifulsoup4 ^4.12.3

colorama ^0.4.6

docutils ^0.21.0

eval-type-backport ^0.2.0

html5lib ^1.1

httpx ^0.27.0

lxml ^5.2.1

more_itertools ^10.2.0

playwright ^1.43.0

psutil ^6.0.0

pydantic ^2.6.3

pydantic-settings ^2.2.1

pyee ^11.1.0

python-dateutil ^2.9.0

sortedcollections ^2.1.0

typing-extensions ^4.1.0

tldextract ^5.1.2

cookiecutter ^2.6.0

typer ^0.12.3

inquirer ^3.3.0

build ~1.2.0

filelock ~3.15.0

ipdb ^0.13.13

mypy ~1.10.0

pre-commit ~3.7.0

pydoc-markdown ~4.8.2

pytest ~8.2.0

pytest-asyncio ~0.23.5

pytest-cov ~5.0.0

pytest-only ~2.1.0

pytest-timeout ~2.3.0

pytest-xdist ~3.6.0

respx ~0.21.0

ruff ~0.5.0

setuptools ^70.0.0

proxy-py ^2.4.4

templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml

python ^3.9

crawlee *

templates/playwright/{{cookiecutter.project_name}}/pyproject.toml

python ^3.9

crawlee *

Check this box to trigger a request for Renovate to run again on this repository

Implement initial version of `BrowserPool` and `PlaywrightCrawler`

Implement the initial version with basic features.
Codebase in Crawlee should serve as an inspiration:
- BrowserPool: https://github.com/apify/crawlee/tree/v3.10.1/packages/browser-pool
- PlaywrightCrawler: https://github.com/apify/crawlee/tree/v3.10.1/packages/playwright-crawler
Update documentation in readme with the new crawler.

Relase a first version of `crawlee` package to PyPI

Release just an empty package, just so no one takes that name.

Separate `MemoryStorageClient` and `FilesystemStorageClient`

Description

Currently, we have a MemoryStorageClient, that can persist the data in the file system.

Let's separate them, FilesystemStorageClient could probably extend MemoryStorageClient

Other related things

There are memory storage-only data models in the storage/models.py module. Move them to the memory storage subpackage.

Implement `Snapshotter._snapshot_client()`

Currently, there is only a dummy version of Snapshotter._snapshot_client() without a real measurement.
Once StorageClient is implemented, use it there to measure the real values.
Check TypeScript implementation for an inspiration - Snapshotter._snapshotClient().

Optimize performance by skipping unnecessary `update_request()` calls in `RequestQueue.reclaim_request()`

Optimize performance by skipping unnecessary update_request() calls in RequestQueue.reclaim_request()

https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/request_queue.py#L314:318

Add base storage client and resource subclients

Description

~~Currently, our resource clients are memory storage specific. Let's update them to be storage-agnostic. It will probably require the update of the BaseStorageClient & MemoryStorageClient as well.~~

Storage-agnostic resource clients are not an option regarding the structure of Apify (platform) clients. So instead of that, let's implement a unified interface (abstract base classes) for BaseStorage and all resource sub-clients (it will be based on the ApifyClient). All of the specific storage clients should inherit the base class and implement the relevant methods.

Soon we will have MemoryStorageClient, FileSystemStorageClient (probably extending the MemoryStorageClient), and ApifyStorageClient (implemented in the apify-sdk or in apify-client). All of them should implement an interface from BaseStorageClient.

StorageClientManager will take care of setting the specific StorageClient.

Add context helpers for working with storages to `BasicCrawlingContext`

we should probably implement the "transactional" storage handling that AdaptivePlaywrightCrawler does in JS from the get go

Verify accuracy of CPU load measurement in `LocalEventManager`

Once we have a non-trivial scraper in place, we should make sure that we can measure its CPU usage correctly. With playwright, for example, it might happen that the CPU usage of the browser process is not taken into account.

Implement fingerprinting

Coordinate with @barjin before implementing anything.

There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling (same in JavaScript).

Implement local storages & storage clients

Implement the initial version.
This should be already implemented in the SDK, at least for the most part. Check the following subpackages:
- src/apify/storages
- src/apify/_memory_storage
Storages in TS Crawlee - https://github.com/apify/crawlee/tree/master/packages/core/src/storages.
Memory storage and resource clients in TS Crawlee - https://github.com/apify/crawlee/tree/master/packages/memory-storage/src.

Generate changelog from the commit messages

Generate CHANGELOG from the commit messages as we do in JS/TS projects.

Once this is solved for this repository, please create the same issue in the SDK, Client, and Shared Python repositories.

@vladfrangu suggested to explore https://github.com/orhun/git-cliff as a solution to this.

Edit: also check this discussion: #125 (comment)

Add `enqueue_links` helper

We should provide a similar helper to what we have in crawlee.

https://crawlee.dev/api/core/function/enqueueLinks

In a nutshell, there is base implementation, which requires a list of URLs, filters them based on the provided options (e.g. globs/regexps or the enqueue strategies) and adds them to the RQ. Then we have contextual helpers in each crawler, e.g. CheerioCrawler has its own context-aware variant, which operates on the current page, and automatically finds all the links (matching the selector option, which defaults to just a).

The enqueuing strategies are described here:

https://crawlee.dev/api/core/enum/EnqueueStrategy

We should first come up with the basic support for autoscaling, and have a BasicCrawler and BeautifulsoupCrawler classes.

We could start with a simple variant that will only work with regexps, and add more features/options going forward.

Implement switching between storage implementations based on autodetection

When Apify platform is detected, we should use SDK implementations. Otherwise, fall back to local storage.

test

Introduce a better solution for dealing with byte size

Current state

Currently, we have many variables describing "byte size" as integers. It leads to identifiers with the _bytes suffixes, e.g. max_memory_bytes, buffer_memory_bytes, threshold_memory_bytes, ...
Then we have to use some conversion functions, for example when we want to log it (e.g. to_mb function).

Goal state

Use some more clever ways of dealing with these kinds of variables. Similarly, we utilize datetime.timedelta.
Either by implementing our solution, e.g. like this:

# src/crawlee/_utils/byte_size.py

from dataclasses import dataclass

_BYTES_PER_KB = 1024
_BYTES_PER_MB = _BYTES_PER_KB**2
_BYTES_PER_GB = _BYTES_PER_KB**3
_BYTES_PER_TB = _BYTES_PER_KB**4

@dataclass
class ByteSize:
    """Represents a size in bytes."""

    bytes_: int

    def to_kb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_KB

    def to_mb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_MB

    def to_gb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_GB

    def to_tb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_TB

    def __str__(self: ByteSize) -> str:
        if self.bytes_ >= _BYTES_PER_TB:
            return f'{self.to_tb():.2f} TB'

        if self.bytes_ >= _BYTES_PER_GB:
            return f'{self.to_gb():.2f} GB'

        if self.bytes_ >= _BYTES_PER_MB:
            return f'{self.to_mb():.2f} MB'

        if self.bytes_ >= _BYTES_PER_KB:
            return f'{self.to_kb():.2f} KB'

        return f'{self.bytes_} Bytes'

Or use some existing solution. Explore the following:
- typing.NewType;
- packages on PyPI.

Add package for run-time type checking

Description

Based on the PR apify/apify-sdk-python#171, @janbuchar suggested the usage of some run-time checking for Python.

E.g. typeguard, it can be applied either using a decorator @typechecked for a specific function or import hook typeguard.install_import_hook() for the whole module.

For some methods/functions where we check manually the type of args/return type it could make sense to use it. E.g. here https://github.com/apify/apify-sdk-python/blob/v1.5.1/src/apify/scrapy/utils.py#L44.

Potential problems

I suppose it is implemented by using typing.get_type_hints for getting the type hints for a specific function. I run into a bug when typing.get_type_hints and from __future__ import annotations are used together, see the issue apify/apify-sdk-python#151. However, tests should reveal it.

Implement `AutoscaledPool`

Implement the initial version.
An initial investigation was already done at #3.
AutoscaledPool in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/autoscaled_pool.ts.

Implement `BeautifulSoupCrawler`

Implement the initial version.
CheerioCrawler in Crawlee should serve as inspiration - https://github.com/apify/crawlee/tree/v3.7.3/packages/cheerio-crawler.
Cheerio is a similar library for JS/TS as BeautifulSoup is for Python.

Implement session management

Implement the initial version.
Session management in TS Crawlee - https://github.com/apify/crawlee/tree/v3.8.2/packages/core/src/session_pool

Better approach of making a cache

This is a follow-up issue to the discussion in #82 (comment).
Currently, we have our own implementation of LRU cache in crawlee/_utils/lru_cache.py.
Let's do it in a more Pythonic way, maybe utilizing the built-in caching from functools std module (lru_cache decorator)?

Request queue v2 support

Implement methods for request queue v2 (locking, batch operations)
Implement request queue v2 into local request queue (locking, batch operations)
On top of that there is a difference between Python and js clients, we are missing parallelism and retries in python client so we need to implement in into sdk

Implement the initial version.
EventManager in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/events/event_manager.ts.

Implement `SystemStatus`

Implement the initial version.
SystemStatus in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/system_status.ts.

Implement the initial version.
BasicCrawler in Crawlee should serve as inspiration - https://github.com/apify/crawlee/tree/v3.7.3/packages/basic-crawler.

apify / crawlee-python Goto Github PK

crawlee-python's Introduction

A web scraping and browser automation library

Installation

With Crawlee CLI

Examples

BeautifulSoupCrawler

PlaywrightCrawler

More examples

Features

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Why to use Crawlee rather than Scrapy?

Running on the Apify platform

Support

Contributing

License

crawlee-python's People

Contributors

Stargazers

Watchers

Forkers

crawlee-python's Issues

Tasks

Test Actor for PoCs:

The problem

What OG Crawlee does

Python version A (+/- current implementation)

Python version B

Description

Desired state

Open

Ignored or Blocked

Detected dependencies

Description

Other related things

Description

Current state

Goal state

Description

Potential problems

Recommend Projects

Recommend Topics

Recommend Org