Git Product home page Git Product logo

python-scrapfly's Introduction

Scrapfly SDK

Installation

pip install scrapfly-sdk

You can also install extra dependencies

  • pip install "scrapfly-sdk[seepdup]" for performance improvement
  • pip install "scrapfly-sdk[concurrency]" for concurrency out of the box (asyncio / thread)
  • pip install "scrapfly-sdk[scrapy]" for scrapy integration
  • pip install "scrapfly-sdk[all]" Everything!

For use of built-in HTML parser (via ScrapeApiResponse.selector property) additional requirement of either parsel or scrapy is required.

For reference of usage or examples, please checkout the folder /examples in this repository.

Integrations

Scrapfly Python SDKs are integrated with LlamaIndex and LangChain. Both framework allows training Large Language Models (LLMs) using augmented context.

This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:

  • Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
  • Document Understanding and Extraction
  • Autonomous Agents that can perform research and take actions

In the context of web scraping, web page data can be extracted as Text or Markdown using Scrapfly's format feature to train LLMs with the scraped data.

LlamaIndex

Installation

Install llama-index, llama-index-readers-web, and scrapfly-sdk using pip:

pip install llama-index llama-index-readers-web scrapfly-sdk

Usage

Scrapfly is available at LlamaIndex as a data connector, known as a Reader. This reader is used to gather a web page data into a Document representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the LlamaIndex use cases for more.

import os

from llama_index.readers.web import ScrapflyReader
from llama_index.core import VectorStoreIndex

# Initiate ScrapflyReader with your Scrapfly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

# After creating the documents, train them with an LLM
# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry: 
# https://docs.llamaindex.ai/en/stable/examples/llm/openai/

# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the dark energy potion is bold cherry cola."

The load_data function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:

from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"],
    scrape_config=scrapfly_scrape_config,  # Pass the scrape config
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

LangChain

Installation

Install langchain, langchain-community, and scrapfly-sdk using pip:

pip install langchain langchain-community scrapfly-sdk

Usage

Scrapfly is available at LangChain as a document loader, known as a Loader. This reader is used to gather a web page data into Document representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see LangChain tutorials for further use cases.

import os

from langchain import hub # pip install langchainhub
from langchain_chroma import Chroma # pip install langchain_chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI # pip install langchain_openai
from langchain_text_splitters import RecursiveCharacterTextSplitter # pip install langchain_text_splitters
from langchain_community.document_loaders import ScrapflyLoader


scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()

# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"

# Create a retriever
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

model = ChatOpenAI()
prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

response = rag_chain.invoke("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the Dark Energy Potion is bold cherry cola."

To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to the ScrapflyLoader:

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
    scrape_config=scrapfly_scrape_config,  # Pass the scrape_config object
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)

Get Your API Key

You can create a free account on Scrapfly to get your API Key.

Migration

Migrate from 0.7.x to 0.8

asyncio-pool dependency has been dropped

scrapfly.concurrent_scrape is now an async generator. If the concurrency is None or not defined, the max concurrency allowed by your current subscription is used.

    async for result in scrapfly.concurrent_scrape(concurrency=10, scrape_configs=[ScrapConfig(...), ...]):
        print(result)

brotli args is deprecated and will be removed in the next minor. There is not benefit in most of case versus gzip regarding and size and use more CPU.

What's new

0.8.x

  • Better error log
  • Async/Improvement for concurrent scrape with asyncio
  • Scrapy media pipeline are now supported out of the box

python-scrapfly's People

Contributors

granitosaurus avatar jjsaunier avatar mazen-r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

python-scrapfly's Issues

AttributeError with Scrapy examples

When I run the Scrapy examples like the Bea one, I get the following error:

2023-10-22 05:57:11 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2023-10-22 05:57:11 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (main, Jul 30 2023, 10:41:08) [GCC 11.3.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.3 19 Sep 2023), cryptography 41.0.4, Platform Linux-5.15.0-86-generic-x86_64-with-glibc2.35
Unhandled error in Deferred:
2023-10-22 05:57:11 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 265, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 269, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1947, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1857, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 155, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 169, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapfly/scrapy/spider.py", line 126, in from_crawler
    crawler.stats.set_value('scrapfly/api_call_cost', 0)
builtins.AttributeError: 'NoneType' object has no attribute 'set_value'

2023-10-22 05:57:11 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 155, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapy/crawler.py", line 169, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/home/matthew/repos/payclass/settlement-data-parser/.venv/lib/python3.11/site-packages/scrapfly/scrapy/spider.py", line 126, in from_crawler
    crawler.stats.set_value('scrapfly/api_call_cost', 0)
AttributeError: 'NoneType' object has no attribute 'set_value'

I'm using Python 3.11 with:
scrapfly-sdk==0.8.9
Scrapy==2.11.0

Unable to execute demo

with scrapy 2.11.0 unable to launch anything, demo.py or custom spider will throw errors :

2023-11-20 09:46:35 [scrapy.core.engine] INFO: Spider opened 2023-11-20 09:46:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://web-scraping.dev/product/5> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://web-scraping.dev/product/4> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://httpbin.dev/status/403> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://httpbin.dev/status/400> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://httpbin.dev/status/404> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://web-scraping.dev/product/1> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://web-scraping.dev/product/2> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:35 [scrapfly.scrapy.spider] INFO: ==> Retrying request for reason [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('STORE routines', '', 'unregistered scheme'), ('system library', '', ''), ('STORE routines', '', 'unregistered scheme'), ('STORE routines', '', 'unsupported'), ('SSL routines', '', 'certificate verify failed')]>] 2023-11-20 09:46:35 [scrapfly.scrapy.spider] WARNING: Retrying <GET https://web-scraping.dev/product/3> for x0: twisted.web._newclient.ResponseNeverReceived 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ] 2023-11-20 09:46:36 [scrapfly] ERROR: [Failure instance: Traceback: <class 'TypeError'>: ExecutionEngine.crawl() takes 2 positional arguments but 3 were given F:\python310\lib\site-packages\twisted\internet\defer.py:661:callback F:\python310\lib\site-packages\twisted\internet\defer.py:763:_startRunCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\defer.py:1750:gotResult --- <exception caught here> --- F:\python310\lib\site-packages\twisted\internet\defer.py:1656:_inlineCallbacks F:\python310\lib\site-packages\twisted\python\failure.py:514:throwExceptionIntoGenerator F:\python310\lib\site-packages\scrapy\core\downloader\middleware.py:86:process_exception F:\python310\lib\site-packages\twisted\internet\defer.py:857:_runCallbacks F:\python310\lib\site-packages\twisted\internet\task.py:869:cb ]

Python SDK Import Failure

Just started with the first python program example and the program fails on this initial line.

"from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse"

ImportError: cannot import name 'ScrapeConfig' from partially initialized module 'scrapfly' (most likely due to a circular import)

Latest Version Release?

Hi @jjsaunier,

Thanks for your recent commit that removes the excess print statements.

If possible please can you release this version on pypi soon? Our team is waiting on it.

Thanks

raise_on_upstream_error=False does not work

Hi, I tried

api_response: ScrapeApiResponse = scrapfly.scrape(
    scrape_config=ScrapeConfig(url="https://louisabraham.github.io/404error", raise_on_upstream_error=False)
)

but it raised

ScrapflyScrapeError: 404 404 - Not Found: Upstream respond with http code >= 299. Learn more: https://scrapfly.io/docs/scrape-api/error/ERR::SCRAPE::BAD_UPSTREAM_RESPONSE

Exceptions from concurrent_scrape?

We have code that looks like this:

        scrapfly = ScrapflyClient(key=self.__scrapfly_api_key, max_concurrency=15)
        targets = [
            ScrapeConfig(
                url=url,
                render_js=True,
                raise_on_upstream_error=False,
                country="us",
                asp=True,
            )
            for url in urls
        ]
        async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
            self.__logger.info(f"Got result: {result}")  # when this code explodes, no log appears
            if isinstance(result, ScrapflyError):  # error from scrapfly itself
                ...
            elif result.error:  # error from upstream
                ...
            else:  # success
                ...

That being said, this code tends to explode on the async iterator sometimes, which will throw an error which looks like this without returning a result at all.

<-- 422 | ERR::PROXY::TIMEOUT - Proxy connection or website was too slow and timeout - Proxy or website do not respond after 15s - Check if the website is online or geoblocking, if you are using session, rotate it..Checkout the related doc: https://scrapfly.io/docs/scrape-api/error/ERR::PROXY::TIMEOUT

Seems like there's some kind of bug where the async iterator can itself throw rather than return an exception, which means the entire process blows up. Any ideas how we might go about fixing?

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

disable brotli

Hi, it seems that brotli is making our application use much more CPU. Could we have a parameter to disable brotli, even if it is installed?

Error on api_response

I'm having the following issue when scraping yielding ScrapflyScrapyRequest:

ERROR scraper.py:246 Error downloading <GET https://immobilienscout24.de/expose/146870274>
Traceback (most recent call last):
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 105, in __call__
    return self.content_loader(content)
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 51, in _date_parser
    value[k] = _date_parser(v)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 53, in _date_parser
    value[k] = v
TypeError: 'bytes' object does not support item assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 75, in process_exception
    response = yield deferred_from_coro(method(request=request, exception=exception, spider=spider))
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/scrapy/middleware.py", line 70, in process_exception
    raise exception
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/scrapy/downloader.py", line 82, in on_body_downloaded
    scrapfly_api_response:ScrapeApiResponse = spider.scrapfly_client._handle_response(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/client.py", line 295, in _handle_response
    api_response = self._handle_api_response(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/client.py", line 453, in _handle_api_response
    body = self.body_handler(response.content)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 107, in __call__
    raise EncoderError(content=content.decode('utf-8')) from e
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid start byte

This error is present in 2% of my total requests and it's completely random some URLs may hit this error in a few tries, but in most cases, they don't repeat.

Environment Setup:

  • python = 3.9.9
  • MacOS = Apple M2
  • scrapfly-sdk = {extras = ["all"], version = "^0.8.9"}

"NameError: name 'url' is not defined" in pipelines.py

In pipelines.py, in the 2 places where it says

for config in scrape_configs:
      if isinstance(config, str): # string link - dummy convert
          config = scrape_config=ScrapeConfig(url=url)

it should say

for config in scrape_configs:
      if isinstance(config, str): # string link - dummy convert
          config = scrape_config=ScrapeConfig(url=config)  # <------------------------------ OBS!

302 Redirects and Browser Automation in js_scenario

Dear Scrapfly Support Team,

302 Redirect Handling:When my requests encounter a 302 redirect, the redirected request method remains 'POST'. However, the standard behavior should switch the method to 'GET'. I would like to know if there is a way to configure Scrapfly to handle 302 redirects by changing the method to 'GET'. Additionally, could you please inform me if there is an option to add allow_redirect functionality to control this behavior?
js_scenario Browser Automation:I am utilizing js_scenario for browser automation to perform actions such as clicking and inputting data. Unfortunately, these actions do not appear to be effective. I am concerned that this might be due to Cloudflare's protection, possibly identifying my requests as non-HTML content. Could you provide guidance on how to ensure that my browser automation actions are executed correctly and how to handle potential Cloudflare interruptions?
I would greatly appreciate your expertise and any recommendations you can provide to resolve these issues. Your assistance will be invaluable in helping me continue my work efficiently.

Thank you very much for your support.

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.