Git Product home page Git Product logo

scrapeghost's Introduction

scrapeghost

scrapeghost logo

scrapeghost is an experimental library for scraping websites using OpenAI's GPT.

Source: https://github.com/jamesturk/scrapeghost

Documentation: https://jamesturk.github.io/scrapeghost/

Issues: https://github.com/jamesturk/scrapeghost/issues

PyPI badge Test badge

Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the OpenAI pricing page and not guaranteed to be accurate.

Features

The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, scrapeghost provides a number of features to make it easier to use.

Python-based schema definition - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.

Preprocessing

  • HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests.
  • CSS and XPath selectors - Pre-filter HTML by writing a single CSS or XPath selector.
  • Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.

Postprocessing

  • JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.)
  • Schema validation - Go a step further, use a pydantic schema to validate the response.
  • Hallucination check - Does the data in the response truly exist on the page?

Cost Controls

  • Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
  • Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
  • Allows setting a budget and stops the scraper if the budget is exceeded.

scrapeghost's People

Contributors

jamesturk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapeghost's Issues

Extending the project for older python versions

Currently the project only supports Python 3.10 or higher, it does not support versions like 3.8 or 3.9.

Otherwise, if backward compatibility does not open, please indicate with the badge which python versions are supported.

Example Data/Case Studies Needed!

Hi!

This took off today on a couple of sites, I'm glad people found this interesting :)

I'm working on some real world tests that are giving me more ideas of how to improve the tooling & DX.

Since I'm scraping tons of pages anyway, I'd love to make it actually useful for someone. If you happen to know of someone doing meaningful work or research that could use some one-off data scraped please put them in touch.

Ideally they'd:

  • Need data scraped that exists spread across <10k pages.
  • Be working with data that is 100% public.
  • Be doing something good for the world.
  • Not have a particularly time sensitive need.
  • Have some existing data already collected to compare against (not a hard requirement.)

Figured someone coming across this might know someone in that boat. My contact info is on my GitHub profile.

add non-JSON output option

CSV could save tokens for large pages but I'm concerned about parsing since JSON is well specified and CSV is not.

Probably not worth it but might be something to investigate.

`Response` class

  • raw response
  • API response metadata
  • parsed object
  • token counts & cost

More Docs

  • pros and cons
  • usage / cookbook
  • cool ghost

Pros

  • nearly free in terms of time/energy to see if it works or not, throw it at a problem and see if the results are high enough quality, might need to write one or two CSS selectors
  • great for few pages where content changes frequently and scrapers might otherwise be too expensive to maintain
  • possibility of random errors that are hard to detect, should not be used without validation if results matter

Cons

  • scrapers will be dependent upon OpenAI until other models are available/equally good
  • every run costs something, even if pages don't change at all
  • not good for scrapers with lots of pages & lots of runs
  • bad at list pages due to context size issues and speed

mypy

Casually typing as I went, could use some attention.

  • fix mypy errors
  • add py.typed
  • add to linting

Better Automatic Token Reduction

To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?

This isn't straightforward as it seems & many off the shelf tools are focused on different problems:

  • Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
  • A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)

It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.

scrapeghost.errors.TooManyTokens even though I am using auto_split_length

I am trying to scrape text information from a page and even if I use auto_split_length=1500 I get this error:
scrapeghost.errors.TooManyTokens: HTML is 4662 tokens, max for gpt-3.5-turbo is 4096

However, it appears that the code tries to create the chunks, but does so in the wrong way by creating only one of 4662 tokens. This is the output before the error:
2023-05-14 21:55:04 [debug ] got HTML length=664813 url=https://www.ndpa.ch/
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CSS(header.SITE_HEADER_WRAPPER a[class!='image link-internal']) nodes=14
2023-05-14 21:55:04 [debug ] chunked tags num=1 sizes=[775]
2023-05-14 21:55:04 [info ] API request html_tokens=775 model=gpt-3.5-turbo
2023-05-14 21:55:16 [info ] API response completion_tokens=183 cost=0.002072 duration=12.036354303359985 finish_reason=stop prompt_tokens=853
2023-05-14 21:55:16 [debug ] postprocessor data=[{"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch/nativedigital"}, {"url": "https://www.ndpa.ch/authentication"}, {"url": "https://www.nativedigital.ch"}, {"url": "https://www.ndpa.ch/investors"}, {"url": "https://www.ndpa.ch/shop"}, {"url": "https://www.ndpa.ch/about-us"},
{"url": "https://www.ndpa.ch/team-1"}, {"url": "https://www.ndpa.ch/contacts"}, {"url": "https://www.ndpa.ch/privacy-policy"}, {"url": "https://www.ndpa.ch/terms-and-conditions"}, {"url": "https://www.ndpa.ch/items"}, {"url": "https://www.ndpa.ch/news"}] data_type=<class 'str'> postprocessor=JSONPostprocessor(nudge=False))
Scraped 14 site URLs, cost 0.002072
https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] got HTML length=664813 url=https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CSS(main.PAGES_CONTAINER) nodes=1
2023-05-14 21:55:16 [debug ] chunked tags num=1 sizes=[4662]
The error starts here, at line 32

My code:
from dotenv import load_dotenv
load_dotenv()

import json
from scrapeghost import SchemaScraper, CSS

url = "https://www.ndpa.ch/"
link_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("header.SITE_HEADER_WRAPPER a[class!='image link-internal']")],
)
schema = {
"title": "str",
"text": "str",
}
text_scraper = SchemaScraper(
schema,
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("main.PAGES_CONTAINER")],
)

resp = link_scraper(url)
web_urls = resp.data
print(f"Scraped {len(web_urls)} site URLs, cost {resp.total_cost}")

web_data = []
for web_url in web_urls:
print(web_url)
web_data.append(
text_scraper(
web_url["url"],
).data
)

print(f"Scraped {len(web_data)} site, ${text_scraper.stats()['total_cost']}")

with open("scraped_info.json", "w") as f:
json.dump(web_data, f, indent=2)

Discussion: Relicensing

I don't think the current license is the one I'll go with long term, but also know from experience the world of web scraping is a lot of people doing good work, but also a lot of spammy/abusive use cases too. I want to make this as unappealing as possible to those people, but I also know they aren't going to mind violating a license with no teeth.

I'm just sort of rambling about this publicly in case anybody reads this that has thoughts, right now I'm considering going AGPL.. but open to ideas.

more examples

  • check all example output once more
  • more specific instructions
  • nudge
  • scrapple test w/ paginator
  • CSV instead of JSON
  • Open States people tester

explore using new functions-mode in GPT

The new feature announced yesterday explicitly mentions content extraction & indeed seems like it'd be a good choice to help constrain the output to JSON.

Altering the underlying request should be easy enough, and will likely improve output. This also suggests an opportunity to test the pluggable request feature that'll be needed for supporting multiple backends

Optionally use puppeteer chromium and/or beautiful soup?

Here's some example code (using async, which you may consider removing). html5lib is a better parser than the default bs4 parser, but slightly slower.

    async def get_browser_html(browser_type):
        async with await browser_type.launch() as browser:
            context = await browser.new_context()
            page = await context.new_page()
            await page.goto(url_obj.url, timeout=DOWNLOAD_HTML_TIMEOUT * 1000)
            content = await page.content()
            await browser.close()
        logger.error(f"Exiting get_browser_html function for {browser_type.name}")
        return content

    try:
        async with async_playwright() as p:
            logger.error("Async playwright started")
            chromium_content = await get_browser_html(p.chromium)
            logger.error("Async playwright executed")

        def html_to_html(html: str) -> Optional[str]:
            soup = BeautifulSoup(html, "html5lib")
            text = soup.get_text().strip()
            html = str(soup).strip()
            if text == "":
                return None
            else:
                return html

        url_obj.chromium_html = html_to_html(chromium_content)

Functionality to JUST update existing CSS / XPath Selectors

I love this as a concept and would love to implement something like this in my project https://github.com/srhinos/primelooter, but my biggest holdup is overall cost.

One alternative I'd really support would be the ability to have the scraper be told to, on top of returning the jsonified data, also return selectors to fetch that data WITHOUT the library using traditional libraries (that are free to use).

This would give me all the benefits of AI to continuously and automotously update my selectors somewhere in code, saving me tons of time, while also minimizing the drawbacks of a ton of cost.

Not sure if this necessarily fits the scope of this project but IMO, would make this much safer to implement in a lot more projects (especially mine <3). Would be more than happy to work on something like this when I catch time over the next few weekends and contribute it back if out of scope for your current short term plans.

meta

  • set up CI
  • pre-commit config
  • release script
  • cool ghost logo?

Example error

In the example of the following section https://jamesturk.github.io/scrapeghost/tutorial/#putting-it-all-together, there are several errors in the code.

The extra_preprocessors field expects a list not the CSS method directly. I add the correction just below

`import json
from scrapeghost import SchemaScraper, CSS

episode_list_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
# restrict this to GPT-3.5-Turbo to keep the cost down
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)

episode_scraper = SchemaScraper(
{
"title": "str",
"episode_number": "int",
"release_date": "YYYY-MM-DD",
"guests": ["str"],
"characters": ["str"],
},
extra_preprocessors=[CSS("div.page-content")],
)

resp = episode_list_scraper(
"https://comedybangbang.fandom.com/wiki/Category:Episodes",
)
episode_urls = resp.data
print(f"Scraped {len(episode_urls)} episode URLs, cost {resp.total_cost}")

episode_data = []
for episode_url in episode_urls:
print(episode_url)
episode_data.append(
episode_scraper(
episode_url["url"],
).data
)

scrapers have a stats() method that returns a dict of statistics across all calls

print(f"Scraped {len(episode_data)} episodes, ${episode_scraper.stats()['total_cost']}")

with open("episode_data.json", "w") as f:
json.dump(episode_data, f, indent=2)
`

request->response extensibility

Ideally there'd be a way to augment/replace any part of the process:

  • URL -> HTML (can pass in any HTML)
  • HTML -> clean HTML (preprocessors)
  • #18
  • JSON validation & refinement (postprocessors)

HallucinationChecker error

This is a very promising and simple tool, thanks for sharing it!

I was playing with it, but I'm getting an the following error:

scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

full trace:

File "[...]/scraper.py", line 17, in <module>
    response = scraper(url)
               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/scrapers.py", line 142, in scrape
    return self._apply_postprocessors(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/apicall.py", line 207, in _apply_postprocessors
    response = pp(response, self)
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/postprocessors.py", line 102, in __call__
    raise PostprocessingError(
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

My scraper.py code is close to the tutorial example, but maybe I'm doing it wrong:

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.boredpanda.com/bruce-lee-quotes"
schema = {
    "index": "int",
    "quote": "str",
}

scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".open-list-items div.open-list-item:nth-child(-n+10) .bordered-description")],
)

response = scraper(url)
pprint(response.data)

I tried to disable the HallucinationChecker by overriding the postprocessors but it wasn't clear to me how to do that properly.

Thanks again for your work on this, it's very cool and exciting!

help improve tests

If you want to make a small contribution:

Run just test to run tests w/ coverage report.

  • Push coverage to 100% with a few small tests.
  • Mocked versions of tests/live/ (where practical)
  • fix mypy ignores

Add ability to provide few-shot examples

I was playing around with this and found that the scraping got way, way better when I provided one or two short HTML snippets and JSON output based on each. I was primarily interested in this project as a way to self-heal CSS selectors, and ran into similar issues where the selectors weren't 100% accurate. When I modified the code to allow me to add an example, the success rate went up to 100% in my handful of tests.

Make API backend pluggable to allow for non-OpenAI models

This seems like it'll be the most important task to make this more viable for people.

Alternative models will be cheaper, potentially much faster, allow running on someone's own hardware (LLaMa), and allow for more experimentation (e.g. models that are trained on HTML->JSON).

Quite a few models are attention free, which would remove the token limit altogether.

Models

OpenAssistant

No API as of June 2023, their FAQ makes it sound unlikely.

Cohere

TBD, commenter below says it didn't work well, haven't evaluated.

Anthropic Claude

100k limits added in May, as soon as I get access this will be my guinea pig to add support for pluggable models.

Others

Please add comments below if you've tried this approach with others that have an API.

Not all chunk sizes are limited by the auto_split_length parameter

Following the tutorial which fetches a single legislator, I tried to scrape them all by following the links from the index page (https://www.ilga.gov/house/default.asp).

When I do, there is always 1 chunk that is too large.

Here is an example output when auto_split_length is set to 1000:

2023-08-04 13:06:30 [debug    ] got HTML                       length=83130 url=https://www.ilga.gov/house/default.asp
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=CleanHTML nodes=1
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=XPath(//table/tr/td//table/tr) nodes=137
2023-08-04 13:06:30 [debug    ] chunked tags                   num=28 sizes=[576, 22215, 978, 903, 896, 902, 907, 904, 902, 902, 905, 899, 898, 900, 908, 903, 903, 895, 905, 903, 897, 901, 906, 901, 904, 907, 860, 449]
2023-08-04 13:06:30 [info     ] API request                    html_tokens=576 model=gpt-3.5-turbo-16k

As shown, all of the chunked tags are below the threshold of 1000, except the second one ( 22215 ). I suspect that it is maybe a large script block, but I cannot tell.

I tried adding an additional preprocessor:

class DropJSandStyle:
    def __str__(self) -> str:
        return f"DropJSandStyle()"

    def __call__(self, node: lxml.html.HtmlElement) -> list[lxml.html.HtmlElement]:
        cleaner = Cleaner()
        cleaner.javascript = True # This is True because we want to activate the javascript filter
        cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

        return cleaner.clean_html(node)

This didn't help either so maybe it's not javascript or styles inflating that second chunk.

What am I doing wrong? or how to get that second chunk under 16k

cost tracking

could just be accessors on Scraper class? hard code prices?

Hallucination checker improvements

Right now it's a simple check of top-level strings. Could get much fancier if this proves to be a problem in practice:

  • subfields
  • way to select which fields are checked

auto-repair

automatic repair of common mistakes should be possible with a second request

JavaScript Enabling

Some websites need js to be enabled to get HTML content. In scrapers.py you use Requests API to get content:
image
But with the first website I tried I got:
image
I used library "playwright" to fix it:
image

breaking change: adjust how models are selected

probably going to introduce a breaking change in 0.6 or 0.7 to how models are selected

Now that OpenAI is releasing different versions of models (3.5-turbo has 4 current versions between the token limit and the different iterations) I think model configuration/fallback needs to change a bit.

If 3.5 parsed the results and didn't do well, the next fallback should be 4, if the token limit was exceeded however it should go to 3.5-16k. There are a lot of possible conditions for this, and people that want 100% control can explicitly pass a single model, but the way the fallback chain is traversed can improve to reduce redundant requests.

Something like:

models=[GPT35T(allow_16k=True), GPT4(allow_32k=False)]

This would try gpt-3.5-turbo only once, either at 4k or 16k based on input.
Then it would try gpt-4.
Selecting particular revisions could work this way as well.

Probably makes the most sense to do this as part of #18

pagination

If the website I scrape needs pagination to get all the data, can this tool click on "next" (or similar) and continue with the next page?

0.3.0

  • get snippets working in docs
  • put example output into tutorial
  • CLI test
  • a couple of small examples in docs (splits, additional instructions, selectors)
  • README
  • basic tests
  • cost predictor utility
  • max-cost parameter
  • selector overhaul

pydantic improvements

Right now the method utils._pydantic_to_simple_schema is pretty weak, could use a more comprehensive version and tests.

Preprocessing

Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai?

Is it less helpful to just scrape all plain text on a page & then do auto splitter?

Autoscraper memoization?

Related to #7.

LLMs are seemingly happy to even take raw text and extract the structure out of it, even better than the needlessly HTML, not to mention cheaper.

Packages like autoscraper can take a web page with known structured data and build a model of what selectors most likely yielded that text.

So perhaps the proposed hybrid mode can be implemented not through the LLM generating code from a HTML, but by first using it to extract the raw unformatted data and then writing small lambdas to normalize it to your expected format.

clean HTML

beyond mere selection, filter attributes that aren't likely to be used, remove extra characters, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.