jamesturk / scrapeghost Goto Github PK

View Code? Open in Web Editor NEW

1.4K 20.0 87.0 1.7 MB

👻 Experimental library for scraping websites using OpenAI's GPT API.

Home Page: https://jamesturk.github.io/scrapeghost/

License: Other

Python 97.75% Just 2.25%

gpt webscraping openai-api

scrapeghost's Introduction

scrapeghost

scrapeghost is an experimental library for scraping websites using OpenAI's GPT.

Source: https://github.com/jamesturk/scrapeghost

Documentation: https://jamesturk.github.io/scrapeghost/

Issues: https://github.com/jamesturk/scrapeghost/issues

Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the OpenAI pricing page and not guaranteed to be accurate.

Features

The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, scrapeghost provides a number of features to make it easier to use.

Python-based schema definition - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.

Preprocessing

HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests.
CSS and XPath selectors - Pre-filter HTML by writing a single CSS or XPath selector.
Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.

Postprocessing

JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.)
Schema validation - Go a step further, use a pydantic schema to validate the response.
Hallucination check - Does the data in the response truly exist on the page?

Cost Controls

Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
Allows setting a budget and stops the scraper if the budget is exceeded.

scrapeghost's People

Contributors

Stargazers

Watchers

Forkers

suryatmodulus ford-perfect fabiorizzomatos danielbarankin xstarlink muhammadharoun saliksik knightcn1983 kbautista3 bensenhsu jade2290 loveayhz rollrat jaberwiki munifico sungbin07 bingoral douglaskrouth ankit-tyagi eduals acivii-li lishoulong kmmao jmaigc devinzone jthodge 22317a57 jiangzhuo f3al tieck-it a90120411 edelah vicjung kznmft mlxyz cornpo polkadot21 james-voyavoy odebroqueville nashid iancarnevale afikri hirajanwin invertednz informaticacba commerceless tzengwei dylan-wu aicodehunt ishan-marikar vivekdurai amoabakelvin iq-scm cobustheunissen jaxenormus commandolphie2022 gear273 brishtiteveja galinhalx brandondelpozo billwang233 excloudx6 aasim97 aristoddle codehornets ishaan-jaff asjondalipaj devcharli eespinal ken1901 5l1v3r1 nisugaj elormcodes1 johncharrington tchurusinov protoys-webdev wartek69 onlyone-hyphen mohamed-taha awesome-software sheng-jie phongtnit lemeb octag0no

scrapeghost's Issues

Extending the project for older python versions

Currently the project only supports Python 3.10 or higher, it does not support versions like 3.8 or 3.9.

Otherwise, if backward compatibility does not open, please indicate with the badge which python versions are supported.

SSL verification check

some websites do not have working ssl.

Example Data/Case Studies Needed!

Hi!

This took off today on a couple of sites, I'm glad people found this interesting :)

I'm working on some real world tests that are giving me more ideas of how to improve the tooling & DX.

Since I'm scraping tons of pages anyway, I'd love to make it actually useful for someone. If you happen to know of someone doing meaningful work or research that could use some one-off data scraped please put them in touch.

Ideally they'd:

Need data scraped that exists spread across <10k pages.
Be working with data that is 100% public.
Be doing something good for the world.
Not have a particularly time sensitive need.
Have some existing data already collected to compare against (not a hard requirement.)

Figured someone coming across this might know someone in that boat. My contact info is on my GitHub profile.

add non-JSON output option

CSV could save tokens for large pages but I'm concerned about parsing since JSON is well specified and CSV is not.

Probably not worth it but might be something to investigate.

`Response` class

raw response
API response metadata
parsed object
token counts & cost

restore PaginatedSchemaScraper

needs to be rethought with the new API

More Docs

pros and cons
usage / cookbook
cool ghost

Pros

nearly free in terms of time/energy to see if it works or not, throw it at a problem and see if the results are high enough quality, might need to write one or two CSS selectors
great for few pages where content changes frequently and scrapers might otherwise be too expensive to maintain
possibility of random errors that are hard to detect, should not be used without validation if results matter

Cons

scrapers will be dependent upon OpenAI until other models are available/equally good
every run costs something, even if pages don't change at all
not good for scrapers with lots of pages & lots of runs
bad at list pages due to context size issues and speed

mypy

Casually typing as I went, could use some attention.

fix mypy errors
add py.typed
add to linting

Better Automatic Token Reduction

To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?

This isn't straightforward as it seems & many off the shelf tools are focused on different problems:

Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)

It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.

scrapeghost.errors.TooManyTokens even though I am using auto_split_length

I am trying to scrape text information from a page and even if I use auto_split_length=1500 I get this error:
scrapeghost.errors.TooManyTokens: HTML is 4662 tokens, max for gpt-3.5-turbo is 4096

However, it appears that the code tries to create the chunks, but does so in the wrong way by creating only one of 4662 tokens. This is the output before the error:
2023-05-14 21:55:04 [debug ] got HTML length=664813 url=https://www.ndpa.ch/
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CSS(header.SITE_HEADER_WRAPPER a[class!='image link-internal']) nodes=14
2023-05-14 21:55:04 [debug ] chunked tags num=1 sizes=[775]
2023-05-14 21:55:04 [info ] API request html_tokens=775 model=gpt-3.5-turbo
2023-05-14 21:55:16 [info ] API response completion_tokens=183 cost=0.002072 duration=12.036354303359985 finish_reason=stop prompt_tokens=853
2023-05-14 21:55:16 [debug ] postprocessor data=[{"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch/nativedigital"}, {"url": "https://www.ndpa.ch/authentication"}, {"url": "https://www.nativedigital.ch"}, {"url": "https://www.ndpa.ch/investors"}, {"url": "https://www.ndpa.ch/shop"}, {"url": "https://www.ndpa.ch/about-us"},
{"url": "https://www.ndpa.ch/team-1"}, {"url": "https://www.ndpa.ch/contacts"}, {"url": "https://www.ndpa.ch/privacy-policy"}, {"url": "https://www.ndpa.ch/terms-and-conditions"}, {"url": "https://www.ndpa.ch/items"}, {"url": "https://www.ndpa.ch/news"}] data_type=<class 'str'> postprocessor=JSONPostprocessor(nudge=False))
Scraped 14 site URLs, cost 0.002072
https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] got HTML length=664813 url=https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CSS(main.PAGES_CONTAINER) nodes=1
2023-05-14 21:55:16 [debug ] chunked tags num=1 sizes=[4662]
The error starts here, at line 32

My code:
from dotenv import load_dotenv
load_dotenv()

import json
from scrapeghost import SchemaScraper, CSS

url = "https://www.ndpa.ch/"
link_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("header.SITE_HEADER_WRAPPER a[class!='image link-internal']")],
)
schema = {
"title": "str",
"text": "str",
}
text_scraper = SchemaScraper(
schema,
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("main.PAGES_CONTAINER")],
)

resp = link_scraper(url)
web_urls = resp.data
print(f"Scraped {len(web_urls)} site URLs, cost {resp.total_cost}")

web_data = []
for web_url in web_urls:
print(web_url)
web_data.append(
text_scraper(
web_url["url"],
).data
)

print(f"Scraped {len(web_data)} site, ${text_scraper.stats()['total_cost']}")

with open("scraped_info.json", "w") as f:
json.dump(web_data, f, indent=2)

Use guardrails for validation.

start-to-finish example

simple schema
narrowing it down
providing feedback
options

change log configuration for examples

when examples are run date should be disabled and log level should be WARNING

Discussion: Relicensing

I don't think the current license is the one I'll go with long term, but also know from experience the world of web scraping is a lot of people doing good work, but also a lot of spammy/abusive use cases too. I want to make this as unappealing as possible to those people, but I also know they aren't going to mind violating a license with no teeth.

I'm just sort of rambling about this publicly in case anybody reads this that has thoughts, right now I'm considering going AGPL.. but open to ideas.

more examples

explore using new functions-mode in GPT

The new feature announced yesterday explicitly mentions content extraction & indeed seems like it'd be a good choice to help constrain the output to JSON.

Altering the underlying request should be easy enough, and will likely improve output. This also suggests an opportunity to test the pluggable request feature that'll be needed for supporting multiple backends

Disable HallucinationChecker

Is there a way to disable HallucinationChecker? If so, how? If not, can we add the ability to do this?

Optionally use puppeteer chromium and/or beautiful soup?

Here's some example code (using async, which you may consider removing). html5lib is a better parser than the default bs4 parser, but slightly slower.

    async def get_browser_html(browser_type):
        async with await browser_type.launch() as browser:
            context = await browser.new_context()
            page = await context.new_page()
            await page.goto(url_obj.url, timeout=DOWNLOAD_HTML_TIMEOUT * 1000)
            content = await page.content()
            await browser.close()
        logger.error(f"Exiting get_browser_html function for {browser_type.name}")
        return content

    try:
        async with async_playwright() as p:
            logger.error("Async playwright started")
            chromium_content = await get_browser_html(p.chromium)
            logger.error("Async playwright executed")

        def html_to_html(html: str) -> Optional[str]:
            soup = BeautifulSoup(html, "html5lib")
            text = soup.get_text().strip()
            html = str(soup).strip()
            if text == "":
                return None
            else:
                return html

        url_obj.chromium_html = html_to_html(chromium_content)

Hallucination check postprocessor

add `stats` method

Functionality to JUST update existing CSS / XPath Selectors

I love this as a concept and would love to implement something like this in my project https://github.com/srhinos/primelooter, but my biggest holdup is overall cost.

One alternative I'd really support would be the ability to have the scraper be told to, on top of returning the jsonified data, also return selectors to fetch that data WITHOUT the library using traditional libraries (that are free to use).

This would give me all the benefits of AI to continuously and automotously update my selectors somewhere in code, saving me tons of time, while also minimizing the drawbacks of a ton of cost.

Not sure if this necessarily fits the scope of this project but IMO, would make this much safer to implement in a lot more projects (especially mine <3). Would be more than happy to work on something like this when I catch time over the next few weekends and contribute it back if out of scope for your current short term plans.

Example error

In the example of the following section https://jamesturk.github.io/scrapeghost/tutorial/#putting-it-all-together, there are several errors in the code.

The extra_preprocessors field expects a list not the CSS method directly. I add the correction just below

`import json
from scrapeghost import SchemaScraper, CSS

episode_list_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
# restrict this to GPT-3.5-Turbo to keep the cost down
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)

episode_scraper = SchemaScraper(
{
"title": "str",
"episode_number": "int",
"release_date": "YYYY-MM-DD",
"guests": ["str"],
"characters": ["str"],
},
extra_preprocessors=[CSS("div.page-content")],
)

resp = episode_list_scraper(
"https://comedybangbang.fandom.com/wiki/Category:Episodes",
)
episode_urls = resp.data
print(f"Scraped {len(episode_urls)} episode URLs, cost {resp.total_cost}")

episode_data = []
for episode_url in episode_urls:
print(episode_url)
episode_data.append(
episode_scraper(
episode_url["url"],
).data
)

scrapers have a stats() method that returns a dict of statistics across all calls

print(f"Scraped {len(episode_data)} episodes, ${episode_scraper.stats()['total_cost']}")

with open("episode_data.json", "w") as f:
json.dump(episode_data, f, indent=2)
`

request->response extensibility

Ideally there'd be a way to augment/replace any part of the process:

URL -> HTML (can pass in any HTML)
HTML -> clean HTML (preprocessors)
#18
JSON validation & refinement (postprocessors)

HallucinationChecker error

This is a very promising and simple tool, thanks for sharing it!

I was playing with it, but I'm getting an the following error:

scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

full trace:

File "[...]/scraper.py", line 17, in <module>
    response = scraper(url)
               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/scrapers.py", line 142, in scrape
    return self._apply_postprocessors(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/apicall.py", line 207, in _apply_postprocessors
    response = pp(response, self)
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/postprocessors.py", line 102, in __call__
    raise PostprocessingError(
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

My scraper.py code is close to the tutorial example, but maybe I'm doing it wrong:

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.boredpanda.com/bruce-lee-quotes"
schema = {
    "index": "int",
    "quote": "str",
}

scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".open-list-items div.open-list-item:nth-child(-n+10) .bordered-description")],
)

response = scraper(url)
pprint(response.data)

I tried to disable the HallucinationChecker by overriding the postprocessors but it wasn't clear to me how to do that properly.

Thanks again for your work on this, it's very cool and exciting!

help improve tests

If you want to make a small contribution:

Run just test to run tests w/ coverage report.

Push coverage to 100% with a few small tests.
Mocked versions of tests/live/ (where practical)
fix mypy ignores

Add ability to provide few-shot examples

I was playing around with this and found that the scraping got way, way better when I provided one or two short HTML snippets and JSON output based on each. I was primarily interested in this project as a way to self-heal CSS selectors, and ran into similar issues where the selectors weren't 100% accurate. When I modified the code to allow me to add an example, the success rate went up to 100% in my handful of tests.

support other filetypes?

JSON
XML
PDF/images (once image support is enabled)

pydantic validation

selenium intergration?

Use case:

Fill forms and click on buttons online...

Make API backend pluggable to allow for non-OpenAI models

This seems like it'll be the most important task to make this more viable for people.

Alternative models will be cheaper, potentially much faster, allow running on someone's own hardware (LLaMa), and allow for more experimentation (e.g. models that are trained on HTML->JSON).

Quite a few models are attention free, which would remove the token limit altogether.

Models

OpenAssistant

No API as of June 2023, their FAQ makes it sound unlikely.

Cohere

TBD, commenter below says it didn't work well, haven't evaluated.

Anthropic Claude

100k limits added in May, as soon as I get access this will be my guinea pig to add support for pluggable models.

Others

Please add comments below if you've tried this approach with others that have an API.

Issue with release_date formatting in tutorial

Following along with https://jamesturk.github.io/scrapeghost/tutorial/#enhancing-the-schema

Things are working as expected when using "release_date": "str",

Things break for me when you try to format that date using "release_date": "YYYY-MM-DD",

Any idea what could be causing this?

track token counts

count tokens in/out for easier reporting on total usage

tiktoken support

https://github.com/openai/tiktoken

Not all chunk sizes are limited by the auto_split_length parameter

Following the tutorial which fetches a single legislator, I tried to scrape them all by following the links from the index page (https://www.ilga.gov/house/default.asp).

When I do, there is always 1 chunk that is too large.

Here is an example output when auto_split_length is set to 1000:

2023-08-04 13:06:30 [debug    ] got HTML                       length=83130 url=https://www.ilga.gov/house/default.asp
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=CleanHTML nodes=1
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=XPath(//table/tr/td//table/tr) nodes=137
2023-08-04 13:06:30 [debug    ] chunked tags                   num=28 sizes=[576, 22215, 978, 903, 896, 902, 907, 904, 902, 902, 905, 899, 898, 900, 908, 903, 903, 895, 905, 903, 897, 901, 906, 901, 904, 907, 860, 449]
2023-08-04 13:06:30 [info     ] API request                    html_tokens=576 model=gpt-3.5-turbo-16k

As shown, all of the chunked tags are below the threshold of 1000, except the second one ( 22215 ). I suspect that it is maybe a large script block, but I cannot tell.

I tried adding an additional preprocessor:

class DropJSandStyle:
    def __str__(self) -> str:
        return f"DropJSandStyle()"

    def __call__(self, node: lxml.html.HtmlElement) -> list[lxml.html.HtmlElement]:
        cleaner = Cleaner()
        cleaner.javascript = True # This is True because we want to activate the javascript filter
        cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

        return cleaner.clean_html(node)

This didn't help either so maybe it's not javascript or styles inflating that second chunk.

What am I doing wrong? or how to get that second chunk under 16k

cost tracking

could just be accessors on Scraper class? hard code prices?

Hallucination checker improvements

Right now it's a simple check of top-level strings. Could get much fancier if this proves to be a problem in practice:

subfields
way to select which fields are checked

Hybrid Mode: ask scrapeghost to write selectors

See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors

There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.

auto-repair

automatic repair of common mistakes should be possible with a second request

max_cost parameter(s)

max_cost_per_request
max_total_cost

JavaScript Enabling

Some websites need js to be enabled to get HTML content. In scrapers.py you use Requests API to get content:

But with the first website I tried I got:

I used library "playwright" to fix it:

breaking change: adjust how models are selected

probably going to introduce a breaking change in 0.6 or 0.7 to how models are selected

Now that OpenAI is releasing different versions of models (3.5-turbo has 4 current versions between the token limit and the different iterations) I think model configuration/fallback needs to change a bit.

If 3.5 parsed the results and didn't do well, the next fallback should be 4, if the token limit was exceeded however it should go to 3.5-16k. There are a lot of possible conditions for this, and people that want 100% control can explicitly pass a single model, but the way the fallback chain is traversed can improve to reduce redundant requests.

Something like:

models=[GPT35T(allow_16k=True), GPT4(allow_32k=False)]

This would try gpt-3.5-turbo only once, either at 4k or 16k based on input.
Then it would try gpt-4.
Selecting particular revisions could work this way as well.

Probably makes the most sense to do this as part of #18

pagination

If the website I scrape needs pagination to get all the data, can this tool click on "next" (or similar) and continue with the next page?

0.3.0 pydantic improvements

Right now the method utils._pydantic_to_simple_schema is pretty weak, could use a more comprehensive version and tests.

Preprocessing

Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai?

Is it less helpful to just scrape all plain text on a page & then do auto splitter?

Autoscraper memoization?

Related to #7.

LLMs are seemingly happy to even take raw text and extract the structure out of it, even better than the needlessly HTML, not to mention cheaper.

Packages like autoscraper can take a web page with known structured data and build a model of what selectors most likely yielded that text.

So perhaps the proposed hybrid mode can be implemented not through the LLM generating code from a HTML, but by first using it to extract the raw unformatted data and then writing small lambdas to normalize it to your expected format.

clean HTML

beyond mere selection, filter attributes that aren't likely to be used, remove extra characters, etc.