jamesturk / scrapeghost Goto Github PK

View Code? Open in Web Editor NEW

1.4K 20.0 88.0 1.7 MB

👻 Experimental library for scraping websites using OpenAI's GPT API.

Home Page: https://jamesturk.github.io/scrapeghost/

License: Other

Python 97.75% Just 2.25%

gpt webscraping openai-api

scrapeghost's Issues

restore PaginatedSchemaScraper

needs to be rethought with the new API

pagination

If the website I scrape needs pagination to get all the data, can this tool click on "next" (or similar) and continue with the next page?

I don't think the current license is the one I'll go with long term, but also know from experience the world of web scraping is a lot of people doing good work, but also a lot of spammy/abusive use cases too. I want to make this as unappealing as possible to those people, but I also know they aren't going to mind violating a license with no teeth.

I'm just sort of rambling about this publicly in case anybody reads this that has thoughts, right now I'm considering going AGPL.. but open to ideas.

mypy

Casually typing as I went, could use some attention.

fix mypy errors
add py.typed
add to linting

track token counts

count tokens in/out for easier reporting on total usage

auto-repair

automatic repair of common mistakes should be possible with a second request

explore using new functions-mode in GPT

The new feature announced yesterday explicitly mentions content extraction & indeed seems like it'd be a good choice to help constrain the output to JSON.

Altering the underlying request should be easy enough, and will likely improve output. This also suggests an opportunity to test the pluggable request feature that'll be needed for supporting multiple backends

help improve tests

If you want to make a small contribution:

Run just test to run tests w/ coverage report.

Push coverage to 100% with a few small tests.
Mocked versions of tests/live/ (where practical)
fix mypy ignores

start-to-finish example

simple schema
narrowing it down
providing feedback
options

Better Automatic Token Reduction

To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?

This isn't straightforward as it seems & many off the shelf tools are focused on different problems:

Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)

It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.

Add ability to provide few-shot examples

I was playing around with this and found that the scraping got way, way better when I provided one or two short HTML snippets and JSON output based on each. I was primarily interested in this project as a way to self-heal CSS selectors, and ran into similar issues where the selectors weren't 100% accurate. When I modified the code to allow me to add an example, the success rate went up to 100% in my handful of tests.

add non-JSON output option

CSV could save tokens for large pages but I'm concerned about parsing since JSON is well specified and CSV is not.

Probably not worth it but might be something to investigate.

Preprocessing

Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai?

Is it less helpful to just scrape all plain text on a page & then do auto splitter?

scrapeghost.errors.TooManyTokens even though I am using auto_split_length

I am trying to scrape text information from a page and even if I use auto_split_length=1500 I get this error:
scrapeghost.errors.TooManyTokens: HTML is 4662 tokens, max for gpt-3.5-turbo is 4096

However, it appears that the code tries to create the chunks, but does so in the wrong way by creating only one of 4662 tokens. This is the output before the error:
2023-05-14 21:55:04 [debug ] got HTML length=664813 url=https://www.ndpa.ch/
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CSS(header.SITE_HEADER_WRAPPER a[class!='image link-internal']) nodes=14
2023-05-14 21:55:04 [debug ] chunked tags num=1 sizes=[775]
2023-05-14 21:55:04 [info ] API request html_tokens=775 model=gpt-3.5-turbo
2023-05-14 21:55:16 [info ] API response completion_tokens=183 cost=0.002072 duration=12.036354303359985 finish_reason=stop prompt_tokens=853
2023-05-14 21:55:16 [debug ] postprocessor data=[{"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch/nativedigital"}, {"url": "https://www.ndpa.ch/authentication"}, {"url": "https://www.nativedigital.ch"}, {"url": "https://www.ndpa.ch/investors"}, {"url": "https://www.ndpa.ch/shop"}, {"url": "https://www.ndpa.ch/about-us"},
{"url": "https://www.ndpa.ch/team-1"}, {"url": "https://www.ndpa.ch/contacts"}, {"url": "https://www.ndpa.ch/privacy-policy"}, {"url": "https://www.ndpa.ch/terms-and-conditions"}, {"url": "https://www.ndpa.ch/items"}, {"url": "https://www.ndpa.ch/news"}] data_type=<class 'str'> postprocessor=JSONPostprocessor(nudge=False))
Scraped 14 site URLs, cost 0.002072
https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] got HTML length=664813 url=https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CSS(main.PAGES_CONTAINER) nodes=1
2023-05-14 21:55:16 [debug ] chunked tags num=1 sizes=[4662]
The error starts here, at line 32

My code:
from dotenv import load_dotenv
load_dotenv()

import json
from scrapeghost import SchemaScraper, CSS

url = "https://www.ndpa.ch/"
link_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("header.SITE_HEADER_WRAPPER a[class!='image link-internal']")],
)
schema = {
"title": "str",
"text": "str",
}
text_scraper = SchemaScraper(
schema,
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("main.PAGES_CONTAINER")],
)

resp = link_scraper(url)
web_urls = resp.data
print(f"Scraped {len(web_urls)} site URLs, cost {resp.total_cost}")

web_data = []
for web_url in web_urls:
print(web_url)
web_data.append(
text_scraper(
web_url["url"],
).data
)

print(f"Scraped {len(web_data)} site, ${text_scraper.stats()['total_cost']}")

with open("scraped_info.json", "w") as f:
json.dump(web_data, f, indent=2)

cost tracking

could just be accessors on Scraper class? hard code prices?

Hybrid Mode: ask scrapeghost to write selectors

See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors

There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.

Example error

In the example of the following section https://jamesturk.github.io/scrapeghost/tutorial/#putting-it-all-together, there are several errors in the code.

The extra_preprocessors field expects a list not the CSS method directly. I add the correction just below

`import json
from scrapeghost import SchemaScraper, CSS

episode_list_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
# restrict this to GPT-3.5-Turbo to keep the cost down
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)

episode_scraper = SchemaScraper(
{
"title": "str",
"episode_number": "int",
"release_date": "YYYY-MM-DD",
"guests": ["str"],
"characters": ["str"],
},
extra_preprocessors=[CSS("div.page-content")],
)

resp = episode_list_scraper(
"https://comedybangbang.fandom.com/wiki/Category:Episodes",
)
episode_urls = resp.data
print(f"Scraped {len(episode_urls)} episode URLs, cost {resp.total_cost}")

episode_data = []
for episode_url in episode_urls:
print(episode_url)
episode_data.append(
episode_scraper(
episode_url["url"],
).data
)

scrapers have a stats() method that returns a dict of statistics across all calls

print(f"Scraped {len(episode_data)} episodes, ${episode_scraper.stats()['total_cost']}")

with open("episode_data.json", "w") as f:
json.dump(episode_data, f, indent=2)
`

Functionality to JUST update existing CSS / XPath Selectors

I love this as a concept and would love to implement something like this in my project https://github.com/srhinos/primelooter, but my biggest holdup is overall cost.

One alternative I'd really support would be the ability to have the scraper be told to, on top of returning the jsonified data, also return selectors to fetch that data WITHOUT the library using traditional libraries (that are free to use).

This would give me all the benefits of AI to continuously and automotously update my selectors somewhere in code, saving me tons of time, while also minimizing the drawbacks of a ton of cost.

Not sure if this necessarily fits the scope of this project but IMO, would make this much safer to implement in a lot more projects (especially mine <3). Would be more than happy to work on something like this when I catch time over the next few weekends and contribute it back if out of scope for your current short term plans.

Hallucination check postprocessor

pydantic validation

add `stats` method

Autoscraper memoization?

Related to #7.

LLMs are seemingly happy to even take raw text and extract the structure out of it, even better than the needlessly HTML, not to mention cheaper.

Packages like autoscraper can take a web page with known structured data and build a model of what selectors most likely yielded that text.

So perhaps the proposed hybrid mode can be implemented not through the LLM generating code from a HTML, but by first using it to extract the raw unformatted data and then writing small lambdas to normalize it to your expected format.

support other filetypes?

JSON
XML
PDF/images (once image support is enabled)

More Docs

pros and cons
usage / cookbook
cool ghost

Pros

nearly free in terms of time/energy to see if it works or not, throw it at a problem and see if the results are high enough quality, might need to write one or two CSS selectors
great for few pages where content changes frequently and scrapers might otherwise be too expensive to maintain
possibility of random errors that are hard to detect, should not be used without validation if results matter

Cons

scrapers will be dependent upon OpenAI until other models are available/equally good
every run costs something, even if pages don't change at all
not good for scrapers with lots of pages & lots of runs
bad at list pages due to context size issues and speed

tiktoken support

https://github.com/openai/tiktoken

0.3.0 selenium intergration?

Use case:

Fill forms and click on buttons online...

change log configuration for examples

when examples are run date should be disabled and log level should be WARNING

breaking change: adjust how models are selected

probably going to introduce a breaking change in 0.6 or 0.7 to how models are selected

Now that OpenAI is releasing different versions of models (3.5-turbo has 4 current versions between the token limit and the different iterations) I think model configuration/fallback needs to change a bit.

If 3.5 parsed the results and didn't do well, the next fallback should be 4, if the token limit was exceeded however it should go to 3.5-16k. There are a lot of possible conditions for this, and people that want 100% control can explicitly pass a single model, but the way the fallback chain is traversed can improve to reduce redundant requests.

Something like:

models=[GPT35T(allow_16k=True), GPT4(allow_32k=False)]

This would try gpt-3.5-turbo only once, either at 4k or 16k based on input.
Then it would try gpt-4.
Selecting particular revisions could work this way as well.

Probably makes the most sense to do this as part of #18

Issue with release_date formatting in tutorial

Following along with https://jamesturk.github.io/scrapeghost/tutorial/#enhancing-the-schema

Things are working as expected when using "release_date": "str",

Things break for me when you try to format that date using "release_date": "YYYY-MM-DD",

Any idea what could be causing this?

Hallucination checker improvements

Right now it's a simple check of top-level strings. Could get much fancier if this proves to be a problem in practice:

subfields
way to select which fields are checked

Extending the project for older python versions

Currently the project only supports Python 3.10 or higher, it does not support versions like 3.8 or 3.9.

Otherwise, if backward compatibility does not open, please indicate with the badge which python versions are supported.

Make API backend pluggable to allow for non-OpenAI models

This seems like it'll be the most important task to make this more viable for people.

Alternative models will be cheaper, potentially much faster, allow running on someone's own hardware (LLaMa), and allow for more experimentation (e.g. models that are trained on HTML->JSON).

Quite a few models are attention free, which would remove the token limit altogether.

Models

OpenAssistant

No API as of June 2023, their FAQ makes it sound unlikely.

Cohere

TBD, commenter below says it didn't work well, haven't evaluated.

Anthropic Claude

100k limits added in May, as soon as I get access this will be my guinea pig to add support for pluggable models.

Others

Please add comments below if you've tried this approach with others that have an API.

pydantic improvements

Right now the method utils._pydantic_to_simple_schema is pretty weak, could use a more comprehensive version and tests.

request->response extensibility

Ideally there'd be a way to augment/replace any part of the process:

URL -> HTML (can pass in any HTML)
HTML -> clean HTML (preprocessors)
#18
JSON validation & refinement (postprocessors)

Example Data/Case Studies Needed!

Hi!

This took off today on a couple of sites, I'm glad people found this interesting :)

I'm working on some real world tests that are giving me more ideas of how to improve the tooling & DX.

Since I'm scraping tons of pages anyway, I'd love to make it actually useful for someone. If you happen to know of someone doing meaningful work or research that could use some one-off data scraped please put them in touch.

Ideally they'd:

Need data scraped that exists spread across <10k pages.
Be working with data that is 100% public.
Be doing something good for the world.
Not have a particularly time sensitive need.
Have some existing data already collected to compare against (not a hard requirement.)

Figured someone coming across this might know someone in that boat. My contact info is on my GitHub profile.

Not all chunk sizes are limited by the auto_split_length parameter

Following the tutorial which fetches a single legislator, I tried to scrape them all by following the links from the index page (https://www.ilga.gov/house/default.asp).

When I do, there is always 1 chunk that is too large.

Here is an example output when auto_split_length is set to 1000:

2023-08-04 13:06:30 [debug    ] got HTML                       length=83130 url=https://www.ilga.gov/house/default.asp
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=CleanHTML nodes=1
2023-08-04 13:06:30 [debug    ] preprocessor                   from_nodes=1 name=XPath(//table/tr/td//table/tr) nodes=137
2023-08-04 13:06:30 [debug    ] chunked tags                   num=28 sizes=[576, 22215, 978, 903, 896, 902, 907, 904, 902, 902, 905, 899, 898, 900, 908, 903, 903, 895, 905, 903, 897, 901, 906, 901, 904, 907, 860, 449]
2023-08-04 13:06:30 [info     ] API request                    html_tokens=576 model=gpt-3.5-turbo-16k

As shown, all of the chunked tags are below the threshold of 1000, except the second one ( 22215 ). I suspect that it is maybe a large script block, but I cannot tell.

I tried adding an additional preprocessor:

class DropJSandStyle:
    def __str__(self) -> str:
        return f"DropJSandStyle()"

    def __call__(self, node: lxml.html.HtmlElement) -> list[lxml.html.HtmlElement]:
        cleaner = Cleaner()
        cleaner.javascript = True # This is True because we want to activate the javascript filter
        cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

        return cleaner.clean_html(node)

This didn't help either so maybe it's not javascript or styles inflating that second chunk.

What am I doing wrong? or how to get that second chunk under 16k

HallucinationChecker error

This is a very promising and simple tool, thanks for sharing it!

I was playing with it, but I'm getting an the following error:

scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

full trace:

File "[...]/scraper.py", line 17, in <module>
    response = scraper(url)
               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/scrapers.py", line 142, in scrape
    return self._apply_postprocessors(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/apicall.py", line 207, in _apply_postprocessors
    response = pp(response, self)
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/postprocessors.py", line 102, in __call__
    raise PostprocessingError(
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

My scraper.py code is close to the tutorial example, but maybe I'm doing it wrong:

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.boredpanda.com/bruce-lee-quotes"
schema = {
    "index": "int",
    "quote": "str",
}

scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".open-list-items div.open-list-item:nth-child(-n+10) .bordered-description")],
)

response = scraper(url)
pprint(response.data)

I tried to disable the HallucinationChecker by overriding the postprocessors but it wasn't clear to me how to do that properly.

Thanks again for your work on this, it's very cool and exciting!

Use guardrails for validation.

Optionally use puppeteer chromium and/or beautiful soup?

Here's some example code (using async, which you may consider removing). html5lib is a better parser than the default bs4 parser, but slightly slower.

    async def get_browser_html(browser_type):
        async with await browser_type.launch() as browser:
            context = await browser.new_context()
            page = await context.new_page()
            await page.goto(url_obj.url, timeout=DOWNLOAD_HTML_TIMEOUT * 1000)
            content = await page.content()
            await browser.close()
        logger.error(f"Exiting get_browser_html function for {browser_type.name}")
        return content

    try:
        async with async_playwright() as p:
            logger.error("Async playwright started")
            chromium_content = await get_browser_html(p.chromium)
            logger.error("Async playwright executed")

        def html_to_html(html: str) -> Optional[str]:
            soup = BeautifulSoup(html, "html5lib")
            text = soup.get_text().strip()
            html = str(soup).strip()
            if text == "":
                return None
            else:
                return html

        url_obj.chromium_html = html_to_html(chromium_content)