jamesturk / scrapeghost Goto Github PK
View Code? Open in Web Editor NEW๐ป Experimental library for scraping websites using OpenAI's GPT API.
Home Page: https://jamesturk.github.io/scrapeghost/
License: Other
๐ป Experimental library for scraping websites using OpenAI's GPT API.
Home Page: https://jamesturk.github.io/scrapeghost/
License: Other
needs to be rethought with the new API
If the website I scrape needs pagination to get all the data, can this tool click on "next" (or similar) and continue with the next page?
I don't think the current license is the one I'll go with long term, but also know from experience the world of web scraping is a lot of people doing good work, but also a lot of spammy/abusive use cases too. I want to make this as unappealing as possible to those people, but I also know they aren't going to mind violating a license with no teeth.
I'm just sort of rambling about this publicly in case anybody reads this that has thoughts, right now I'm considering going AGPL.. but open to ideas.
Casually typing as I went, could use some attention.
py.typed
count tokens in/out for easier reporting on total usage
automatic repair of common mistakes should be possible with a second request
The new feature announced yesterday explicitly mentions content extraction & indeed seems like it'd be a good choice to help constrain the output to JSON.
Altering the underlying request should be easy enough, and will likely improve output. This also suggests an opportunity to test the pluggable request feature that'll be needed for supporting multiple backends
If you want to make a small contribution:
Run just test
to run tests w/ coverage report.
To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?
This isn't straightforward as it seems & many off the shelf tools are focused on different problems:
It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean
.
I was playing around with this and found that the scraping got way, way better when I provided one or two short HTML snippets and JSON output based on each. I was primarily interested in this project as a way to self-heal CSS selectors, and ran into similar issues where the selectors weren't 100% accurate. When I modified the code to allow me to add an example, the success rate went up to 100% in my handful of tests.
CSV could save tokens for large pages but I'm concerned about parsing since JSON is well specified and CSV is not.
Probably not worth it but might be something to investigate.
Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai?
Is it less helpful to just scrape all plain text on a page & then do auto splitter?
I am trying to scrape text information from a page and even if I use auto_split_length=1500 I get this error:
scrapeghost.errors.TooManyTokens: HTML is 4662 tokens, max for gpt-3.5-turbo is 4096
However, it appears that the code tries to create the chunks, but does so in the wrong way by creating only one of 4662 tokens. This is the output before the error:
2023-05-14 21:55:04 [debug ] got HTML length=664813 url=https://www.ndpa.ch/
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:04 [debug ] preprocessor from_nodes=1 name=CSS(header.SITE_HEADER_WRAPPER a[class!='image link-internal']) nodes=14
2023-05-14 21:55:04 [debug ] chunked tags num=1 sizes=[775]
2023-05-14 21:55:04 [info ] API request html_tokens=775 model=gpt-3.5-turbo
2023-05-14 21:55:16 [info ] API response completion_tokens=183 cost=0.002072 duration=12.036354303359985 finish_reason=stop prompt_tokens=853
2023-05-14 21:55:16 [debug ] postprocessor data=[{"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch"}, {"url": "https://www.ndpa.ch/nativedigital"}, {"url": "https://www.ndpa.ch/authentication"}, {"url": "https://www.nativedigital.ch"}, {"url": "https://www.ndpa.ch/investors"}, {"url": "https://www.ndpa.ch/shop"}, {"url": "https://www.ndpa.ch/about-us"},
{"url": "https://www.ndpa.ch/team-1"}, {"url": "https://www.ndpa.ch/contacts"}, {"url": "https://www.ndpa.ch/privacy-policy"}, {"url": "https://www.ndpa.ch/terms-and-conditions"}, {"url": "https://www.ndpa.ch/items"}, {"url": "https://www.ndpa.ch/news"}] data_type=<class 'str'> postprocessor=JSONPostprocessor(nudge=False))
Scraped 14 site URLs, cost 0.002072
https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] got HTML length=664813 url=https://www.ndpa.ch
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-05-14 21:55:16 [debug ] preprocessor from_nodes=1 name=CSS(main.PAGES_CONTAINER) nodes=1
2023-05-14 21:55:16 [debug ] chunked tags num=1 sizes=[4662]
The error starts here, at line 32
My code:
from dotenv import load_dotenv
load_dotenv()
import json
from scrapeghost import SchemaScraper, CSS
url = "https://www.ndpa.ch/"
link_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("header.SITE_HEADER_WRAPPER a[class!='image link-internal']")],
)
schema = {
"title": "str",
"text": "str",
}
text_scraper = SchemaScraper(
schema,
auto_split_length=1500,
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS("main.PAGES_CONTAINER")],
)
resp = link_scraper(url)
web_urls = resp.data
print(f"Scraped {len(web_urls)} site URLs, cost {resp.total_cost}")
web_data = []
for web_url in web_urls:
print(web_url)
web_data.append(
text_scraper(
web_url["url"],
).data
)
print(f"Scraped {len(web_data)} site, ${text_scraper.stats()['total_cost']}")
with open("scraped_info.json", "w") as f:
json.dump(web_data, f, indent=2)
could just be accessors on Scraper class? hard code prices?
See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors
There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.
In the example of the following section https://jamesturk.github.io/scrapeghost/tutorial/#putting-it-all-together, there are several errors in the code.
The extra_preprocessors
field expects a list not the CSS method directly. I add the correction just below
`import json
from scrapeghost import SchemaScraper, CSS
episode_list_scraper = SchemaScraper(
'{"url": "url"}',
auto_split_length=1500,
# restrict this to GPT-3.5-Turbo to keep the cost down
models=["gpt-3.5-turbo"],
extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)
episode_scraper = SchemaScraper(
{
"title": "str",
"episode_number": "int",
"release_date": "YYYY-MM-DD",
"guests": ["str"],
"characters": ["str"],
},
extra_preprocessors=[CSS("div.page-content")],
)
resp = episode_list_scraper(
"https://comedybangbang.fandom.com/wiki/Category:Episodes",
)
episode_urls = resp.data
print(f"Scraped {len(episode_urls)} episode URLs, cost {resp.total_cost}")
episode_data = []
for episode_url in episode_urls:
print(episode_url)
episode_data.append(
episode_scraper(
episode_url["url"],
).data
)
print(f"Scraped {len(episode_data)} episodes, ${episode_scraper.stats()['total_cost']}")
with open("episode_data.json", "w") as f:
json.dump(episode_data, f, indent=2)
`
I love this as a concept and would love to implement something like this in my project https://github.com/srhinos/primelooter, but my biggest holdup is overall cost.
One alternative I'd really support would be the ability to have the scraper be told to, on top of returning the jsonified data, also return selectors to fetch that data WITHOUT the library using traditional libraries (that are free to use).
This would give me all the benefits of AI to continuously and automotously update my selectors somewhere in code, saving me tons of time, while also minimizing the drawbacks of a ton of cost.
Not sure if this necessarily fits the scope of this project but IMO, would make this much safer to implement in a lot more projects (especially mine <3). Would be more than happy to work on something like this when I catch time over the next few weekends and contribute it back if out of scope for your current short term plans.
Related to #7.
LLMs are seemingly happy to even take raw text and extract the structure out of it, even better than the needlessly HTML, not to mention cheaper.
Packages like autoscraper can take a web page with known structured data and build a model of what selectors most likely yielded that text.
So perhaps the proposed hybrid mode can be implemented not through the LLM generating code from a HTML, but by first using it to extract the raw unformatted data and then writing small lambdas to normalize it to your expected format.
Use case:
Fill forms and click on buttons online...
when examples are run date should be disabled and log level should be WARNING
probably going to introduce a breaking change in 0.6 or 0.7 to how models are selected
Now that OpenAI is releasing different versions of models (3.5-turbo has 4 current versions between the token limit and the different iterations) I think model configuration/fallback needs to change a bit.
If 3.5 parsed the results and didn't do well, the next fallback should be 4, if the token limit was exceeded however it should go to 3.5-16k. There are a lot of possible conditions for this, and people that want 100% control can explicitly pass a single model, but the way the fallback chain is traversed can improve to reduce redundant requests.
Something like:
models=[GPT35T(allow_16k=True), GPT4(allow_32k=False)]
This would try gpt-3.5-turbo only once, either at 4k or 16k based on input.
Then it would try gpt-4.
Selecting particular revisions could work this way as well.
Probably makes the most sense to do this as part of #18
Following along with https://jamesturk.github.io/scrapeghost/tutorial/#enhancing-the-schema
Things are working as expected when using "release_date": "str",
Things break for me when you try to format that date using "release_date": "YYYY-MM-DD",
Any idea what could be causing this?
Right now it's a simple check of top-level strings. Could get much fancier if this proves to be a problem in practice:
Currently the project only supports Python 3.10 or higher, it does not support versions like 3.8 or 3.9.
Otherwise, if backward compatibility does not open, please indicate with the badge which python versions are supported.
This seems like it'll be the most important task to make this more viable for people.
Alternative models will be cheaper, potentially much faster, allow running on someone's own hardware (LLaMa), and allow for more experimentation (e.g. models that are trained on HTML->JSON).
Quite a few models are attention free, which would remove the token limit altogether.
No API as of June 2023, their FAQ makes it sound unlikely.
TBD, commenter below says it didn't work well, haven't evaluated.
100k limits added in May, as soon as I get access this will be my guinea pig to add support for pluggable models.
Please add comments below if you've tried this approach with others that have an API.
Right now the method utils._pydantic_to_simple_schema
is pretty weak, could use a more comprehensive version and tests.
Ideally there'd be a way to augment/replace any part of the process:
Hi!
This took off today on a couple of sites, I'm glad people found this interesting :)
I'm working on some real world tests that are giving me more ideas of how to improve the tooling & DX.
Since I'm scraping tons of pages anyway, I'd love to make it actually useful for someone. If you happen to know of someone doing meaningful work or research that could use some one-off data scraped please put them in touch.
Ideally they'd:
Figured someone coming across this might know someone in that boat. My contact info is on my GitHub profile.
Following the tutorial which fetches a single legislator, I tried to scrape them all by following the links from the index page (https://www.ilga.gov/house/default.asp).
When I do, there is always 1 chunk that is too large.
Here is an example output when auto_split_length is set to 1000:
2023-08-04 13:06:30 [debug ] got HTML length=83130 url=https://www.ilga.gov/house/default.asp
2023-08-04 13:06:30 [debug ] preprocessor from_nodes=1 name=CleanHTML nodes=1
2023-08-04 13:06:30 [debug ] preprocessor from_nodes=1 name=XPath(//table/tr/td//table/tr) nodes=137
2023-08-04 13:06:30 [debug ] chunked tags num=28 sizes=[576, 22215, 978, 903, 896, 902, 907, 904, 902, 902, 905, 899, 898, 900, 908, 903, 903, 895, 905, 903, 897, 901, 906, 901, 904, 907, 860, 449]
2023-08-04 13:06:30 [info ] API request html_tokens=576 model=gpt-3.5-turbo-16k
As shown, all of the chunked tags are below the threshold of 1000, except the second one ( 22215 ). I suspect that it is maybe a large script block, but I cannot tell.
I tried adding an additional preprocessor:
class DropJSandStyle:
def __str__(self) -> str:
return f"DropJSandStyle()"
def __call__(self, node: lxml.html.HtmlElement) -> list[lxml.html.HtmlElement]:
cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True # This is True because we want to activate the styles & stylesheet filter
return cleaner.clean_html(node)
This didn't help either so maybe it's not javascript or styles inflating that second chunk.
What am I doing wrong? or how to get that second chunk under 16k
This is a very promising and simple tool, thanks for sharing it!
I was playing with it, but I'm getting an the following error:
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.
full trace:
File "[...]/scraper.py", line 17, in <module>
response = scraper(url)
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapeghost/scrapers.py", line 142, in scrape
return self._apply_postprocessors( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapeghost/apicall.py", line 207, in _apply_postprocessors
response = pp(response, self)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapeghost/postprocessors.py", line 102, in __call__
raise PostprocessingError(
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.
My scraper.py
code is close to the tutorial example, but maybe I'm doing it wrong:
from scrapeghost import SchemaScraper, CSS
from pprint import pprint
url = "https://www.boredpanda.com/bruce-lee-quotes"
schema = {
"index": "int",
"quote": "str",
}
scraper = SchemaScraper(
schema,
extra_preprocessors=[CSS(".open-list-items div.open-list-item:nth-child(-n+10) .bordered-description")],
)
response = scraper(url)
pprint(response.data)
I tried to disable the HallucinationChecker by overriding the postprocessors but it wasn't clear to me how to do that properly.
Thanks again for your work on this, it's very cool and exciting!
.
Here's some example code (using async, which you may consider removing). html5lib is a better parser than the default bs4 parser, but slightly slower.
async def get_browser_html(browser_type):
async with await browser_type.launch() as browser:
context = await browser.new_context()
page = await context.new_page()
await page.goto(url_obj.url, timeout=DOWNLOAD_HTML_TIMEOUT * 1000)
content = await page.content()
await browser.close()
logger.error(f"Exiting get_browser_html function for {browser_type.name}")
return content
try:
async with async_playwright() as p:
logger.error("Async playwright started")
chromium_content = await get_browser_html(p.chromium)
logger.error("Async playwright executed")
def html_to_html(html: str) -> Optional[str]:
soup = BeautifulSoup(html, "html5lib")
text = soup.get_text().strip()
html = str(soup).strip()
if text == "":
return None
else:
return html
url_obj.chromium_html = html_to_html(chromium_content)
Is there a way to disable HallucinationChecker? If so, how? If not, can we add the ability to do this?
some websites do not have working ssl.
max_cost_per_request
max_total_cost
beyond mere selection, filter attributes that aren't likely to be used, remove extra characters, etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.