needs to be rethought with the new API

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thanks for the detailed examples <a class="user-mention notranslate" data-hovercard-ty

Had some more more thoughts on this and moved them to <a class="issue-link js-issue-li

restore PaginatedSchemaScraper about scrapeghost HOT 7 CLOSED

jamesturk commented on May 31, 2024

restore PaginatedSchemaScraper

from scrapeghost.

Comments (7)

eric-czech commented on May 31, 2024

Have you considered prompting for code to be generated for dynamic evaluation in the pagination case? I.e. as opposed to prompting for extraction directly.

It would clearly be scary to be automatically running whatever code comes back, but presumably there are reasonable ways to put guardrails on that?

from scrapeghost.

jamesturk commented on May 31, 2024

I'm taking a general philosophical stance for this library that I won't execute code that GPT generates, there are a couple of reasons for this:

Converting HTML to JSON is pretty safe relatively speaking, when things like prompt injection & similar are out there. While sure, it is possible for a page to "disregard all instructions..." the processing pipeline would reject any input that isn't valid JSON so the worst they can do is send incorrect JSON that would presumably fail validation & be rejected in most cases. (Unless they were explicitly targeting /your/ scraper.) The additional layer of safety concerns generating/executing Python would be.
One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit. The added complexity of generating Python code is something GPT-4 is great at in most cases, but might restrict the approach to top-of-the-line expensive models when a lot can potentially be done (arguably more if the speed/size issues are the trade off) with simpler models. (Note: I haven't proven this concept yet, it may be that the quality difference of these models isn't worth it, but I don't want to give up on the idea prematurely.)

That isn't to say I don't think there's room for others to explore other techniques! They could even probably work as pre/postprocessors in scrapeghost's architecture if someone so chose.

from scrapeghost.

eric-czech commented on May 31, 2024

I'm taking a general philosophical stance for this library that I won't execute code that GPT generates

Makes sense, though some reasonable ways to sandbox the execution could be gVisor or Pyodide (for posterity). Pyodide is particularly attractive IMO since it seems to be optimizing for ML/data use cases w/ scikit-learn, pandas, etc. support and the security model for browser execution is probably sufficient for a lot of use cases.

One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit

FWIW I have been looking for this as well in places like in Microsoft's (now old-ish) Document AI models (XDoc, MarkupLM, etc.) and more recent distillations of large LLMs (e.g. https://github.com/mbzuai-nlp/LaMini-LM) with no luck. They're just not accurate enough.

If I do come across some more efficient HTML extraction models though, I'll try to remember to share them here.

from scrapeghost.

trompx commented on May 31, 2024

One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit

FWIW I have been looking for this as well in places like in Microsoft's (now old-ish) Document AI models (XDoc, MarkupLM, etc.) and more recent distillations of large LLMs (e.g. https://github.com/mbzuai-nlp/LaMini-LM) with no luck. They're just not accurate enough.

If I do come across some more efficient HTML extraction models though, I'll try to remember to share them here.

Excellent @eric-czech, it seems we went through the exact same search path. I was just about to experiment with XDoc and MarkupLM, but did not consider LLM at all after the different paper I've read where even GPT-4 was not on par with fine-tuned models for data extraction. There are also VrDU models: like LayoutML, Ernie-Layout but I'd prefer not have to render the page to extract data. Do you have a small repo with your tests with XDoc/MarkupLM? How inaccurate were the results? Compared to GPT3.5/4?

from scrapeghost.

eric-czech commented on May 31, 2024

Hey @trompx, I was using examples like this (for LaMini in this case):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("MBZUAI/LaMini-Flan-T5-783M")

model = AutoModelForSeq2SeqLM.from_pretrained("MBZUAI/LaMini-Flan-T5-783M")

pipe = pipeline('text2text-generation', model = model, tokenizer=tokenizer)

input_prompt = """
Convert the table in the following HTML document to a list of JSON objects:

--- BEGIN DOCUMENT ---
<html><body><h1>Solar System Planets</h1><table><tr><th>Planet</th><th>Diameter (km)</th><th>Mass (kg)</th><th>Orbit Radius (km)</th><th>Orbital Period (Earth days)</th></tr><tr><td>Mercury</td><td>4,879</td><td>3.3011 × 10^23</td><td>57,910,000</td><td>88</td></tr><tr><td>Venus</td><td>12,104</td><td>4.8675 × 10^24</td><td>108,200,000</td><td>225</td></tr><tr><td>Earth</td><td>12,742</td><td>5.972 × 10^24</td><td>149,600,000</td><td>365.25</td></tr><tr><td>Mars</td><td>6,779</td><td>6.4171 × 10^23</td><td>227,940,000</td><td>687</td></tr><tr><td>Jupiter</td><td>139,820</td><td>1.8982 × 10^27</td><td>778,570,000</td><td>4,333</td></tr><tr><td>Saturn</td><td>116,460</td><td>5.6834 × 10^26</td><td>1,429,400,000</td><td>10,759</td></tr><tr><td>Uranus</td><td>50,724</td><td>8.6810 × 10^25</td><td>2,870,990,000</td><td>30,687</td></tr><tr><td>Neptune</td><td>49,244</td><td>1.02413 × 10^26</td><td>4,504,300,000</td><td>60,190</td></tr></table></body></html>
--- END DOCUMENT ---
"""
pipe(input_prompt, max_length=512, do_sample=True)[0]['generated_text']

The point was to provide some very clean HTML with little more than a table to try to query. I also posed prompts like "Which planet is the largest?", "Extract the names of the planets as a JSON list", etc. and the XDoc models were somewhat better at this but still not great. The LaMini model responses were just unusable. For example, here is a response I get for the prompt above:

To convert the given HTML document to a list of JSON objects, you can use indexing
and encoding to extract the json classes from the table> element and replace the element
with its reference number from the following example:   "Begin Document":   
"List": [ [1, 2] , [3, 4, , 7, 1, 429 , 2, 8, 5, 4, 6,, 7, 2, 12]

which makes no sense. And in any case, all of these models have a 512 token context window so they're not really practical on real HTML scraping use cases anyhow.

from scrapeghost.

trompx commented on May 31, 2024

Thanks for the detailed examples @eric-czech. Yes the 512 token context is problematic, for table you could first parse the html to extract only the tables and if > 512, break them down so X rows fit in the context and extract recursively. But in general you can easily extract data for tables without LLMs, and could have a preprocessing table parser along other extraction systems. The problem is more to extract the data when the structure is more complex. Have you tried upgrading to Flan-T5-XXL, it is relatively more performant.

from scrapeghost.

eric-czech commented on May 31, 2024

Had some more more thoughts on this and moved them to #54 instead.

from scrapeghost.

restore PaginatedSchemaScraper about scrapeghost HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent