Git Product home page Git Product logo

Comments (7)

eric-czech avatar eric-czech commented on May 31, 2024

Have you considered prompting for code to be generated for dynamic evaluation in the pagination case? I.e. as opposed to prompting for extraction directly.

It would clearly be scary to be automatically running whatever code comes back, but presumably there are reasonable ways to put guardrails on that?

from scrapeghost.

jamesturk avatar jamesturk commented on May 31, 2024

I'm taking a general philosophical stance for this library that I won't execute code that GPT generates, there are a couple of reasons for this:

  • Converting HTML to JSON is pretty safe relatively speaking, when things like prompt injection & similar are out there. While sure, it is possible for a page to "disregard all instructions..." the processing pipeline would reject any input that isn't valid JSON so the worst they can do is send incorrect JSON that would presumably fail validation & be rejected in most cases. (Unless they were explicitly targeting /your/ scraper.) The additional layer of safety concerns generating/executing Python would be.
  • One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit. The added complexity of generating Python code is something GPT-4 is great at in most cases, but might restrict the approach to top-of-the-line expensive models when a lot can potentially be done (arguably more if the speed/size issues are the trade off) with simpler models. (Note: I haven't proven this concept yet, it may be that the quality difference of these models isn't worth it, but I don't want to give up on the idea prematurely.)

That isn't to say I don't think there's room for others to explore other techniques! They could even probably work as pre/postprocessors in scrapeghost's architecture if someone so chose.

from scrapeghost.

eric-czech avatar eric-czech commented on May 31, 2024

I'm taking a general philosophical stance for this library that I won't execute code that GPT generates

Makes sense, though some reasonable ways to sandbox the execution could be gVisor or Pyodide (for posterity). Pyodide is particularly attractive IMO since it seems to be optimizing for ML/data use cases w/ scikit-learn, pandas, etc. support and the security model for browser execution is probably sufficient for a lot of use cases.

One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit

FWIW I have been looking for this as well in places like in Microsoft's (now old-ish) Document AI models (XDoc, MarkupLM, etc.) and more recent distillations of large LLMs (e.g. https://github.com/mbzuai-nlp/LaMini-LM) with no luck. They're just not accurate enough.

If I do come across some more efficient HTML extraction models though, I'll try to remember to share them here.

from scrapeghost.

trompx avatar trompx commented on May 31, 2024

One direction I want to explore with this is using smaller/cheaper/faster models that can do the HTML->JSON bit

FWIW I have been looking for this as well in places like in Microsoft's (now old-ish) Document AI models (XDoc, MarkupLM, etc.) and more recent distillations of large LLMs (e.g. https://github.com/mbzuai-nlp/LaMini-LM) with no luck. They're just not accurate enough.

If I do come across some more efficient HTML extraction models though, I'll try to remember to share them here.

Excellent @eric-czech, it seems we went through the exact same search path. I was just about to experiment with XDoc and MarkupLM, but did not consider LLM at all after the different paper I've read where even GPT-4 was not on par with fine-tuned models for data extraction. There are also VrDU models: like LayoutML, Ernie-Layout but I'd prefer not have to render the page to extract data. Do you have a small repo with your tests with XDoc/MarkupLM? How inaccurate were the results? Compared to GPT3.5/4?

from scrapeghost.

eric-czech avatar eric-czech commented on May 31, 2024

Hey @trompx, I was using examples like this (for LaMini in this case):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("MBZUAI/LaMini-Flan-T5-783M")

model = AutoModelForSeq2SeqLM.from_pretrained("MBZUAI/LaMini-Flan-T5-783M")

pipe = pipeline('text2text-generation', model = model, tokenizer=tokenizer)

input_prompt = """
Convert the table in the following HTML document to a list of JSON objects:

--- BEGIN DOCUMENT ---
<html><body><h1>Solar System Planets</h1><table><tr><th>Planet</th><th>Diameter (km)</th><th>Mass (kg)</th><th>Orbit Radius (km)</th><th>Orbital Period (Earth days)</th></tr><tr><td>Mercury</td><td>4,879</td><td>3.3011 × 10^23</td><td>57,910,000</td><td>88</td></tr><tr><td>Venus</td><td>12,104</td><td>4.8675 × 10^24</td><td>108,200,000</td><td>225</td></tr><tr><td>Earth</td><td>12,742</td><td>5.972 × 10^24</td><td>149,600,000</td><td>365.25</td></tr><tr><td>Mars</td><td>6,779</td><td>6.4171 × 10^23</td><td>227,940,000</td><td>687</td></tr><tr><td>Jupiter</td><td>139,820</td><td>1.8982 × 10^27</td><td>778,570,000</td><td>4,333</td></tr><tr><td>Saturn</td><td>116,460</td><td>5.6834 × 10^26</td><td>1,429,400,000</td><td>10,759</td></tr><tr><td>Uranus</td><td>50,724</td><td>8.6810 × 10^25</td><td>2,870,990,000</td><td>30,687</td></tr><tr><td>Neptune</td><td>49,244</td><td>1.02413 × 10^26</td><td>4,504,300,000</td><td>60,190</td></tr></table></body></html>
--- END DOCUMENT ---
"""
pipe(input_prompt, max_length=512, do_sample=True)[0]['generated_text']

The point was to provide some very clean HTML with little more than a table to try to query. I also posed prompts like "Which planet is the largest?", "Extract the names of the planets as a JSON list", etc. and the XDoc models were somewhat better at this but still not great. The LaMini model responses were just unusable. For example, here is a response I get for the prompt above:

To convert the given HTML document to a list of JSON objects, you can use indexing
and encoding to extract the json classes from the table> element and replace the element
with its reference number from the following example:   "Begin Document":   
"List": [ [1, 2] , [3, 4, , 7, 1, 429 , 2, 8, 5, 4, 6,, 7, 2, 12]

which makes no sense. And in any case, all of these models have a 512 token context window so they're not really practical on real HTML scraping use cases anyhow.

from scrapeghost.

trompx avatar trompx commented on May 31, 2024

Thanks for the detailed examples @eric-czech. Yes the 512 token context is problematic, for table you could first parse the html to extract only the tables and if > 512, break them down so X rows fit in the context and extract recursively. But in general you can easily extract data for tables without LLMs, and could have a preprocessing table parser along other extraction systems. The problem is more to extract the data when the structure is more complex. Have you tried upgrading to Flan-T5-XXL, it is relatively more performant.

from scrapeghost.

eric-czech avatar eric-czech commented on May 31, 2024

Had some more more thoughts on this and moved them to #54 instead.

from scrapeghost.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.