harvard-lil / warc-gpt Goto Github PK
View Code? Open in Web Editor NEWWARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
License: MIT License
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
License: MIT License
Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).
I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?
I think it would mostly be a matter of adding some logic to ingest.py to look for records with WARC-Type: resource
and then use the URL that's in the WARC-Target-URI
header to determine the URL to associate the text with?
Here's an example for the text generated on the initial page render:
WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897
Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...
The WARC-Target-URI
could also look like WARC-Target-URI: urn:textFinal:{url}
which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?
Environment:
Apple M1 Pro, macOS 14.3.1, Chrome
I initially uploaded 41 WARC files into WARCgpt. Among these files was an email containing titles and links to several papers related to AI. When I queried WARCgpt about the email's content regarding AI, the system responded that the email did not directly mention AI. Instead, it referenced links to product pages on B&H Photo Video's website for various computer components, such as processors and memory, with encoded parameters that specify the products linked. It was unclear how these components were connected to AI. Although I have some WARC files containing emails from B&H Photo Video, they pertain to cameras, video equipment, etc.
Later, I crawled the page https://arxiv.org/html/2303.08774v5 and ingested it into WARCgpt. Asking the system about what the email said regarding transformers, it accurately responded that the emails discussed several transformer models, such as "gpt-j-6b," "gpt-neo," "bloom," and "opt," describing them as large-scale autoregressive language models. Some emails covered aspects like training, deployment, alignment, and human data collection for these models, in addition to contributions to datasets. The emails were authored by individuals ranging from researchers and engineers to product managers at companies including Microsoft, Meta, and Google, providing the correct sources.
I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.
Here's the link to the files I used:
https://drive.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
warc_old.zip contains 41 warc files
Archive.zip contain warc files I crawled using Browsertrix crawler
I have just installed and tested with the test dataset on Chandrayaan-3 of ISRO by using mistral:latest model. It's working nice but I've observed that it is bit slow (my laptop spec is i7 12 core with 24 GB RAM with Ubuntu 22.04). For example, a question takes almost on an average 4 minutes to respond:
[2024-04-02 00:13:22,123] WARNING in api: litellm could not trim messages for ollama/mistral:latest
127.0.0.1 - - [02/Apr/2024 00:18:56] "POST /api/completion HTTP/1.1" 200 -
127.0.0.1 - - [02/Apr/2024 00:26:34] "POST /api/completion HTTP/1.1" 200 -
Whereas when I use directly the ollama prompt, it is almost instantaneous (but of course it is not contextualized with the test datasets on the subject):
ollama run mistral
>>> What is Chandrayaan-3?
Chandrayaan-3 is a proposed lunar mission by the Indian Space Research Organization (ISRO). It is the third lunar expedition by ISRO, following Chandrayaan-1 and Chandrayaan-2. The primary
objective of Chandrayaan-3 is to soft land a rover on the Moon's South Polar Region to carry out scientific explorations and studies. The mission also includes a orbiter that will map the Moon in
various wavelengths and study its resources and exosphere, as well as a lander. However, as of now, the launch has not been scheduled yet.
Any suggestion?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.