Git Product home page Git Product logo

harvard-lil / warc-gpt Goto Github PK

View Code? Open in Web Editor NEW
215.0 12.0 18.0 1.07 MB

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Home Page: https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

License: MIT License

Python 45.37% CSS 12.56% HTML 2.80% JavaScript 32.99% Shell 6.27%
ai rag warc webarchiving

warc-gpt's Introduction

WARC-GPT

WARC + AI: Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

More info:

Screen.Recording.2024-03-20.at.5.09.28.PM.mov

Summary


Features

  • Retrieval Augmented Generation pipeline for WARC files
  • Highly customizable, can interact with many different LLMs, providers and embedding models
  • REST API
  • Web UI
  • Embeddings visualization

☝️ Summary


Installation

WARC-GPT requires the following machine-level dependencies to be installed.

Use the following commands to clone the project and instal its dependencies:

git clone https://github.com/harvard-lil/warc-gpt.git
poetry env use 3.11
poetry install

☝️ Summary


Configuring the application

This program uses environment variables to handle settings. Copy .env.example into a new .env file and edit it as needed.

cp .env.example .env

See details for individual settings in .env.example.

A few notes:

  • WARC-GPT can interact with both the OpenAI API and Ollama for local inference.
    • Both can be used at the same time, but at least one is needed.
    • By default, the program will try to communicate with Ollama's API at http://localhost:11434.
    • It is also possible to use OpenAI's client to interact with compatible providers, such as HuggingFace's Message API or vLLM. To do so, set values for both OPENAI_BASE_URL and OPENAI_COMPATIBLE_MODEL environment variables.
  • Prompts can be edited directly in the configuration file.

☝️ Summary


Ingesting WARCs

Place the WARC files you would to explore with WARC-GPT under ./warc and run the following command to:

  • Extract text from all the text/html and application/pdf response records present in the WARC files.
  • Generate text embeddings for this text. WARC-GPT will automatically split text based on the embedding model's context window.
  • Store these embeddings in a vector store, so it can be used as WARC-GPT's knowledge base.
poetry run flask ingest

# May help with performance in certain cases: only ingest 1 chunk of text at a time.
poetry run flask ingest --batch-size 1

Note: Running ingest clears the ./chromadb folder.

☝️ Summary


Starting the server

The following command will start WARC-GPT's server on port 5000.

poetry run flask run
# Not: Use --port to use a different port

☝️ Summary


Interacting with the WEB UI

Once the server is started, the application's web UI should be available on http://localhost:5000.

Unless RAG search is disabled in settings, the system will try to find relevant excerpts in its knowledge base - populated ahead of time using WARC files and the ingest command - to answer the questions it is asked.

The interface also automatically handles a basic chat history, allowing for few-shots / chain-of-thoughts prompting.

☝️ Summary


Interacting with the API

[GET] /api/models

Returns a list of available models as JSON.

[POST] /api/search

Performs search against the vector store for a given message.

Accepts a JSON body with the following properties:
  • message: User prompt (required)
Returns a JSON array of objects containing the following properties:
  • [].warc_filename: Filename of the WARC from which that excerpt is from.
  • [].warc_record_content_type: Can start with either text/html or application/pdf.
  • [].warc_record_id: Individual identifier of the WARC record within the WARC file.
  • [].warc_record_date: Date at which the WARC record was created.
  • [].warc_record_target_uri: Filename of the WARC from which that excerpt is from.
  • [].warc_record_text: Text excerpt.

[POST] /api/complete

Uses an LLM to generate a text completion.

Accepts a JSON body with the following properties:
  • model: One of the models /api/models lists (required)
  • message: User prompt (required)
  • temperature: Defaults to 0.0
  • max_tokens: If provided, caps number of tokens that will be generated in response.
  • search_results: Array, output of /api/search.
  • history: A list of chat completion objects representing the chat history. Each object must contain user and content.

Returns RAW text stream as output.

☝️ Summary


Visualizing embeddings

WARC-GPT allows for generating basic interactive T-SNE 2D scatter plots of the vector stores it generates.

Use the visualize command to do so:

poetry run flask visualize

visualize takes a --questions option which allows to place questions on the plot:

poetry run flask visualize --questions="Who am I?;Who are you?"

☝️ Summary


Disclaimer

The Library Innovation Lab is an organization based at the Harvard Law School Library. We are a cross-functional group of software developers, librarians, lawyers, and researchers doing work at the edges of technology and digital information.

Our work is rooted in library principles including longevity, authenticity, reliability, and privacy. Any work that we produce takes these principles as a primary lens. However due to the nature of exploration and a desire to prototype our work with real users, we do not guarantee service or performance at the level of a production-grade platform for all of our releases. This includes WARC-GPT, which is an experimental boilerplate released under MIT License.

Successful experimentation hinges on user feedback, so we encourage anyone interested in trying out our work to do so. It is all open-source and available on Github.

Please keep in mind:

  • We are an innovation lab leveraging our resources and flexibility to conduct explorations for a broader field. Projects may be eventually passed off to another group, take a totally unexpected turn, or be sunset completely.
  • While we always have priorities set around security and privacy each of those topics is complex in its own right and often requires grand scale work. Experiments can sometimes initially prioritize closed-loop feedback over broader questions of security. We will always disclose when this is the case.
  • There are some experiments that are destined to become mainstays in our established platforms and tools. We will also disclose when that’s the case.

☝️ Summary

warc-gpt's People

Contributors

bensteinberg avatar matteocargnelutti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warc-gpt's Issues

Use extracted text in WARC resource records

Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).

I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?

I think it would mostly be a matter of adding some logic to ingest.py to look for records with WARC-Type: resource and then use the URL that's in the WARC-Target-URI header to determine the URL to associate the text with?

Here's an example for the text generated on the initial page render:

WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897

Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...

The WARC-Target-URI could also look like WARC-Target-URI: urn:textFinal:{url} which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?

warc-gpt unable to find information

Environment:
Apple M1 Pro, macOS 14.3.1, Chrome

I initially uploaded 41 WARC files into WARCgpt. Among these files was an email containing titles and links to several papers related to AI. When I queried WARCgpt about the email's content regarding AI, the system responded that the email did not directly mention AI. Instead, it referenced links to product pages on B&H Photo Video's website for various computer components, such as processors and memory, with encoded parameters that specify the products linked. It was unclear how these components were connected to AI. Although I have some WARC files containing emails from B&H Photo Video, they pertain to cameras, video equipment, etc.

Later, I crawled the page https://arxiv.org/html/2303.08774v5 and ingested it into WARCgpt. Asking the system about what the email said regarding transformers, it accurately responded that the emails discussed several transformer models, such as "gpt-j-6b," "gpt-neo," "bloom," and "opt," describing them as large-scale autoregressive language models. Some emails covered aspects like training, deployment, alignment, and human data collection for these models, in addition to contributions to datasets. The emails were authored by individuals ranging from researchers and engineers to product managers at companies including Microsoft, Meta, and Google, providing the correct sources.

I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.

Here's the link to the files I used:
https://drive.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
warc_old.zip contains 41 warc files
Archive.zip contain warc files I crawled using Browsertrix crawler

Screenshot 2024-03-06 at 10 36 35
.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
Screenshot 2024-03-06 at 10 43 12

Slow response

I have just installed and tested with the test dataset on Chandrayaan-3 of ISRO by using mistral:latest model. It's working nice but I've observed that it is bit slow (my laptop spec is i7 12 core with 24 GB RAM with Ubuntu 22.04). For example, a question takes almost on an average 4 minutes to respond:

[2024-04-02 00:13:22,123] WARNING in api: litellm could not trim messages for ollama/mistral:latest
127.0.0.1 - - [02/Apr/2024 00:18:56] "POST /api/completion HTTP/1.1" 200 -
127.0.0.1 - - [02/Apr/2024 00:26:34] "POST /api/completion HTTP/1.1" 200 -

Whereas when I use directly the ollama prompt, it is almost instantaneous (but of course it is not contextualized with the test datasets on the subject):

ollama run mistral
>>> What is Chandrayaan-3?
 Chandrayaan-3 is a proposed lunar mission by the Indian Space Research Organization (ISRO). It is the third lunar expedition by ISRO, following Chandrayaan-1 and Chandrayaan-2. The primary 
objective of Chandrayaan-3 is to soft land a rover on the Moon's South Polar Region to carry out scientific explorations and studies. The mission also includes a orbiter that will map the Moon in 
various wavelengths and study its resources and exosphere, as well as a lander. However, as of now, the launch has not been scheduled yet.

Any suggestion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.