Git Product home page Git Product logo

genai-batch-ingester's Introduction

GenAI-Batch-Ingester

The GenAI-Batch-Ingester pattern is an innovative solution designed to address the complexities and challenges associated with the bulk processing and ingestion of documents from historic and ongoing perspective, ranging from a few hundred to tens of millions.

This pattern is built upon the robust foundation provided by Azure Cloud services, harnessing the power of Azure Functions, Storage Queues, Azure Document Intelligence, Azure OpenAI, CosmosDB, and Azure Storage. It is designed to streamline the process of data ingestion, offering both scalability and reliability, is well-suited for operations that require the generative AI application of language model prompts and the subsequent storage for analysis or processing.

This pattern provides a versatile framework that can be adapted to various scenarios, including the extension into vectorization and ingestion into vector stores, such as the Azure AI Search service.

Current version supports PDF file ingestion.

Caution

  • Do not run directly on 1000s of documents.
  • Get this solution working locally through Azure Function Core tools as the first step.
  • Run on a very small set (1-2) of smaller documents (< 10 pages) first.
  • This pattern assumes incremental addition of new document batch and track progress one batch at a time.
  • Incorrect configuration, updates to the code may lead to unexpected Azure bill.

Running in Dev Container

  • Clone this repository and open in VS code container
  • Provision Azure services
    • Azure Storage Account
      • 2 containers and 4 queues (Please refer storage setup shown below)
    • Azure CosmosDB
    • Azure OpenAI Service
    • Azure Function App [Python 3.10] (not required unless deploying to Azure)
    • You will have to create Python virtual environment and install packages given in requirements.txt file before running func start command. Scripts to install miniconda are available in ./azure_functions/scripts directory. Run these scripts / code in bash.
  • Renamed to template_local.settings.json file to local.settings.json
  • Update values / keys in local.settings.json

Storage Setup

Azure Functions

Template file is included which can be renamed to local.settings.json and configuration values need to be updated in this file.

Please note that the solution supports single or multiple Azure OpenAI and Document Intelligence endpoints to spread calls across them. For using multiple endpoints, please separate the values using pipe '|' delimiter. This includes keys and deployment names.

Example below.

"AZURE_OPENAI_DEPLOYMENT_ID": "gpt-35-turbo-1106|gpt-35-turbo-0613|gpt-35-turbo-0125"

App Setting : Explanation Example Value Note
BLOB_STORAGE_ACCOUNT_UPLOAD_CONTAINER_NAME : Storage container where files to be ingested are added. Please ensure the uploaded files follow 2 level hierarchy system / username --> batch_id e.g. system --> 202404161240 Please add new set of documents with a new batch_id 202404161240, this will help track progress of each document set. upload
BLOB_STORAGE_ACCOUNT_LOG_CONTAINER_NAME logs Not required
COSMOSDB_LOG_DATABASE_NAME : CosmosDB database where processing logs are saved. This can be used for reporting progress and to view errors (if any) statusdb This database will be created by the code if it does not exist
COSMOSDB_LOG_CONTAINER_NAME : CosmosDB container to store the log artefacts statuscontainer This container will be created by the code if it does not exist
COSMOSDB_TAGS_CONTAINER_NAME tagcontainer Not required
COSMOSDB_TAGS_DATABASE_NAME tagdb Not required
COSMOSDB_PROMPT_DATABASE_NAME : CosmosDB database to store the default prompt (and user defined prompts in future) promptdb
COSMOSDB_PROMPT_CONTAINER_NAME : CosmosDB container to store default prompt (and user defined prompts in future) promptcontainer
COSMOSDB_URL : CosmosDB endpoint https://xxxxx.documents.azure.com:443/
AZURE_FORM_RECOGNIZER_ENDPOINT : Document Intelligence / Form Recognizer endpoint https://xxxxx.cognitiveservices.azure.com/
ENRICHMENT_LOCATION Not required
AZURE_SEARCH_INDEX Not implemented in this version
AZURE_SEARCH_SERVICE_ENDPOINT Not implemented in this version
BLOB_STORAGE_ACCOUNT : Azure storage account name Input, intermediate and processed files will be saved here
BLOB_STORAGE_ACCOUNT_OUTPUT_CONTAINER_NAME : Storage container where outputs are saved content Outputs saved here
BLOB_STORAGE_ACCOUNT_ENDPOINT : Azure Storage blob endpoint https://xxxxx.blob.core.windows.net/
CHUNK_TARGET_SIZE : Token count 256 Used for chunking input document text
MERGED_CHUNK_TARGET_SIZE : Token count 512 Roll-up the chunks into bigger size according to your use case requirement
DEPLOYMENT_KEYVAULT_NAME Not required
ENABLE_DEV_CODE false Not required
ENRICHMENT_ENDPOINT Not required
ENRICHMENT_NAME Not required
FR_API_VERSION : Document Intelligence API version 2023-07-31 This repo uses prebuilt-layout model offered by Document Intelligence
TARGET_PAGES Not required
TARGET_TRANSLATION_LANGUAGE Not required
FUNCTIONS_WORKER_RUNTIME python
AzureWebJobs.parse_html_w_form_rec.Disabled true Not required
MAX_SECONDS_HIDE_ON_UPLOAD : Random seconds between 1 and this value that sets first time visibility of message to the queue 10
MAX_SUBMIT_REQUEUE_COUNT : Requeue message until these many occurences are exhausted 10
POLL_QUEUE_SUBMIT_BACKOFF : The message to poll Document Intelligence upon successfull submission will be visible to queue in these many seconds 10
PDF_SUBMIT_QUEUE_BACKOFF : Value is seconds used to requeue the message with visibility in seconds in case Document Intelligence API returns status_code 429, i.e. throttled 10
MAX_POLLING_REQUEUE_COUNT : Keep polling Document Intelligence API until these many attempts are exhausted 10
SUBMIT_REQUEUE_HIDE_SECONDS : Value is seconds used to requeue the message with visibility in seconds in case Document Intelligence API returns unexpected error, such as internal capacity overload 120
POLLING_BACKOFF : Used to set backoff when Document Intelligence API is still processing the request 10
MAX_READ_ATTEMPTS 5
MAX_ENRICHMENT_REQUEUE_COUNT 10 Not required
ENRICHMENT_BACKOFF 60 Not required
EMBEDDINGS_QUEUE Not implemented in this version
MEDIA_SUBMIT_QUEUE Not required
NON_PDF_SUBMIT_QUEUE non-pdf-submit-queue Not implemented in this version
PDF_POLLING_QUEUE : Azure storage queue name pdf-polling-queue Poll Document Intelligence for resultID received after submission
PDF_SUBMIT_QUEUE : Azure storage queue name pdf-submit-queue PDFs to process
CHUNKS_QUEUE : Azure storage queue name chunks-queue Individual merged chunks to be processed
TEXT_ENRICHMENT_QUEUE Not required
IMAGE_ENRICHMENT_QUEUE Not required
AZURE_SEARCH_SERVICE_KEY Not implemented in this version
AZURE_BLOB_STORAGE_KEY : Azure storage key
BLOB_CONNECTION_STRING : Azure storage connection string DefaultEndpointsProtocol=https;AccountName=xxxxx;AccountKey=xxxxx;EndpointSuffix=core.windows.net
AzureWebJobsStorage : Azure storage connection string DefaultEndpointsProtocol=https;AccountName=xxxxx;AccountKey=xxxxx;EndpointSuffix=core.windows.net Azure Web Jobs Storage
COSMOSDB_KEY : Azure CosmosDB key
AZURE_FORM_RECOGNIZER_KEY : Azure Document Intelligence key
ENRICHMENT_KEY Not required
AZURE_OPENAI_ENDPOINT : Azure OpenAI endpoint https://xxxxx.openai.azure.com
AZURE_OPENAI_KEY : Azure OpenAI Key
AZURE_OPENAI_DEPLOYMENT_ID : Azure OpenAI deployment name gpt-35-turbo-1106
AZURE_OPENAI_API_VERSION : Azure OpenAI API version 2024-02-01
AZURE_OPENAI_TEMPERATURE : Temperature 0
AZURE_OPENAI_TOP_P : Top P 0.95
AZURE_OPENAI_MAX_TOKENS : Maximum Tokens 200
AZURE_OPENAI_SYSTEM_MESSAGE : System Message You are AI assistant. Do not make up facts.

Deploy Azure Functions

Open terminal in VSCode where you have opened this project as a dev container

az login --tenant <TENANT_ID>

az account set --subscription <SUBSCRIPTION_ID>

cd azure_functions

func azure functionapp publish <NAME_OF_YOUR_FUNCTION_APP> --publish-local-settings

Reviewing Processing Logs

Each chunk's processing is logged into CosmosDB and handy to review process completion status / troubleshoot / benchmark processing performance. Below CosmosDB NOSQL queries can help with this.

--merged_chunks processed so far

SELECT VALUE COUNT(1) FROM c WHERE c.doc_type = 'chunk_log' and c.chunk_state = 'Complete'

--merged_chunks with errors

SELECT * FROM c WHERE c.doc_type = 'chunk_log' and c.chunk_state = 'Error'

--merged_chunks that were Throttled

SELECT * FROM c WHERE c.doc_type = 'chunk_log' and c.chunk_state = 'Throttled'

--merged_chunks that were Skipped due to any other reasons / unexpected response from LLM API

SELECT * FROM c WHERE c.doc_type = 'chunk_log' and c.chunk_state = 'Skipped'

genai-batch-ingester's People

Contributors

maheshsql avatar

Stargazers

Lachlan White avatar Ashish Talati avatar Akash Talole avatar Mohamed Chorfa avatar Sam Ellis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.