neo4j-labs / llm-graph-builder Goto Github PK

Neo4j graph construction from unstructured data using LLMs

Home Page: https://neo4j.com/labs/genai-ecosystem/llm-graph-builder/

License: Apache License 2.0

Python 19.02% HTML 0.06% TypeScript 32.45% CSS 0.71% Jupyter Notebook 46.78% JavaScript 0.04% Dockerfile 0.19% Roff 0.74%

data-import genai graph graph-rag graph-search graphdb graphrag knowledge-graph langchain neo4j

llm-graph-builder's Introduction

Knowledge Graph Builder App

Creating knowledge graphs from unstructured data

LLM Graph Builder

Overview

This application is designed to turn Unstructured data (pdfs,docs,txt,youtube video,web pages,etc.) into a knowledge graph stored in Neo4j. It utilizes the power of Large language models (OpenAI,Gemini,etc.) to extract nodes, relationships and their properties from the text and create a structured knowledge graph using Langchain framework.

Upload your files from local machine, GCS or S3 bucket or from web sources, choose your LLM model and generate knowledge graph.

Key Features

Knowledge Graph Creation: Transform unstructured data into structured knowledge graphs using LLMs.
Providing Schema: Provide your own custom schema or use existing schema in settings to generate graph.
View Graph: View graph for a particular source or multiple sources at a time in Bloom.
Chat with Data: Interact with your data in a Neo4j database through conversational queries, also retrive metadata about the source of response to your queries.

Getting started

⚠️ You will need to have a Neo4j Database V5.15 or later with APOC installed to use this Knowledge Graph Builder. You can use any Neo4j Aura database (including the free database) If you are using Neo4j Desktop, you will not be able to use the docker-compose but will have to follow the separate deployment of backend and frontend section. ⚠️

Deployment

Local deployment

Running through docker-compose

By default only OpenAI and Diffbot are enabled since Gemini requires extra GCP configurations.

In your root folder, create a .env file with your OPENAI and DIFFBOT keys (if you want to use both):

OPENAI_API_KEY="your-openai-key"
DIFFBOT_API_KEY="your-diffbot-key"

if you only want OpenAI:

LLM_MODELS="diffbot,openai-gpt-3.5,openai-gpt-4o"
OPENAI_API_KEY="your-openai-key"

if you only want Diffbot:

LLM_MODELS="diffbot"
DIFFBOT_API_KEY="your-diffbot-key"

You can then run Docker Compose to build and start all components:

docker-compose up --build

Additional configs

By default, the input sources will be: Local files, Youtube, Wikipedia ,AWS S3 and Webpages. As this default config is applied:

REACT_APP_SOURCES="local,youtube,wiki,s3,web"

If however you want the Google GCS integration, add gcs and your Google client ID:

REACT_APP_SOURCES="local,youtube,wiki,s3,gcs,web"
GOOGLE_CLIENT_ID="xxxx"

You can of course combine all (local, youtube, wikipedia, s3 and gcs) or remove any you don't want/need.

Chat Modes

By default,all of the chat modes will be available: vector, graph+vector and graph. If none of the mode is mentioned in the chat modes variable all modes will be available:

CHAT_MODES=""

If however you want to specify the only vector mode or only graph mode you can do that by specifying the mode in the env:

CHAT_MODES="vector,graph+vector"

Running Backend and Frontend separately (dev environment)

Alternatively, you can run the backend and frontend separately:

For the frontend:

Create the frontend/.env file by copy/pasting the frontend/example.env.
Change values as needed
```
cd frontend
yarn
yarn run dev
```

For the backend:

Create the backend/.env file by copy/pasting the backend/example.env.
Change values as needed

cd backend
python -m venv envName
source envName/bin/activate 
pip install -r requirements.txt
uvicorn score:app --reload

Deploy in Cloud

To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:

# Frontend deploy 
gcloud run deploy 
source location current directory > Frontend
region : 32 [us-central 1]
Allow unauthenticated request : Yes

# Backend deploy 
gcloud run deploy --set-env-vars "OPENAI_API_KEY = " --set-env-vars "DIFFBOT_API_KEY = " --set-env-vars "NEO4J_URI = " --set-env-vars "NEO4J_PASSWORD = " --set-env-vars "NEO4J_USERNAME = "
source location current directory > Backend
region : 32 [us-central 1]
Allow unauthenticated request : Yes

ENV

Env Variable Name	Mandatory/Optional	Default Value	Description
OPENAI_API_KEY	Mandatory		API key for OpenAI
DIFFBOT_API_KEY	Mandatory		API key for Diffbot
EMBEDDING_MODEL	Optional	all-MiniLM-L6-v2	Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai)
IS_EMBEDDING	Optional	true	Flag to enable text embedding
KNN_MIN_SCORE	Optional	0.94	Minimum score for KNN algorithm
GEMINI_ENABLED	Optional	False	Flag to enable Gemini
GCP_LOG_METRICS_ENABLED	Optional	False	Flag to enable Google Cloud logs
NUMBER_OF_CHUNKS_TO_COMBINE	Optional	5	Number of chunks to combine when processing embeddings
UPDATE_GRAPH_CHUNKS_PROCESSED	Optional	20	Number of chunks processed before updating progress
NEO4J_URI	Optional	neo4j://database:7687	URI for Neo4j database
NEO4J_USERNAME	Optional	neo4j	Username for Neo4j database
NEO4J_PASSWORD	Optional	password	Password for Neo4j database
LANGCHAIN_API_KEY	Optional		API key for Langchain
LANGCHAIN_PROJECT	Optional		Project for Langchain
LANGCHAIN_TRACING_V2	Optional	true	Flag to enable Langchain tracing
LANGCHAIN_ENDPOINT	Optional	https://api.smith.langchain.com	Endpoint for Langchain API
BACKEND_API_URL	Optional	http://localhost:8000	URL for backend API
BLOOM_URL	Optional	https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true	URL for Bloom visualization
REACT_APP_SOURCES	Optional	local,youtube,wiki,s3	List of input sources that will be available
LLM_MODELS	Optional	diffbot,openai-gpt-3.5,openai-gpt-4o	Models available for selection on the frontend, used for entities extraction and Q&A
CHAT_MODES	Optional	vector,graph+vector,graph	Chat modes available for Q&A
ENV	Optional	DEV	Environment variable for the app
TIME_PER_CHUNK	Optional	4	Time per chunk for processing
CHUNK_SIZE	Optional	5242880	Size of each chunk of file for upload
GOOGLE_CLIENT_ID	Optional		Client ID for Google authentication
GCS_FILE_CACHE	Optional	False	If set to True, will save the files to process into GCS. If set to False, will save the files locally
ENTITY_EMBEDDING	Optional	False	If set to True, It will add embeddings for each entity in database
LLM_MODEL_CONFIG_ollama_<model_name>	Optional		Set ollama config as - model_name,model_local_url for local deployments

Usage

Connect to Neo4j Aura Instance by passing URI and password or using Neo4j credentials file.
Choose your source from a list of Unstructured sources to create graph.
Change the LLM (if required) from drop down, which will be used to generate graph.
Optionally, define schema(nodes and relationship labels) in entity graph extraction settings.
Either select multiple files to 'Generate Graph' or all the files in 'New' status will be processed for graph creation.
Have a look at the graph for individial files using 'View' in grid or select one or more files and 'Preview Graph'
Ask questions related to the processed/completed sources to chat-bot, Also get detailed information about your answers generated by LLM.

Links

LLM Knowledge Graph Builder Application

Neo4j Workspace

Reference

Demo of application

Contact

For any inquiries or support, feel free to raise Github Issue

Happy Graph Building!

llm-graph-builder's People

Contributors

Stargazers

Watchers

Forkers

aashipandya prakriti-solankey kartikpersistent arora-rakshita rakshita-arora mrvictorn johnaffolter nepi24 afmrgit lw9726 mr2cool msenechal jeffbanks deepbiolabs pierreembacher jivishov yacov techthiyanes hollaugo im-ajaymeena octag0no mencelot olivier-essner smwongela bekimpilo humbitious cytrix228 zuwei-zhao aymankonna sorokinvld liuzheng081 geraldworks wjiayis undeser sandhiyagiri sarahtoh w666x michaelschmidt1729 malikaoo benhoff ayo-faks jaxsulav bwenyenye zxh263 blackflame007 silviachen46 fti9999 dante8276 manzke anyuanay pradeepkasula eranik taocao binnong finusj massi-ang thomas-ray chrissblm 0xxyz squ1ddy l-earner jalakoo olanigan nickstas af-labs eertan anavelonline mpparsley infocz-lucy chenliang87 apisani1 mathias-ammon arshiasangwan fancyfoot menonpg drroad fraserhore rmulligan michaelfong-forms manish30007 ansontgn rodneybratcher constalexander taosha1 jidechao yaqangela keepwonder samirarioui cove9988 vedhasua blackwhites zhuxiaosheng zhaomeng0113 chabbou7 xinhen cromulus bigwboy xmkoh kalam360 cpetoile

llm-graph-builder's Issues

App is not responsive, generate graph button doesn't show on narrower screen

Also instead of select LLM it should say "Select Model"

GCS support

Handle GCS support and move the cloud selection out of dropZone.

Backend API

When adding :Source nodes to the graph to represent the files, add a /sources/list endpoint that returns the list of sources ordered by updatedAt descending and returns all the metadata, that was added/updated when creating the nodes

fileName (for the time being this can be the id - unique constraint)
fileType
fileSize
createdAt
updatedAt
processingTime
status
errorMessage
nodeCount
relationshipCount

Frontend- API Integration on File upload.

When user uploads a file, hit an API [/sources] to post the the file data.

Front-end pass neo4j connection information to backend

If there is a separate connection information provided in the front-end it should pass that to the backend in a suitable way when making requests.

e.g. for processing files the connection information of the front-end (if available) should be passed on as an extra nested payload and be used in the processing.

Same for listing sources for the table, it should use the front-end connection information.

If the backend is configured with a neo4j connection but the front-end is not connected, it should still work, then automatically using the backends connection config inside the backend.

Create a config setting panel

Create a setting panel.
Add settings for LLM Dropdown, Access key and Secret Key , Embedding checkbox

Frontend - Create LLM Selection Dropdown

Add a dropdown for user to select LLM of their choice.

Frontend- Code Cleanup

Need to create a common function to call the API's. Some function name changes and logic formatting

Frontend-Add relationship column to the table

Integrate the relationships created on file upload processing in the table.

Add a filter to the status column of the Table

Add Access key and secret key Check

As per our understanding the secret key and access key if already available in the source node, put a check of its existence , if its there show the available for the processing/New .

S3 Credentials as Payload-Backend

Bug: Issue on populating data of multiple files.

Working on handling the bug found while testing
When all the files are in processing then their respective records are populating correctly in table but when lets say processing on on going of 3 files and I upload a new large file with "New" status and doesn't start processing for it then data of records in UI table is getting shuffled .
On refresh retain back to their original data.

Sum nodes and rels over all documents

https://github.com/neo4j-labs/llm-graph-builder/blob/main/backend/src/main.py#L42-L43

Also add relationship count to the UI

Frontend/Backend-Integration of LLM dropdown selection with API for all files in the Table

On selection of LLM - user shall be able to process the files on basis of preferred LLM.

front-end-backend communication

There seems to be a CORS issue.

-> ok seems to be related to the GH codespaces, need to make the backend URL public to make it work for the time being, should be resolved when running it with docker or deploying it elsewhere.

But also connecting to a wrong back-end? Not sure if you hard-coded it, but it should just connect to localhost:8000 on the machine where the backend-is running or the configured base-URL.

Seems you have that hard-coded

https://github.com/neo4j-labs/llm-graph-builder/blob/main/frontend/src/components/DropZone.tsx#L10

it should not be hard-coded but come from an .env file (also provide an example.env)
it should not just hidden inside a UI component but a proper backend/REST API component !!
there should be a health check that valdiates that the backend is correctly running and indicate that to the user!

https://github.com/neo4j-labs/llm-graph-builder/issues/new
Access to XMLHttpRequest at 'https://animated-space-broccoli-jpgjg6pg59qcp7pg-8000.app.github.dev/predict' from origin 'https://studious-dollop-979pxr45x3p4p4-5173.app.github.dev' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

POST https://animated-space-broccoli-jpgjg6pg59qcp7pg-8000.app.github.dev/predict net::ERR_FAILED 404 (Not Found)

my backend is running on: https://studious-dollop-979pxr45x3p4p4-8000.app.github.dev/docs

Later: Perhaps mid-term we can even serve them from the backend as static assets.

App Deployment through Cloud run

Deploy docker containers to google cloud run , generate a URL

Data Model Cleanup

rename Source -> Document
HAS_CHILD relationship inverse (chunk)-[:PART_OF]->(:Document)
add a single first relationship: (:Document)-[:FIRST_CHUNK]->(:Chunk)
create a NEXT relationship between chunks of each document

Deduplication of entities created from LLM

Connection modal implementation and Dropzone allignment

Add connection modal according to the new Figma
Place/Positioning dropzone according to the above design
Table css styling fixes

To Create Generate Button progress Bar

Add some sort of feedback when user clicks on "Generate Graph". The button should show that the files are processing and then indicate completed once the job is done.

Backend configuration

can we make all environment variables uppercase.

and add a section to the backend readme on configuration and the env file.

also call out in the file and readme what is optional and what can be overriden e.g. from the client.

It should be more aligned with the usual style of config variables that we use elsewhere:

#OPENAI_API_KEY="sk-..."
DIFFBOT_API_KEY=""
NEO4J_URI=""
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD=""

Backend URL in front-end via environment

by default should come from environment variable / .env file (using dotenv)
if not set you can use the logic you have here right now, but probably want to add a check and also test for localhost
https://github.com/neo4j-labs/llm-graph-builder/blob/main/frontend/src/utils/utils.ts#L1

UI Changes for cloud Provider buckets

Move Aws S3 bucket out of dropZone.
Handle page content with no white line. Custom icons [suggestion]

Add endpoint update_similarity_graph to update graph

Create and link chunk node with source node

Parallelism of all APIs using asyncio

Frontend/Backend- Table integration to get data from the GET API [/sources_list]

Update the table with the API response.

Frontend- Allowing users to specify the S3 bucket name and credential details required for accessing resources.

Implement a frontend interface element allowing users to specify the S3 bucket name and credential details required for accessing resources.

s3-bucket (with path)

optional credentials (access-key, secret-key) and region (so that public buckets can be accessed without credentials)

S3 upload- Backend API

Model selection state management issue

Handle Model Bug fix
Change failed Response Alert position from top center to bottom left
Add a Check for disable state of Generate graph button and dropdown

Add information to the Source node and the table, which model was used for processing

LLVM Graph Library

Modify backend code to generate KG from OpenAI via /extract api

LLMs comparision in csv

Frontend-Multiple file upload handling

Currently user is able to upload one file at a time. Allow users to add 5 files at a time.

typo in payload: `errorMessgae` of `sources/list`

Bug Fixing for Frontend UI

Fix the white space issue by dynamically adjusting the height and it is responsive even after changing the page size in the table
Fix Auto Page Shifting
New items should be shown on first page rather than last
If the File is Already Processed show it as Completed
Removed the extra check for Disabling the Dropdown and Generate Graph Button
Connection Modal should display, if user is not connected with Neo4j Database
Add the Neo4j Favicon

Experimenting with different chunking strategies

Create relationship between chunk node and entity node

Frontend Connection Status

The front end should indicate if the back-end is running.

Right now it shows the file-drop area if neo4j/the backend is connected but there should be a clearer indication.

Update the readme

instructions how to run / deploy / configure
link to the public google cloud run URL + link to neo4j workspace
list of features (upload, s3/gcs, connection to neo4j, file + chunk handling, extract entities with different models, create embeddings, create kNN graph)
screenshot or short animated gif
graph model
screenshot of the graph model in neo4j workspace + query that I shared

JSON comparision of LLMs in same format as diffbot

Backend API

change the API name from /predict to /extract

spell out knowledge graph in the description
rename the body object in the docs to something more consistent and descriptive from Body_kg_creation_predict_post

also add metadata about the file:

filename
file-size
if available file date
store those a :Source node (or equivalent if the graph transformer already creates a metadata node) in the graph

and in the response at least prepare the numeric processingTime and nodeCount and relationshipCount response fields
and status and errorMessage

Backend - Checking on multiple LLMs for entity and relations extraction and populating KG

Backend URL handling

you have an inconsistentcy on how you use BACKEND_URL -> url() sometimes {url()}sources sometimes {url()}/extract
I changed it now to always use a slash / i.e. {url()}/extract
so that you have to set the environment variable like this without a trailing slash: export BACKEND_API_URL="https://studious-dollop-979pxr45x3p4p4-8000.app.github.dev/"
ideally in url() we would remove trailing slashes

Update KNN Query and congurable knn_min_score

Duplicate entities & PDF processing fails with 422

I tried to run the app, it still creates duplicates of the file with the same name

and when trying to process the file I get a 422 error

backend   | INFO:     172.18.0.1:44740 - "GET /sources_list HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44732 - "GET /health HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44748 - "GET /health HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44754 - "GET /sources_list HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:35078 - "POST /sources HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:55262 - "POST /extract HTTP/1.1" 422 Unprocessable Entity

Handle Validation Checks for S3 bucket input

Handle validations when user adds invalid url.
Handle state of banner .
Secret and Access key params confirmation.