Git Product home page Git Product logo

aws-samples / text-embeddings-pipeline-for-rag Goto Github PK

View Code? Open in Web Editor NEW
10.0 1.0 1.0 220 KB

A pipeline to convert contextual knowledge stored in documents and databases into text embeddings, and store them in a vector store

License: MIT No Attribution

JavaScript 4.28% Python 29.72% TypeScript 62.54% PowerShell 2.07% Shell 1.39%
amazon-bedrock genai generative-ai large-language-models rds-postgres retrieval-augmented-generation text-embeddings aws bedrock cdk

text-embeddings-pipeline-for-rag's Introduction

Text Embeddings Pipeline for Retrieval Augmented Generation (RAG)

This solution is a pipeline to convert contextual knowledge stored in documents and databases into text embeddings, and store them in a vector store. Applications built with Large Language Models (LLMs) can perform a similarity search on the vector store to retrieve the contextual knowledge before generating a response. This technique is known as Retrieval Augmented Generation (RAG), and it is often used to improve the quality and accuracy of the responses.

❗ Warning ❗

  • Review and change the configurations before using it for production: the current configuration should not be used for production without further review and adaptation. Many anti-patterns are adopted to save cost, such as disabling backups and multi-AZ.

  • Be mindful of the costs incurred: while this solution is developed to be cost-effective, please be mindful of the costs incurred.

Architecture

"Architecture"

Prerequisites

  1. AWS CDK CLI installed
  2. AWS CLI set up with a default profile
  3. Python v3.11 installed

Setup

  1. Clone this repository.

  2. Create an EC2 Key Pair named "EC2DefaultKeyPair" in your AWS account.

  3. Install dependencies.

npm install
  1. Bootstrap your AWS account with CDK Toolkit (if not done for your AWS account yet).
cdk bootstrap
  1. Package Lambda function and its dependencies.

    • macOS: sh prepare-lambda-package.sh
    • Windows: .\prepare-lambda-package.ps1
  2. Deploy the CDK stacks.

cdk deploy --all --require-approval never
  1. While waiting for the previous step to complete, go to Amazon Bedrock in us-east-1 and grant access to "Amazon Titan Embeddings G1 - Text".

Walkthrough

  1. There are two ways to upload data to the S3 bucket.

    • (a) Upload a .txt file with some content (sample.txt is an example) to the S3 bucket created by one of the stacks.

    • (b) Start the DMS replication task in the AWS management console. The data from the source database will be replicated to the S3 bucket and stored in .csv files.

      "Start DMS replication task"

    The Lambda function will create text embeddings of the content in .txt / .csv files and store them in the vector store.

  2. Connect (SSH or instance connect) to the bastion host. Run the following command (and provide the password) to authenticate. The credentials can be found in the "text-embeddings-pipeline-vector-store" secret in AWS Secrets Manager.

psql --port=5432 --dbname=postgres --username=postgres --host=<RDS instance DNS name>
  1. Run the \dt to list the database tables. Tables with names starting with the prefix "langchain" are created by LangChain automatically as it creates and stores the embeddings.
                  List of relations
 Schema |          Name           | Type  |  Owner
--------+-------------------------+-------+----------
 public | langchain_pg_collection | table | postgres
 public | langchain_pg_embedding  | table | postgres
 public | upsertion_record        | table | postgres
(3 rows)
  1. The documents and embeddings are stored in the "langchain_pg_embedding" table. You can see the truncated values (actual values are too long) by running the following commands.

    • SELECT embedding::varchar(80) FROM langchain_pg_embedding;
                                          embedding                                     
      ----------------------------------------------------------------------------------
      [-0.005340576,-0.61328125,0.13769531,0.7890625,0.4296875,-0.13671875,-0.01379394 ...
      [0.59375,-0.23339844,0.45703125,-0.14257812,-0.18164062,0.0030517578,-0.00933837 ...
      (2 rows)
      
    • SELECT document::varchar(80) FROM langchain_pg_embedding;
                                          document                                     
      ----------------------------------------------------------------------------------
      What is text embeddings pipeline?,Text embeddings pipeline allows you to create ...
      AWS Health provides improved visibility into planned lifecycle events ...
      (2 rows)
      

Clean Up

  1. Destroy all CDK stacks.
cdk destroy --all

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

text-embeddings-pipeline-for-rag's People

Contributors

amazon-auto avatar tchangkiat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

stophobia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.