Hello! My name is Cauê and I am a computer science student at USP.

🔭 Currently, my main interest is developing in the Langchain framework using Python and Large Language Models, like chatGPT, as well as developing applications to extract and feed high-quality data into Machine Learning Models
🌱 I am learning how to use various language models, including open-source ones.
📫 You can reach me at the email [email protected].

My projects

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

Educational Chatbot for Brazilian high school students

The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).

To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

CustomGPTs using APIs hosted on AWS

Implementation of the Educational Chatbot described above but using the new OpenAI customGPTs service.

Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.

For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.

ETL pipeline for processing PDFs and feeding data into vectorDBs

For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.

In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.

Projects i developed to learn new technologies and concepts!

Crypto Data ETL pipeline with Airflow and AWS

This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.

The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data

Heres the architecture of the Project/Pipeline:

Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)

Robot with Computer Vision and Speech Recognition

Project developed in group for an eletronics class in university

The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)

My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.

Heres the certificate for the Symposium

Technologies i am familiar with:

caue-paiva Goto Github PK

Hello! My name is Cauê and I am a computer science student at USP.

My projects

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

Projects i developed to learn new technologies and concepts!

Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)

Technologies i am familiar with:

My social networks:

Caue Paiva's Projects

Recommend Projects

Recommend Topics

Recommend Org