Git Product home page Git Product logo

Hello! My name is Cauê and I am a computer science student at USP.

  • 🔭 Currently, my main interest is developing in the Langchain framework using Python and Large Language Models, like chatGPT, as well as developing applications to extract and feed high-quality data into Machine Learning Models
  • 🌱 I am learning how to use various language models, including open-source ones.
  • 📫 You can reach me at the email [email protected].

Caue's GitHub stats

My projects

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).

To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

Implementation of the Educational Chatbot described above but using the new OpenAI customGPTs service.

Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.

For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.

For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.

In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.

Projects i developed to learn new technologies and concepts!

This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.

The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data

Heres the architecture of the Project/Pipeline:

Caue-airflow

Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)

Project developed in group for an eletronics class in university

The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)

My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.

Heres the certificate for the Symposium


Technologies i am familiar with:

Caue-Js Caue-HTML Caue-Python Caue-GPT Caue-qdrant Caue-AWS

My social networks:

Caue Paiva's Projects

c_cpp_repo icon c_cpp_repo

repository for C and C++ personal libs, macros and other stuff

enem_pdf_parser icon enem_pdf_parser

This project tprovides a tool to extract ENEM (Brazilian SAT) tests into parsed .txt and json files

intent_classifier icon intent_classifier

ML projects based on intent classification, used to develop langchain agents

querido-diario icon querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.