Git Product home page Git Product logo

ph-ausseil / llm-training-dataset-builder Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 0.0 9.55 MB

Streamlines the creation of dataset to train a Large Language Model with triplets : instruction-input-output . The default configuration fits github.com/tloen/alpaca-lora requirements.

Home Page: https://www.linkedin.com/in/ausseil/

License: Other

Python 91.57% Shell 8.43%
alpaca chatgpt lama lora

llm-training-dataset-builder's Introduction

llm-training-dataset-builder

Streamlines the creation of dataset to train a Large Language Model with triplets : instruction-input-output . The default configuration fits github.com/tloen/alpaca-lora requirements.

This project processes sample orders in various formats (XML, JSON, and PostgreSQL database) and generates question-answer pairs based on the orders' information ๐Ÿ˜Š

The code is designed to be modular and easily customizable, allowing for various pretreatment methods and instruction generation.

Features

  • Supports processing of XML, JSON, and PostgreSQL database input formats.
  • Customizable dataset preprocessing and instruction generation.
  • Option to merge output files into a single file.
  • Configurable parameters via config.py or command-line arguments.

Files and Functions

main.py

This is the main entry point for the program. It handles command-line arguments, processes the input files or database, and generates the output files.

config.py

This file contains configuration parameters that are used throughout the project.

  • PARAM_ACTIVATE_CONFIG: Whether to use config.py parameters or command-line arguments (True/False).
  • PARAM_OUTPUT_DIR: The directory where the training set is created.
  • PARAM_OUTPUT_MERGE_FILE: Whether to merge output files (True/False).
  • PARAM_OUTPUT_MERGE_FILE_NAME: The name of the merged output file.
  • PARAM_METHOD: The processing method (values: xmls, xml, jsons, json, database).
  • PARAM_XML_PATH, PARAM_XMLS_PATH, PARAM_JSON_PATH, PARAM_JSONS_PATH: Input file/directory paths for XML and JSON files.
  • PARAM_DATABASE_HOST, PARAM_DATABASE_USER, PARAM_DATABASE_DBNAME, PARAM_DATABASE_DBPASSWORD, PARAM_DATABASE_PORT: PostgreSQL database connection parameters.

config_parser.py

This file contains functions to process XML, JSON, and PostgreSQL database inputs and generate question-answer pairs based on the dataset.

  • dataset_pretreatment(dataset): Preprocesses the dataset. Can be customized.
  • generate_instructions(dataset): Generates question-answer pairs based on the dataset. Can be customized.
  • process_xml_file(filename): Processes an XML file and generates question-answer pairs.
  • process_json_file(filename): Processes a JSON file and generates question-answer pairs.
  • process_database(user, password, host, port, database): Fetches data from a PostgreSQL database, processes it, and generates question-answer pairs.

sample_orders_parser.py

This file contains custom functions to pretreat datasets and generate question-answer pairs.

  • remove_duplicates(items_node): Removes duplicate items from the items_node based on their description.
  • update_sku_price(item_node, sku_dict, price_dict): Updates the SKU and price of the item_node based on the description.
  • apply_inflation(order_date, price, quantity): Applies inflation based on the order_date to the price and quantity.
  • calculate_total_price(items_node): Calculates the total price of all items in the items_node.
  • update_items_with_inflation(items, order_date): Updates the items with inflated prices and quantities based on the order_date.
  • generate_general_instructions(dataset): Generates general instructions based on the dataset.
  • generate_item_instructions(item_node): Generates item-specific instructions based on the item_node.
  • dataset_pretreatment_custom(dataset): Custom function to preprocess the dataset.
  • generate_instructions_custom(dataset): Custom function to generate question-answer pairs based on the dataset.

Getting Started

  1. Choose one of the three processing methods to implement:
    • process_xml_file(filename): Processing XML files (already implemented in the example)
    • process_json_file(filename): Processing JSON files (now implemented)
    • process_database(user, password, ip, database, output_dir): Processing records from a database (now implemented)
  2. Modify the dataset_pretreatment(dataset) function to preprocess the data before generating instructions.
  3. Modify the generate_instructions(treated_dataset) function to generate the desired instructions.
  4. To test the example provided, run one of the following commands:
python main.py --xmls=./input/sample-order-xml/

or

python main.py --xml=./input/sample-order-xml/sample-file.xml

For JSON files:

python main.py --jsons=./input/sample-order-json/

or

python main.py --json=./input/sample-order-json/sample-file.json

For PostgreSQL database:

python main.py --user=<db_user> --password=<db_password> --ip=<db_host> --database=<db_name>

Example provided

The example showcased in this program demonstrates a sophisticated use case, necessitating a custom parser tailored to a particular business context. Data preprocessing takes place within the dataset_pretreatment_custom function, while the intricate series of instructions are generated by the generate_instructions_custom function. By adhering to these steps and utilizing the provided code samples as a reference, you can modify this program to accommodate diverse input formats and produce customized instructions tailored to your unique needs.

Disclaimer for my employer

This has been developed over one weekend on my personal time.

Author

Pierre-Henri AUSSEIL LinkedIn: linkedin.com/in/ausseil/ GitHub: github.com/ph-ausseil/

About the author

I work in Data Integration (Middleware, DataHub, API...) & Service Management. I am not a developer. I wanted to make a proof of concept on the uses of LLM in a company stack so the LLM would know about the business environment and improve a company's decision-making.

llm-training-dataset-builder's People

Contributors

ph-ausseil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.