Git Product home page Git Product logo

jpllm's Introduction

Introduction:

This research project explores LLM behavior on code-switched corpora in English and Spanish. Foundational work in this field includes the LinCE benchmark on code-switching tasks. Much of the existing data in this field is pulled from short social media interactions on Twitter (now X). Furthermore, LinCE does not explore generative tasks. Due to the rise in generative models, examining generative LLM behavior on similar data is important. Finally, human speech data is available. Short tweets or posts may not accurately represent human interactions. Using the Bangor Miami Corpus, I can validate and/or fine-tune Sagor Sarker's code-switched models (github: sagorbrur). In this way, I can expand on existing research towards both generative tasks and human data.

Files: job*.slurm files represent requests for compute on my clusters. You can use these files as guidelines for RAM, CPU, GPU needs for the code I am running.

prompting.py files are different kinds of prompts to the Mistral-8x7b Intruct model. They can be with Spanish, English, or a Mix.

The databricks files represent massive instruction datasets for LLMs. In addition to the human speech data, this is additional data that this project may use.

(https://huggingface.co/datasets/databricks/databricks-dolly-15k).

prompts.tsv pulls prompts from a similar study on South East Asian languages. Any references to languages that are out of scope (not English/Spanish) are modified to refer to the English/Spanish language pair. Any cultural and geographic references are similarly modified.

References: Sagor Sarker: https://github.com/sagorbrur/codeswitch https://huggingface.co/sagorsarker/codeswitch-spaeng-lid-lince/tree/main

Aguilar, Kar, Solorio: https://arxiv.org/abs/2005.04322

Bangor Miami Corpus: http://bangortalk.org.uk/speakers.php?c=miami

Prompting Data: https://github.com/Southeast-Asia-NLP/LLM-Code-Mixing

jpllm's People

Contributors

ctarnold avatar

Watchers

Kostas Georgiou avatar  avatar

jpllm's Issues

Investigate and implement appropriate hyperparameters (temp, etc).

In web mistral interfaces, the model code-switches to a limited, but consistent, degree. In my implementation, I am seeing a lack of code-switching. I likely have temp to a default of 1, making the model deterministic with a default to English. A preliminary attempt to set temp to some value between 0 and 1 resulted in an error thrown.

Clean slurm scripts

Make slurm scripts consistent in time, memory requested, etc. 30 minutes is shown to be a reasonable time for a few messages to the model. 190GB a reasonable CPU limit. 80GB a reasonable GPU limit.

Finish prompt transcription

From the linked prompting data in the Southeast Asia code-mixing study, transcribe and modify it to an English/Spanish context.

Prompt Mixtral w/ human speech data

Because we need to conform to user, assistant response templates, we can alternate each portion of dialogue. Whenever the speaker switches, switch the user to assistant and vice versa. Steps:

Clean corpus data into chat template
Provide chat template to model
Collect and Save output

Implement lang ID

When model with code-switching behavior (at least some code-switching as in the web interface) is implemented, add the component of lang id via the BERT model. Test on the miami corpus, or the LinCE framework.

Parallel Computations

Eventually scale to large instruction datasets, needs parallel computations for maximum efficiency. (ex: split 1000 prompts into sets of 50, 100, 250, 500, depending on cluster availability. Test experimentally).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.