Git Product home page Git Product logo

dl-register-kor's Introduction

Can Korean Language Models Detect Social Registers in Utterances?

This is a repository for the routines for the experiments supporting:

Lee & Song. (to appear). Can Korean Language Models Detect Social Registers in Utterances? Korean Journal of Linguistics (under revision).

What does it do?

  • utils/handle_data.py: make unified data file ready for the deep learning objectives.
    • function prepare: gets json data files from National Institute of Korean Language.
    • function label: from data converted with prepare, make labels using the schemes given as an external json file.
      • file labeling_scheme.json: exemplar labeling schema file, as utilized for the published experiments as well.
    • usage: see function run_prepare of main.py
    • NOTE due to the license issue, we do not include the data in any format. For the data, visit corpus.korean.go.kr. prepare function is designed for the particular format of the data provided here. Particular data we used are titled:
      • 일상 대화 말뭉치 2021
      • 일상 대화 말뭉치 2020
  • train.py: perform fine-tuning with PyTorch framework and HuggingFace libraries.
    • Given the data prepared with labels as a json file, i.e., as produced by function run_prepare in main.py,
      • make PyTorch dataset instances
      • (run and test functions will save the dataset into pickles in datajar directory for later references by default)
      • load model checkpoint from HuggingFace Hub with sequence classification objective
      • perform fine-tuning with HuggingFace Trainer with wandb report
  • test_openai.ipynb: in-context inference test with OpenAI API
    • Notebook with outputs for:
      • loading test data from local pickle file
      • constructing prompts
      • make API calls
      • write json file for RegEx-based cleaning (Expressions were included as a text in sed-ready format, not as a code with Python re. Since OpenAI output can literally be in any format, it is best to perform the editing on the text file monitoring substitution results with some manual edits. Expressions given were used in Visual Studio Code to make cleaned output file.)
    • NOTE you need to provide your own OpenAI API key. By default, the notebook will load the key stored as a plain text file name .creds. You can either fill up .creds file, or adjust the notebook cells to assign API key with a string literal, or load the key with other methods.
    • NOTE this notebook will make few API calls on your account. Make sure you have your credit ready to make calls. The free credit provided when signing up was more than enough for this, but your mileage can always vary with OpenAI :/
    • NOTE OpenAI API calls fail for several reasons from lack of credits to busy server. Unfortunately, error code from the API are way less than enough to make smart adjustments for the failures. As many of such errors are temporal with an instructions to 'try again later', each call is in an infinite loop that make the same call after 10 seconds upon each failure. If it fails too much, stop the cell (send ctrl+c) to get out of the loop.

dl-register-kor's People

Contributors

gyulukeyi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.