Git Product home page Git Product logo

tglang's Introduction

https://github.com/tzador/tglang

TGLang - Computer Language Detector

This is a library (in C) for detecting computer languages of a code snippet.

It heavily relies on ChatGPT to generate a dataset of 100 languages code snippets, as well as creating indicative substrings that frequently occur in those languages. We use substrings to vectorize a snippet of code, by putting 1 if a given substrings is in the snippet, 0 otherwise.

Data set generation

First we generate the code snippets using a typescript script and OpenAI api. Pretty much it boils down to "generate me a code snippet in languagte X". Both for faster code snippet generation and since chat completion seems to be repetitive, I have used gpt-3.5-instruct model to complete string prefixes. More details in src/1_dataset.ts Two languages have been skipped, TL and FUNC since ChatGPT seems not to know about them.

Feature generation

For each of the 100 languages, in a repetitive loop of 30, we ask chat gpt to give us a JSON list of strings that occure in the language source code and are indicative of the langauge. We collect all of them into a set and use it to vectorize our snippets.

Training a Random Forest to figure out which features are important

After vecdtorizing our snippets using all of the features, we train a random forest. As side effect, random forest can tell us which features are important and which less important. We keep the 1000 most important features.

Train final random forest using selected features

We retrain the random forest, but use only top 1000 important features for vectorization. We store it in the file for further processing.

Transpile the random forest into C code (if else branches)

A python script walks the threes in the forest and emits equivalent C code, which is bundled as library.

tglang's People

Contributors

tzador avatar

Stargazers

Konstantin Moskalenko avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.