Git Product home page Git Product logo

jetbrains-research / buckwheat Goto Github PK

View Code? Open in Web Editor NEW
24.0 6.0 8.0 2.8 MB

A multi-language tokenizer for extracting identifiers from source code.

License: Apache License 2.0

Python 29.29% C 0.42% C++ 0.30% C# 0.49% Go 0.25% Haskell 0.31% Java 0.56% JavaScript 0.12% Kotlin 0.62% PHP 0.24% Ruby 0.20% Rust 0.36% Scala 1.16% Shell 0.50% Swift 0.35% TypeScript 0.35% Jupyter Notebook 64.48%

buckwheat's Introduction

JetBrains Research Linux & MacOS build

Source Code Identifiers

A multi-language tokenizer for extracting identifiers (or, theoretically, anything else) from source code.

The tool is already employed in searching for similar repositories and studying the dynamics of topics in code.

How to use

The tool currently works on Linux and MacOS, correct versions of files will be downloaded automatically.

  1. The project uses tree-sitter and its grammars as submodules, so update them after cloning:

    git submodule update --init --recursive --depth 1
  2. Install the required dependencies:

    pip3 install cython
    pip3 install -r requirements.txt
  3. Create an input file with a list of repositories. In the default mode, the list must contain links to GitHub, in the local mode (activated by passing the -l argument), the list must contain the paths to local directories.

  4. Run from the command line with python3 -m identifiers_extractor.run and the following arguments:

    • -i: a path to the input file;
    • -o: a path to the output directory;
    • -b: the size of the batch of projects that will be saved together (by default 100);
    • -l: if passed, switches the tokenization into the local mode, where the input file must contain the paths to local directories.

For every batch, two files will be created:

  • docword: for every repository, all of its subtokens are listed as id:count, one repository per line, in descending order of counts. The ids are the same for the entire batch.
  • vocab: all unique subtokens are listed as id;subtoken, one subtoken per line, in ascending order of ids.

How it works

After the target project is downloaded, it is processed in three main steps:

  1. Language recognition. Firstly, the languages of the project are recognized with enry. This operation returns a dictionary with languages as keys and corresponding lists of files as values. Only the files in supported languages are passed on to the next step (see the full list below).
  2. Parsing. Every file is parsed with one of the two parsers. The most popular languages are parsed with tree-sitter, and the languages that do not yet have tree-sitter grammar are parsed with pygments. At this point, identifiers are extracted and every identifier is passed on to the next step.
  3. Subtokenizing. Every identifier is split into subtokens by camelCase and snake_case, small subtokens are connected to longer ones, and the subtokens are stemmed. In general, the preprocessing is carried out as described in this paper.

The counters of subtokens are aggregated for projects and saved to file.

Advanced use

Every step of the pipeline can be modified:

  1. Languages can be added by modifying SUPPORTED_LANGUAGES in parsing.py.
  2. The tool can extract not only identifiers, but anything that is detected by either tree-sitter or pygments. This can be done my modifying NODE_TYPES in TreeSitterParser class and TYPES in PygmentsParser class.
  3. Subtokenization can be modified in subtokenizing.py. The tokens can be connected together, stemmed, filtered by length, etc.

Supported languages

Currently, the following languages are supported: C, C#, C++, Go, Haskell, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Shell, Swift, and TypeScript.

buckwheat's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

buckwheat's Issues

Prebuilt version of enry might not work on some platforms

To detect languages, we download a prebuilt release of enry based on the OS. Seems like in some cases it won't work (see issue in Sosed). Possible workarounds are:

  1. A backup plan in this case -- to build enry from scratch if the prebuilt version fails
  2. Add thorough specification of the software you need to run enry smoothly (e.g., g++ version or something)

Buckwheat module can not be found in python3

Hello,

I am trying to use Buckwheat on Mac but when executing the aforementioned command "python3 -m buckwheat. run"
I am facing the error (ImportError: No module named 'buckwheat' )
Any help will be pleasant!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.