Git Product home page Git Product logo

neural-code-search-evaluation-dataset's Introduction

Neural-Code-Search-Evaluation-Dataset

Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models (NCS, UNIF) from recent work.

The full paper is available at Neural Code Search Evaluation Dataset.

Dataset contents

All the dataset contents are in the data directory.

GitHub Repositories

The most popular Android repositories on GitHub (ranked by the number of stars) is used to create the search corpus. For each repository that we indexed, we provide the link, specific to the commit that was used. In total, there are 24,549 repositories. This is located in data/android_repositories_download_links.txt. We also provide a Python script (download.py) that downloads these GitHub repositories.

Example:

https://github.com/00-00-00/ably-chat/archive/9bb2e36acc24f1cd684ef5d1b98d837055ba9cc8.zip
https://github.com/01sadra/Detoxiom/archive/c3fffd36989b0cd93bd09cbaa35123b9d605f989.zip
https://github.com/0411ameya/MPG_update/archive/27ac5531ca2c2f123e0cb854ebcb4d0441e2bc98.zip
...

Search Corpus

The search corpus is indexed using all method bodies parsed from the 24,549 GitHub repositories. In total, there are 4,716,814 methods in this corpus. The code search model will find relevant code snippets (i.e. method bodies) from this corpus given a natural language query. In this data release, we will provide the following information for each method in the corpus:

  • id: Each method in the corpus has a unique numeric identifier. This ID number will also be referenced in our evaluation dataset.
  • filepath: The file path is in the format of :owner/:repo/relative-file-path-to-the-repo
  • method_name
  • start_line: Starting line number of the method in the file.
  • end_line: Ending line number of the method in the file.
  • url: GitHub link to the method body with commit ID and line numbers encoded.

This is located in two parts (due to GitHub file size constraints): data/search_corpus_1.tar.gz and data/search_corpus_2.tar.gz.

Example:

{
  "id": 4716813,
  "filepath": "Mindgames/VideoStreamServer/playersdk/src/main/java/com/kaltura/playersdk/PlayerViewController.java",
  "method_name": "notifyKPlayerEvent",
  "start_line": 506,
  "end_line": 566,
  "url":  "https://github.com/Mindgames/VideoStreamServer/blob/b7c73d2bcd296b3a24f83cf67d6a5998c7a1af6b/playersdk/src/main/java/com/kaltura/playersdk/PlayerViewController.java\#L506-L566"
}

Evaluation Dataset

The evaluation dataset is composed of 287 Stack Overflow question and answer pairs, for which we release the following information:

  • stackoverflow_id: Stack Overflow post ID.
  • question: Title fo the Stack Overflow post.
  • question_url: URL of the Stack Overflow post.
  • answer: Code snippet answer to the question.

The questions were collected from a data dump publicly relased by Stack Exchange here. This is located in data/287_android_questions.json.

Example:

{
  "stackoverflow_id": 1109022,
  "question": "Close/hide the Android Soft Keyboard",
  "question_url": "https://stackoverflow.com/questions/1109022/close-hide-the-android-soft-keyboard",
  "question_author": "Vidar Vestnes",
  "question_author_url": "https://stackoverflow.com/users/133858",
  "answer": "// Check if no view has focus:\nView 
        view = this.getCurrentFocus(); \nif view != null) {InputMethodManager 
        imm = (InputMethodManager) getSystemService(Context.INPUT_METHOD_SERVICE);       
        imm.hideSoftInputFromWindow(view.getWindowToken(), 0);}",
  "answer_url": "https://stackoverflow.com/a/1109108",
  "answer_author": "Reto Meier",
  "answer_author_url": "https://stackoverflow.com/users/822",
  "examples": [1841045, 1800067, 1271795],
  "examples_url": [
    "https://github.com/alextselegidis/easyappointmentsandroid-client/blob/39f1e8...",
    "https://github.com/zelloptt/zello-android-clientsdk/blob/87b45b6...",
    "https://github.com/systers/conference-android/blob/a67982abf54e0...",
  ]
}

NCS / UNIF Score Sheet

We provide the evaluation results for two code search models of our creation, each with two variations:

  • NCS: an unsupervised model which uses word embedding derived directly from the search corpus.
  • NCSpostrank: an extension of the base NCS model that performs a post-pass ranking, as explained in here.
  • UNIFandroid, UNIFstackoverflow: supervised extensions of the NCS model that uses a bag-of-words-based neural network with attention. The supervision is learned using GitHub-Android-Train and StackOverflow-AndroidTrain datasets, respectively, as described here UNIF.

We provide the rank of the first correct answer (FRank) for each question in our evaluation dataset. The score sheet is saved in a comma-delimited csv file in data/score_sheet.csv.

Example:

No.,StackOverflow ID,NCS FRank,NCS_postrank FRank, UNIF_android FRank,UNIF_stackoverflow FRank
1,1109022,NF,1,1,1
2,4616095,17,1,31,19
3,3004515,2,1,5,2
4,1560788,1,4,5,1
5,3423754,5,1,22,10
6,1397361,NF,3,2,1

License

Neural-Code-Search-Evaluation-Dataset is CC-BY-NC 4.0 (Attr Non-Commercial Inter.) (e.g., FAIR) licensed, as found in the LICENSE file.

neural-code-search-evaluation-dataset's People

Contributors

skim9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural-code-search-evaluation-dataset's Issues

which are the training corpus for supervised method?

Hi,
I have downloaded the search corpus you provided, which results in about 4,679,758 methods. But when we parse these methods, most of them do not have their docstrings (i.e., the natural language descriptions), and there are only 436,450 (less than 10%) methods have their docstrings. Since your paper in https://arxiv.org/pdf/1908.09804.pdf does not provide the statistics about how many methods have their docstrings, do you think it is reasonable or could you provide more information about it?

Other questions that we want to figure out are:

  1. Which dataset do you use to train your supervised baseline methods (like UNIF_android, UNIF_stackoverflow in the paper) and how many samples it contains?
  2. I also observed that many docstrings are non-English (i.e., Chinese), how do you process these docstrings?
  3. Do the training samples you used for training are included in the search corpus we downloaded?
    If so, is that mean the training dataset shares the same samples with the testing/evaluation dataset?

I am sorry to bother you, but I still hope that you can help me ~
Thank you very much!

can we calculate the MRR metric on the basis of NCS dataset?

I've noticed that the score_sheet only have the FRank score of each query which means the rank order of the first hit result.
But the orginal paper have mentioned the MRR (in Table 1) of NCS and UNIF.
So how can I calculate the MRR metric using NCS dataset.

Is there any backup dataset of project repo?

I have tried to download repository that provided by download.py script. But some project repositories were not exist in github based on provided links.

Is there a backup for all repo code so I can download it?

Thanks for your help

How to download the data?

It is a big challenge for me to download so mush data. My scripts could not work because of the strategy of Github. The Github website do not allow high-frequency requests.

Could you please directly provide the data?

Tree sitter based method extractor for one line Java file, in case for some invalid code sample.

Guys, I've found that some Java file in search corpus only has 1-1 form permalink in original github link e.g: https://github.com/cymcsg/UltimateAndroid/blob/678afdda49d1e7c91a36830946a85e0fda541971/deprecated/UltimateAndroidGradle/ultimateandroiduianimation/src/main/java/com/marshalchen/common/uimodule/imageprocessing/FastImageProcessingPipeline.java#L1-L1.

This will cause url duplicate in search corpus and method snippets missing in eval set.

I've found a simple solution for these issue which on the basis of tree-sitter, here's the code:

from tree_sitter import Language, Parser
def get_java_methods(code:str):
    ''' 
         code: java code string
    '''
    def dfs(node, method_list):
        node_childs = node.children
        if node_childs == []: return
        for i in range(len(node_childs)):
            if node_childs[i].type == 'class_declaration':
                dfs(node_childs[i], method_list)
            elif node_childs[i].type == 'class_body':
                method_nodes = node_childs[i].children
                for j in method_nodes:
                    if j.type == 'method_declaration':
                        method_list.append(j)
                return

    JAVA_LANGUAGE = Language('Your tree sitter pretrained model file path/java-language.so', 'java')
    parser = Parser()
    parser.set_language(JAVA_LANGUAGE)
    tree = parser.parse(bytes(code, 'utf8'))
    method_nodes=[]
    dfs(tree.root_node,method_nodes)

    code_list = code.split('\n')
    methods_list = []
    for n in method_nodes:
            method = '\n'.join(code_list[n.start_point[0]:n.end_point[0]+1])
            methods_list.append(method)
    return methods_list,method_nodes

While the method_nodes return value will carry original span(the starting line number and end line number) of java method.
So you can extract the complete method body.

Inspired by blog:https://blog.csdn.net/weixin_43646592/article/details/120639861 great thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.