github / codesearchnet Goto Github PK

Datasets, tools, and benchmarks for representation learning of code.

Home Page: https://arxiv.org/abs/1909.09436

License: MIT License

Dockerfile 0.60% Jupyter Notebook 55.31% Shell 0.23% Python 43.86%

deep-learning natural-language-processing programming-language-theory machine-learning tensorflow datasets data-science machine-learning-on-source-code representation-learning neural-networks

codesearchnet's Issues

Comments are not removed in code_tokens

This is a data from java_test_0.jsonl file. There are still comments in code_tokens. I highlight them using black.

"repo": "ReactiveX/RxJava", "path": "src/main/java/io/reactivex/Observable.java", "func_name": "Observable.skipLast", "original_string": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "language": "java", "code": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "code_tokens": ["@", "CheckReturnValue", "@", "SchedulerSupport", "(", "SchedulerSupport", ".", "CUSTOM", ")", "public", "final", "Observable", "<", "T", ">", "skipLast", "(", "long", "time", ",", "TimeUnit", "unit", ",", "Scheduler", "scheduler", ",", "boolean", "delayError", ",", "int", "bufferSize", ")", "{", "ObjectHelper", ".", "requireNonNull", "(", "unit", ",", ""unit is null"", ")", ";", "ObjectHelper", ".", "requireNonNull", "(", "scheduler", ",", ""scheduler is null"", ")", ";", "ObjectHelper", ".", "verifyPositive", "(", "bufferSize", ",", ""bufferSize"", ")", ";", "// the internal buffer holds pairs of (timestamp, value) so double the default buffer size", "int", "s", "=", "bufferSize", "<<", "1", ";", "return", "RxJavaPlugins", ".", "onAssembly", "(", "new", "ObservableSkipLastTimed", "<", "T", ">", "(", "this", ",", "time", ",", "unit", ",", "scheduler", ",", "s", ",", "delayError", ")", ")", ";", "}"]

Generating Pypi module for function_parser

Hi,

Are there any plans to export the function_parser library this repo has into a proper pypi module people could easily install? It looks awesome and I bet a ton of SE people could make use of it, especially if it was made a bit easier to setup. I'm more than happy to help with generating a PR to do it and if need be helping finishing any final touches it needs (I understand how research tools sometimes be :)).

BTW, love the research and uploading all your data and code. I've used it numerous times in my research 🤓.

Preprocessing of docstrings can be improved

Hello,

first thanks for the challenge, the code and the dataset! Really cool stuff that you're doing and I want to work on this task. :)

I've read the Contribution Guidelines and know that you will not change any of the preprocessing code, but nevertheless I want to discuss the preprocessing of the docstrings here in case someone wants to produce a similar dataset (or maybe v3 ;) ) .

Current preprocessing of docstrings

I read your code and it seems that this is the way you preprocess the docstrings:

You extract the documentation from the method and strip c-style delimiters

CodeSearchNet/function_parser/function_parser/parsers/commentutils.py

Line 1 in 9356b31

def strip_c_style_comment_delimiters(comment: str) -> str:

Then you extract the relevant part of the docstring (which acts as a summary) using the following heuristics:
2.1 If \n\n is found you take the part before
2.2 otherwise take the part before the first @
2.3. otherwise take the full docstring

CodeSearchNet/function_parser/function_parser/parsers/commentutils.py

Lines 18 to 24 in 9356b31

 def get_docstring_summary(docstring: str) -> str: 

 """Get the first lines of the documentation comment up to the empty lines.""" 

 if '\n\n' in docstring: 

 return docstring.split('\n\n')[0] 

 elif '@' in docstring: 

 return docstring[:docstring.find('@')] # This usually is the start of a JavaDoc-style @param comment. 

 return docstring

Finally you tokenize the extracted summary using the following regex:

CodeSearchNet/function_parser/function_parser/parsers/language_parser.py

Line 5 in 9356b31

DOCSTRING_REGEX_TOKENIZER = re.compile(r"[^\s,'\"`.():\[\]=*;>{\}+-/\\]+|\\+|\.+||{\}|\[\]|$+|$+|:+|\[+|\]+|{+|\}+|=+|\*+|;+|>+|\++|-+|/+")

Problems with this pipeline

This way of preprocessing produces a couple of results that are probably not wanted and could be improved.

Extracting the summary

Compare the first 12 lines of the tokenized docstrings of the java train set to the raw ones

Bind indexed elements to the supplied collection .
Set {
Add {
Set servlet names that the filter will be registered against . This will replace any previously specified servlet names .
Add servlet names for the filter .
Set the URL patterns that the filter will be registered against . This will replace any previously specified URL patterns .
Add URL patterns as defined in the Servlet specification that the filter will be registered against .
Convenience method to {
Configure registration settings . Subclasses can override this method to perform additional configuration if required .
Create a nested {
Create a nested {
Create a nested {

As you can see 6/12 are basically not usable.

**Bind indexed elements to the supplied collection.** @param name the name of the property to bind @param target the target bindable @param elementBinder the binder to use for elements @param aggregateType the aggregate type, may be a collection or an array @param elementType the element type @param result the destination for results
**Set** {@link **ServletRegistrationBean**}**s that the filter will be registered against.** @param servletRegistrationBeans the Servlet registration beans
**Add** {@link **ServletRegistrationBean**}**s for the filter.** @param servletRegistrationBeans the servlet registration beans to add @see #setServletRegistrationBeans
**Set servlet names that the filter will be registered against. This will replace any previously specified servlet names.** @param servletNames the servlet names @see #setServletRegistrationBeans @see #setUrlPatterns
**Add servlet names for the filter.** @param servletNames the servlet names to add
**Set the URL patterns that the filter will be registered against. This will replace any previously specified URL patterns.** @param urlPatterns the URL patterns @see #setServletRegistrationBeans @see #setServletNames
**Add URL patterns, as defined in the Servlet specification, that the filter will be registered against.** @param urlPatterns the URL patterns
**Convenience method to** {@link **#setDispatcherTypes(EnumSet) set dispatcher types**} **using the specified elements.** @param first the first dispatcher type @param rest additional dispatcher types
**Configure registration settings. Subclasses can override this method to perform additional configuration if required.** @param registration the registration
**Create a nested** {@link **DependencyCustomizer**} **that only applies if any of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if all of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if the specified paths are on the class path.** @param paths the paths to test @return a nested {@link DependencyCustomizer}

However, the relevant information is in the raw docstrings (I added the ** to highlight relevant passages). Simply using the part before the first @ produces pretty bad results (at least in java) as its common practice to highlight code blocks or links with javadoc-tags.
Possible solution: Stripping everything before the first param (or maybe @param) and afterwards removing javadoc-tags (maybe keeping the tokens inside).

Cleaning

The preprocessing does not include any cleaning. This manifests in docstrings that contain html-tags (which are commonly found in javadoc), as well as URLs (which afterwards get pretty verbose tokenized). See these two samples from java

Determine if a uri is in origin - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
Determine if a uri is in asterisk - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .

Some stats

>>> wc -l java.train.comment 
454436 java.train.comment
>>> grep -E '<p >|<p >' java.train.comment | wc -l
42750

At least 10% of the tokenized java docstrings still contain html tags.

>>> grep '{ @' java.train.comment | wc -l
44500

Another 10% still contain javadoc.

>>> grep "{ @inheritDoc }" java.train.comment | wc -l
1685

2k consist only of a single javadoc-tag indication that the doc was inherited.

Many of the golang docstrings contain URLs, which are not very useful in the tokenized version the regex produces (see above in java).

>>> wc -l go.train.comment 
317822 go.train.comment
>>> grep -E 'http :|https :' go.train.comment | wc -l
19753

~6% contain URLs (starting with http :)

>>> grep -E "autogenerated|auto generated" go.train.comment | wc -l
4850

Around 5k auto generated methods

>>> grep "/ *" go.train.comment | wc -l
33620

10% still contain c-style comment delimiters.

Tokenization

Any specific reason you keep punctuation symbols like ., ,, -, /, :, <, >, *, =, @, (, ) as tokens? Is it to keep code in the docstrings?

Summary

I really think better cleaning and a language-dependent preprocessing would produce higher quality docstrings At least for java a removal of javadoc and html could be beneficial. As well as using everything before the first param as a summary (maybe in combination with the first paragraph \n\n heuristic).

Missing annoy module

I tried to execute the predict.py script but I have got the following error:
Traceback (most recent call last):
File "predict.py", line 56, in
from annoy import AnnoyIndex

ModuleNotFoundError: No module named 'annoy'

what version of Annoy module should I use?

question about NDCG calculation

I've noticed that the official calculation of NDCG is here.

CodeSearchNet/src/relevanceeval.py

Line 75 in 3f999d5

 def ndcg(predictions: Dict[str, List[str]], relevance_scores: Dict[str, Dict[str, float]], 

On this basis the original paper has reported the NDCG of six languages on code search tasks.

While I re-implement a baseline search model based on MLP. And I calculate the MRR MAP and NDCG metrics by myself.

The MRR is 0.5128467211800546, MAP is 0.19741363623638755 and NDCG is 0.6274463943748803.

Both MRR and MAP seem great, but NDCG is nearly 3 times outperform than the results which the original paper noticed.

I think it's may not be the power of my baseline modal, there's must be something wrong with the NDCG implementation.

Here's the function I used for calculation:

def iDCG(true_list:list,topk=-1):

    true_descend_list = np.sort(true_list)[::-1]
    
    idcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
    if not topk==-1:
        idcg_list = idcg_list[:topk]
    idcg = np.sum(idcg_list)
    return idcg

def DCG(true_list,pred_list,topk=-1):

    pred_descend_order = np.argsort(pred_list)[::-1]
    
    true_descend_list = [true_list[i] for i in pred_descend_order]
    
    dcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
    
    if not topk==-1:
        dcg_list = dcg_list[:topk]
    
    dcg = np.sum(dcg_list)
    return dcg

def NDCG(true_rank_dict,pred_rank_dict,topk=-1):
    
    ndcg_lst = []
    
    for qid in pred_rank_dict:
        temp_pred = pred_rank_dict[qid]
        temp_true = true_rank_dict[qid]
        
        idcg = iDCG(true_list=temp_true,topk=topk)
        
        dcg = DCG(true_list=temp_true,pred_list=temp_pred,topk=topk)
        
        ndcg = dcg / idcg if not idcg == 0 else 0
        
        ndcg_lst.append(ndcg)
        
    return np.average(ndcg_lst)

I'm confused about how original paper calculates NDCG metrics, especially how to choose the threshold K of NDCG@K which is not noticed in paper.

Pls help.

Request for pre-fine tuning self-art weights

I would like to obtain the self attention model weights before any fine tuning was done to it.
Does this link from the leaderboard contain such a weights file. If it is the fine tuned one, could you make the base self-att model for python available.

My intention is to pass these weights to a model benefitting from contextual word embeddings in PyTorch. Any information pertaining to the structure of the weights file would be beneficial.

Thank you

Carlos

GPU Docker Image can not be build

When building the gpu-docker image following error occurs:

W: The repository 'http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial Release' does not have a Release file. E: Failed to fetch http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden E: Some index files failed to download. They have been ignored, or old ones used instead.

Seems like the PPA for Python 3.6 was removed. See: https://launchpad.net/~jonathonf/+archive/ubuntu/python-3.6

Extract descriptions from `@return` doc comments if method has no summary?

e.g. some short methods may contain a description in the return tag, but not a description of the method itself. (to avoid redundancy).

Doing this would extract more methods, but they may be of lower quality if incorrectly parsed or automatically generated. I'd expect a description such as @return bool STARTDESCRIPTION true if this is a float to be extracted (I'm not familiar with how the data representation works)

Some libraries/applications are strict about always having a summary in their coding standards, others aren't.
e.g. I've seen @return the description without a type for php
Could try to annotate the fact that the fallback was used and this is a non-official summary

    /**
     * @return bool true if this is a float
     */
    public function isFloat()

It would be nice to account for code such as @return HasTemplate<string, stdClass>, etc. Making sure that <([{ in the first token are matched up may be useful as a basic approach (and give up if that fails). (There's no official standard and different static analyzers have their own extensions to the type syntax)

An example implementation for PHP is https://github.com/phan/phan/blob/2.2.12/src/Phan/Language/Element/MarkupDescription.php - examples of what it generates as descriptions are https://github.com/phan/phan/blob/2.2.12/tests/Phan/Language/Element/MarkupDescriptionTest.php#L132-L155

Standalone model_predictions.csv

How to submit only model_predictions.csv file from an external project?

php with utf-8 tokenization are broken

I've tried to use php_dedupe_definitions_v2.pkl for my own project and found many functions with broken tokenization. For example, find functions with empty ('') tokens - there are above 8000 of that. Then, If we try to look for all 1-letter tokens we will get tons of 1-letter utf8 tokens which is impossible.

NDCG computation

Hi there! In the function corresponding to NDCG calculation I noticed that ignore_rank_of_non_annotated_urls flag is present.

CodeSearchNet/src/relevanceeval.py

Line 76 in 76a006f

ignore_rank_of_non_annotated_urls: bool=True) -> float:

The question is: are your results in the paper calculated with ignore_rank_of_non_annotated_urls=True?

RELEVANCE_ANNOTATIONS_CSV_PATH file for running evaluation

Hi,

I wanted to run an evaluation using the NDCG score as done in the paper.

Where is the RELEVANCE_ANNOTATIONS_CSV_PATH for the 99 queries as mentioned in the README to run the /src/relevanceeval.py file?

Just want to test my results..

Error while processing a single Python file.

In CodeSearchNet/function_parser/function_parser/demo.ipynb . I kept everything same till thrid cell and then I did this processor.process_single_file(py_file_path) . Here py_file_path contains the complete path of .py file that I want to process.

After executing the above line I got the following error :
unhashable type: 'tree_sitter.Node' in file function_parser/function_parser/parsers/language_parser.py.

Am I missing something?

Calculating NDCG for a Keras model during run-time.

I am trying to build a Keras model from scratch using only Python dataset for a start. But I am confused as to how NDCG can be calculated during training or even testing.

According to my understanding, calculating NDCG will require a ground truth of which pages are relevant to a query in a ranked order and the model's predictions for that query. But in the dataset, each code-block is provided with its own query (docstring), but no ranking order of pages for a query.

I am using same network architecture as proposed in this repository, but only have python as the language input for now, the final layer being a 2D matrix (softmax applied) of shape (BS*BS), with each cell having relevance score for (query, code-page) pair. As ground-truth to this, I prepared a 2D matrix having 1 as diagonal elements and 0 elsewhere.

But how can I calculate NDCG in this scenario, as only 1 page is relevant for each query, instead of a relevance list of pages?

Support Dart and Flutter

Flutter is a UI DSL currently pushed by Google. It uses the Dart programming language.

Is there any chance that you create Dart and Flutter dataset just like java, python, php, go, js and ruby?

Thanks for your work!

Shuffle Validation Set when calculating MRR in included scripts

from @mallamanis

One masters' student in Berkeley has asked me the following question for CodeSearchNet.

The validation loss logging shows that the MRR performance decreases as it is being computed (at the end of the epoch). This seems to be the case with many of the runs on W&B. Do you have any idea why this might happening? I don't see anything obviously wrong.

For example,

11 (valid): Batch     0 (has 200 samples). Processed 0 samples. Loss so far: 0.0000.  MRR so far: 0.0000 
11 (valid): Batch     1 (has 200 samples). Processed 200 samples. Loss so far: 2.3159.  MRR so far: 0.5468 
11 (valid): Batch     2 (has 200 samples). Processed 400 samples. Loss so far: 2.3237.  MRR so far: 0.5623 
11 (valid): Batch     3 (has 200 samples). Processed 600 samples. Loss so far: 2.3163.  MRR so far: 0.5652 
11 (valid): Batch     4 (has 200 samples). Processed 800 samples. Loss so far: 2.3615.  MRR so far: 0.5568 
11 (valid): Batch     5 (has 200 samples). Processed 1000 samples. Loss so far: 2.5153.  MRR so far: 0.5323 
11 (valid): Batch     6 (has 200 samples). Processed 1200 samples. Loss so far: 2.8651.  MRR so far: 0.4921 
11 (valid): Batch     7 (has 200 samples). Processed 1400 samples. Loss so far: 2.7642.  MRR so far: 0.5085 
11 (valid): Batch     8 (has 200 samples). Processed 1600 samples. Loss so far: 2.7468.  MRR so far: 0.5091 
11 (valid): Batch     9 (has 200 samples). Processed 1800 samples. Loss so far: 2.7153.  MRR so far: 0.5134 
11 (valid): Batch    10 (has 200 samples). Processed 2000 samples. Loss so far: 2.7024.  MRR so far: 0.5131 
11 (valid): Batch    11 (has 200 samples). Processed 2200 samples. Loss so far: 2.7061.  MRR so far: 0.5125 
11 (valid): Batch    12 (has 200 samples). Processed 2400 samples. Loss so far: 2.6665.  MRR so far: 0.5183 
11 (valid): Batch    13 (has 200 samples). Processed 2600 samples. Loss so far: 2.6986.  MRR so far: 0.5157 
11 (valid): Batch    14 (has 200 samples). Processed 2800 samples. Loss so far: 2.6975.  MRR so far: 0.5166 
11 (valid): Batch    15 (has 200 samples). Processed 3000 samples. Loss so far: 2.7363.  MRR so far: 0.5118 
11 (valid): Batch    16 (has 200 samples). Processed 3200 samples. Loss so far: 2.7226.  MRR so far: 0.5137 
11 (valid): Batch    17 (has 200 samples). Processed 3400 samples. Loss so far: 2.7146.  MRR so far: 0.5153 
11 (valid): Batch    18 (has 200 samples). Processed 3600 samples. Loss so far: 2.7491.  MRR so far: 0.5115 
11 (valid): Batch    19 (has 200 samples). Processed 3800 samples. Loss so far: 2.7468.  MRR so far: 0.5108 
11 (valid): Batch    20 (has 200 samples). Processed 4000 samples. Loss so far: 2.7470.  MRR so far: 0.5097 
11 (valid): Batch    21 (has 200 samples). Processed 4200 samples. Loss so far: 2.7783.  MRR so far: 0.5070 
11 (valid): Batch    22 (has 200 samples). Processed 4400 samples. Loss so far: 2.7725.  MRR so far: 0.5086 
11 (valid): Batch    23 (has 200 samples). Processed 4600 samples. Loss so far: 2.7606.  MRR so far: 0.5096 
11 (valid): Batch    24 (has 200 samples). Processed 4800 samples. Loss so far: 2.7733.  MRR so far: 0.5069 
11 (valid): Batch    25 (has 200 samples). Processed 5000 samples. Loss so far: 2.8067.  MRR so far: 0.5030 
11 (valid): Batch    26 (has 200 samples). Processed 5200 samples. Loss so far: 2.7878.  MRR so far: 0.5054 
11 (valid): Batch    27 (has 200 samples). Processed 5400 samples. Loss so far: 2.7869.  MRR so far: 0.5054 
11 (valid): Batch    28 (has 200 samples). Processed 5600 samples. Loss so far: 2.8128.  MRR so far: 0.5003 
11 (valid): Batch    29 (has 200 samples). Processed 5800 samples. Loss so far: 2.8420.  MRR so far: 0.4959 
11 (valid): Batch    30 (has 200 samples). Processed 6000 samples. Loss so far: 2.8311.  MRR so far: 0.4981 
11 (valid): Batch    31 (has 200 samples). Processed 6200 samples. Loss so far: 2.8291.  MRR so far: 0.4978 
11 (valid): Batch    32 (has 200 samples). Processed 6400 samples. Loss so far: 2.8190.  MRR so far: 0.4993 
11 (valid): Batch    33 (has 200 samples). Processed 6600 samples. Loss so far: 2.8345.  MRR so far: 0.4980 
11 (valid): Batch    34 (has 200 samples). Processed 6800 samples. Loss so far: 2.8100.  MRR so far: 0.5011 
11 (valid): Batch    35 (has 200 samples). Processed 7000 samples. Loss so far: 2.7998.  MRR so far: 0.5026 
11 (valid): Batch    36 (has 200 samples). Processed 7200 samples. Loss so far: 2.7841.  MRR so far: 0.5052 
11 (valid): Batch    37 (has 200 samples). Processed 7400 samples. Loss so far: 2.7836.  MRR so far: 0.5052 
11 (valid): Batch    38 (has 200 samples). Processed 7600 samples. Loss so far: 2.7906.  MRR so far: 0.5044 
11 (valid): Batch    39 (has 200 samples). Processed 7800 samples. Loss so far: 2.8042.  MRR so far: 0.5020 
11 (valid): Batch    40 (has 200 samples). Processed 8000 samples. Loss so far: 2.8095.  MRR so far: 0.5013 
11 (valid): Batch    41 (has 200 samples). Processed 8200 samples. Loss so far: 2.8119.  MRR so far: 0.5008 
11 (valid): Batch    42 (has 200 samples). Processed 8400 samples. Loss so far: 2.7931.  MRR so far: 0.5038 
11 (valid): Batch    43 (has 200 samples). Processed 8600 samples. Loss so far: 2.7902.  MRR so far: 0.5043 
11 (valid): Batch    44 (has 200 samples). Processed 8800 samples. Loss so far: 2.7918.  MRR so far: 0.5041 
11 (valid): Batch    45 (has 200 samples). Processed 9000 samples. Loss so far: 2.7948.  MRR so far: 0.5044 
11 (valid): Batch    46 (has 200 samples). Processed 9200 samples. Loss so far: 2.7991.  MRR so far: 0.5031 
11 (valid): Batch    47 (has 200 samples). Processed 9400 samples. Loss so far: 2.8023.  MRR so far: 0.5028 
11 (valid): Batch    48 (has 200 samples). Processed 9600 samples. Loss so far: 2.8052.  MRR so far: 0.5020 
11 (valid): Batch    49 (has 200 samples). Processed 9800 samples. Loss so far: 2.8261.  MRR so far: 0.4986 
11 (valid): Batch    50 (has 200 samples). Processed 10000 samples. Loss so far: 2.8538.  MRR so far: 0.4944 
11 (valid): Batch    51 (has 200 samples). Processed 10200 samples. Loss so far: 2.8511.  MRR so far: 0.4946 
11 (valid): Batch    52 (has 200 samples). Processed 10400 samples. Loss so far: 2.8531.  MRR so far: 0.4949 
11 (valid): Batch    53 (has 200 samples). Processed 10600 samples. Loss so far: 2.8617.  MRR so far: 0.4928 
11 (valid): Batch    54 (has 200 samples). Processed 10800 samples. Loss so far: 2.8791.  MRR so far: 0.4897 
11 (valid): Batch    55 (has 200 samples). Processed 11000 samples. Loss so far: 2.8967.  MRR so far: 0.4876 
11 (valid): Batch    56 (has 200 samples). Processed 11200 samples. Loss so far: 2.9021.  MRR so far: 0.4871 
11 (valid): Batch    57 (has 200 samples). Processed 11400 samples. Loss so far: 2.9030.  MRR so far: 0.4873 
11 (valid): Batch    58 (has 200 samples). Processed 11600 samples. Loss so far: 2.9257.  MRR so far: 0.4843 
11 (valid): Batch    59 (has 200 samples). Processed 11800 samples. Loss so far: 2.9270.  MRR so far: 0.4841 
11 (valid): Batch    60 (has 200 samples). Processed 12000 samples. Loss so far: 2.9342.  MRR so far: 0.4836 
11 (valid): Batch    61 (has 200 samples). Processed 12200 samples. Loss so far: 2.9443.  MRR so far: 0.4819 
11 (valid): Batch    62 (has 200 samples). Processed 12400 samples. Loss so far: 2.9565.  MRR so far: 0.4798 
11 (valid): Batch    63 (has 200 samples). Processed 12600 samples. Loss so far: 2.9494.  MRR so far: 0.4810 
11 (valid): Batch    64 (has 200 samples). Processed 12800 samples. Loss so far: 2.9631.  MRR so far: 0.4785 
11 (valid): Batch    65 (has 200 samples). Processed 13000 samples. Loss so far: 2.9674.  MRR so far: 0.4779 
11 (valid): Batch    66 (has 200 samples). Processed 13200 samples. Loss so far: 2.9685.  MRR so far: 0.4778 
11 (valid): Batch    67 (has 200 samples). Processed 13400 samples. Loss so far: 2.9789.  MRR so far: 0.4769 
11 (valid): Batch    68 (has 200 samples). Processed 13600 samples. Loss so far: 2.9794.  MRR so far: 0.4765 
11 (valid): Batch    69 (has 200 samples). Processed 13800 samples. Loss so far: 2.9754.  MRR so far: 0.4766 
11 (valid): Batch    70 (has 200 samples). Processed 14000 samples. Loss so far: 2.9731.  MRR so far: 0.4767 
11 (valid): Batch    71 (has 200 samples). Processed 14200 samples. Loss so far: 2.9822.  MRR so far: 0.4751 
11 (valid): Batch    72 (has 200 samples). Processed 14400 samples. Loss so far: 2.9755.  MRR so far: 0.4756 
11 (valid): Batch    73 (has 200 samples). Processed 14600 samples. Loss so far: 2.9747.  MRR so far: 0.4758 
11 (valid): Batch    74 (has 200 samples). Processed 14800 samples. Loss so far: 2.9666.  MRR so far: 0.4770 
11 (valid): Batch    75 (has 200 samples). Processed 15000 samples. Loss so far: 2.9757.  MRR so far: 0.4758 
11 (valid): Batch    76 (has 200 samples). Processed 15200 samples. Loss so far: 2.9775.  MRR so far: 0.4756 
11 (valid): Batch    77 (has 200 samples). Processed 15400 samples. Loss so far: 2.9813.  MRR so far: 0.4745 
11 (valid): Batch    78 (has 200 samples). Processed 15600 samples. Loss so far: 2.9830.  MRR so far: 0.4739 
11 (valid): Batch    79 (has 200 samples). Processed 15800 samples. Loss so far: 2.9905.  MRR so far: 0.4726 
11 (valid): Batch    80 (has 200 samples). Processed 16000 samples. Loss so far: 3.0173.  MRR so far: 0.4688 
11 (valid): Batch    81 (has 200 samples). Processed 16200 samples. Loss so far: 3.0520.  MRR so far: 0.4645 
11 (valid): Batch    82 (has 200 samples). Processed 16400 samples. Loss so far: 3.0599.  MRR so far: 0.4630 
11 (valid): Batch    83 (has 200 samples). Processed 16600 samples. Loss so far: 3.0639.  MRR so far: 0.4625 
11 (valid): Batch    84 (has 200 samples). Processed 16800 samples. Loss so far: 3.0691.  MRR so far: 0.4616 
11 (valid): Batch    85 (has 200 samples). Processed 17000 samples. Loss so far: 3.0723.  MRR so far: 0.4614 
11 (valid): Batch    86 (has 200 samples). Processed 17200 samples. Loss so far: 3.0691.  MRR so far: 0.4615 
11 (valid): Batch    87 (has 200 samples). Processed 17400 samples. Loss so far: 3.0919.  MRR so far: 0.4589 
11 (valid): Batch    88 (has 200 samples). Processed 17600 samples. Loss so far: 3.0917.  MRR so far: 0.4587 
11 (valid): Batch    89 (has 200 samples). Processed 17800 samples. Loss so far: 3.0887.  MRR so far: 0.4592 
11 (valid): Batch    90 (has 200 samples). Processed 18000 samples. Loss so far: 3.1092.  MRR so far: 0.4561 
11 (valid): Batch    91 (has 200 samples). Processed 18200 samples. Loss so far: 3.1391.  MRR so far: 0.4515 
11 (valid): Batch    92 (has 200 samples). Processed 18400 samples. Loss so far: 3.1650.  MRR so far: 0.4482 
11 (valid): Batch    93 (has 200 samples). Processed 18600 samples. Loss so far: 3.1630.  MRR so far: 0.4486 
11 (valid): Batch    94 (has 200 samples). Processed 18800 samples. Loss so far: 3.1624.  MRR so far: 0.4488 
11 (valid): Batch    95 (has 200 samples). Processed 19000 samples. Loss so far: 3.1692.  MRR so far: 0.4480 
11 (valid): Batch    96 (has 200 samples). Processed 19200 samples. Loss so far: 3.1660.  MRR so far: 0.4486 
11 (valid): Batch    97 (has 200 samples). Processed 19400 samples. Loss so far: 3.1729.  MRR so far: 0.4478 
11 (valid): Batch    98 (has 200 samples). Processed 19600 samples. Loss so far: 3.1779.  MRR so far: 0.4468 
11 (valid): Batch    99 (has 200 samples). Processed 19800 samples. Loss so far: 3.1967.  MRR so far: 0.4445 
11 (valid): Batch   100 (has 200 samples). Processed 20000 samples. Loss so far: 3.1978.  MRR so far: 0.4443 
11 (valid): Batch   101 (has 200 samples). Processed 20200 samples. Loss so far: 3.1949.  MRR so far: 0.4443 
11 (valid): Batch   102 (has 200 samples). Processed 20400 samples. Loss so far: 3.1932.  MRR so far: 0.4444 
11 (valid): Batch   103 (has 200 samples). Processed 20600 samples. Loss so far: 3.1864.  MRR so far: 0.4453 
11 (valid): Batch   104 (has 200 samples). Processed 20800 samples. Loss so far: 3.1881.  MRR so far: 0.4453 
11 (valid): Batch   105 (has 200 samples). Processed 21000 samples. Loss so far: 3.1855.  MRR so far: 0.4457 
11 (valid): Batch   106 (has 200 samples). Processed 21200 samples. Loss so far: 3.1818.  MRR so far: 0.4460 
11 (valid): Batch   107 (has 200 samples). Processed 21400 samples. Loss so far: 3.1841.  MRR so far: 0.4458 
11 (valid): Batch   108 (has 200 samples). Processed 21600 samples. Loss so far: 3.1811.  MRR so far: 0.4462 
11 (valid): Batch   109 (has 200 samples). Processed 21800 samples. Loss so far: 3.1806.  MRR so far: 0.4463 
11 (valid): Batch   110 (has 200 samples). Processed 22000 samples. Loss so far: 3.1816.  MRR so far: 0.4464 
11 (valid): Batch   111 (has 200 samples). Processed 22200 samples. Loss so far: 3.1796.  MRR so far: 0.4467 
11 (valid): Batch   112 (has 200 samples). Processed 22400 samples. Loss so far: 3.1842.  MRR so far: 0.4457 
  Epoch 11 (valid) took 6.12s [processed 3689 samples/second]
 Validation:  Loss: 3.190802 | MRR: 0.444851

Repo has topic "nueral-networks"

while others have made the same typo, neural-networks is far more popular. thanks!

Groundtruth

Is there a plan to release the annotations?

999 distractor snippets

Thank You for CodeSearchNet. A quick question: if we wish to replicate Table 3 (of your paper) where can we find the 999 distractor snippets and how did you form those 999 snippets?

Missing code to build files *_dedupe_definitions_v2.pkl

I have noticed the usage of the files *_dedupe_definitions_v2.pkl when using predict.py
However, I cannot find the code to build the files *_dedupe_definitions_v2.pkl
how should I build those files?
what is the purpose of those files?

thanks

Thanks

Just wanted to leave a note of appreciation. I've been looking for parsers that would deconstruct code into tokens easily (specifically for Ruby code, I took at look at parser which is just an AST and would have required more processing) so that I could figure out potential call sites and relations between test files and source code.

I'm looking to prioritize tests that should be run based on which functions are added/removed/updated, similar to what's found in the paper The art of testing less without sacrificing quality.

(I also found that Facebook has done something like this and they have a paper on it)

script/setup failed

Thanks for your cool project!
I follow the quickstart of the instruction, but it seems that I can't pip tensorflow_gpu in the docker container because of network problem. Here is the error information:
Step 13/21 : RUN pip --no-cache-dir install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
---> Running in 10ecd6cca56f
Collecting tensorflow-gpu==1.12.0 from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
Downloading https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl (281.7MB)
Requirement already satisfied: six>=1.10.0 in /usr/lib/python3/dist-packages (from tensorflow-gpu==1.12.0) (1.10.0)
Collecting termcolor>=1.1.0 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting keras-preprocessing>=1.0.5 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
Collecting keras-applications>=1.0.6 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
Collecting numpy>=1.13.3 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/e5/e6/c3fdc53aed9fa19d6ff3abf97dfad768ae3afce1b7431f7500000816bda5/numpy-1.17.2-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)

ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 397, in _error_catcher
yield
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 479, in read
data = self._fp.read(amt)
File "/usr/lib/python3.6/http/client.py", line 449, in read
n = self.readinto(b)
File "/usr/lib/python3.6/http/client.py", line 493, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

Error when executing docker run

I have tried to execute the scripts to run the dockers images such as
docker run -v $(pwd):/home/dev preprocessing

However, I have got the following:
standard_init_linux.go:219: exec user process caused: no such file or directory

Additionally, there is no script to run the docker-cpu.Dockerfile
how can I run that docker?

I have tried something like
docker run --net=host -v $(pwd):/home/dev csnet:cpu bash
but, it doesn't return any, I can notice the instance goes down by observing the syslog file

I am using an ubuntu 18.04 virtual machine with python 3.6.5

about

use train set and valid set for training
and codebase for evaluating the model
so test set for what?
deep code search says that manual evaluation,
there is way for auto evaluation?

and for java dataset some desc is empty and some desc is not english ,like german

script to request and download model NDCG

Since the relevance_annotations.csv is not available, a script to upload model_predictions.csv and download a model statistics file (NDCG, MRR, for example) would be great.

Request for a smaller dataset for researchers with lesser resources

Thank you for making this amazing problem statement public, along with a very comprehensive dataset!

Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?

This will open up the problem for a larger audience and may bring in some innovative solutions!

question: changing the default leaderboard order

The default ordering on the leaderboard is "Mean NDCG" over all results that are not None.

This means it is very easy to appear #1 by overfitting on a single language.

What about computing "Mean NDCG" over all 6 languages and replacing None by 0?

CodeChallenge run prediction on whole code corpus?

Did I get it right, that for a submission on the challenge, I have to run those 99 queries against the whole code corpus, and not just the test set?
Thanks in advance :)

question: calculating mrr and loss

I'm making a model for this using my own codebase. I wanted to confirm if this way of calculating loss and mrr is correct.

def softmax_loss(y_true, y_pred):
    q, c = y_pred
    similarity_score = tf.matmul(q, K.transpose(c))
    per_sample_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=similarity_score,
        labels=tf.range(q.shape[0])
    )
    return tf.reduce_sum(per_sample_loss) / q.shape[0]


def mrr(y_true, y_pred):
   q, c = y_pred
   similarity_scores = tf.matmul(q, K.transpose(c))
   correct_scores = tf.linalg.diag_part(similarity_scores)
   compared_scores = similarity_scores >= tf.expand_dims(correct_scores, axis=-1)
   compared_scores = tf.cast(compared_scores, tf.float16)
   return K.mean(tf.math.reciprocal(tf.reduce_sum(compared_scores, axis=1)))

Here, q and c are query and code feature vector of shape (batch_size, vector_dimension).
I'm a bit hesitant because I feel that mrr and loss depend on the batch_size and kind of example in the batch(closely related(mrr might be low) or far apart(mrr can be high)).

EDIT:
I've looked around some issues already present. Is my understanding correct, during testing, we will have a batch_size of 1000 and we won't be shuffling the data?

A minor Java tokenization utf-related issue

This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.

At least one function in the dataset has code_tokens that do not include a { token.

Quick inspection with

with pd.option_context('display.max_colwidth', -1):
    display(jdf.loc[jdf['url'] == 'https://github.com/jbehave/jbehave-core/blob/bdd6a6199528df3c35087e72d4644870655b23e6/examples/i18n/src/main/java/org/jbehave/examples/trader/i18n/steps/DeSteps.java#L22-L25'][['code', 'code_tokens']])

shows tokens like tring , ymbol for this code

@Given("ich habe eine Aktion mit dem Symbol $sümbol und eine Schwelle von $threshold")
public void aStock(@Named("sümbol") String symbol, @Named("threshold") double threshold) { ...

code_tokens looks like this

[@, Given, (, "ich habe eine Aktion mit dem Symbol  𝑠ü𝑚𝑏𝑜𝑙𝑢𝑛𝑑𝑒𝑖𝑛𝑒𝑆𝑐ℎ𝑤𝑒𝑙𝑙𝑒𝑣𝑜𝑛 threshold"), , public, void, aStock, (, @, Named, (, "sümbol"), , tring , ymbol,, , N, amed(, ", threshold"), ...]

I'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.

How to custom model submissions?

How do I submit custom models only containing a model_predictions.csv file as an output of the custom model? It seems like the CI tests fail on custom model submissions and thus the PR is closed (see #184)?

Less number of data found than stated in the paper

The paper says that there are 503502 data available for python, but when I download the python data, I get 457461 data combining the 14 files of train data, the file for test data and the file for valid data. I used the whole corpus (of size 1.1M) to find the data with non-empty 'docstring' field and ended up with the reported number of 503502 though. I assume 46k data have been filtered but cannot seem to find why.

W&B When You Tag A Run, Doesn't Appear Until Refresh

See animation below, perhaps something is broken in javascript

I do get the following error in the console: Error while trying to use the following icon from the Manifest: https://app.wandb.ai/favicon.png (Resource size is not correct - typo in the Manifest?)

downloads of pretrained baseline models

Is there a link where we can download pretrained baseline models ?

The function processor.process_single_file() fail to output.

Hi, I want to use the parser to process my own data. First I want to try several single files, but the function processor.process_single_file() fails to output things (just an empty list). How can I debug? I have used the provided docker to set up my environment.

Is the MRR calculation reasonable now?

CodeSearchNet is a very good task. I think this will greatly promote the development of this field.
But I have some questions, as follows:

The MRR and NDCG performances are inconsistent. The model has a high MRR value on test sets, but has a low score on the leaderboard. Why is that?
According to my understand, the calculation of MRR is based on the order in a batch. However, the examples in the batch are random and the batch size affects the calculation of the MRR. Is this appropriate?
If I want to train a model, how should I evaluate the model and select the best model during training? Is there a better way ?

py-tree-sitter does not have `_declarator`

CodeSearchNet/function_parser/function_parser/parsers/java_parser.py

Line 81 in e792e1c

 traverse_type(function_node, declarators, '{}_declarator'.format(function_node.type.split('_')[0])) 

change _declarator to _declaration

Where can I find the pre-trained model

I saw that one of the goal of this project is to "Open source code for a range of baseline models, along with pre-trained weights".
So Is there a pre-trained model? I couldn't find it here.

How long does it usually take to review a run?

Hi!

I've made a custom model and now I'm trying to submit it to the leaderboard. Here is the run: https://app.wandb.ai/github/codesearchnet/runs/lqqo1i4m

Above it says 'Awaiting review from codesearchnet benchmark' and that's probably the reason why I get an error when I try to 'Publish to GitHub'. Am I doing something wrong, or do I just have to wait?

Thanks.

Launch tracking issue

Finish annotation (https://semanticcodesearch.z22.web.core.windows.net/)
Integration with W&B
GitHub benchmark landing page

How can I get the annotated code?

Hello, Mr. auther I'd like to do some L2R reseach with the CodeSearchNet code , I wonder if I can get the data that you've annotated already as you said in your paper "annotation statistic"! Thx a lot!!!

wandb custom submission error

I'm currently trying to submit a custom model. After entering the run, evaluating the ndcg score, and writing a brief note on how we approached the results, I get the following error

Now, when I click refresh, I will be redirected to the submission page again. When trying to re-submit, I get the following error Invalid CSV format. Please upload a well-formatted CSV file..

W&B Default Screen - Wasted Real Estate

Consider opening the Auto-Visualizations by default, right now - when you navigate to a project it opens with the below view. Seems like wasted real-estate

tokens_str in Encoder

I have been trying to implement a "custom" encoder within this codebase, and I was wondering how to get access to the raw tokens (in string form).

What I have tried so far:

Setting up the tokens_str placeholder:

self.placeholders['tokens_str'] = \
            tf.placeholder(tf.string,
                           shape=[None, self.get_hyper('max_num_tokens')],
                           name='tokens_str')

Once I have those, I am trying to pipe the tokens into the tf_hub Elmo module, as follows:

seq_tokens = self.placeholders['tokens_str']
seq_tokens_lengths = self.placeholders['tokens_lengths']
            
# ## DEBUGGING: OUTPUT SHAPES
# print("Sequence Tokens Shape: %s" % seq_tokens.shape)
# print("Sequence Tokens Lengths: %s" % seq_tokens_lengths)

## pull elmo model from tensorflow hub
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=is_train)
token_embeddings = elmo(
     {
           "tokens": seq_tokens,
           "sequence_len": seq_tokens_lengths
      },
     signature='tokens',
    as_dict=True
)['elmo'] ## [batch_size, max_length, 1024 or 512]

I can see from my debugging statements, that seq_tokens shape is (?, 200), which is expected and seq_tokens_lengths shape is (?) which is also expected.

The error I get is len(seq_lens) != input.dims(0), (1000 vs. 0), which is coming from the ELMo model. But I'm guessing that this is coming from the seq_tokens not being pushed into the ELMo model.
Any help is appreciated!

Doc: a broken link in resources/README.md

Thank you for doing a great job putting this together, publishing it and congratulations on the launch 🎉

This is most probably a nitpick, but a link in https://github.com/github/CodeSearchNet/blob/master/resources/README.md#data-format points to non-existing resources/docs/DATA_FORMAT.md

I'm not familiar with the project yet, but it looks like may be it can be replaces with https://github.com/github/CodeSearchNet#schema--format

Hope this helps!

How to deconstruct code into tokens to extract functions and comments?

I want to make a code search corpus. I have collected a lots of GitHub repositories. Now I need to deconstruct code into tokens to extract functions and comments. You describe in the paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search: We then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using TreeSitter — GitHub’s universal parser — and, where available, their respective documentation text using a heuristic regular expression.

I can extract functions in python. But it hasn't comments. How do you extract functions with comments? Can you share your codes?

Sorry I still unkown the whole Train-Test workflow after reading all instructions.

Greetings, I'm a PhD student working in IR/SE. I'm very happy to see CSNet project and thank you for your great work.

I'm working on software representation and code retrieval/search. I'm familiar with IR dataset, but forgive me I quite not understand how CSNet dataset provide data annotataions.

I've got all six language for AWS, and I've found queries here https://github.com/github/CodeSearchNet/blob/master/resources/queries.csv in resource folder.

I explore data from pandas.Dataframe and see all six language have been divided into train/test/valid partation label.

But I still cannot find the relation beweeen queries and code data. In other words, every query text has no related source code data for it.

I'm newer to WandB platform, I wonder there be golden label for me to train model offline. Or I strictly need to write a dataloader or function for accepting WandB's online label data.

Load data from disk into memory in batches

Instead of loading all training and test data, can we load the data in memory in batches, i.e. on the fly during training and evaluation?

Are you planning to release the relevance judgements?

Hi,

Fantastic initiative, thanks a lot :-)

Are you planning to publish the relevance judgements, ie, the 4k expert relevance annotations?

	def get_docstring_summary(docstring: str) -> str:
	"""Get the first lines of the documentation comment up to the empty lines."""
	if '\n\n' in docstring:
	return docstring.split('\n\n')[0]
	elif '@' in docstring:
	return docstring[:docstring.find('@')] # This usually is the start of a JavaDoc-style @param comment.
	return docstring