github / codesearchnet Goto Github PK
View Code? Open in Web Editor NEWDatasets, tools, and benchmarks for representation learning of code.
Home Page: https://arxiv.org/abs/1909.09436
License: MIT License
Datasets, tools, and benchmarks for representation learning of code.
Home Page: https://arxiv.org/abs/1909.09436
License: MIT License
This is a data from java_test_0.jsonl file. There are still comments in code_tokens. I highlight them using black.
"repo": "ReactiveX/RxJava", "path": "src/main/java/io/reactivex/Observable.java", "func_name": "Observable.skipLast", "original_string": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "language": "java", "code": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "code_tokens": ["@", "CheckReturnValue", "@", "SchedulerSupport", "(", "SchedulerSupport", ".", "CUSTOM", ")", "public", "final", "Observable", "<", "T", ">", "skipLast", "(", "long", "time", ",", "TimeUnit", "unit", ",", "Scheduler", "scheduler", ",", "boolean", "delayError", ",", "int", "bufferSize", ")", "{", "ObjectHelper", ".", "requireNonNull", "(", "unit", ",", ""unit is null"", ")", ";", "ObjectHelper", ".", "requireNonNull", "(", "scheduler", ",", ""scheduler is null"", ")", ";", "ObjectHelper", ".", "verifyPositive", "(", "bufferSize", ",", ""bufferSize"", ")", ";", "// the internal buffer holds pairs of (timestamp, value) so double the default buffer size", "int", "s", "=", "bufferSize", "<<", "1", ";", "return", "RxJavaPlugins", ".", "onAssembly", "(", "new", "ObservableSkipLastTimed", "<", "T", ">", "(", "this", ",", "time", ",", "unit", ",", "scheduler", ",", "s", ",", "delayError", ")", ")", ";", "}"]
Hi,
Are there any plans to export the function_parser library this repo has into a proper pypi module people could easily install? It looks awesome and I bet a ton of SE people could make use of it, especially if it was made a bit easier to setup. I'm more than happy to help with generating a PR to do it and if need be helping finishing any final touches it needs (I understand how research tools sometimes be :)).
BTW, love the research and uploading all your data and code. I've used it numerous times in my research π€.
Hello,
first thanks for the challenge, the code and the dataset! Really cool stuff that you're doing and I want to work on this task. :)
I've read the Contribution Guidelines and know that you will not change any of the preprocessing code, but nevertheless I want to discuss the preprocessing of the docstrings here in case someone wants to produce a similar dataset (or maybe v3 ;) ) .
I read your code and it seems that this is the way you preprocess the docstrings:
\n\n
is found you take the part before@
CodeSearchNet/function_parser/function_parser/parsers/commentutils.py
Lines 18 to 24 in 9356b31
This way of preprocessing produces a couple of results that are probably not wanted and could be improved.
Compare the first 12 lines of the tokenized docstrings of the java train set to the raw ones
Bind indexed elements to the supplied collection .
Set {
Add {
Set servlet names that the filter will be registered against . This will replace any previously specified servlet names .
Add servlet names for the filter .
Set the URL patterns that the filter will be registered against . This will replace any previously specified URL patterns .
Add URL patterns as defined in the Servlet specification that the filter will be registered against .
Convenience method to {
Configure registration settings . Subclasses can override this method to perform additional configuration if required .
Create a nested {
Create a nested {
Create a nested {
As you can see 6/12 are basically not usable.
**Bind indexed elements to the supplied collection.** @param name the name of the property to bind @param target the target bindable @param elementBinder the binder to use for elements @param aggregateType the aggregate type, may be a collection or an array @param elementType the element type @param result the destination for results
**Set** {@link **ServletRegistrationBean**}**s that the filter will be registered against.** @param servletRegistrationBeans the Servlet registration beans
**Add** {@link **ServletRegistrationBean**}**s for the filter.** @param servletRegistrationBeans the servlet registration beans to add @see #setServletRegistrationBeans
**Set servlet names that the filter will be registered against. This will replace any previously specified servlet names.** @param servletNames the servlet names @see #setServletRegistrationBeans @see #setUrlPatterns
**Add servlet names for the filter.** @param servletNames the servlet names to add
**Set the URL patterns that the filter will be registered against. This will replace any previously specified URL patterns.** @param urlPatterns the URL patterns @see #setServletRegistrationBeans @see #setServletNames
**Add URL patterns, as defined in the Servlet specification, that the filter will be registered against.** @param urlPatterns the URL patterns
**Convenience method to** {@link **#setDispatcherTypes(EnumSet) set dispatcher types**} **using the specified elements.** @param first the first dispatcher type @param rest additional dispatcher types
**Configure registration settings. Subclasses can override this method to perform additional configuration if required.** @param registration the registration
**Create a nested** {@link **DependencyCustomizer**} **that only applies if any of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if all of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if the specified paths are on the class path.** @param paths the paths to test @return a nested {@link DependencyCustomizer}
However, the relevant information is in the raw docstrings (I added the ** to highlight relevant passages). Simply using the part before the first @
produces pretty bad results (at least in java) as its common practice to highlight code blocks or links with javadoc-tags.
Possible solution: Stripping everything before the first param
(or maybe @param
) and afterwards removing javadoc-tags (maybe keeping the tokens inside).
The preprocessing does not include any cleaning. This manifests in docstrings that contain html-tags (which are commonly found in javadoc), as well as URLs (which afterwards get pretty verbose tokenized). See these two samples from java
Determine if a uri is in origin - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
Determine if a uri is in asterisk - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
Some stats
>>> wc -l java.train.comment
454436 java.train.comment
>>> grep -E '<p >|<p >' java.train.comment | wc -l
42750
At least 10% of the tokenized java docstrings still contain html tags.
>>> grep '{ @' java.train.comment | wc -l
44500
Another 10% still contain javadoc.
>>> grep "{ @inheritDoc }" java.train.comment | wc -l
1685
2k consist only of a single javadoc-tag indication that the doc was inherited.
Many of the golang docstrings contain URLs, which are not very useful in the tokenized version the regex produces (see above in java).
>>> wc -l go.train.comment
317822 go.train.comment
>>> grep -E 'http :|https :' go.train.comment | wc -l
19753
~6% contain URLs (starting with http :
)
>>> grep -E "autogenerated|auto generated" go.train.comment | wc -l
4850
Around 5k auto generated methods
>>> grep "/ *" go.train.comment | wc -l
33620
10% still contain c-style comment delimiters.
Any specific reason you keep punctuation symbols like .
, ,
, -
, /
, :
, <
, >
, *
, =
, @
, (
, )
as tokens? Is it to keep code in the docstrings?
I really think better cleaning and a language-dependent preprocessing would produce higher quality docstrings At least for java
a removal of javadoc and html could be beneficial. As well as using everything before the first param
as a summary (maybe in combination with the first paragraph \n\n
heuristic).
I tried to execute the predict.py script but I have got the following error:
Traceback (most recent call last):
File "predict.py", line 56, in
from annoy import AnnoyIndex
ModuleNotFoundError: No module named 'annoy'
what version of Annoy module should I use?
I've noticed that the official calculation of NDCG is here.
CodeSearchNet/src/relevanceeval.py
Line 75 in 3f999d5
On this basis the original paper has reported the NDCG of six languages on code search tasks.
While I re-implement a baseline search model based on MLP. And I calculate the MRR MAP and NDCG metrics by myself.
The MRR is 0.5128467211800546
, MAP is 0.19741363623638755
and NDCG is 0.6274463943748803
.
Both MRR and MAP seem great, but NDCG is nearly 3 times outperform than the results which the original paper noticed.
I think it's may not be the power of my baseline modal, there's must be something wrong with the NDCG implementation.
Here's the function I used for calculation:
def iDCG(true_list:list,topk=-1):
true_descend_list = np.sort(true_list)[::-1]
idcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
if not topk==-1:
idcg_list = idcg_list[:topk]
idcg = np.sum(idcg_list)
return idcg
def DCG(true_list,pred_list,topk=-1):
pred_descend_order = np.argsort(pred_list)[::-1]
true_descend_list = [true_list[i] for i in pred_descend_order]
dcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
if not topk==-1:
dcg_list = dcg_list[:topk]
dcg = np.sum(dcg_list)
return dcg
def NDCG(true_rank_dict,pred_rank_dict,topk=-1):
ndcg_lst = []
for qid in pred_rank_dict:
temp_pred = pred_rank_dict[qid]
temp_true = true_rank_dict[qid]
idcg = iDCG(true_list=temp_true,topk=topk)
dcg = DCG(true_list=temp_true,pred_list=temp_pred,topk=topk)
ndcg = dcg / idcg if not idcg == 0 else 0
ndcg_lst.append(ndcg)
return np.average(ndcg_lst)
I'm confused about how original paper calculates NDCG metrics, especially how to choose the threshold K of NDCG@K which is not noticed in paper.
Pls help.
Hi
I would like to obtain the self attention model weights before any fine tuning was done to it.
Does this link from the leaderboard contain such a weights file. If it is the fine tuned one, could you make the base self-att model for python available.
My intention is to pass these weights to a model benefitting from contextual word embeddings in PyTorch. Any information pertaining to the structure of the weights file would be beneficial.
Thank you
Carlos
When building the gpu-docker image following error occurs:
W: The repository 'http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial Release' does not have a Release file. E: Failed to fetch http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden E: Some index files failed to download. They have been ignored, or old ones used instead.
Seems like the PPA for Python 3.6 was removed. See: https://launchpad.net/~jonathonf/+archive/ubuntu/python-3.6
e.g. some short methods may contain a description in the return tag, but not a description of the method itself. (to avoid redundancy).
Doing this would extract more methods, but they may be of lower quality if incorrectly parsed or automatically generated. I'd expect a description such as @return bool STARTDESCRIPTION true if this is a float
to be extracted (I'm not familiar with how the data representation works)
@return the description
without a type for php /**
* @return bool true if this is a float
*/
public function isFloat()
It would be nice to account for code such as @return HasTemplate<string, stdClass>
, etc. Making sure that <([{
in the first token are matched up may be useful as a basic approach (and give up if that fails). (There's no official standard and different static analyzers have their own extensions to the type syntax)
An example implementation for PHP is https://github.com/phan/phan/blob/2.2.12/src/Phan/Language/Element/MarkupDescription.php - examples of what it generates as descriptions are https://github.com/phan/phan/blob/2.2.12/tests/Phan/Language/Element/MarkupDescriptionTest.php#L132-L155
How to submit only model_predictions.csv
file from an external project?
I've tried to use php_dedupe_definitions_v2.pkl for my own project and found many functions with broken tokenization. For example, find functions with empty ('') tokens - there are above 8000 of that. Then, If we try to look for all 1-letter tokens we will get tons of 1-letter utf8 tokens which is impossible.
Hi there! In the function corresponding to NDCG calculation I noticed that ignore_rank_of_non_annotated_urls
flag is present.
CodeSearchNet/src/relevanceeval.py
Line 76 in 76a006f
The question is: are your results in the paper calculated with ignore_rank_of_non_annotated_urls=True
?
Hi,
I wanted to run an evaluation using the NDCG score as done in the paper.
Where is the RELEVANCE_ANNOTATIONS_CSV_PATH for the 99 queries as mentioned in the README to run the /src/relevanceeval.py
file?
Just want to test my results..
In CodeSearchNet/function_parser/function_parser/demo.ipynb . I kept everything same till thrid cell and then I did this processor.process_single_file(py_file_path)
. Here py_file_path
contains the complete path of .py file that I want to process.
After executing the above line I got the following error :
unhashable type: 'tree_sitter.Node'
in file function_parser/function_parser/parsers/language_parser.py
.
Am I missing something?
Hi
I am trying to build a Keras model from scratch using only Python dataset for a start. But I am confused as to how NDCG can be calculated during training or even testing.
According to my understanding, calculating NDCG will require a ground truth of which pages are relevant to a query in a ranked order and the model's predictions for that query. But in the dataset, each code-block is provided with its own query (docstring), but no ranking order of pages for a query.
I am using same network architecture as proposed in this repository, but only have python as the language input for now, the final layer being a 2D matrix (softmax applied) of shape (BS*BS), with each cell having relevance score for (query, code-page) pair. As ground-truth to this, I prepared a 2D matrix having 1 as diagonal elements and 0 elsewhere.
But how can I calculate NDCG in this scenario, as only 1 page is relevant for each query, instead of a relevance list of pages?
Flutter is a UI DSL currently pushed by Google. It uses the Dart programming language.
Is there any chance that you create Dart and Flutter dataset just like java, python, php, go, js and ruby?
Thanks for your work!
from @mallamanis
One masters' student in Berkeley has asked me the following question for CodeSearchNet.
The validation loss logging shows that the MRR performance decreases as it is being computed (at the end of the epoch). This seems to be the case with many of the runs on W&B. Do you have any idea why this might happening? I don't see anything obviously wrong.
For example,
11 (valid): Batch 0 (has 200 samples). Processed 0 samples. Loss so far: 0.0000. MRR so far: 0.0000
11 (valid): Batch 1 (has 200 samples). Processed 200 samples. Loss so far: 2.3159. MRR so far: 0.5468
11 (valid): Batch 2 (has 200 samples). Processed 400 samples. Loss so far: 2.3237. MRR so far: 0.5623
11 (valid): Batch 3 (has 200 samples). Processed 600 samples. Loss so far: 2.3163. MRR so far: 0.5652
11 (valid): Batch 4 (has 200 samples). Processed 800 samples. Loss so far: 2.3615. MRR so far: 0.5568
11 (valid): Batch 5 (has 200 samples). Processed 1000 samples. Loss so far: 2.5153. MRR so far: 0.5323
11 (valid): Batch 6 (has 200 samples). Processed 1200 samples. Loss so far: 2.8651. MRR so far: 0.4921
11 (valid): Batch 7 (has 200 samples). Processed 1400 samples. Loss so far: 2.7642. MRR so far: 0.5085
11 (valid): Batch 8 (has 200 samples). Processed 1600 samples. Loss so far: 2.7468. MRR so far: 0.5091
11 (valid): Batch 9 (has 200 samples). Processed 1800 samples. Loss so far: 2.7153. MRR so far: 0.5134
11 (valid): Batch 10 (has 200 samples). Processed 2000 samples. Loss so far: 2.7024. MRR so far: 0.5131
11 (valid): Batch 11 (has 200 samples). Processed 2200 samples. Loss so far: 2.7061. MRR so far: 0.5125
11 (valid): Batch 12 (has 200 samples). Processed 2400 samples. Loss so far: 2.6665. MRR so far: 0.5183
11 (valid): Batch 13 (has 200 samples). Processed 2600 samples. Loss so far: 2.6986. MRR so far: 0.5157
11 (valid): Batch 14 (has 200 samples). Processed 2800 samples. Loss so far: 2.6975. MRR so far: 0.5166
11 (valid): Batch 15 (has 200 samples). Processed 3000 samples. Loss so far: 2.7363. MRR so far: 0.5118
11 (valid): Batch 16 (has 200 samples). Processed 3200 samples. Loss so far: 2.7226. MRR so far: 0.5137
11 (valid): Batch 17 (has 200 samples). Processed 3400 samples. Loss so far: 2.7146. MRR so far: 0.5153
11 (valid): Batch 18 (has 200 samples). Processed 3600 samples. Loss so far: 2.7491. MRR so far: 0.5115
11 (valid): Batch 19 (has 200 samples). Processed 3800 samples. Loss so far: 2.7468. MRR so far: 0.5108
11 (valid): Batch 20 (has 200 samples). Processed 4000 samples. Loss so far: 2.7470. MRR so far: 0.5097
11 (valid): Batch 21 (has 200 samples). Processed 4200 samples. Loss so far: 2.7783. MRR so far: 0.5070
11 (valid): Batch 22 (has 200 samples). Processed 4400 samples. Loss so far: 2.7725. MRR so far: 0.5086
11 (valid): Batch 23 (has 200 samples). Processed 4600 samples. Loss so far: 2.7606. MRR so far: 0.5096
11 (valid): Batch 24 (has 200 samples). Processed 4800 samples. Loss so far: 2.7733. MRR so far: 0.5069
11 (valid): Batch 25 (has 200 samples). Processed 5000 samples. Loss so far: 2.8067. MRR so far: 0.5030
11 (valid): Batch 26 (has 200 samples). Processed 5200 samples. Loss so far: 2.7878. MRR so far: 0.5054
11 (valid): Batch 27 (has 200 samples). Processed 5400 samples. Loss so far: 2.7869. MRR so far: 0.5054
11 (valid): Batch 28 (has 200 samples). Processed 5600 samples. Loss so far: 2.8128. MRR so far: 0.5003
11 (valid): Batch 29 (has 200 samples). Processed 5800 samples. Loss so far: 2.8420. MRR so far: 0.4959
11 (valid): Batch 30 (has 200 samples). Processed 6000 samples. Loss so far: 2.8311. MRR so far: 0.4981
11 (valid): Batch 31 (has 200 samples). Processed 6200 samples. Loss so far: 2.8291. MRR so far: 0.4978
11 (valid): Batch 32 (has 200 samples). Processed 6400 samples. Loss so far: 2.8190. MRR so far: 0.4993
11 (valid): Batch 33 (has 200 samples). Processed 6600 samples. Loss so far: 2.8345. MRR so far: 0.4980
11 (valid): Batch 34 (has 200 samples). Processed 6800 samples. Loss so far: 2.8100. MRR so far: 0.5011
11 (valid): Batch 35 (has 200 samples). Processed 7000 samples. Loss so far: 2.7998. MRR so far: 0.5026
11 (valid): Batch 36 (has 200 samples). Processed 7200 samples. Loss so far: 2.7841. MRR so far: 0.5052
11 (valid): Batch 37 (has 200 samples). Processed 7400 samples. Loss so far: 2.7836. MRR so far: 0.5052
11 (valid): Batch 38 (has 200 samples). Processed 7600 samples. Loss so far: 2.7906. MRR so far: 0.5044
11 (valid): Batch 39 (has 200 samples). Processed 7800 samples. Loss so far: 2.8042. MRR so far: 0.5020
11 (valid): Batch 40 (has 200 samples). Processed 8000 samples. Loss so far: 2.8095. MRR so far: 0.5013
11 (valid): Batch 41 (has 200 samples). Processed 8200 samples. Loss so far: 2.8119. MRR so far: 0.5008
11 (valid): Batch 42 (has 200 samples). Processed 8400 samples. Loss so far: 2.7931. MRR so far: 0.5038
11 (valid): Batch 43 (has 200 samples). Processed 8600 samples. Loss so far: 2.7902. MRR so far: 0.5043
11 (valid): Batch 44 (has 200 samples). Processed 8800 samples. Loss so far: 2.7918. MRR so far: 0.5041
11 (valid): Batch 45 (has 200 samples). Processed 9000 samples. Loss so far: 2.7948. MRR so far: 0.5044
11 (valid): Batch 46 (has 200 samples). Processed 9200 samples. Loss so far: 2.7991. MRR so far: 0.5031
11 (valid): Batch 47 (has 200 samples). Processed 9400 samples. Loss so far: 2.8023. MRR so far: 0.5028
11 (valid): Batch 48 (has 200 samples). Processed 9600 samples. Loss so far: 2.8052. MRR so far: 0.5020
11 (valid): Batch 49 (has 200 samples). Processed 9800 samples. Loss so far: 2.8261. MRR so far: 0.4986
11 (valid): Batch 50 (has 200 samples). Processed 10000 samples. Loss so far: 2.8538. MRR so far: 0.4944
11 (valid): Batch 51 (has 200 samples). Processed 10200 samples. Loss so far: 2.8511. MRR so far: 0.4946
11 (valid): Batch 52 (has 200 samples). Processed 10400 samples. Loss so far: 2.8531. MRR so far: 0.4949
11 (valid): Batch 53 (has 200 samples). Processed 10600 samples. Loss so far: 2.8617. MRR so far: 0.4928
11 (valid): Batch 54 (has 200 samples). Processed 10800 samples. Loss so far: 2.8791. MRR so far: 0.4897
11 (valid): Batch 55 (has 200 samples). Processed 11000 samples. Loss so far: 2.8967. MRR so far: 0.4876
11 (valid): Batch 56 (has 200 samples). Processed 11200 samples. Loss so far: 2.9021. MRR so far: 0.4871
11 (valid): Batch 57 (has 200 samples). Processed 11400 samples. Loss so far: 2.9030. MRR so far: 0.4873
11 (valid): Batch 58 (has 200 samples). Processed 11600 samples. Loss so far: 2.9257. MRR so far: 0.4843
11 (valid): Batch 59 (has 200 samples). Processed 11800 samples. Loss so far: 2.9270. MRR so far: 0.4841
11 (valid): Batch 60 (has 200 samples). Processed 12000 samples. Loss so far: 2.9342. MRR so far: 0.4836
11 (valid): Batch 61 (has 200 samples). Processed 12200 samples. Loss so far: 2.9443. MRR so far: 0.4819
11 (valid): Batch 62 (has 200 samples). Processed 12400 samples. Loss so far: 2.9565. MRR so far: 0.4798
11 (valid): Batch 63 (has 200 samples). Processed 12600 samples. Loss so far: 2.9494. MRR so far: 0.4810
11 (valid): Batch 64 (has 200 samples). Processed 12800 samples. Loss so far: 2.9631. MRR so far: 0.4785
11 (valid): Batch 65 (has 200 samples). Processed 13000 samples. Loss so far: 2.9674. MRR so far: 0.4779
11 (valid): Batch 66 (has 200 samples). Processed 13200 samples. Loss so far: 2.9685. MRR so far: 0.4778
11 (valid): Batch 67 (has 200 samples). Processed 13400 samples. Loss so far: 2.9789. MRR so far: 0.4769
11 (valid): Batch 68 (has 200 samples). Processed 13600 samples. Loss so far: 2.9794. MRR so far: 0.4765
11 (valid): Batch 69 (has 200 samples). Processed 13800 samples. Loss so far: 2.9754. MRR so far: 0.4766
11 (valid): Batch 70 (has 200 samples). Processed 14000 samples. Loss so far: 2.9731. MRR so far: 0.4767
11 (valid): Batch 71 (has 200 samples). Processed 14200 samples. Loss so far: 2.9822. MRR so far: 0.4751
11 (valid): Batch 72 (has 200 samples). Processed 14400 samples. Loss so far: 2.9755. MRR so far: 0.4756
11 (valid): Batch 73 (has 200 samples). Processed 14600 samples. Loss so far: 2.9747. MRR so far: 0.4758
11 (valid): Batch 74 (has 200 samples). Processed 14800 samples. Loss so far: 2.9666. MRR so far: 0.4770
11 (valid): Batch 75 (has 200 samples). Processed 15000 samples. Loss so far: 2.9757. MRR so far: 0.4758
11 (valid): Batch 76 (has 200 samples). Processed 15200 samples. Loss so far: 2.9775. MRR so far: 0.4756
11 (valid): Batch 77 (has 200 samples). Processed 15400 samples. Loss so far: 2.9813. MRR so far: 0.4745
11 (valid): Batch 78 (has 200 samples). Processed 15600 samples. Loss so far: 2.9830. MRR so far: 0.4739
11 (valid): Batch 79 (has 200 samples). Processed 15800 samples. Loss so far: 2.9905. MRR so far: 0.4726
11 (valid): Batch 80 (has 200 samples). Processed 16000 samples. Loss so far: 3.0173. MRR so far: 0.4688
11 (valid): Batch 81 (has 200 samples). Processed 16200 samples. Loss so far: 3.0520. MRR so far: 0.4645
11 (valid): Batch 82 (has 200 samples). Processed 16400 samples. Loss so far: 3.0599. MRR so far: 0.4630
11 (valid): Batch 83 (has 200 samples). Processed 16600 samples. Loss so far: 3.0639. MRR so far: 0.4625
11 (valid): Batch 84 (has 200 samples). Processed 16800 samples. Loss so far: 3.0691. MRR so far: 0.4616
11 (valid): Batch 85 (has 200 samples). Processed 17000 samples. Loss so far: 3.0723. MRR so far: 0.4614
11 (valid): Batch 86 (has 200 samples). Processed 17200 samples. Loss so far: 3.0691. MRR so far: 0.4615
11 (valid): Batch 87 (has 200 samples). Processed 17400 samples. Loss so far: 3.0919. MRR so far: 0.4589
11 (valid): Batch 88 (has 200 samples). Processed 17600 samples. Loss so far: 3.0917. MRR so far: 0.4587
11 (valid): Batch 89 (has 200 samples). Processed 17800 samples. Loss so far: 3.0887. MRR so far: 0.4592
11 (valid): Batch 90 (has 200 samples). Processed 18000 samples. Loss so far: 3.1092. MRR so far: 0.4561
11 (valid): Batch 91 (has 200 samples). Processed 18200 samples. Loss so far: 3.1391. MRR so far: 0.4515
11 (valid): Batch 92 (has 200 samples). Processed 18400 samples. Loss so far: 3.1650. MRR so far: 0.4482
11 (valid): Batch 93 (has 200 samples). Processed 18600 samples. Loss so far: 3.1630. MRR so far: 0.4486
11 (valid): Batch 94 (has 200 samples). Processed 18800 samples. Loss so far: 3.1624. MRR so far: 0.4488
11 (valid): Batch 95 (has 200 samples). Processed 19000 samples. Loss so far: 3.1692. MRR so far: 0.4480
11 (valid): Batch 96 (has 200 samples). Processed 19200 samples. Loss so far: 3.1660. MRR so far: 0.4486
11 (valid): Batch 97 (has 200 samples). Processed 19400 samples. Loss so far: 3.1729. MRR so far: 0.4478
11 (valid): Batch 98 (has 200 samples). Processed 19600 samples. Loss so far: 3.1779. MRR so far: 0.4468
11 (valid): Batch 99 (has 200 samples). Processed 19800 samples. Loss so far: 3.1967. MRR so far: 0.4445
11 (valid): Batch 100 (has 200 samples). Processed 20000 samples. Loss so far: 3.1978. MRR so far: 0.4443
11 (valid): Batch 101 (has 200 samples). Processed 20200 samples. Loss so far: 3.1949. MRR so far: 0.4443
11 (valid): Batch 102 (has 200 samples). Processed 20400 samples. Loss so far: 3.1932. MRR so far: 0.4444
11 (valid): Batch 103 (has 200 samples). Processed 20600 samples. Loss so far: 3.1864. MRR so far: 0.4453
11 (valid): Batch 104 (has 200 samples). Processed 20800 samples. Loss so far: 3.1881. MRR so far: 0.4453
11 (valid): Batch 105 (has 200 samples). Processed 21000 samples. Loss so far: 3.1855. MRR so far: 0.4457
11 (valid): Batch 106 (has 200 samples). Processed 21200 samples. Loss so far: 3.1818. MRR so far: 0.4460
11 (valid): Batch 107 (has 200 samples). Processed 21400 samples. Loss so far: 3.1841. MRR so far: 0.4458
11 (valid): Batch 108 (has 200 samples). Processed 21600 samples. Loss so far: 3.1811. MRR so far: 0.4462
11 (valid): Batch 109 (has 200 samples). Processed 21800 samples. Loss so far: 3.1806. MRR so far: 0.4463
11 (valid): Batch 110 (has 200 samples). Processed 22000 samples. Loss so far: 3.1816. MRR so far: 0.4464
11 (valid): Batch 111 (has 200 samples). Processed 22200 samples. Loss so far: 3.1796. MRR so far: 0.4467
11 (valid): Batch 112 (has 200 samples). Processed 22400 samples. Loss so far: 3.1842. MRR so far: 0.4457
Epoch 11 (valid) took 6.12s [processed 3689 samples/second]
Validation: Loss: 3.190802 | MRR: 0.444851
while others have made the same typo, neural-networks
is far more popular. thanks!
Is there a plan to release the annotations?
Thank You for CodeSearchNet. A quick question: if we wish to replicate Table 3 (of your paper) where can we find the 999 distractor snippets and how did you form those 999 snippets?
I have noticed the usage of the files *_dedupe_definitions_v2.pkl when using predict.py
However, I cannot find the code to build the files *_dedupe_definitions_v2.pkl
how should I build those files?
what is the purpose of those files?
thanks
Just wanted to leave a note of appreciation. I've been looking for parsers that would deconstruct code into tokens easily (specifically for Ruby code, I took at look at parser which is just an AST and would have required more processing) so that I could figure out potential call sites and relations between test files and source code.
I'm looking to prioritize tests that should be run based on which functions are added/removed/updated, similar to what's found in the paper The art of testing less without sacrificing quality.
(I also found that Facebook has done something like this and they have a paper on it)
Thanks for your cool project!
I follow the quickstart of the instruction, but it seems that I can't pip tensorflow_gpu in the docker container because of network problem. Here is the error information:
Step 13/21 : RUN pip --no-cache-dir install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
---> Running in 10ecd6cca56f
Collecting tensorflow-gpu==1.12.0 from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
Downloading https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl (281.7MB)
Requirement already satisfied: six>=1.10.0 in /usr/lib/python3/dist-packages (from tensorflow-gpu==1.12.0) (1.10.0)
Collecting termcolor>=1.1.0 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting keras-preprocessing>=1.0.5 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
Collecting keras-applications>=1.0.6 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
Collecting numpy>=1.13.3 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/e5/e6/c3fdc53aed9fa19d6ff3abf97dfad768ae3afce1b7431f7500000816bda5/numpy-1.17.2-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)
ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 397, in _error_catcher
yield
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 479, in read
data = self._fp.read(amt)
File "/usr/lib/python3.6/http/client.py", line 449, in read
n = self.readinto(b)
File "/usr/lib/python3.6/http/client.py", line 493, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
I have tried to execute the scripts to run the dockers images such as
docker run -v $(pwd):/home/dev preprocessing
However, I have got the following:
standard_init_linux.go:219: exec user process caused: no such file or directory
Additionally, there is no script to run the docker-cpu.Dockerfile
how can I run that docker?
I have tried something like
docker run --net=host -v $(pwd):/home/dev csnet:cpu bash
but, it doesn't return any, I can notice the instance goes down by observing the syslog file
I am using an ubuntu 18.04 virtual machine with python 3.6.5
use train set and valid set for training
and codebase for evaluating the model
so test set for what?
deep code search says that manual evaluation,
there is way for auto evaluation?
and for java dataset some desc is empty and some desc is not english ,like german
Since the relevance_annotations.csv
is not available, a script to upload model_predictions.csv
and download a model statistics file (NDCG, MRR, for example) would be great.
Thank you for making this amazing problem statement public, along with a very comprehensive dataset!
Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?
This will open up the problem for a larger audience and may bring in some innovative solutions!
The default ordering on the leaderboard is "Mean NDCG" over all results that are not None
.
This means it is very easy to appear #1 by overfitting on a single language.
What about computing "Mean NDCG" over all 6 languages and replacing None
by 0
?
Did I get it right, that for a submission on the challenge, I have to run those 99 queries against the whole code corpus, and not just the test set?
Thanks in advance :)
I'm making a model for this using my own codebase. I wanted to confirm if this way of calculating loss and mrr is correct.
def softmax_loss(y_true, y_pred):
q, c = y_pred
similarity_score = tf.matmul(q, K.transpose(c))
per_sample_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=similarity_score,
labels=tf.range(q.shape[0])
)
return tf.reduce_sum(per_sample_loss) / q.shape[0]
def mrr(y_true, y_pred):
q, c = y_pred
similarity_scores = tf.matmul(q, K.transpose(c))
correct_scores = tf.linalg.diag_part(similarity_scores)
compared_scores = similarity_scores >= tf.expand_dims(correct_scores, axis=-1)
compared_scores = tf.cast(compared_scores, tf.float16)
return K.mean(tf.math.reciprocal(tf.reduce_sum(compared_scores, axis=1)))
Here, q
and c
are query and code feature vector of shape (batch_size, vector_dimension).
I'm a bit hesitant because I feel that mrr and loss depend on the batch_size and kind of example in the batch(closely related(mrr might be low) or far apart(mrr can be high)).
EDIT:
I've looked around some issues already present. Is my understanding correct, during testing, we will have a batch_size of 1000 and we won't be shuffling the data?
This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.
At least one function in the dataset has code_tokens
that do not include a {
token.
Quick inspection with
with pd.option_context('display.max_colwidth', -1):
display(jdf.loc[jdf['url'] == 'https://github.com/jbehave/jbehave-core/blob/bdd6a6199528df3c35087e72d4644870655b23e6/examples/i18n/src/main/java/org/jbehave/examples/trader/i18n/steps/DeSteps.java#L22-L25'][['code', 'code_tokens']])
shows tokens like tring , ymbol
for this code
@Given("ich habe eine Aktion mit dem Symbol $sΓΌmbol und eine Schwelle von $threshold")
public void aStock(@Named("sΓΌmbol") String symbol, @Named("threshold") double threshold) { ...
code_tokens
looks like this
[@, Given, (, "ich habe eine Aktion mit dem Symbol π ΓΌπππππ’ππππππππβπ€πππππ£ππ threshold"), , public, void, aStock, (, @, Named, (, "sΓΌmbol"), , tring , ymbol,, , N, amed(, ", threshold"), ...]
I'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.
How do I submit custom models only containing a model_predictions.csv
file as an output of the custom model? It seems like the CI tests fail on custom model submissions and thus the PR is closed (see #184)?
The paper says that there are 503502 data available for python, but when I download the python data, I get 457461 data combining the 14 files of train data, the file for test data and the file for valid data. I used the whole corpus (of size 1.1M) to find the data with non-empty 'docstring' field and ended up with the reported number of 503502 though. I assume 46k data have been filtered but cannot seem to find why.
Is there a link where we can download pretrained baseline models ?
Hi, I want to use the parser to process my own data. First I want to try several single files, but the function processor.process_single_file()
fails to output things (just an empty list). How can I debug? I have used the provided docker to set up my environment.
CodeSearchNet is a very good task. I think this will greatly promote the development of this field.
But I have some questions, as follows:
_declarator
to _declaration
I saw that one of the goal of this project is to "Open source code for a range of baseline models, along with pre-trained weights".
So Is there a pre-trained model? I couldn't find it here.
Hi!
I've made a custom model and now I'm trying to submit it to the leaderboard. Here is the run: https://app.wandb.ai/github/codesearchnet/runs/lqqo1i4m
Above it says 'Awaiting review from codesearchnet benchmark' and that's probably the reason why I get an error when I try to 'Publish to GitHub'. Am I doing something wrong, or do I just have to wait?
Thanks.
Hello, Mr. auther I'd like to do some L2R reseach with the CodeSearchNet code , I wonder if I can get the data that you've annotated already as you said in your paper "annotation statistic"! Thx a lot!!!
I'm currently trying to submit a custom model. After entering the run, evaluating the ndcg score, and writing a brief note on how we approached the results, I get the following error
Now, when I click refresh
, I will be redirected to the submission page again. When trying to re-submit, I get the following error Invalid CSV format. Please upload a well-formatted CSV file.
.
I have been trying to implement a "custom" encoder within this codebase, and I was wondering how to get access to the raw tokens (in string form).
What I have tried so far:
tokens_str
placeholder:self.placeholders['tokens_str'] = \
tf.placeholder(tf.string,
shape=[None, self.get_hyper('max_num_tokens')],
name='tokens_str')
Once I have those, I am trying to pipe the tokens into the tf_hub Elmo module, as follows:
seq_tokens = self.placeholders['tokens_str']
seq_tokens_lengths = self.placeholders['tokens_lengths']
# ## DEBUGGING: OUTPUT SHAPES
# print("Sequence Tokens Shape: %s" % seq_tokens.shape)
# print("Sequence Tokens Lengths: %s" % seq_tokens_lengths)
## pull elmo model from tensorflow hub
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=is_train)
token_embeddings = elmo(
{
"tokens": seq_tokens,
"sequence_len": seq_tokens_lengths
},
signature='tokens',
as_dict=True
)['elmo'] ## [batch_size, max_length, 1024 or 512]
I can see from my debugging statements, that seq_tokens
shape is (?, 200)
, which is expected and seq_tokens_lengths
shape is (?)
which is also expected.
The error I get is len(seq_lens) != input.dims(0), (1000 vs. 0)
, which is coming from the ELMo model. But I'm guessing that this is coming from the seq_tokens
not being pushed into the ELMo model.
Any help is appreciated!
Thank you for doing a great job putting this together, publishing it and congratulations on the launch π
This is most probably a nitpick, but a link in https://github.com/github/CodeSearchNet/blob/master/resources/README.md#data-format points to non-existing resources/docs/DATA_FORMAT.md
I'm not familiar with the project yet, but it looks like may be it can be replaces with https://github.com/github/CodeSearchNet#schema--format
Hope this helps!
I want to make a code search corpus. I have collected a lots of GitHub repositories. Now I need to deconstruct code into tokens to extract functions and comments. You describe in the paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search
: We then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using TreeSitter β GitHubβs universal parser β and, where available, their respective documentation text using a heuristic regular expression.
I can extract functions in python. But it hasn't comments. How do you extract functions with comments? Can you share your codes?
Greetings, I'm a PhD student working in IR/SE. I'm very happy to see CSNet project and thank you for your great work.
I'm working on software representation and code retrieval/search. I'm familiar with IR dataset, but forgive me I quite not understand how CSNet dataset provide data annotataions.
I've got all six language for AWS, and I've found queries here https://github.com/github/CodeSearchNet/blob/master/resources/queries.csv in resource folder.
I explore data from pandas.Dataframe and see all six language have been divided into train/test/valid partation label.
But I still cannot find the relation beweeen queries and code data. In other words, every query text has no related source code data for it.
I'm newer to WandB platform, I wonder there be golden label for me to train model offline. Or I strictly need to write a dataloader or function for accepting WandB's online label data.
Instead of loading all training and test data, can we load the data in memory in batches, i.e. on the fly during training and evaluation?
Hi,
Fantastic initiative, thanks a lot :-)
Are you planning to publish the relevance judgements, ie, the 4k expert relevance annotations?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.