assert-kth / codrep Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 15.0 204.77 MB

58069 Java source code diffs. http://arxiv.org/pdf/1807.03200

Home Page: http://arxiv.org/pdf/1807.03200

codrep's People

Contributors

Stargazers

Watchers

Forkers

egorbu vmarkovtsev egor-bogomolov martinezmatias cdmatters tdurieux mkkaushiksjce guillaumeuvhc mrezende cloudstdiolab xwixcn benhoff epicfaace gildasmulders nashid

codrep's Issues

Participant %9: Singapore Management University

Created for Nghi D. Q. Bui and Lingxiao Jiang from Singapore Management University for discussions. Welcome!

Compiler warning appended at the end of some files

I noticed that in some task files there is an extra line appended at the end of the file.
I think this this is probably a compiler warning that you appended by accident.
An example of this is 3.txt in Dataset 2. https://github.com/KTH/CodRep-competition/blob/master/Datasets/Dataset2/Tasks/3.txt

Participant %13: Team madPL, University of Wisconsin--Madison & Microsoft Research

Created for Team madPL from University of Wisconsin--Madison & Microsoft Research for discussions. Welcome!

Jordan Henkel, Shuvendu Lahiri, Ben Liblit, Thomas Reps

Participant %12: Team COINSE, KAIST

Created for Team COINSE (Gabin An, Shin Yoo) from KAIST, South Korea, for discussions. Welcome!

"Hidden" dataset for weekly rankings without any solutions?

Follow up of #6 (@EgorBu)

I recommend using kaggle-style approach and publish test dataset without solutions, so you can receive predictions and publish public score and compute a private score.

Announces about CodeRep

Watch the repo to get notified about important news!

Add explanation about loss function in case of several predictions

Hello,
Thank you for a great competition!
One thing is not very clear to me - how loss function is computed in case of prediction several lines per file here.
As I understand it will be minimal loss among all predictions from here. Is it correct?

Participant %7: Team CSV, Universidad Central "Marta Abreu" de Las Villas

Created for Team CSV(@cesarsotovalero) from the Universidad Central "Marta Abreu" de Las Villas for discussions. Welcome!

Submission process for intermediate ranking (deadline July 4th 2018)

as of now, and before July 4th, you send us your solution:
- by adding us as collaborator to your Github/Gitlab/Bitbucket project (preferred, so as to discuss on the issue tracker directly)
- or by sending an archive by email, with the code (source or binary) and instructions on how to compile or install the dependencies
your program must be a local computation, you cannot use the network to download data or ask a server. The network will be cut during the evaluation on the hidden dataset.
we execute your tool on the hidden dataset
if we have problems in compiling or executing your code, we further discuss with you in an incremental manner.

Machine used for evaluation: Ubuntu 18.04 LTS, CPU Intel 2299MHZ, 16 GB RAM

Don't hesitate to comment here about the process.

Participant %3: Team Avmb, The University of Edinburgh

Created for Team Avmb from the Univerisity of Edinburgh for discussions. Welcome!

predicting all lines results in a minimum loss

if several predictions, the loss function should be average of maximum loss

Participant %1: Bogomolov et al., JetBrains Research, HSE

Team name: SPbAU, Bogomolov
Naive solution
Error on Dataset1: 0.164
Error on Dataset2: 0.1235

Participant %10: Team Ericsson-RISE, Ericsson & RISE

Created for Jesper and Olof from Ericsson and RISE for discussions. Welcome!

Discussion about baselines for CodRep

Hi,
Thanks a lot for organizing this :) Hope that you don't mind the drive-by issue submission: I would like to suggest three additional, strong, but reasonable, baselines:

Random prediction over the lines where after the replacement the code still parses;
The line that is the most similar to the line being added (e.g. max % common tokens between the lines);
The combination of the above.

The reason I am suggesting this, is that these baselines seem easy "hacks" to achieve reasonable performance without any machine learning.

Participant %6: source{d}

Created for the source{d} team. We plan to keep track of our approaches and solutions in this issue.

Add explanation about hidden dataset

Hello,
Thank you for a great competition!
May you add additional information about hidden dataset from here.
Offtopic: I recommend using kaggle-style approach and publish test dataset without solutions, so you can receive predictions and publish public score and compute a private score.

Participant %4: @tdurieux, INRIA

Hi all,

I just did a quick naive solution based on string distance:

Dataset	Perfect Match	In Top 10	Recall	Loss
Bench 1	3791	4322 98%	0.86	0.13615878141899027
Bench 2	9910	10805 97%	0.89	0.10263617900182995

Participant %8: Marcelo Martins, IPT Sao Paulo

Created for Marcelo Martins(@mrezende) from IPT Sao Paulo for discussions. Welcome!

Participant %2: Allamanis et al., Microsoft Research

Hi @mallamanis!

According to the interesting points discussed in #13, you may submit a proposal (and we do hope so :-), so here is your participant wall!

The idea is to post here findings that are specific to your solution and to tease with the corresponding scores on Dataset1 and Dataset2.

Note that it's also perfectly OK to open other issues.

Participant %5: Team ARD, Siemens Technology

Team ARD
Institution: STSPL (Siemens Technology and Services Private Limited )
Github contact: @Amulyard

Welcome!

Participant %11: Team OttoRepairs, Otto-von-Guericke University Magdeburg

Created for Team OttoRepairs from Otto-von-Guericke University Magdeburg for discussions. Welcome!

evaluate.py does not accept empty line numbers

As stated in the README:
"Your program does not have to predict something for all input files, if there is no clear answer, simply don't output anything, the error computation takes that into account, more information about this in Loss function below."

This is a bit ambiguous, does it mean that the output should be skipped altogether or that one could output just the filename with no line number?
Either case, it should probably either be fixed or clarified in the README.

The latter does not work (full path omitted):
echo "CodRep-competition/Datasets/Dataset1/Tasks/2703.txt 35" | python evaluate.py
Total files: 34096
Average line error: 0.999970671046 (the lower, the better)
Recall@1: 2.93289535429e-05 (the higher, the better)

echo "CodRep-competition/Datasets/Dataset1/Tasks/2703.txt" | python evaluate.py
Traceback (most recent call last):
File "evaluate.py", line 183, in
main()
File "evaluate.py", line 168, in main
prediction = inputs[1]
IndexError: list index out of range

echo "CodRep-competition/Datasets/Dataset1/Tasks/2703.txt " | python evaluate.py
Traceback (most recent call last):
File "evaluate.py", line 183, in
main()
File "evaluate.py", line 168, in main
prediction = inputs[1]
IndexError: list index out of range