flagopen / taco Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
你们的工作太棒了! TACO 是目前看到的开源数据集中,最棒的代码生成数据集。
比较好奇你们在制作数据集时,有没有考虑到数据污染问题。在LLM的时代,测试集是否被污染是一个非常重要的参考点。
我在里面论文里没有找到相关的信息。
When I use compute_metric.py
to evaluate the generation results, the console noted "no module named pyext
." I installed it using pip and got the following error:
Collecting pyext
Using cached pyext-0.7.tar.gz (7.8 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "...\AppData\Local\Temp\pip-install-exx9gznl\pyext_28002897deae467da164cbba24ad8613\setup.py", line 6, in <module>
import pyext
File "...\AppData\Local\Temp\pip-install-exx9gznl\pyext_28002897deae467da164cbba24ad8613\pyext.py", line 117, in <module>
oargspec = inspect.getargspec
^^^^^^^^^^^^^^^^^^
AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
After searching online, I found out that this is a Python bug and I'm using Python 3.11 to support OpenAI's models.
Upon further investigation, it appears that this issue doesn't exist in Python 3.8, so I downgraded the Python version to 3.8.
Moreover, Someone mentioned that the problem may be resolved in the second quarter of 2024: https://community.privacyidea.org/t/python-3-11-support/3115/2
I suggest providing some detailed environment setting on the guide page.
I try to evaluate codellama-7b using the easy difficulty of the TACO dataset. But I find that the output code needs to meet certain specifications in order to pass the test cases, like
s = input()\nprint(s.swapcase())
not
def solve(s):\n return s.swapcase()
How to construct the appropriate output?
貌似测试结果的pass@1、pass@10和pass@100都是在train数据集上测的结果?(或者是train+test?)请问是否有在test数据集上单独测试过呢?
我从test set里抽了20题,然后用compute_metric.py评测了一下数据集自带的solution code, 但大概只有60%-70%能通过评测。我找了一份没通过的代码,人工评估是对的,提交到Codeforces也Accepted了,所以应该是评测这块有问题。另,论文里report的结果是用这个script算出来的吗?
我尝试了test split中easy部分的200题,跑了一遍测得的pass@1只有3左右,和paper里的精度不符。
我已经使用了repo中的prompt和evaluation部分,设置n_samples=1,temperature=0.8。
想求助原因,非常感谢!
请问数据集中题目难度的标注是如何确定的,是基于网站本身的标签还是用户通过率呢?
Hi, thanks for the great work! I am curious about the criteria of the difficulty level annotation in the files. Is it based on the website's own tagging or the user pass rates? Can you share more details on this? Thank you!
我再次尝试评测了数据集自带的solution code,但似乎所有评测都失败了,check_correctness()返回的所有结果都是-1。我不太了解APPS的测试框架,因此没有进一步排查,你们能够检查一下这个问题吗?
Hello! I'm having some trouble reproducing the finetuning with the script – would you mind releasing the trained models so I can verify their evaluation results and build on top of them? Thank you!
Thanks for releasing this dataset and all the amazing work you have done! Do you have any data on the specific performance of gpt-4 on the test data set? If so, can you send me a copy?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.