Hi! I tried to use GPT-3.5-turbo for the ToT experiment on Game24 and got similar resu

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

GPT3.5 ToT Performance is a lot lower about tree-of-thought-llm HOT 5 CLOSED

princeton-nlp commented on May 27, 2024

GPT3.5 ToT Performance is a lot lower

from tree-of-thought-llm.

Comments (5)

ysymyth commented on May 27, 2024 1

Hi @IsThatYou this is a great point --- I tried GPT-3.5 and it indeeds performs badly on game of 24. Note that IO: 36% CoT:42% are pass@100 though.

We also tried ToT using GPT-3.5-turbo instead of GPT-4 on Creative Writing (scoring is still via GPT-4). We find all methods perform worse, but ToT is still significantly better than other methods.

Creative Writing	GPT-4 (in paper)	GPT-3.5-turbo
IO	6.19	4.47
CoT	6.93	5.16
ToT	6.93	6.62

In general, I believe proposing and evaluating diverse thoughts is an "emerging capability" that is hard even for GPT-4, but significantly harder for smaller/weaker models. It would be important and interesting to study how to make smaller models better at ToT reasoning!

from tree-of-thought-llm.

ysymyth commented on May 27, 2024 1

Yes I agree, and perhaps some better prompt engineering can help with the issue.

from tree-of-thought-llm.

GithungDang commented on May 27, 2024

Does this mean that using open source models like vicuna will be worse?

from tree-of-thought-llm.

GithungDang commented on May 27, 2024

Actually, I want to use ToT to improve the reasoning ability of open source models, so that they can be close to the reasoning level of gpt3.5, rather than the superficial dialogue style.

from tree-of-thought-llm.

IsThatYou commented on May 27, 2024

Hi @ysymyth thank you for the response! I closely looked and compared some of the generations between gpt-3.5 and gpt-4, I found gpt-4 to be better at task understanding in general. gpt-3.5 degenerates more often than gpt-4. Anyway, this is pretty interesting. It is definitely interesting to see how to make smaller models better at them. :D

from tree-of-thought-llm.

Recommend Projects

GPT3.5 ToT Performance is a lot lower about tree-of-thought-llm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent