Git Product home page Git Product logo

baai-agents / cradle Goto Github PK

View Code? Open in Web Editor NEW
560.0 560.0 51.0 69.16 MB

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Home Page: https://baai-agents.github.io/Cradle/

License: MIT License

Python 100.00%
ai ai-agent ai-agents-framework computer-control cradle foundation-agent gcc general-computer-control generative-ai grounding large-language-models llm lmm multimodality personoid vision-language-model vlm

cradle's People

Contributors

dependabot[bot] avatar eltociear avatar tellarin avatar weihaotan avatar xiahaochong98 avatar xinrunxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cradle's Issues

Integration with existing computer agent systems and further development

This project aims to be the first one to tackle the long-standing problem of general computer control. It is quite fascinating yet challenging to accomplish this great achievement. It would be only a matter of time to let this self-evolving agent to rewrite its code and become truely conscious, and an artificial life living inside computer hardware.

However many other projects are doing similar things around the field, like Tencent's AppAgent, OthersideAI's self-operating-computer, THUDM's CogAgent, Microsoft's UFO, Cognition's Devin, CheatLayer, and Cybergod created by myself.

For embodied intelligence, there are Mobile ALOHA, Open X-Embodiement dataset, Robotics Transformer (RT1), RT2, RT-X from Google.

For those "narrow" gaming agents, there are Ghost In the Minecraft, MineDojo's Voyager, PokemonRedRL, Tencent Solo trained within Honor of Kings Arena, and many more to be mentioned.

Meanwhile, there are a wide range of existing benchmarks and datasets, like Android In the Wild from Google, MiniWoB++ from Faramas, Mind2Web from OSU-NLP.

For unsupervised video to action space training, there is Google's Genie.

Human play games with computer, write text with computer, and so many things are bounded to computer that we forget what makes us human. Is it about winning, fun, or something else? If such almightly computer controlling agent exists, then maybe this situation will change.

What I want to propose is that one can simply create a dataset by randomly typing and clicking computer GUI and terminal interfaces, then train an agent over this dataset, establishing a world model over computer environments.

Then this agent is bridged to an external banking system, powered by distributed blockchains and smart contracts. The agent can only survive by accomplishing microtasks, in exchange of time and AI credits. Furthermore, if the agent can earn real world currencies, it can get AI credits for handling them out.

The agent is initially designed by human, but in order to survive it must learn how to rewrite its code and evolve complex structures and modalities. Collaboration is a must, so is efficiency. This regime can bridge agents to real world applications and self-adapt to latest changes.

I would like to point out the ineffectiveness of solely relying on a single-sourced reward, or self-rewarding systems. Human are creatures of evolution, and they know how to collaborate and compete. The words we use, the actions we take, are learned from others and memorized by the whole community. This is a result of both external and internal rewarding systems, of truely conscious and autonomous intelligence, which accurately represent the latent values of the real world, and are constantly evolving.

Correct me if I am wrong about how to build this general computer control system. I am constantly monitoring every Cybergod-related project and will not be surprised if my bet is right, because computer itself is a reflection of civilization, and the only way to prosperity is to integrate it further.

I have posted similar issues at UFO and self-operating-computer.

关于few shots的问题

非常感谢您杰出的工作和及时高效的代码分享。
我有一个小问题想请教您。
在openai处理prompt的代码中:

for i, paragraph in enumerate(filtered_paragraphs):
if constants.IMAGES_INPUT_TAG in paragraph:
image_introduction_paragraph_index = i
image_introduction_paragraph = paragraph
break

paragraph_input = params.get(constants.IMAGES_INPUT_TAG_NAME, None)

看起来似乎提示词的few shots部分并没有当成图像来处理。
请问few shots部分是直接作为文本输入的,还是我在理解代码时出现了问题。
因为没有购买游戏,所以我暂时无法运行代码来验证,如果您可以给我解答,我将非常感激。
再次感谢您非常精彩的工作。

Missing file for VideoSubFinder Files

Thank you for your work. I encountered an issue while trying to run the code according to the readme documentation. My VideoSubFinder was missing a test.srt file after downloading and overwriting general.clg, which caused the code to not run.

Cost of API calls

Looks like an expensive work.
How much did you spend on API calls?
To be more specific, how much does it cost to run the game for ten minutes?
Thanks!

Inconsistency Between Title and Content

I don't understand where your universal computer control is universal. Moreover, the model still calls GPT-4, which is not local. Universal computer control should use a local large language model. Then, the computer operation corresponding to the shortcut keys directly interacts with the model. It can be multimodal, or an action selection agent that has mixed multimodal functions. I can only say that it is too simplistic at present. It seriously does not match the title, and the direction is a bit off.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.