The All-Seeing Project
[Paper][AS-1B Dataset Browser]
This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World". (The name "All-Seeing" is derived from "The All-Seeing Eye", which means having complete knowledge, awareness, or insight into all aspects of existence. The logo is Millennium Puzzle, an artifact from the manga "Yu-Gi-Oh!")
- Release the ASM model.
- Release the human verification results of AS-1B.
- Release the detailed region annotations of AS-1B.
- Release the semantic tags of AS-1B.
- Online demo, including dataset browser and ASM online demo.
We present the All-Seeing Project with:
All-Seeing 1B (AS-1B) dataset: we propose a new large-scale dataset (AS-1B) for open-world panoptic visual recognition and understanding, using an economical semi-automatic data engine that combines the power of off-the-shelf vision/language models and human feedback.
All-Seeing Model (ASM): we develop a unified vision-language foundation model (ASM) for open-world panoptic visual recognition and understanding. Aligning with LLMs, our ASM supports versatile image-text retrieval and generation tasks, demonstrating impressive zero-shot capability.
![image](https://private-user-images.githubusercontent.com/8529570/258205443-e43ab8db-6437-46f1-8aa1-c95f012e9147.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzNzE2MDEsIm5iZiI6MTcyMzM3MTMwMSwicGF0aCI6Ii84NTI5NTcwLzI1ODIwNTQ0My1lNDNhYjhkYi02NDM3LTQ2ZjEtOGFhMS1jOTVmMDEyZTkxNDcucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MTFUMTAxNTAxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NGQxZDQxNDdjMmFjNjI3NTExZDY3YzZkYTdhYTgwYzdmOGFmYTA4NDc4N2ViMmI0NmJlMDEyM2FhYTk2M2JkNSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lLZawPpOyLXxQAmHCYtocBq_RlWEM6OwMHyr6BmSAxE)
Figure 1: Overview and comparison of our All-Seeing project with other popular large foundation models.
[TODO] The ASM model will be integrated into InternGPT.
Dataset Browser will be available here.
AS-1B with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes.
![image](https://private-user-images.githubusercontent.com/8529570/258205645-adac37ed-312f-4f11-ba8a-6bc62067438f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzNzE2MDEsIm5iZiI6MTcyMzM3MTMwMSwicGF0aCI6Ii84NTI5NTcwLzI1ODIwNTY0NS1hZGFjMzdlZC0zMTJmLTRmMTEtYmE4YS02YmM2MjA2NzQzOGYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MTFUMTAxNTAxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ODM3MmY4ZDY5MTRlYjliM2MzZGUyMzVhOTNkN2M0ZjgyNDM2ODNlZWNhODdiYjdjZjZiMDhiMzc3NmIxODk5NiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.uTVOWURkRZrHX6d4I56LuxJ_8q_4kvPy7G3izLphG7g)
Some examples
![image](https://private-user-images.githubusercontent.com/8529570/258205789-fcf6ab07-c4ba-441c-aa6c-111c769f75b1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzNzE2MDEsIm5iZiI6MTcyMzM3MTMwMSwicGF0aCI6Ii84NTI5NTcwLzI1ODIwNTc4OS1mY2Y2YWIwNy1jNGJhLTQ0MWMtYWE2Yy0xMTFjNzY5Zjc1YjEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MTFUMTAxNTAxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YzgzZTE2N2VkMGQ0Njg3MmNkZjNjN2Y3MzQ0ODM5YjNmN2M3ZDAzNDQ4MmFmODk0ZDAxNjUzN2NlNTE0M2MwYSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.1WOfcffguqIO2cR2W8fnN0-s6mK9auLI8IoMSKeoFIk)
The All-Seeing model (ASM) is a unified framework for panoptic visual recognition and understanding, including image/region-text retrieval, image/region recognition, captioning, and question-answering.
This project is released under the Apache 2.0 license.