Git Product home page Git Product logo

fastdup's Introduction

PyPi PyPi PyPi Contributors License


Fastdup logo.

Manage, Clean & Curate Visual Data - Fast and at Scale

An unsupervised and free tool for image and video dataset analysis.
Explore the docs »
Features · Report Bug · Read Blog · Quickstart · Enterprise Edition · About us

Logo Logo Logo Logo


🔥 We've released fastdup V1.0! View the release notes here.

What's Included

fastdup lets you identify -

Additional features -

Why fastdup?

  • Quality: Find and remove anomalies and outliers from your dataset, including duplicates and similar images and videos at a large scale.
  • Cost: Reduce data operation costs by intelligently sampling high-quality or novel datasets before labeling and assessing labeled data quality.
  • Scale: fastdup's C++ graph engine is highly efficient and can handle up to 400M images on a single CPU machine.

Setting up

Prerequisites

Supported Python versions:

PyPi

Supported operating systems:

Windows 10 Windows 11 Windows Server 2019 Windows WSL Ubuntu 20.04 LTS Ubuntu 18.04 LTS macOS 10+ (Intel) macOS 10+ (M1) Amazon Linux 2 CentOS 7 RedHat 4.8

Installation

Option 1 - Install fastdup via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install fastdup
pip install fastdup
    
# Alternatively, use explicit python version (XX)
python3.XX -m pip install fastdup 

Option 2 - Install fastdup via an Ubuntu 20.04 Docker image on DockerHub:

docker pull karpadoni/fastdup-ubuntu-20.04

Detailed installation instructions and common errors here.

Getting Started

Run fastdup with only 3 lines of code.

run

Visualize the result.

results

Here are the 8 lines of code you'll need in most cases.

import fastdup

fd = fastdup.create(work_dir, images_dir)
fd.run(nearest_neighbors_k=5, cc_threshold=0.96)

fd.vis.duplicates_gallery()    # create a visual gallery of found duplicates
fd.vis.outliers_gallery()      # create a visual gallery of anomalies
fd.vis.component_gallery()     # create a visualization of connected components
fd.vis.stats_gallery()         # create a visualization of images statistics (for example blur)
fd.vis.similarity_gallery()    # create a gallery of similar images

View the API docs here.

Learn from Examples

Quick Dataset Analysis: In this example, learn how to quickly analyze a dataset for potential issues. Identify duplicates, outliers, dark/bright/blurry images, and cluster similar images with only a few lines of code. If you're new, start here.
Dino v2 Embeddings: In this example, learn how to use the latest Dino v2 algorithm to create and visualize image embeddings.
Cleaning Image Dataset: In this tutorial, learn how to clean a dataset from broken images, duplicates, outliers, and identify dark/bright/blurry images.
Analyzing Labeled Image Classification Dataset: In this tutorial, learn how to analyze a labeled image classification dataset for potential issues. We use the Imagenette dataset, a 10-class, 13k image subset of ImageNet as a working example.
Analyzing Labeled Object Detection Dataset: In this tutorial learn how to load and analyze an object detection dataset with labeled bounding boxes and classes. We use the mini-coco dataset as a working example. Learn how to discover duplicates, outliers, and possible mislabeled bounding boxes.

Advanced Features

The following are advanced functionalities of fastdup which are still in the beta testing phase. Sign up for free to be a beta tester and get early access. Drop us an email at [email protected] .

Face Detection Video Analysis: In this tutorial, learn how to use fastdup with a face detection model to detect and crop from videos. Following that we analyze the cropped faces for issues such as duplicates, near-duplicates, outliers, bright/dark/blurry faces.
YOLOv5 Object Detection Video Analysis: In this tutorial, learn how to use fastdup with a pre-trained yolov5 object detection model to detect and crop from videos. Following that we analyze the cropped objects for issues such as duplicates, near-duplicates, outliers, bright/dark/blurry objects.
Satellite Image Analysis: In this tutorial, learn how to use fastdup to load 16 bit grayscale satellite image, work with rotated bounding boxes, understand your dataset, find issues with the data and check the quality of annotations.
Surveillance Camera Analysis: In this tutorial, learn how to use fastdup to analyze surveillance camera videos, caption the activity inside the videos and detect indoor/ outdoor.

Getting Help

Get help from the fastdup team or community members via the following channels -

Community Contributions

The following are community-contributed blog posts about fastdup -

What our users say

feedback

License

fastdup is licensed under Creative Commons 4.0 license. See LICENSE.

For any queries, reach us at [email protected]

Disclaimer

Usage Tracking

We have added an experimental crash report collection, using sentry.io. It does not collect user data other than anonymized IP address data, and it only logs fastdup library's own actions. We do NOT collect folder names, user names, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores, etc. Collecting fastdup crashes will help us improve stability.

The code for the data collection is found here. On MAC we use Google crashpad.

It is always possible to opt out of the experimental crash report collection via either of the following two options:

  • Define an environment variable called SENTRY_OPT_OUT
  • or run() with turi_param='run_sentry=0'

About Visual-Layer

fastdup is founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.

Learn more about Visual Layer here.

fastdup's People

Contributors

amiralush avatar amirmk89 avatar dbickson avatar dimafrid avatar dnth avatar nagar-omer avatar rosenfeldamir avatar sanster avatar sourabmaity avatar tompil3r avatar visualdatabase avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.