Git Product home page Git Product logo

split-folders's Introduction

split-folders Build Status PyPI PyPI - Python Version PyPI - Downloads

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder should have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

  • Split files into a training set and a validation set (and optionally a test set).
  • Works on any file types.
  • The files get shuffled.
  • A seed makes splits reproducible.
  • Allows randomized oversampling for imbalanced datasets.
  • Optionally group files by prefix.
  • (Should) work on all operating systems.

Install

This package is Python only and there are no external dependencies.

pip install split-folders

Optionally, you may install tqdm to get a progress bar when moving files.

pip install split-folders[full]

Usage

You can use split-folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default. Oversampling is only applied to the train folder since having duplicates in val or test would be considered cheating.

Module

import splitfolders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output",
    seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values

# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.
splitfolders.fixed("input_folder", output="output",
    seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values

Occasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)). splitfolders lets you split files into equally-sized groups based on their prefix. Set group_prefix to the length of the group (e.g. 2). But now all files should be part of groups.

Set move=True if you want to move the files instead of copying.

CLI

Usage:
    splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
    --output        path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.
    --fixed         set the absolute number of items per validation/test set. The remaining items constitute
                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.
                    Set 3 values, e.g. `300 100 100`, to limit the number of training values.
    --seed          set seed value for shuffling the items. defaults to 1337.
    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.
    --group_prefix  split files into equally-sized groups based on their prefix
    --move          move the files instead of copying
Example:
    splitfolders --ratio .8 .1 .1 -- folder_with_images

Because of some Python quirks you have to prepend -- after using --ratio.

Instead of the command splitfolders you can also use split_folders or split-folders.

Development

Install and use poetry.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

MIT

split-folders's People

Contributors

jfilter avatar mariusmez avatar nicholastzx avatar ghltshubh avatar andife avatar dependabot[bot] avatar snul2 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.