Git Product home page Git Product logo

fs-curator's Introduction

FS-Curator

The curator is a meta-data repository that organizes your files. Designed to utilize modern filesystems to their full potential while keeping the operator in total control

The service

fs-curator workflow

  • Makes no assumptions about nor places any demands on your workflow(s)
  • Fault tolerant design that leverages OS guarantees for graceful degredation in the case of user error
  • Configure for what you want, not what to do. Write 100 lines of config, it reorganizes 100K files

Other tools

other tools workflow

  • Demands your workflow adapt to its assumptions, behaves unexpectedly if violated
  • Usually single point of failure, degrades terribly if corrupted
  • Narrow purpose restricts the amount of compatible workflows, frequently involving repetitive list sifting

No risk design

Rest assured the curator doesn't do anything risky or evil with your data

  • No vendor lock in! Delete the curator's DB at no risk to your directory trees or your stored meta-data
  • No propritary meta-data files. All meta-data are expressed as directory trees or attached via NTFS streams or xattrs. Access them directly via notepad and cli commands, respectively
  • No networking capabilities, the curator respects your privacy.
    • It uses Unix domain sockets that are literally incapable of connecting to another machines
    • For networked clients that need to access meta-data, attributes used by the curator are fully compatible with both Windows SMB and Samba (with some config)
  • No data loss risk in the repository. Curator will never run the equivlent of rm -rf and or overwrite files. In fact, to regenerate a directory tree, you must delete it yourself (or the command fails)

What exactly is in the box

  • 100% native program written in C++20 with the resource efficiency you'd expect
  • Easy to understand & write ini configurations
  • Monitors multiple paths for directories & files to injest
  • Incrementally dedupe files as they are added
  • Murmur3 hash based binary level deduplication
  • PHash perceptual deduplication for images
  • Integrated FFMPEG thumbnailer for images & videos
  • Groups "related" files and maintain file ordering
  • Regex based renaming capabilities (with named capture groups)
  • Transform files by invoking other programs (un-archiving, re-encoding, etc)
  • Rules based directory tree generation
  • Hard links support to keep file contents synced & reduce duplication

See the configuration manual for how it works

fs-curator's People

Contributors

unreadablewxy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fs-curator's Issues

Group level metadata storage

I hate losing data. While I also am not a fan of data models that can represent inconsistent state, functional necessities beats philosophical design preferences.

As I try to build version 0.0.1 features, I'm increasingly being confronted by a need to retain certain group level properties.

  • Properties that are undeniably for the group and not the file.
  • If we are just talking about file level properties then xattrs would prove sufficient.
  • Group level data can't be handled in the same fashion.
  • Scattering it among the constituent files is a no go as doing so means:
    • N different ways the data can be inconsistent.
    • Should it ever be updated, there will be N different values to update
    • File systems don't support transactions across file boundaries
  • Necessity is derived from the fact that the import process irrevocably loses data in the form of losing progenitor file's name & attributes. Which means stores as they currently exist are stateful entities. As the system does not have all the necessary data to regenerate stores. I would much prefer them to be like SQL views. Something that can be regenerated at will from a single source of truth (the mono-collection)
  • The storage dimension of the problem stems from the fact that groups has no physical entity to which data can be anchored.
  • A candidate solution is to create directories for them and set xattrs on the directory.
    • The biggest drawback of this solution is that we absolutely lose the ability to use the by-order index as a "see-all" directory. But then again its contents has no file extensions so maybe it's all for the best?
  • Another candidate solution is to just keep meta-data files for groups. Each file contains a list of its member files by ID.
    • The biggest issue of this solution is that if the files change, a gigantic scan needs to happen where every one of these meta-data files must be scanned.
    • Gains for this approach is equally unimpressive. An entirely speculative reduction in storage concerning directory structure while ensuring increased inode usage and being forced to trade off readability or storing file identifiers in an inefficient encoding.

Inline capture groups

Regexes are supposed to be fluid, using positional name assignment sucks. the C++ standard lib's regex implementation seems woefully underwhelming and the boost one is flagged by engsec.

Lets build a RAII wrapper on PCRE2

Windows support

Though I don't see a lot of value porting to windows, this might be something someone wants?

SHA1 and configurable hashing range support

Downloading a torrent only for curator to smack it away is wasteful

The scenario is to integrate with a transmission frontend so that we can pre-filter what files to download & then add them to the correct groups automagically

xxhash support

Trying out NVME arrays for some of my mods, turns out murmur3 was decidedly the bottleneck. XXHash3 removes this as well as address the hash stability issue between architectures (though this isn't really a realistic problem).

APNG thumbnails

  • Current thumbnailer snaps the first frame visible. This is not ideal for non-images that frequently starts with solid colors.
  • We can't just transcode a hour long APNG if the source is that long.
  • Perhaps a configurable strategy? possibly +X seconds from first contrast change, or when histogram drastically changes?

Lossy Thumbnails

Saw a group's 512x512 resolution thumbnails take up 1/3 the size of the actual content.

  • Thumbnails don't necessarily just have to be smaller in resolution.
  • JPEG thumbnails with very aggressive compression might be acceptable to some.

Create a docker image

Has been requested recently, not well versed in docker admin. Need better understanding, will allocate some hosts in personal lab and investigate.

Crop to aspect thumbnails

Since FS-Viewer now supports covering thumbnails, resolution has been highlighted as an issue.

  • Not all images & short video clips has the same aspect ratio.
  • Thumbnails are previews, not necessarily scaled down originals.
  • It might make sense to scale down and crop to a particular aspect ratio.
  • This thumbnailing strategy reduces waste and produces thumbnails that can be easily tessellated, which is usually visually appealing

Conflict resolution strategy support

  • Currently if an conflict is found after transforms, import is aborted.
  • There exists no instrument to indicate what to do when this is the case and the user is just stuck. Should provide a way to signal what to do with a conflict

Solution?

  • Accept a ".merge" suffix on files indicating what to do
  • Accept a "merge group" strategy, that combines all groups overlapping with the current group into one big group
  • Accept a "use existing" strategy, that accepts the file as a repeat and link it to the newly created group

Sort by length then lexical difference

lets say we have files: a1, a2, a3, a10, a11. lexical ordering dictates: a1, a10, a11, a2, a3. Which is semantically wrong. sorting by length then character codes seems like the safer option

Experimental rust rewrite

Picked up rust as a language for fun. Decided to try it out with this project as a testbed

If successful, Should

  • shorten the road to supporting OSX & BSD
  • be able to add more intuitive named capture groups
  • allow additional parallelism

Build a FS-Viewer extension

  • Find phash similar files in stage ode
  • Import into collection function
  • Reorder files in group
  • Infinite scroll through all the files in the mono-collection ordered by group + index

Transform to multiple groups support

Currently, all files generated from a transform are made into one group.

But it is sometimes possible that a single archive file may contain many groups of files that either look similar to each other or shares a common name prefix.

The specific example given was unpacking cg bundles of art rips downloaded from torrent sites.

I never understood why people would try compress already compressed images. But I suppose we could always support finer granularity grouping and ship a separate utility to do phash grouping for small sets of files that can then emit meta-data in an understandable format.

Tags support

Part space efficiency, part ease of use problem.

There's something appealing about the binary nature of tags. Something either has the attribute or it does not. From a data perspective.

  • FS-Viewer's tagging option makes a lot of sense and we should support something like it
  • The problem with namespaced tags is that the curator can't be made to understand namespaces.
    • Scope is not a thing to the mono-collection and can't be made into one without coupling the mono-collection to views
  • One option is to just ignore nuances of the viewer and create a single mono-collection scoped tag namespace

Remove distinction between stores and hoppers

The ideal vision is a central store that project views into directories, a lot like how a SQL database works, but with different constraints.

A design needs to be written for how to deal with files showing up in ingestion points that by rights has no way to efficiently know if they're a new conflicted file or a projection of an okay file. At least, there is no portable way to do this.

Regex replace support in attributes

I've lived long enough to know why the programmers loathed spaces in file names and I can definitively say those days are the past. Nower days I find uses of hyphens and underscores unnecessary and aesthetically displeasing. Let's build in some mechanisms to replace characters from attributes

Hash aliasing support

There's been cases where crawlers drag in files that looks the same and definitely are the same but hashes slightly differently either due to re-encoding or whatever other transformative processes.

  • We should be able to easily build up a list of known hash aliases by creating hard links into say collection/by-id-alias where files are named based on alais_size.alias_hash.extension
  • Would require a new conflict resolution option file = alias HASH|GROUP+INDEX
  • Would require a new test at import time against known aliases

Subset patrolling support

Apparently not everyone knows what meta-data stripping is, so some people end up with thumbnails embedded in JPEGs messing up binary dedupe.

Some of these problems were found in considerably large collections so batch fixing might be a problem for the inevitable patrolling read that needs to come afterwards. So, we need to support the curator patrolling a subset of files, perhaps even being the orchestrator of these batch commands so it can safely & automatically do the necessary index updates afterwards.

Runtime declared properties support

WIP files currently only supports a hand full of predefined properties to be applied to the group, lets expand that to any property we like, including those that were never declared in the config files

Property existence static validations

A problem has been brought to my attention. Once a config becomes sufficiently complicated, some static validation makes sense to catch potential runtime errors.

  • It is not hard to scan path formats to determine the property requirements of a store.
  • Produced properties are already parsed from hoppers
  • We also know the names of all properties during reoffering
  • All that is needed is to run a quick comparison between source & store to see if the source eclipses the store.

User configurable per-file attributes

It could be that I rely too much on group attributes to realize it before. But I've realized that the program doesn't actually support user configured file level attributes as a first class concept.

  • This makes sense because hoppers corresponds to groups which doesn't necessarily have a 1:1 relationship with the files
  • The service doesn't really do any file level inspection that doesn't result in an attribute it will write anyways
  • Maybe we ought to expose more FFMPEG & OpenCV functionality
  • Maybe we ought to let users set attributes on files via its own phase in the import process, so it doesn't have to be hacked it into existence via transforms
  • Quite possibly it is a good to format paths purely based off file attributes so we can generate more nuanced file names instead of some combination of group level properties & file index

OpenCV integration

Currently import is just a matter of creating a plan and see if the plan would be obstructed in any way by existing files. Which usually is a no. Obstructions renames the incoming files that are obstructed and abort the import.

Thinking we should promote this obstruction detection into its own category of actions. CV can happen as a similar "thing" where we redefine obstruction as both a binary hash conflict and a perceptual hash clash. Only difference being the conflict resolution strategies are that with perceptual clashes, it's valid to ignore the clash.

Only issue with this approach is that it doesn't have a retroactive element. People would have some collections built already, perhaps with similar files that should be grouped, a story for them may be required.

Add example-config.txt

Dear @unreadablewxy,
I got inspired to try your app, read a local wiki twice, but failed to proceed further. Consider adding example-config.txt (like DSNCrypt does) adapted for some common scenario. For example, a user has C:\Downloads with a bunch of executable, text and audio files. Show how to move them to sub-folders (Bin, Docs, Media) accordingly.

Property indexing

Let's expand & generalize the querying functionality of the system.

  • The file system already serves as a good mechanism to do discrete querying. i.e. questions like "is there a file satisfying predicate P"
  • The question then becomes, is there a good way to do intersections on predicates?
    • Sort-Merge-Join. Global ordering allows us to do this without needing to sort, cutting out a "N" out of the time complexity
    • Build sqlite indices. Not as appealing, but joins is what they're good at, so maybe we've finally found a valid purpose for them in this project?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.