unreadablewxy / fs-curator Goto Github PK

View Code? Open in Web Editor NEW

96.0 6.0 2.0 22 KB

Automation for the serious data hoarder that wants to have their data and use it

file-sorting file-renamer hard-links organizer directory-tree deduplication

fs-curator's Introduction

I am... WXY

Worked full time for two of the biggest companies you definitely know
Programming since I was 6, capable in pretty much any language or style, though most proficient in (alphabetical order)
- C, C++, C#, Golang, Java, Lua, JS/TS/FlowtypedJS, Rust
- I unironically like the various dialects of Lisp & don't mind haskel
Veteran in ReactJS, former Knockout developer, oldschool JS+HTML+CSS dev, dabled in old angular, capable in Vue
Award winning UX/web designer
Hobby Android & embedded systems developer
Well versed in information security, software forensics, and reverse engineering
Homelab owner, data hoarder, hobby crypto miner
For 450K/yr salary and the right project I might be convinced to becoming your architect

Hit: Australian coca-cola, mainframes, lie down desks

Miss: Emails, driving, inconsistently formatted code

Want to get in touch? Post an issue on the relevant project or message me on Reddit

fs-curator's People

Contributors

Stargazers

Watchers

Forkers

koolkeith testxsubject

fs-curator's Issues

Add example-config.txt

Dear @unreadablewxy,
I got inspired to try your app, read a local wiki twice, but failed to proceed further. Consider adding example-config.txt (like DSNCrypt does) adapted for some common scenario. For example, a user has C:\Downloads with a bunch of executable, text and audio files. Show how to move them to sub-folders (Bin, Docs, Media) accordingly.

Group level metadata storage

I hate losing data. While I also am not a fan of data models that can represent inconsistent state, functional necessities beats philosophical design preferences.

As I try to build version 0.0.1 features, I'm increasingly being confronted by a need to retain certain group level properties.

Properties that are undeniably for the group and not the file.
If we are just talking about file level properties then xattrs would prove sufficient.
Group level data can't be handled in the same fashion.
Scattering it among the constituent files is a no go as doing so means:
- N different ways the data can be inconsistent.
- Should it ever be updated, there will be N different values to update
- File systems don't support transactions across file boundaries
Necessity is derived from the fact that the import process irrevocably loses data in the form of losing progenitor file's name & attributes. Which means stores as they currently exist are stateful entities. As the system does not have all the necessary data to regenerate stores. I would much prefer them to be like SQL views. Something that can be regenerated at will from a single source of truth (the mono-collection)
The storage dimension of the problem stems from the fact that groups has no physical entity to which data can be anchored.
A candidate solution is to create directories for them and set xattrs on the directory.
- The biggest drawback of this solution is that we absolutely lose the ability to use the by-order index as a "see-all" directory. But then again its contents has no file extensions so maybe it's all for the best?
Another candidate solution is to just keep meta-data files for groups. Each file contains a list of its member files by ID.
- The biggest issue of this solution is that if the files change, a gigantic scan needs to happen where every one of these meta-data files must be scanned.
- Gains for this approach is equally unimpressive. An entirely speculative reduction in storage concerning directory structure while ensuring increased inode usage and being forced to trade off readability or storing file identifiers in an inefficient encoding.

Regex replace support in attributes

I've lived long enough to know why the programmers loathed spaces in file names and I can definitively say those days are the past. Nower days I find uses of hyphens and underscores unnecessary and aesthetically displeasing. Let's build in some mechanisms to replace characters from attributes

Transform to multiple groups support

Currently, all files generated from a transform are made into one group.

But it is sometimes possible that a single archive file may contain many groups of files that either look similar to each other or shares a common name prefix.

The specific example given was unpacking cg bundles of art rips downloaded from torrent sites.

I never understood why people would try compress already compressed images. But I suppose we could always support finer granularity grouping and ship a separate utility to do phash grouping for small sets of files that can then emit meta-data in an understandable format.

Lossy Thumbnails

Saw a group's 512x512 resolution thumbnails take up 1/3 the size of the actual content.

Thumbnails don't necessarily just have to be smaller in resolution.
JPEG thumbnails with very aggressive compression might be acceptable to some.

Subset patrolling support

Apparently not everyone knows what meta-data stripping is, so some people end up with thumbnails embedded in JPEGs messing up binary dedupe.

Some of these problems were found in considerably large collections so batch fixing might be a problem for the inevitable patrolling read that needs to come afterwards. So, we need to support the curator patrolling a subset of files, perhaps even being the orchestrator of these batch commands so it can safely & automatically do the necessary index updates afterwards.

Crop to aspect thumbnails

Since FS-Viewer now supports covering thumbnails, resolution has been highlighted as an issue.

Not all images & short video clips has the same aspect ratio.
Thumbnails are previews, not necessarily scaled down originals.
It might make sense to scale down and crop to a particular aspect ratio.
This thumbnailing strategy reduces waste and produces thumbnails that can be easily tessellated, which is usually visually appealing

SHA1 and configurable hashing range support

Downloading a torrent only for curator to smack it away is wasteful

The scenario is to integrate with a transmission frontend so that we can pre-filter what files to download & then add them to the correct groups automagically

Audio signature support

Makes sense to eventually support this scenario. Design does not yet exist.

An algorithm has been raised by its author (url: https://birds-are-nice.me/software/risahash.html). Useful because it can use the same search algorithm as phash. Down side is sensitivity to time displacement.

Needs scenarios & design.

Sort by length then lexical difference

lets say we have files: a1, a2, a3, a10, a11. lexical ordering dictates: a1, a10, a11, a2, a3. Which is semantically wrong. sorting by length then character codes seems like the safer option

Conflict resolution strategy support

Currently if an conflict is found after transforms, import is aborted.
There exists no instrument to indicate what to do when this is the case and the user is just stuck. Should provide a way to signal what to do with a conflict

Solution?

Accept a ".merge" suffix on files indicating what to do
Accept a "merge group" strategy, that combines all groups overlapping with the current group into one big group
Accept a "use existing" strategy, that accepts the file as a repeat and link it to the newly created group

Inline capture groups

Regexes are supposed to be fluid, using positional name assignment sucks. the C++ standard lib's regex implementation seems woefully underwhelming and the boost one is flagged by engsec.

Lets build a RAII wrapper on PCRE2

Experimental rust rewrite

Picked up rust as a language for fun. Decided to try it out with this project as a testbed

If successful, Should

shorten the road to supporting OSX & BSD
be able to add more intuitive named capture groups
allow additional parallelism

Remove distinction between stores and hoppers

The ideal vision is a central store that project views into directories, a lot like how a SQL database works, but with different constraints.

A design needs to be written for how to deal with files showing up in ingestion points that by rights has no way to efficiently know if they're a new conflicted file or a projection of an okay file. At least, there is no portable way to do this.

Hash aliasing support

There's been cases where crawlers drag in files that looks the same and definitely are the same but hashes slightly differently either due to re-encoding or whatever other transformative processes.

We should be able to easily build up a list of known hash aliases by creating hard links into say collection/by-id-alias where files are named based on alais_size.alias_hash.extension
Would require a new conflict resolution option file = alias HASH|GROUP+INDEX
Would require a new test at import time against known aliases

Build a FS-Viewer extension

Find phash similar files in stage ode
Import into collection function
Reorder files in group
Infinite scroll through all the files in the mono-collection ordered by group + index

Property existence static validations

A problem has been brought to my attention. Once a config becomes sufficiently complicated, some static validation makes sense to catch potential runtime errors.

It is not hard to scan path formats to determine the property requirements of a store.
Produced properties are already parsed from hoppers
We also know the names of all properties during reoffering
All that is needed is to run a quick comparison between source & store to see if the source eclipses the store.

Runtime declared properties support

WIP files currently only supports a hand full of predefined properties to be applied to the group, lets expand that to any property we like, including those that were never declared in the config files

Property indexing

Let's expand & generalize the querying functionality of the system.

The file system already serves as a good mechanism to do discrete querying. i.e. questions like "is there a file satisfying predicate P"
The question then becomes, is there a good way to do intersections on predicates?
- Sort-Merge-Join. Global ordering allows us to do this without needing to sort, cutting out a "N" out of the time complexity
- Build sqlite indices. Not as appealing, but joins is what they're good at, so maybe we've finally found a valid purpose for them in this project?

Windows support

Though I don't see a lot of value porting to windows, this might be something someone wants?

xxhash support

Trying out NVME arrays for some of my mods, turns out murmur3 was decidedly the bottleneck. XXHash3 removes this as well as address the hash stability issue between architectures (though this isn't really a realistic problem).

User configurable per-file attributes

It could be that I rely too much on group attributes to realize it before. But I've realized that the program doesn't actually support user configured file level attributes as a first class concept.

This makes sense because hoppers corresponds to groups which doesn't necessarily have a 1:1 relationship with the files
The service doesn't really do any file level inspection that doesn't result in an attribute it will write anyways
Maybe we ought to expose more FFMPEG & OpenCV functionality
Maybe we ought to let users set attributes on files via its own phase in the import process, so it doesn't have to be hacked it into existence via transforms
Quite possibly it is a good to format paths purely based off file attributes so we can generate more nuanced file names instead of some combination of group level properties & file index

APNG thumbnails

Current thumbnailer snaps the first frame visible. This is not ideal for non-images that frequently starts with solid colors.
We can't just transcode a hour long APNG if the source is that long.
Perhaps a configurable strategy? possibly +X seconds from first contrast change, or when histogram drastically changes?

Tags support

Part space efficiency, part ease of use problem.

There's something appealing about the binary nature of tags. Something either has the attribute or it does not. From a data perspective.

FS-Viewer's tagging option makes a lot of sense and we should support something like it
The problem with namespaced tags is that the curator can't be made to understand namespaces.
- Scope is not a thing to the mono-collection and can't be made into one without coupling the mono-collection to views
One option is to just ignore nuances of the viewer and create a single mono-collection scoped tag namespace

OpenCV integration

Currently import is just a matter of creating a plan and see if the plan would be obstructed in any way by existing files. Which usually is a no. Obstructions renames the incoming files that are obstructed and abort the import.

Thinking we should promote this obstruction detection into its own category of actions. CV can happen as a similar "thing" where we redefine obstruction as both a binary hash conflict and a perceptual hash clash. Only difference being the conflict resolution strategies are that with perceptual clashes, it's valid to ignore the clash.

Only issue with this approach is that it doesn't have a retroactive element. People would have some collections built already, perhaps with similar files that should be grouped, a story for them may be required.

Create a docker image

Has been requested recently, not well versed in docker admin. Need better understanding, will allocate some hosts in personal lab and investigate.