living-with-machines / deduplify Goto Github PK
View Code? Open in Web Editor NEWA Python tool to search for and remove duplicated files in messy datasets
License: MIT License
A Python tool to search for and remove duplicated files in messy datasets
License: MIT License
The prefix /Users/${USER}
is prepended to the filepaths that are saved in the JSON file. This means if we're running the hash
process with the --restart
flag in a different user account, the code does not actually skip the files that have already been hashed.
/Users/${USER}
needs to be removed from the filepaths when they are saved. This can be saved to a variable in Python like so:import os
exp_usr = os.path.expanduser("~")
The following code blocks needs updating for cases where len(name_freq) > 1 && len(set(file_list)) == 1
, i.e., there are multiple filepaths that are different, but by coincidence have the same length.
deduplify/deduplify/compare_files.py
Lines 65 to 71 in cca36e5
Currently, the GitHub Action to publish the package to Test PyPI runs on every push to an open Pull Request against the main
branch. This is problematic since, after the first push, a distribution exists on the Test PyPI server and subsequent pushes fail due to a naming clash.
The GitHub Action workflow needs to be tweaked so that multiple pushes to a PR does not break pushing distributions to Test PyPI. Maybe this could be manually triggered by a comment on the PR?
Could this block:
deduplify/deduplify/hash_files.py
Lines 94 to 100 in 112fa62
be better written as:
for k, v in counted_hashes.items():
db.update({"duplicate": v > 1}, where("hash") == k)
Is that valid Python?
Or maybe:
for k, v in counted_hashes.items():
cond = v > 1
db.update({"duplicate": cond}, where("hash") == k)
This option would definitely work
Update Contributing Guide on how to make a version bump with incremental https://pypi.org/project/incremental/ and then make a release
Build across many platforms and Python versions using CIBuildWheel https://cibuildwheel.readthedocs.io/en/stable/setup/#github-actions
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.