Git Product home page Git Product logo

imgdupes's People

Contributors

hilkoc avatar jesjimher avatar tomhoover avatar top-on avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

imgdupes's Issues

Detect duplicated images rotated with jpegtran

When a JPEG file has been rotated using jpegtran or any other JPEG lossless rotation utility, imgdupes can't find duplicates, because this kind of rotation involves altering original image data. "Standard" rotation (switching EXIF rotation tag) is fully detected.

One way to detect this kind of transformations would imply generating and storing in .signatures up to 4 hashes (all possible rotations) instead of just one. This would slow things quite a bit, albeit perhaps not that much since image data would already be in memory and imgdupes is usually I/O bound. Some multiprocessing would help.

One thing to note is that jpegtran also allows to losslessly flip images, so theoretically imgdupes should store all 4 possible rotations, 2 possible flips (horizontal and vertical), and all possible combinations of rotation+flip. Since this is obviously unfeasible, I think that flipping may be ignored for the moment. After all, is not an operation as common as rotation.

installation on ubuntu 20.04 failed

root@myhostname:/tmp/imgdupes# apt-get install python3-dev libjpeg-dev gir1.2-gexiv2-0.10 jpeginfo
...
root@myhostname:/tmp/imgdupes# python3 setup.py build
...
root@myhostname:/tmp/imgdupes# python3 setup.py install
ModuleNotFoundError: No module named 'cffi'

releases on github

Hi,

Thanks for notifying me of your updates & posting them on pypi.

For the arch linux' AUR package you should release the versions on github. The AUR pkg will download sources from github and not pypi. I've proceeded already by creating jpegdupes-git package which downloads your latest commit as source, but the non git version of the AUR pkg would require a release on github.

For simplicity's sake it would be nice to rename the github repo to jpegdupes as well. People will be wondering what line 4 in my PKGBUILD is.

Thanks!

Pieter

‘struct.error: integer out of range for 'H' format code’

Hello,

I get an error after a certain number of files analysed:

Traceback (most recent call last):
  File "/home/gilles/bin/imgdupes.py", line 259, in <module>
    'hash':hashcalc(ruta,pool,args.method),
  File "/home/gilles/bin/imgdupes.py", line 63, in hashcalc
    results=pool.map(phash,lista)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'H' format code

If I relaunch the command, it continues from where it stopped until the next error.

$ python --version 
Python 2.7.6

and

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:   trusty

Thank you!

delete option crashes it

I just did a
$ jpegdupes -d /home/turgut/Pictures/

and got:
(...)
Exploring ./2018/07
Exploring ./2018/07/06
Exploring ./2018/07/07
Exploring ./2018/07/08
Exploring ./2018/07/24
Exploring ./2018/07/27
Exploring ./2018/08
Exploring ./2018/08/19
Exploring ./2018/08/21
Exploring ./2018/08/23
Exploring ./2018/08/11
Exploring ./2018/08/12
Exploring ./2018/08/17
Exploring ./2018/08/18
Exploring ./2018/08/20
Exploring ./2018/09
Exploring ./2018/09/07
Exploring ./2018/09/08
Exploring ./2018/09/09
Exploring ./2018/09/14
Exploring ./2018/09/16
Exploring ./2009
Exploring ./2009/09
Exploring ./2009/09/22

Traceback (most recent call last):
File "/usr/local/bin/jpegdupes", line 11, in
load_entry_point('jpegdupes==2.0.13', 'console_scripts', 'jpegdupes')()
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in main
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 145, in metadata_summary
AttributeError: 'Metadata' object has no attribute 'get_tags'

Dupes found are actually the same file

Got result like this:

(all 4 files are SAME file)

.... dupes that are ok ...

./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg 

... some more dupes that are ok ...

I don't know why but it worked perfect (detected all dupes it should have detected) except when it thought this one file to be a dupe of itself... weird.

recompile with -fPIC

Hi, I tried running this script (Linux Mint).
I had to add
import gi
gi.require_version('GExiv2', '0.10')
before the from gi.repository import GExiv2 in order to not get an error.

The next attempt left me with the below error on the top and the script just finished after crawling some folders:

/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a(libturbojpeg_la-turbojpeg.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object; recompile with -fPIC
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

Have you thought about doing whole-file MD5 for other image types such as png and nef?

Have you thought about doing whole-file MD5 for other image types such as png and nef?

I have forked your code and done some adjustments. I have some files that crash imgdupes, because they have "truncated jpg block" data. Also, imgdupes seems to show the same file multiple times for HDR files re-developed by shotwell. Then choosing one to keep fails with the error that it cannot delete the extras, e.g.

If you are still interested in this project, I'm planning to send you some PRs for:

  1. Do not crash on truncated JPG data blocks, catch the exception and do whole-file hash for those
  2. Do not crash on files with non-jpg content such as misnamed PNG files
  3. Automatic mode: non-interactively select to keep the best duplicate of a set, with the most tags, residing in the shallowest directory tree, and with the longest directory path in case of ties (prefer more descriptive directory names and shallower trees)

Execpation not properly caught

imgdupes stops on this image:

  Calculating hash of ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg...
Traceback (most recent call last):
  File "/root/imgdupes/imgdupes.py", line 256, in <module>
    'hash':hashcalc(ruta,pool,args.method),
  File "/root/imgdupes/imgdupes.py", line 66, in hashcalc
    results=pool.map(phash,lista)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
NameError: global name 'path' is not defined
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# ls -l "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
-rwxrwxr-x+ 1 root root 2960434 Jun  8  2007 ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# file "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg: JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=11, manufacturer=CASIO COMPUTER CO.,LTD , model=QV-R61 , orientation=upper-left, xresolution=178, yresolution=186, resolutionunit=2, software=1.00                 , datetime=2007:02:01 22:11:18]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.