Comments (46)
Hello!
I wrote a testing tool (which I'll not release due to the terrible code) and tested @elixire's image database (which contained 116280 unique images at the time of this test) with meow.
In the end I found 13 collisions. There's some relevant information here but overall it'll be useless without the files themselves (see bottom for more information about them):
512b collision found on /home/elixire/elixire/images/0/0344d20115f24ac00d6a13451b69725680cdf8fe4b09f50b15793d1cf6e500ec.a, prev on /home/elixire/elixire/images/0/020eb2406e841bb696fe7893ec7dd2e744228f133949fec7e835d490e879f0aa.png
16085343-C5231457-C21CC8EF-A0FB6063 D0C7D198-AF1398A7-300DCDBE-4D12D370
503B7AD3-9AF85B58-5D874BD6-D7399A16 F870D6D2-90BB54D1-67C0E2E6-19C12B83
512b collision found on /home/elixire/elixire/images/4/45091ded5e28408daeb05d97c9dee6f7188dd52278dfba3dbf01c7bf85f413eb, prev on /home/elixire/elixire/images/4/4e9b524fbae3c735c71f738d63362379f294930f2603d54dcf0a81cf908a8611.png
6921AB42-2AC9DC20-18FD8C44-C4FB0CDF 2F43440E-3AA56D48-539BE78C-827A39E7
30702F50-2C166578-19C59A34-A41B3AD6 42FC5EE7-C92FA010-48699D26-3B2DD1D6
512b collision found on /home/elixire/elixire/images/8/8333583c5a1b06a03ae2ffa562f6afc4f40e336d316b753f0e3e7c77f66bdf98, prev on /home/elixire/elixire/images/8/8dea60fec26a04d75af2d55bac78689b552c24e00b806c383268f6a1a84c9139.png
C6F1E3B5-0E4B20E5-A4F982B6-E7AB5C7E BAF5E5F4-B2A0D3EC-2A4C57D8-BB762BC2
276E27C6-E91CF118-D850A76D-E39B69EB 28217905-820BDA79-019ADADE-0EB0DA59
512b collision found on /home/elixire/elixire/images/5/594c41df3803343e1e68b554fc048faa7d0eef1b9bd01a088c30b6c6379bb555.a, prev on /home/elixire/elixire/images/5/58f4efcd23692e92cafdd0760d60c22555c6f53cf20e4fc6227308a6c91c66b7.png
48B8DFCC-A17F7839-1D0B4234-B2D95373 ECDB27E2-8E16EA54-38853FBD-9C8D39C2
56F7AB29-FC03B560-320AF1FC-EB21C7F7 23A29E62-99B766DC-302BEB6E-AE68BF7F
512b collision found on /home/elixire/elixire/images/a/a5ce1fc952617593233b4581de5c7fc787be9561ba268aa362ee6eeb5dbab4cd.a, prev on /home/elixire/elixire/images/a/abb2eab0eef9645973cdbd055d5cbface5e1646286fe7524f0bd3a784a171b45.png
2DADB7EF-F1044DCB-7EEF651E-4C0F7937 089DAA95-AB2AE79A-48AC58FB-9A951577
4026A0E4-D438C4A2-61C660E2-97FDCAB0 43438359-F8179CDA-323BC39D-87CB4594
512b collision found on /home/elixire/elixire/images/9/9aa8308660421b115d79a12f5b924da6df2f847c0e0999802b2d14ed0c0b68cc.mp4, prev on /home/elixire/elixire/images/9/97749f72e9d571c3cda0091eba6e4d16fe0c12c623c933ad008b42c11ba7009b.png
5884B4F7-FDBDC14A-01276417-4FF0AEBE BFA621CA-F89CFCE8-257DD2BA-0C566587
A0C68D7F-4A0C1938-40CDD295-B909BB7E C9F5B305-A9E8FBAE-09BEC8BF-AAFE55B7
512b collision found on /home/elixire/elixire/images/9/92b6244f6ea31f666ffa2e34836f29c8eef009c2d30770e8f90ba6ff29ecc34e, prev on /home/elixire/elixire/images/9/9d9413239486bcde407db8c4db288f69e824015f554152c34b738dff7fdedb4f.png
0020EC7D-43A56393-AD572478-9073AB55 27573B94-65F10EDE-68DC4B73-20BE7702
39BAE273-2CAB6136-D5DC2B10-1BA16DA6 ED85E19F-E3F1BAB3-F95BC2BD-C294654A
512b collision found on /home/elixire/elixire/images/6/61c2055bbded8bc6a5094a81c355b9db9e07e0a1646c14b6b811d98edbf4fa3d.a, prev on /home/elixire/elixire/images/6/693705033cff0c1ff465f56f5ee689c64deecea7890bffba6757be159642e747.png
B33EBE1A-977CC27D-F6245928-5633B169 BB200935-8A39E282-E53613B6-0657AFCD
9995C798-ECB8711B-7272A2DF-B8B80613 36F8A3FB-6BECF882-75719EC3-141D6A9E
512b collision found on /home/elixire/elixire/images/e/ea4773089bc133dc675b9e5d204ff892d3b492d730901362641213ce68324e8f.diff, prev on /home/elixire/elixire/images/e/e40e3f0296a232b3a4c3277c0f63674c046b928008c3c5a45b80d284ad2acefd.png
F67884B4-60E3EE07-41F154D5-E013758B F13E763B-75189041-0B1BED4F-D153C74F
84745696-DF172BE6-7E44AF7E-A0A5BB6A E1C29690-B2FFF1E8-8B79F967-D9FD3F24
512b collision found on /home/elixire/elixire/images/e/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, prev on /home/elixire/elixire/images/e/e2e277754e52134d949d632cc1c9f70696f893754cd2f5d45750405efcdb95e5.png
1FE6423A-82F81B46-3147FFF8-9F3D3B34 1FE6423A-82F81B46-3147FFF8-9F3D3B34
1FE6423A-82F81B46-3147FFF8-9F3D3B34 1FE6423A-82F81B46-3147FFF8-9F3D3B34
512b collision found on /home/elixire/elixire/images/3/39639c94587e4602695846b3f39895fa33ce54c9847174dfa0997eaca86d744f.jpg, prev on /home/elixire/elixire/images/3/3aedd8b209e0d887d60b91b2c07815dd935e819eebf795cd311ac25fcc26fd03.png
EE66394D-19E4A2F9-ED775D18-581B504E A133E1EF-91EDA6AC-A56CFC1D-DAD7CF3D
1CC710BD-2E0F167F-C8DEBAD7-5236DF47 6DEC75F5-DE20B5BE-531A823C-6F96CD73
512b collision found on /home/elixire/elixire/images/3/3867ace1999377aa2ae7596a09dc2d9dc56079b06ca81863d27961a9f1294723.a, prev on /home/elixire/elixire/images/3/3806faf92cbf32cb620c5356eb08c4c51312f3a9216bfcf450c9c4adb43c4fe8.png
9719599C-7A8D21FA-9EA5469A-CE0BC8B5 DACF079C-58242403-28F64A70-EF23A9B0
B6FB41A3-1D7C6CD8-DD5D60D4-465412FA AD6BA2B8-815DFDB1-9623D999-DE5CFACA
512b collision found on /home/elixire/elixire/images/7/776740f4c805bdff46ee3230513018ec52c26496add39d6c753d98bf686534b2, prev on /home/elixire/elixire/images/7/73a15c2e7d6f852c73c71812ed82f81f15b1d740377b1b6f239f4ce36a39ecbe.png
D7B18DC8-648EA5FB-566FB7EF-A396029D 07C87AAC-CE1E12DD-FE97BA16-4ECA8652
DB5807D6-EE794CF1-3B06E998-E3C360A3 0F0B1B9A-C1CE1A9E-8BDEDAF9-E888EEBF
I'll try to contact owners of those files to see if they're okay with the files being shared here for research purposes.
from meow_hash.
@Qix- They are different, yes. The filenames are their SHA256 hashes.
from meow_hash.
@aveao That would be very helpful. Although I suspect we do not actually need the files if you can't get them, because honestly we probably won't update the main loop of Meow hash, so it would only matter if the output of the main loop was different but the hash was not.
So maybe as a middle ground for now, could you send the complete values of S0123, S4567, S89AB, SCDEF as they are at the end of the function? If they are the same for the collisions, they are probably unfixable collisions. But if they are different, then it's our mixdown that is at fault and we can probably improve the mixdown to make them not collide, which would be very good to do!
Thanks,
- Casey
PS. My e-mail address is [email protected] if you'd like to send files.
from meow_hash.
Battlefront 2 + Battlefield 5 data together in one database....
92.2 GB over 946,378 files...
0 collisions with truncation down to 128-bit.
from meow_hash.
I threw all of Battlefield 5's content at it.
Running into 0 collisions, over 338,760 files, about 29.6GB.
Edited to correct numbers, so people reading the thread don't read wrong info at the top.
from meow_hash.
@qix I have 120 pairs, 17 triplets, 4 quadruplets, and 2 quintuplets...
from meow_hash.
We have substantially updated the Meow Hash for v0.5. We believe it now has significantly improved collision resistance. If you would like to conduct testing, the v0.5 branch is now available here as a pull request for early access testing :)
- Casey
from meow_hash.
Have to correct my stats.... my testing code was bugged :( crazy little typo and embarrassing stuff..
don't open and read binary files with fopen(filename, "r")
Anyway... with that corrected... 0 collisions for BF5. I'll see if I can throw Battlefront 2 at it later today.
from meow_hash.
Stupid question @aveao but did you confirm the files were actually different? Is that what the following lines are?
from meow_hash.
This is wonderful! Thank you very much! I will look at these today.
- Casey
from meow_hash.
Curious @tvandijck - any triple-pairs? As in, are there any three-or-more groups of files that all collide? Or are they all double pairs?
from meow_hash.
That sounds more like it :) Thanks for testing!
So we are now down to only one person (@aveao) who has reported 13 collisions, but we have not heard anything more from them. Can we verify that these are real collisions somehow, and if they are, get some repro cases? I still have found zero collisions for Meow and so I really need more testing...
- Casey
from meow_hash.
In @aveao collisions the first "letter" of the SHA256 of every conflict happens to be identical on the two files colliding. That seems very peculiar, definitely seems worth double checking that test code.
from meow_hash.
Is there some way to send a message to a GitHub user? If we can't contact @aveao, then I'm going to call it a misreport for now, since it does seem awfully suspicious and we can't get the files to verify.
- Casey
from meow_hash.
I've got access to 140TB of audio data, with files ranging in size from 10MB to 2GB, which are all unique in terms of SHA256. Would you be interested in me running meow_hash across it?
from meow_hash.
@atruskie That sounds fantastic.
from meow_hash.
@cmuratori I'm working on an official collision tester that I'll PR in by the way. Im also getting it to work on clang/MacOS.
from meow_hash.
@cmuratori Heyo! Sorry for late reply.
I didn't get a chance to check stuff yet, work and all. I'm not too experienced in C++ so it's possible that I made a mistake. I'll test with @Qix-'s collision tester once it's PR'd in.
from meow_hash.
@Qix-, I put mine in a gist here: https://gist.github.com/tvandijck/e8ac50f01b6c656f5599d50b83e35ca9
it's windows only though...
from meow_hash.
I'm debugging a few issues so it might be a little while. I'm using an mmap solution that I'll have to port to windows at some point (or just fall back to using regular streaming, but I wanted to avoid using buffers for the sake of I/O throughput).
It's almost done, I'm just working through a bit of a puddle of platform issues, most of which I'll submit as separate PRs.
from meow_hash.
Collision checker has been PR'd into #15. There are a number of dependency PR's and windows support needs to be tested (sorry).
from meow_hash.
I've got access to 140TB of audio data, with files ranging in size from 10MB to 2GB, which are all unique in terms of SHA256. Would you be interested in me running meow_hash across it?
Yes, definitely, that would be awesome! I am working on the 0.2 release of Meow right now, and it will come with Linux/Windows buildable utility that checks directory trees for collisions. Please stay tuned :)
- Casey
from meow_hash.
Looking to use this as a block hash for a game deployment pipeline: ~400 builds of ~200-500MB each chunked into 1MB blocks. Total is ~40GB. Will test this sometime soon.
from meow_hash.
I've run meow hash on 2709214 files on Linux distribution build folder which has all the source code, build temporary files, output packages, compiler, dependencies, sysroot - a lot of stuff. Total size 131GB.
No collisions. Both for 512-bit hash output, or when truncated to first 128 bits.
from meow_hash.
Ran it on approximately 41,000 separate 1MB chunks of mostly LZMA compressed game bundles. 0 collisions.
from meow_hash.
Meow v0.2 is now available and includes a collision search utility called "meow_search". It should build on both Windows/MSVC and Linux/CLANG, so you can search Windows machines or Linux machines for files that produce hash collisions. Hopefully it is robust. Please report any bugs!
It will report collisions for 128-bit, and also 64-bit and 32-bit truncations. I have not found any 128-bit or 64-bit collisions. 32-bit collisions are expected on anything in the tens-of-thousands range, so I have found some but they were expected - however it is still useful to note them, just in case they show up in suspiciously large numbers!
The hashing function has been changed to be more efficient in this version, and may have been weakened, so if everyone who has a chance could re-run their datasets with v0.2, that would be very much appreciated! Barring major revelations, this is basically the construction Meow will use, so I'd like to get it thoroughly vetted against as many datasets as possible.
Thanks,
- Casey
from meow_hash.
Just ran Meow v0.2 on parts of our Mercurial repository store.
The "regular versioned source files" part:
meow_search 0.2/Ragdoll results:
Root: /Users/aras/unity/graphics/.hg/store
Completed on: Sat Oct 27 08:41:34 2018
Files: 285500
Total size: 18gb
Duplicate files: 12457
Access failures: 0
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 13
<list of collisions; expected for 32 bit>
The "large versioned binary files" part:
meow_search 0.2/Ragdoll results:
Root: /Users/aras/unity/graphics/.hg/largefiles
Completed on: Sat Oct 27 08:38:39 2018
Files: 938
Total size: 31gb
Duplicate files: 0
Access failures: 0
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 0
from meow_hash.
Well... I just ran meow_search
and it did find 13 dupes (and 2 collisions, see below).
Yes, 13 dupes, not collisions. Sigh. But why did they get different SHA256 hashes? Why did they look different when I observed them? Well that's up to us to determine now I suppose.
It's good to know that I'm not going crazy, and that I can actually write acceptable C++.
But yes, there's 2 meow 32-bit 128-wide collisions:
meow_search 0.2/Ragdoll results:
Root: /home/elixire/elixire/images
Completed on: Sat Oct 27 08:31:16 2018
Files: 120204
Total size: 22gb
Duplicate files: 13
Access failures: 21
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 2
00000000-00000000-00000000-6BFA518B:
/home/elixire/elixire/images/3/3b352d0439009a4af1514db9ae466a0c1398075ef74fe99fb8746725a3727b09.png
/home/elixire/elixire/images/f/fd645dcd303959facc4864a0ce48125ddbfbd55128402a8bbbf857ccff0096a1.png
00000000-00000000-00000000-F748E353:
/home/elixire/elixire/images/6/66d3f7a0eefd88a254d460e23aac31f9cb0e101919978a032e8a2b0d5fe13725.png
/home/elixire/elixire/images/f/f21859d7210a4060c906b242a446e5e2207edd90e08110aee1387d9c1f22c5ff.png
Wouldn't it be better to have a Makefile instead of a build.sh
btw?
from meow_hash.
@aveao When you run your utility, and it reports 13 collisions, what happens if you run Beyond Compare or some other diff utility on one of the pairs it reports? Or, can you perhaps send me one pair that collides for me to look at, if indeed you still think they are not identical files?
Thanks,
- Casey
from meow_hash.
Just ran Meow v0.2 on parts of our Mercurial repository store.
Thanks very much! Glad to see there were no collision issues.
- Casey
from meow_hash.
Run again on my Linux build folder, no meow128 collisions:
Files: 2758869
Total size: 132gb
Duplicate files: 1660001
Access failures: 0
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 762
Large amount of duplicates is expected result.
from meow_hash.
@cmuratori they probably are identical, I doubt that we found 13 or even 1 sha256 collision.
from meow_hash.
@aveao But I thought you said they had different SHA256 hashes? Is your SHA hashing code messed up, maybe?
- Casey
from meow_hash.
If you're looking for collisions, I found a 128-bit collision in https://github.com/dvyukov/go-fuzz-corpus
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 1
D45A69A1-1ACC0AD1-9E7602B3-E1CE650F:
dvyukov/go-fuzz-corpus/htmltemplate/corpus/88a5085c17659654909710074e536e95d2be3acc-18
dvyukov/go-fuzz-corpus/url/values/corpus/f85d300aa6ced0edf61be35bc8d45c0c0adf961b
One consists of 65 %
, the other 65 ;
.
from meow_hash.
Yes! That is very helpful. I will download the corpus and try to repro it.
Which Meow hash was this? (v0.1 or v0.2)
Thanks,
- Casey
from meow_hash.
This was at git sha1 67ac7f3, so currently HEAD.
from meow_hash.
There is a new candidate for v0.3 (see the v0.3 branch). It does not collide on go-fuzz, but it has also not been tested on any of the large datasets that folks have tested prior, so we may have some regressions. Any testing of the new function would be greatly appreciated, and if you can find any collisions please send them!
Thanks,
- Casey
from meow_hash.
Just ran Meow v0.3 on parts of our Mercurial repository store, similar to previous test on v0.2. TLDR: 32 bit hash has fewer collisions, yay!
The "regular versioned source files" part:
meow_search 0.3/snowshoe results:
Root: /Users/aras/unity/graphics/.hg/store
Completed on: Sat Nov 3 17:12:44 2018
Files: 285501
Total size: 17.63gb
Duplicate files: 12457
Files changed during search: 0
Access failures: 0
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 7
<list of collisions; expected for 32 bit>
The "large versioned binary files" part:
meow_search 0.3/snowshoe results:
Root: /Users/aras/unity/graphics/.hg/largefiles
Completed on: Sat Nov 3 17:10:30 2018
Files: 939
Total size: 30.52gb
Duplicate files: 0
Files changed during search: 0
Access failures: 0
Allocation failures: 0
Read failures: 0
[Meow128] Meow 128-bit AES-NI 128-wide collisions: 0
[Meow64] Meow 64-bit AES-NI 128-wide collisions: 0
[Meow32] Meow 32-bit AES-NI 128-wide collisions: 0
from meow_hash.
Excellent!
- Casey
from meow_hash.
We are now getting down to the nitty-gritty. v0.4 has been posted, and should hopefully provide the same collision resistance as v0.3, while now being substantially faster on small inputs (we are now the fastest smhasher-passing hash we know of for all input sizes, period). We will be doing a little more testing, but unless something new comes up, I will try to have a v0.5 branch up sometime before the end of the year that we can definitively test as the "final" Meow hash to be christened v1.0.
Thanks everyone for your testing help!
- Casey
from meow_hash.
Ran 0.4 on the same dataset as before. 128 & 64 bit hashes still zero collisions, 32 bit now has 9 collisions where 0.3 had 7. Not sure if that's worth worrying about at all; I think the amount of collisions is ballpark where I would expect with only 32 bits of hash.
If you really want to look at it, here's the files from that data set where 0.3 had collisions, and where 0.4 had collisions.
collisions-32bit-0.3.zip
collisions-32bit-0.4.zip
from meow_hash.
Hi @cmuratori,
Thanks for your brilliant work!
Ran HEAD (0.4) on the ImageNet dataset 2017 from computer vision setting. 32-bit collisions ONLY!
Data processed: 156GB
File Count: 2 025 710
Dupes: 8502
32-bit collisions: 511
With this, I can conclude that meowhash is reliable and ImageNet2017 is a high quality dataset!
Best fortunes,
Tjad
from meow_hash.
the v0.5 branch is now available here as a pull request for early access testing
A bunch of test executables (test, bench, search) do not compile with the provided scripts (on Mac/Linux at least) due to wrong paths, but also the test programs seem to want to include files that are removed (meow_intrinsics.h
, meow_hash.h
are gone but used from meow_test.h
etc.).
from meow_hash.
Ah crap - somehow the cpps were all the old ones. I admit I am not particularly good with the GitHub web client (or anything Git-related for that matter). They should be updated now.
Keep in mind, though, we haven't posted a Mac/Linux build yet. So the build.sh is actually old, and may not work. This is Windows-only at the moment, although it probably isn't far off from working on Mac/Linux.
- Casey
from meow_hash.
@aras-p I pushed a new build.sh and meow_test.cpp today. Those are the only two things that had issues on Linux. Everything should be compiling now on Windows and Linux. I don't test on Mac, so YMMV there, but IIRC the Linux and Mac builds were very similar so it shouldn't be hard to get it running.
- Casey
from meow_hash.
Paused some training to obtain results, appears to be consistency, and improvements 👍
Best fortunes,
Tjad
from meow_hash.
Related Issues (20)
- Example program does not work on Windows HOT 9
- Use streaming construction to hash files HOT 2
- A Sun port i did on a whim, using the system compiler... HOT 3
- Benchmark Results From Ryzen 7 1700 1st Gen HOT 5
- dotnet (c#) bindings HOT 1
- How deterministic is Meow hash? HOT 4
- 256-bit variants HOT 13
- Consider using -mavx rather than -mavx2 in build.sh's build of meow_bench HOT 1
- Inlining Failed HOT 4
- Errors in contributors links
- MeowU64From only returns the first 64 bytes of the hash HOT 4
- .NET Core 3.1 port. HOT 2
- _ReadWriteBarrier() deprecated HOT 2
- Make input parameters const?
- Buffer overflow when size is not a multiple of 16 (ASan). HOT 2
- Full 128-bit collision between two files HOT 15
- 0.6 candidate patterns HOT 4
- Meow 0.6 candidate functions HOT 4
- Compare against xxHash HOT 2
- Suggestion: API for runtime AES instruction check
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from meow_hash.