Discussion:
binary file compare...
SpreadTooThin
2009-04-13 19:54:11 UTC
Permalink
I want to compare two binary files and see if they are the same.
I see the filecmp.cmp function but I don't get a warm fuzzy feeling
that it is doing a byte by byte comparison of two files to see if they
are they same.

What should I be using if not filecmp.cmp?
Przemyslaw Kaminski
2009-04-13 20:00:26 UTC
Permalink
SpreadTooThin wrote:

> I want to compare two binary files and see if they are the same.
> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> that it is doing a byte by byte comparison of two files to see if they
> are they same.
>
> What should I be using if not filecmp.cmp?

Well, here's something
http://www.daniweb.com/forums/thread115959.html
but it seems from the post on the bottom that filecmp does comparison of
binary files.
SpreadTooThin
2009-04-13 20:03:05 UTC
Permalink
On Apr 13, 2:00?pm, Przemyslaw Kaminski <cge... at gmail.com> wrote:
> SpreadTooThin wrote:
> > I want to compare two binary files and see if they are the same.
> > I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> > that it is doing a byte by byte comparison of two files to see if they
> > are they same.
>
> > What should I be using if not filecmp.cmp?
>
> Well, here's somethinghttp://www.daniweb.com/forums/thread115959.html
> but it seems from the post on the bottom that filecmp does comparison of
> binary files.

I just want to be clear, the comparison is not just based on file size
and creation date but by a byte by byte comparison of the data in each
file.
Grant Edwards
2009-04-13 20:03:32 UTC
Permalink
On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:

> I want to compare two binary files and see if they are the same.
> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> that it is doing a byte by byte comparison of two files to see if they
> are they same.

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.

> What should I be using if not filecmp.cmp?

I don't understand what you've got against comparing the files
when you stated that what you wanted to do was compare the files.

--
Grant Edwards grante Yow! Look DEEP into the
at OPENINGS!! Do you see any
visi.com ELVES or EDSELS ... or a
HIGHBALL?? ...
SpreadTooThin
2009-04-13 20:17:02 UTC
Permalink
On Apr 13, 2:03?pm, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-13, SpreadTooThin <bjobrie... at gmail.com> wrote:
>
> > I want to compare two binary files and see if they are the same.
> > I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> > that it is doing a byte by byte comparison of two files to see if they
> > are they same.
>
> Perhaps I'm being dim, but how else are you going to decide if
> two files are the same unless you compare the bytes in the
> files?
>
> You could hash them and compare the hashes, but that's a lot
> more work than just comparing the two byte streams.
>
> > What should I be using if not filecmp.cmp?
>
> I don't understand what you've got against comparing the files
> when you stated that what you wanted to do was compare the files.
>

I think its just the way the documentation was worded
http://www.python.org/doc/2.5.2/lib/module-filecmp.html

Unless shallow is given and is false, files with identical os.stat()
signatures are taken to be equal.
Files that were compared using this function will not be compared
again unless their os.stat() signature changes.

So to do a comparison:
filecmp.cmp(filea, fileb, False)
?




> --
> Grant Edwards ? ? ? ? ? ? ? ? ? grante ? ? ? ? ? ? Yow! Look DEEP into the
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? at ? ? ? ? ? ? ? OPENINGS!! ?Do you see any
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?visi.com ? ? ? ? ? ?ELVES or EDSELS ... or a
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?HIGHBALL?? ...
Grant Edwards
2009-04-13 20:37:19 UTC
Permalink
On 2009-04-13, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:
>
>> I want to compare two binary files and see if they are the same.
>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
>> that it is doing a byte by byte comparison of two files to see if they
>> are they same.
>
> Perhaps I'm being dim, but how else are you going to decide if
> two files are the same unless you compare the bytes in the
> files?
>
> You could hash them and compare the hashes, but that's a lot
> more work than just comparing the two byte streams.
>
>> What should I be using if not filecmp.cmp?
>
> I don't understand what you've got against comparing the files
> when you stated that what you wanted to do was compare the files.

Doh! I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused. Are you under the impression
it's _not_ doing a byte-byte compare? Here's the code:

def _do_cmp(f1, f2):
bufsize = BUFSIZE
fp1 = open(f1, 'rb')
fp2 = open(f2, 'rb')
while True:
b1 = fp1.read(bufsize)
b2 = fp2.read(bufsize)
if b1 != b2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me. Note that when
this function is called the file lengths have already been
compared and found to be equal.

--
Grant Edwards grante Yow! Alright, you!!
at Imitate a WOUNDED SEAL
visi.com pleading for a PARKING
SPACE!!
SpreadTooThin
2009-04-13 20:46:18 UTC
Permalink
On Apr 13, 2:37?pm, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-13, Grant Edwards <invalid at invalid> wrote:
>
>
>
> > On 2009-04-13, SpreadTooThin <bjobrie... at gmail.com> wrote:
>
> >> I want to compare two binary files and see if they are the same.
> >> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> >> that it is doing a byte by byte comparison of two files to see if they
> >> are they same.
>
> > Perhaps I'm being dim, but how else are you going to decide if
> > two files are the same unless you compare the bytes in the
> > files?
>
> > You could hash them and compare the hashes, but that's a lot
> > more work than just comparing the two byte streams.
>
> >> What should I be using if not filecmp.cmp?
>
> > I don't understand what you've got against comparing the files
> > when you stated that what you wanted to do was compare the files.
>
> Doh! ?I misread your post and thought were weren't getting a
> warm fuzzying feeling _because_ it was doing a byte-byte
> compare. Now I'm a bit confused. ?Are you under the impression
> it's _not_ doing a byte-byte compare? ?Here's the code:
>
> def _do_cmp(f1, f2):
> ? ? bufsize = BUFSIZE
> ? ? fp1 = open(f1, 'rb')
> ? ? fp2 = open(f2, 'rb')
> ? ? while True:
> ? ? ? ? b1 = fp1.read(bufsize)
> ? ? ? ? b2 = fp2.read(bufsize)
> ? ? ? ? if b1 != b2:
> ? ? ? ? ? ? return False
> ? ? ? ? if not b1:
> ? ? ? ? ? ? return True
>
> It looks like a byte-by-byte comparison to me. ?Note that when
> this function is called the file lengths have already been
> compared and found to be equal.
>
> --
> Grant Edwards ? ? ? ? ? ? ? ? ? grante ? ? ? ? ? ? Yow! Alright, you!!
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? at ? ? ? ? ? ? ? Imitate a WOUNDED SEAL
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?visi.com ? ? ? ? ? ?pleading for a PARKING
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SPACE!!

I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Seeeeem... makes me wonder.

Thanks for the code! :)
Dave Angel
2009-04-14 01:11:28 UTC
Permalink
SpreadTooThin wrote:
> On Apr 13, 2:37 pm, Grant Edwards <invalid at invalid> wrote:
>
>> On 2009-04-13, Grant Edwards <invalid at invalid> wrote:
>>
>>
>>
>>
>>> On 2009-04-13, SpreadTooThin <bjobrie... at gmail.com> wrote:
>>>
>>>> I want to compare two binary files and see if they are the same.
>>>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
>>>> that it is doing a byte by byte comparison of two files to see if they
>>>> are they same.
>>>>
>>> Perhaps I'm being dim, but how else are you going to decide if
>>> two files are the same unless you compare the bytes in the
>>> files?
>>>
>>> You could hash them and compare the hashes, but that's a lot
>>> more work than just comparing the two byte streams.
>>>
>>>> What should I be using if not filecmp.cmp?
>>>>
>>> I don't understand what you've got against comparing the files
>>> when you stated that what you wanted to do was compare the files.
>>>
>> Doh! I misread your post and thought were weren't getting a
>> warm fuzzying feeling _because_ it was doing a byte-byte
>> compare. Now I'm a bit confused. Are you under the impression
>> it's _not_ doing a byte-byte compare? Here's the code:
>>
>> def _do_cmp(f1, f2):
>> bufsize =UFSIZE
>> fp1 =pen(f1, 'rb')
>> fp2 =pen(f2, 'rb')
>> while True:
>> b1 =p1.read(bufsize)
>> b2 =p2.read(bufsize)
>> if b1 !=2:
>> return False
>> if not b1:
>> return True
>>
>> It looks like a byte-by-byte comparison to me. Note that when
>> this function is called the file lengths have already been
>> compared and found to be equal.
>>
>> --
>> Grant Edwards grante Yow! Alright, you!!
>> at Imitate a WOUNDED SEAL
>> visi.com pleading for a PARKING
>> SPACE!!
>>
>
> I am indeed under the impression that it is not always doing a byte by
> byte comparison...
> as well the documentation states:
> Compare the files named f1 and f2, returning True if they seem equal,
> False otherwise.
>
> That word... Seeeeem... makes me wonder.
>
> Thanks for the code! :)
>
>
>
Some of this discussion depends on the version of Python, but didn't say
so. In version 2.61, the code is different (and more complex) than
what's listed above. The docs are different too. In this version, at
least, you'll want to explicitly pass the shallow=False parameter. It
defaults to 1, by which they must mean True. I think it's a bad
default, but it's still a useful function. Just be careful to include
that parameter in your call.

Further, you want to check the version included with your version. The
file filecmp.py is in the Lib directory, so it's not trouble to check it.
Peter Otten
2009-04-13 21:25:37 UTC
Permalink
Grant Edwards wrote:

> On 2009-04-13, Grant Edwards <invalid at invalid> wrote:
>> On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:
>>
>>> I want to compare two binary files and see if they are the same.
>>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
>>> that it is doing a byte by byte comparison of two files to see if they
>>> are they same.
>>
>> Perhaps I'm being dim, but how else are you going to decide if
>> two files are the same unless you compare the bytes in the
>> files?
>>
>> You could hash them and compare the hashes, but that's a lot
>> more work than just comparing the two byte streams.
>>
>>> What should I be using if not filecmp.cmp?
>>
>> I don't understand what you've got against comparing the files
>> when you stated that what you wanted to do was compare the files.
>
> Doh! I misread your post and thought were weren't getting a
> warm fuzzying feeling _because_ it was doing a byte-byte
> compare. Now I'm a bit confused. Are you under the impression
> it's _not_ doing a byte-byte compare? Here's the code:
>
> def _do_cmp(f1, f2):
> bufsize = BUFSIZE
> fp1 = open(f1, 'rb')
> fp2 = open(f2, 'rb')
> while True:
> b1 = fp1.read(bufsize)
> b2 = fp2.read(bufsize)
> if b1 != b2:
> return False
> if not b1:
> return True
>
> It looks like a byte-by-byte comparison to me. Note that when
> this function is called the file lengths have already been
> compared and found to be equal.

But there's a cache. A change of file contents may go undetected as long as
the file stats don't change:

$ cat fool_filecmp.py
import filecmp, shutil, sys

for fn in "adb":
with open(fn, "w") as f:
f.write("yadda")

shutil.copystat("d", "a")
filecmp.cmp("a", "b", False)

with open("a", "w") as f:
f.write("*****")
shutil.copystat("d", "a")

if "--clear" in sys.argv:
print "clearing cache"
filecmp._cache.clear()

if filecmp.cmp("a", "b", False):
print "file a and b are equal"
else:
print "file a and b differ"
print "a's contents:", open("a").read()
print "b's contents:", open("b").read()

$ python2.6 fool_filecmp.py
file a and b are equal
a's contents: *****
b's contents: yadda

Oops. If you are paranoid you have to clear the cache before doing the
comparison:

$ python2.6 fool_filecmp.py --clear
clearing cache
file a and b differ
a's contents: *****
b's contents: yadda

Peter
Grant Edwards
2009-04-14 02:39:26 UTC
Permalink
On 2009-04-13, Peter Otten <__peter__ at web.de> wrote:

> But there's a cache. A change of file contents may go
> undetected as long as the file stats don't change:

Good point. You can fool it if you force the stats to their
old values after you modify a file and you don't clear the
cache.

--
Grant
Adam Olsen
2009-04-14 23:53:19 UTC
Permalink
On Apr 13, 8:39?pm, Grant Edwards <gra... at visi.com> wrote:
> On 2009-04-13, Peter Otten <__pete... at web.de> wrote:
>
> > But there's a cache. A change of file contents may go
> > undetected as long as the file stats don't change:
>
> Good point. ?You can fool it if you force the stats to their
> old values after you modify a file and you don't clear the
> cache.

The timestamps stored on the filesystem (for ext3 and most other
filesystems) are fairly coarse, so it's quite possible for a check/
update/check sequence to have the same timestamp at the beginning and
end.
Steven D'Aprano
2009-04-14 00:05:52 UTC
Permalink
On Mon, 13 Apr 2009 15:03:32 -0500, Grant Edwards wrote:

> On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:
>
>> I want to compare two binary files and see if they are the same. I see
>> the filecmp.cmp function but I don't get a warm fuzzy feeling that it
>> is doing a byte by byte comparison of two files to see if they are they
>> same.
>
> Perhaps I'm being dim, but how else are you going to decide if two files
> are the same unless you compare the bytes in the files?

If you start with an image in one format (e.g. PNG), and convert it to
another format (e.g. JPEG), you might want the two files to compare equal
even though their byte contents are completely different, because their
contents (the image itself) is visually identical.

Or you might want a heuristic as a short cut for comparing large files,
and decide that if two files have the same size and modification dates,
and the first (say) 100KB are equal, that you will assume the rest are
probably equal too.

Neither of these are what the OP wants, I'm just mentioning them to
answer your rhetorical question :)



--
Steven
Martin
2009-04-15 05:54:20 UTC
Permalink
Hi,

On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:
>
>> I want to compare two binary files and see if they are the same.
>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
>> that it is doing a byte by byte comparison of two files to see if they
>> are they same.
>
> Perhaps I'm being dim, but how else are you going to decide if
> two files are the same unless you compare the bytes in the
> files?

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.

>
> You could hash them and compare the hashes, but that's a lot
> more work than just comparing the two byte streams.

hashing is not exactly much mork in it's simplest form it's 2 lines per file.

$ dd if=/dev/urandom of=testfile.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s
$ dd if=/dev/urandom of=testfile2.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s
$ cp testfile.data testfile3.data
$ python
Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> sha = hashlib.sha256()
>>> sha.update(file("testfile.data").read())
>>> sha.hexdigest()
'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
>>> sha2=hashlib.sha256()
>>> sha2.update(file("testfile2.data").read())
>>> sha2.hexdigest()
'25597380f833f287e8dad936b15ddb616669102c38f54dbd60ce57998d99ad3b'
>>> sha3=hashlib.sha256()
>>> sha3.update(file("testfile3.data").read())
>>> sha3.hexdigest()
'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
>>> sha.hexdigest() == sha2.hexdigest()
False
>>> sha.hexdigest() == sha3.hexdigest()
True
>>> sha2.hexdigest() == sha3.hexdigest()
False
>>>



--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
Steven D'Aprano
2009-04-15 09:03:16 UTC
Permalink
On Wed, 15 Apr 2009 07:54:20 +0200, Martin wrote:

>> Perhaps I'm being dim, but how else are you going to decide if two
>> files are the same unless you compare the bytes in the files?
>
> I'd say checksums, just about every download relies on checksums to
> verify you do have indeed the same file.

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.



>> You could hash them and compare the hashes, but that's a lot more work
>> than just comparing the two byte streams.
>
> hashing is not exactly much mork in it's simplest form it's 2 lines per
> file.

Hashing is a *lot* more work than just comparing two bytes. The MD5
checksum has been specifically designed to be fast and compact, and the
algorithm is still complicated:

http://en.wikipedia.org/wiki/MD5#Pseudocode

The reference implementation is here:

http://www.fastsum.com/rfc1321.php#APPENDIXA

SHA-1 is even more complicated still:

http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode


Just because *calling* some checksum function is easy doesn't make the
checksum function itself simple. They do a LOT more work than just a
simple comparison between bytes, and that's totally unnecessary work if
you are making a one-off comparison of two local files.



--
Steven
Martin
2009-04-15 13:25:27 UTC
Permalink
On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
<steven at remove.this.cybersource.com.au> wrote:
> The checksum does look at every byte in each file. Checksumming isn't a
> way to avoid looking at each byte of the two files, it is a way of
> mapping all the bytes to a single number.

My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

I'd still say rather burn CPU cycles than development hours (if I got
the question right), if not then with binary files you will have to
find some way of representing differences between the 2 files in a
readable manner anyway.

> Hashing is a *lot* more work than just comparing two bytes. The MD5
> checksum has been specifically designed to be fast and compact, and the
> algorithm is still complicated:

I know that the various checksum algorithms aren't exactly cheap, but
I do think that just to know wether 2 files are different a solution
which takes 5mins to implement wins against a lengthy discussion which
optimizes too early wins hands down.

regards,
martin

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
Nigel Rantor
2009-04-15 17:04:18 UTC
Permalink
Martin wrote:
> On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
> <steven at remove.this.cybersource.com.au> wrote:
>> The checksum does look at every byte in each file. Checksumming isn't a
>> way to avoid looking at each byte of the two files, it is a way of
>> mapping all the bytes to a single number.
>
> My understanding of the original question was a way to determine
> wether 2 files are equal or not. Creating a checksum of 1-n files and
> comparing those checksums IMHO is a valid way to do that. I know it's
> a (one way) mapping between a (possibly) longer byte sequence and
> another one, how does checksumming not take each byte in the original
> sequence into account.

The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

n
Adam Olsen
2009-04-15 18:37:41 UTC
Permalink
On Apr 15, 11:04?am, Nigel Rantor <wig... at wiggly.org> wrote:
> The fact that two md5 hashes are equal does not mean that the sources
> they were generated from are equal. To do that you must still perform a
> byte-by-byte comparison which is much less work for the processor than
> generating an md5 or sha hash.
>
> If you insist on using a hashing algorithm to determine the equivalence
> of two files you will eventually realise that it is a flawed plan
> because you will eventually find two files with different contents that
> nonetheless hash to the same value.
>
> The more files you test with the quicker you will find out this basic truth.
>
> This is not complex, it's a simple fact about how hashing algorithms work.

The only flaw on a cryptographic hash is the increasing number of
attacks that are found on it. You need to pick a trusted one when you
start and consider replacing it every few years.

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.
Nigel Rantor
2009-04-15 18:56:50 UTC
Permalink
Adam Olsen wrote:
> The chance of *accidentally* producing a collision, although
> technically possible, is so extraordinarily rare that it's completely
> overshadowed by the risk of a hardware or software failure producing
> an incorrect result.

Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an
error waiting to happen.

But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.

n
Lawrence D'Oliveiro
2009-04-18 00:27:12 UTC
Permalink
In message <mailman.3934.1239821812.11746.python-list at python.org>, Nigel
Rantor wrote:

> Adam Olsen wrote:
>
>> The chance of *accidentally* producing a collision, although
>> technically possible, is so extraordinarily rare that it's completely
>> overshadowed by the risk of a hardware or software failure producing
>> an incorrect result.
>
> Not when you're using them to compare lots of files.
>
> Trust me. Been there, done that, got the t-shirt.

Not with any cryptographically-strong hash, you haven't.
Grant Edwards
2009-04-15 14:11:56 UTC
Permalink
On 2009-04-15, Martin <martin at marcher.name> wrote:
> On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano

> I'd still say rather burn CPU cycles than development hours (if I got
> the question right),

_Hours_? Calling the file compare module takes _one_line_of_code_.

Implementing a file compare from scratch takes about a half
dozen lines of code.

> if not then with binary files you will have to find some way
> of representing differences between the 2 files in a readable
> manner anyway.

1) Who said anything about a readable representation of the
differences?

2) How does a checksum provide that?

>> Hashing is a *lot* more work than just comparing two bytes.
>> The MD5 checksum has been specifically designed to be fast and
>> compact, and the algorithm is still complicated:
>
> I know that the various checksum algorithms aren't exactly
> cheap, but I do think that just to know wether 2 files are
> different a solution which takes 5mins to implement wins
> against a lengthy discussion

Bah. A direct compare is trivial. The discussion of which
checksum to use, how to implement it, and how reliable it is
will be far longer than any discussion over a direct
comparison.

> which optimizes too early wins hands down.

Optimizes too early? Comparing the bytes is the simplest and
most direct, obvious solution. It takes one line of code to
call the file compare module. Implementing it from scratch
takes about five lines of code.

We all rail against premature optimization, but using a
checksum instead of a direct comparison is premature
unoptimization. ;)

--
Grant Edwards grante Yow! Hmmm ... A hash-singer
at and a cross-eyed guy were
visi.com SLEEPING on a deserted
island, when ...
Nigel Rantor
2009-04-15 17:09:27 UTC
Permalink
Grant Edwards wrote:
> We all rail against premature optimization, but using a
> checksum instead of a direct comparison is premature
> unoptimization. ;)

And more than that, will provide false positives for some inputs.

So, basically it's a worse-than-useless approach for determining if two
files are the same.

n
Grant Edwards
2009-04-15 14:04:35 UTC
Permalink
On 2009-04-15, Martin <martin at marcher.name> wrote:
> Hi,
>
> On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid at invalid> wrote:
>> On 2009-04-13, SpreadTooThin <bjobrien62 at gmail.com> wrote:
>>
>>> I want to compare two binary files and see if they are the same.
>>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
>>> that it is doing a byte by byte comparison of two files to see if they
>>> are they same.
>>
>> Perhaps I'm being dim, but how else are you going to decide if
>> two files are the same unless you compare the bytes in the
>> files?
>
> I'd say checksums, just about every download relies on checksums to
> verify you do have indeed the same file.

That's slower than a byte-by-byte compare.

>> You could hash them and compare the hashes, but that's a lot
>> more work than just comparing the two byte streams.
>
> hashing is not exactly much mork in it's simplest form it's 2
> lines per file.

I meant a lot more CPU time/cycles.

--
Grant Edwards grante Yow! Was my SOY LOAF left
at out in th'RAIN? It tastes
visi.com REAL GOOD!!
SpreadTooThin
2009-04-15 17:26:18 UTC
Permalink
On Apr 15, 8:04?am, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-15, Martin <mar... at marcher.name> wrote:
>
>
>
> > Hi,
>
> > On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid at invalid> wrote:
> >> On 2009-04-13, SpreadTooThin <bjobrie... at gmail.com> wrote:
>
> >>> I want to compare two binary files and see if they are the same.
> >>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> >>> that it is doing a byte by byte comparison of two files to see if they
> >>> are they same.
>
> >> Perhaps I'm being dim, but how else are you going to decide if
> >> two files are the same unless you compare the bytes in the
> >> files?
>
> > I'd say checksums, just about every download relies on checksums to
> > verify you do have indeed the same file.
>
> That's slower than a byte-by-byte compare.
>
> >> You could hash them and compare the hashes, but that's a lot
> >> more work than just comparing the two byte streams.
>
> > hashing is not exactly much mork in it's simplest form it's 2
> > lines per file.
>
> I meant a lot more CPU time/cycles.
>
> --
> Grant Edwards ? ? ? ? ? ? ? ? ? grante ? ? ? ? ? ? Yow! Was my SOY LOAF left
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? at ? ? ? ? ? ? ? out in th'RAIN? ?It tastes
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?visi.com ? ? ? ? ? ?REAL GOOD!!

I'd like to add my 2 cents here.. (Thats 1.8 cents US)
All I was trying to get was a clarification of the documentation of
the cmp method.
It isn't clear.

byte by byte comparison is good enough for me as long as there are no
cache issues.
a check sum is not good because it doesn't guarantee that 1 + 2 + 3
== 3 + 2 + 1
a crc of any sort is more work than a byte by byte comparison and
doesn't give you any more information.
Adam Olsen
2009-04-15 19:05:39 UTC
Permalink
On Apr 15, 12:56?pm, Nigel Rantor <wig... at wiggly.org> wrote:
> Adam Olsen wrote:
> > The chance of *accidentally* producing a collision, although
> > technically possible, is so extraordinarily rare that it's completely
> > overshadowed by the risk of a hardware or software failure producing
> > an incorrect result.
>
> Not when you're using them to compare lots of files.
>
> Trust me. Been there, done that, got the t-shirt.
>
> Using hash functions to tell whether or not files are identical is an
> error waiting to happen.
>
> But please, do so if it makes you feel happy, you'll just eventually get
> an incorrect result and not know it.

Please tell us what hash you used and provide the two files that
collided.

If your hash is 256 bits, then you need around 2**128 files to produce
a collision. This is known as a Birthday Attack. I seriously doubt
you had that many files, which suggests something else went wrong.
Nigel Rantor
2009-04-16 09:16:02 UTC
Permalink
Adam Olsen wrote:
> On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
>> Adam Olsen wrote:
>>> The chance of *accidentally* producing a collision, although
>>> technically possible, is so extraordinarily rare that it's completely
>>> overshadowed by the risk of a hardware or software failure producing
>>> an incorrect result.
>> Not when you're using them to compare lots of files.
>>
>> Trust me. Been there, done that, got the t-shirt.
>>
>> Using hash functions to tell whether or not files are identical is an
>> error waiting to happen.
>>
>> But please, do so if it makes you feel happy, you'll just eventually get
>> an incorrect result and not know it.
>
> Please tell us what hash you used and provide the two files that
> collided.

MD5

> If your hash is 256 bits, then you need around 2**128 files to produce
> a collision. This is known as a Birthday Attack. I seriously doubt
> you had that many files, which suggests something else went wrong.

Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.

Which is the only thing I've been saying.

Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.

If you are of the opinion though that a hash function can be used to
tell you whether or not two files are identical then you are wrong. It
really is that simple.

I'm not sitting here discussing this for my health, I'm just trying to
give the OP the benefit of my experience, I have worked with other
people who insisted on this route and had to find out the hard way that
it was a Bad Idea (tm). They just wouldn't be told.

Regards,

Nige
Adam Olsen
2009-04-16 09:44:06 UTC
Permalink
On Apr 16, 3:16?am, Nigel Rantor <wig... at wiggly.org> wrote:
> Adam Olsen wrote:
> > On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
> >> Adam Olsen wrote:
> >>> The chance of *accidentally* producing a collision, although
> >>> technically possible, is so extraordinarily rare that it's completely
> >>> overshadowed by the risk of a hardware or software failure producing
> >>> an incorrect result.
> >> Not when you're using them to compare lots of files.
>
> >> Trust me. Been there, done that, got the t-shirt.
>
> >> Using hash functions to tell whether or not files are identical is an
> >> error waiting to happen.
>
> >> But please, do so if it makes you feel happy, you'll just eventually get
> >> an incorrect result and not know it.
>
> > Please tell us what hash you used and provide the two files that
> > collided.
>
> MD5
>
> > If your hash is 256 bits, then you need around 2**128 files to produce
> > a collision. ?This is known as a Birthday Attack. ?I seriously doubt
> > you had that many files, which suggests something else went wrong.
>
> Okay, before I tell you about the empirical, real-world evidence I have
> could you please accept that hashes collide and that no matter how many
> samples you use the probability of finding two files that do collide is
> small but not zero.

I'm afraid you will need to back up your claims with real files.
Although MD5 is a smaller, older hash (128 bits, so you only need
2**64 files to find collisions), and it has substantial known
vulnerabilities, the scenario you suggest where you *accidentally*
find collisions (and you imply multiple collisions!) would be a rather
significant finding.

Please help us all by justifying your claim.

Mind you, since you use MD5 I wouldn't be surprised if your files were
maliciously produced. As I said before, you need to consider
upgrading your hash every few years to avoid new attacks.
Nigel Rantor
2009-04-16 10:12:57 UTC
Permalink
Adam Olsen wrote:
> On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
>> Adam Olsen wrote:
>>> On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
>>>> Adam Olsen wrote:
>>>>> The chance of *accidentally* producing a collision, although
>>>>> technically possible, is so extraordinarily rare that it's completely
>>>>> overshadowed by the risk of a hardware or software failure producing
>>>>> an incorrect result.
>>>> Not when you're using them to compare lots of files.
>>>> Trust me. Been there, done that, got the t-shirt.
>>>> Using hash functions to tell whether or not files are identical is an
>>>> error waiting to happen.
>>>> But please, do so if it makes you feel happy, you'll just eventually get
>>>> an incorrect result and not know it.
>>> Please tell us what hash you used and provide the two files that
>>> collided.
>> MD5
>>
>>> If your hash is 256 bits, then you need around 2**128 files to produce
>>> a collision. This is known as a Birthday Attack. I seriously doubt
>>> you had that many files, which suggests something else went wrong.
>> Okay, before I tell you about the empirical, real-world evidence I have
>> could you please accept that hashes collide and that no matter how many
>> samples you use the probability of finding two files that do collide is
>> small but not zero.
>
> I'm afraid you will need to back up your claims with real files.
> Although MD5 is a smaller, older hash (128 bits, so you only need
> 2**64 files to find collisions), and it has substantial known
> vulnerabilities, the scenario you suggest where you *accidentally*
> find collisions (and you imply multiple collisions!) would be a rather
> significant finding.

No. It wouldn't. It isn't.

The files in question were millions of audio files. I no longer work at
the company where I had access to them so I cannot give you examples,
and even if I did Data Protection regulations wouldn't have allowed it.

If you still don't beleive me you can easily verify what I'm saying by
doing some simple experiemnts. Go spider the web for images, keep
collecting them until you get an MD5 hash collision.

It won't take long.

> Please help us all by justifying your claim.

Now, please go and re-read my request first and admit that everything I
have said so far is correct.

> Mind you, since you use MD5 I wouldn't be surprised if your files were
> maliciously produced. As I said before, you need to consider
> upgrading your hash every few years to avoid new attacks.

Good grief, this is nothing to do with security concerns, this is about
someone suggesting to the OP that they use a hash function to determine
whether or not two files are identical.

Regards,

Nige
Grant Edwards
2009-04-16 14:59:19 UTC
Permalink
On 2009-04-16, Adam Olsen <rhamph at gmail.com> wrote:

>>>>> The chance of *accidentally* producing a collision, although
>>>>> technically possible, is so extraordinarily rare that it's
>>>>> completely overshadowed by the risk of a hardware or software
>>>>> failure producing an incorrect result.
>>>>
>>>> Not when you're using them to compare lots of files.
>>
>>>> Trust me. Been there, done that, got the t-shirt.

I must admit, that is a bit hard to believe.

>> Okay, before I tell you about the empirical, real-world
>> evidence I have could you please accept that hashes collide
>> and that no matter how many samples you use the probability of
>> finding two files that do collide is small but not zero.
>
> I'm afraid you will need to back up your claims with real files.
> Although MD5 is a smaller, older hash (128 bits, so you only need
> 2**64 files to find collisions),

You don't need quite that many to have a significant chance of
a collision. With "only" something on the order of 2**61
files, you still have about a 1% chance of a collision.

Just for fun here are a few data points showing the probability
of at least one collision in a sample of size n using a perfect
128-bit hash. These were calculated using the taylor-series
approximation shown at

http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability

For "a few million files" (we'll say 4e6), the probability of a
collision is so close to 0 that it can't be calculated using
double-precision IEEE floats.

For a few _billion_ files, the taylor series approximation is
still 0 to about 15 significant digits.

For a few _trillion_ files, you finally get a "non-zero"
probability of about 2e-14.

Here are a few more points:

n = 1806791225998070; p(n) = 0.00000000
n = 1987470348597877; p(n) = 0.00000001
n = 2186217383457665; p(n) = 0.00000001
n = 2404839121803431; p(n) = 0.00000001
n = 2645323033983774; p(n) = 0.00000001
n = 2909855337382151; p(n) = 0.00000001
n = 3200840871120366; p(n) = 0.00000002
n = 3520924958232403; p(n) = 0.00000002
n = 3873017454055643; p(n) = 0.00000002
n = 4260319199461207; p(n) = 0.00000003
n = 4686351119407328; p(n) = 0.00000003
n = 5154986231348061; p(n) = 0.00000004
n = 5670484854482868; p(n) = 0.00000005
n = 6237533339931155; p(n) = 0.00000006
n = 6861286673924271; p(n) = 0.00000007
n = 7547415341316699; p(n) = 0.00000008
n = 8302156875448370; p(n) = 0.00000010
n = 9132372562993208; p(n) = 0.00000012
n = 10045609819292530; p(n) = 0.00000015
n = 11050170801221784; p(n) = 0.00000018
n = 12155187881343964; p(n) = 0.00000022
n = 13370706669478362; p(n) = 0.00000026
n = 14707777336426200; p(n) = 0.00000032
n = 16178555070068822; p(n) = 0.00000038
n = 17796410577075706; p(n) = 0.00000047
n = 19576051634783280; p(n) = 0.00000056
n = 21533656798261608; p(n) = 0.00000068
n = 23687022478087772; p(n) = 0.00000082
n = 26055724725896552; p(n) = 0.00000100
n = 28661297198486208; p(n) = 0.00000121
n = 31527426918334832; p(n) = 0.00000146
n = 34680169610168320; p(n) = 0.00000177
n = 38148186571185152; p(n) = 0.00000214
n = 41963005228303672; p(n) = 0.00000259
n = 46159305751134040; p(n) = 0.00000313
n = 50775236326247448; p(n) = 0.00000379
n = 55852759958872200; p(n) = 0.00000458
n = 61438035954759424; p(n) = 0.00000555
n = 67581839550235368; p(n) = 0.00000671
n = 74340023505258912; p(n) = 0.00000812
n = 81774025855784816; p(n) = 0.00000983
n = 89951428441363312; p(n) = 0.00001189
n = 98946571285499648; p(n) = 0.00001439
n = 108841228414049616; p(n) = 0.00001741
n = 119725351255454592; p(n) = 0.00002106
n = 131697886381000064; p(n) = 0.00002548
n = 144867675019100096; p(n) = 0.00003084
n = 159354442521010112; p(n) = 0.00003731
n = 175289886773111136; p(n) = 0.00004515
n = 192818875450422272; p(n) = 0.00005463
n = 212100762995464512; p(n) = 0.00006610
n = 233310839295010976; p(n) = 0.00007998
n = 256641923224512096; p(n) = 0.00009678
n = 282306115546963328; p(n) = 0.00011710
n = 310536727101659712; p(n) = 0.00014169
n = 341590399811825728; p(n) = 0.00017144
n = 375749439793008320; p(n) = 0.00020744
n = 413324383772309184; p(n) = 0.00025099
n = 454656822149540160; p(n) = 0.00030369
n = 500122504364494208; p(n) = 0.00036745
n = 550134754800943680; p(n) = 0.00044460
n = 605148230281038080; p(n) = 0.00053794
n = 665663053309141888; p(n) = 0.00065088
n = 732229358640056192; p(n) = 0.00078751
n = 805452294504061824; p(n) = 0.00095280
n = 885997523954468096; p(n) = 0.00115278
n = 974597276349915008; p(n) = 0.00139469
n = 1072057003984906624; p(n) = 0.00168733
n = 1179262704383397376; p(n) = 0.00204131
n = 1297188974821737216; p(n) = 0.00246945
n = 1426907872303911168; p(n) = 0.00298726
n = 1569598659534302464; p(n) = 0.00361345
n = 1726558525487732736; p(n) = 0.00437061
n = 1899214378036506112; p(n) = 0.00528601
n = 2089135815840156928; p(n) = 0.00639252
n = 2298049397424172800; p(n) = 0.00772975
n = 2527854337166590464; p(n) = 0.00934539
n = 2780639770883249664; p(n) = 0.01129680

Here's the Python function I'm using:

def bp(n, d):
return 1.0 - exp(-n*(n-1.)/(2.*d))

I haven't spent much time studying the numerical issues of the
way that the exponent is calculated, so I'm not entirely
confident in the results for "small" n values such that
p(n) == 0.0.

--
Grant Edwards grante Yow! ... the MYSTERIANS are
at in here with my CORDUROY
visi.com SOAP DISH!!
Adam Olsen
2009-04-16 19:03:43 UTC
Permalink
On Apr 16, 8:59?am, Grant Edwards <invalid at invalid> wrote:
> On 2009-04-16, Adam Olsen <rha... at gmail.com> wrote:
> > I'm afraid you will need to back up your claims with real files.
> > Although MD5 is a smaller, older hash (128 bits, so you only need
> > 2**64 files to find collisions),
>
> You don't need quite that many to have a significant chance of
> a collision. ?With "only" something on the order of 2**61
> files, you still have about a 1% chance of a collision.

Aye, 2**64 is more of the middle of the curve or so. You can still go
either way. What's important is the order of magnitude required.


> For "a few million files" (we'll say 4e6), the probability of a
> collision is so close to 0 that it can't be calculated using
> double-precision IEEE floats.

? 0.000000000000000000000000023509887

Or 42535296000000000000000000 to 1.

Or 42 trillion trillion to 1.


> Here's the Python function I'm using:
>
> def bp(n, d):
> ? ? return 1.0 - exp(-n*(n-1.)/(2.*d))
>
> I haven't spent much time studying the numerical issues of the
> way that the exponent is calculated, so I'm not entirely
> confident in the results for "small" n values such that
> p(n) == 0.0.

Try using Qalculate. I always resort to it for things like this.
Rhodri James
2009-04-16 22:27:46 UTC
Permalink
On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rhamph at gmail.com> wrote:

> On Apr 16, 3:16?am, Nigel Rantor <wig... at wiggly.org> wrote:
>> Okay, before I tell you about the empirical, real-world evidence I have
>> could you please accept that hashes collide and that no matter how many
>> samples you use the probability of finding two files that do collide is
>> small but not zero.
>
> I'm afraid you will need to back up your claims with real files.

So that would be a "no" then. If the implementation of dicts in Python,
say, were to assert as you are that the hashes aren't going to collide,
then I'd have to walk away from it. There's no point in using something
that guarantees a non-zero chance of corrupting your data.

Why are you advocating a solution to the OP's problem that is more
computationally expensive than a simple byte-by-byte comparison and
doesn't guarantee to give the correct answer?

--
Rhodri James *-* Wildebeeste Herder to the Masses
Adam Olsen
2009-04-17 04:44:42 UTC
Permalink
On Apr 16, 4:27?pm, "Rhodri James" <rho... at wildebst.demon.co.uk>
wrote:
> On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha... at gmail.com> wrote:
> > On Apr 16, 3:16?am, Nigel Rantor <wig... at wiggly.org> wrote:
> >> Okay, before I tell you about the empirical, real-world evidence I have
> >> could you please accept that hashes collide and that no matter how many
> >> samples you use the probability of finding two files that do collide is
> >> small but not zero.
>
> > I'm afraid you will need to back up your claims with real files.
>
> So that would be a "no" then. ?If the implementation of dicts in Python,
> say, were to assert as you are that the hashes aren't going to collide,
> then I'd have to walk away from it. ?There's no point in using something
> that guarantees a non-zero chance of corrupting your data.

Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision. In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512). The two are at totally different extremes.

There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error. It is only in that
broader context that you can realize just how minuscule the risk is.

Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?


> Why are you advocating a solution to the OP's problem that is more
> computationally expensive than a simple byte-by-byte comparison and
> doesn't guarantee to give the correct answer?

For single, one-off comparison I have no problem with a byte-by-byte
comparison. There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.

Only if you're doing multiple comparisons is a hash database
justified. Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway. New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.

That's not my concern though. What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability. The chance
of collision, in our life time, is *insignificant*.

The Wayback Machine has 150 billion pages, so 2**37. Google's index
is a bit larger at over a trillion pages, so 2**40. A little closer
than I'd like, but that's still 562949950000000 to 1 odds of having
*any* collisions between *any* of the files. Step up to SHA-256 and
it becomes 191561940000000000000000000000000000000000000000000000 to
1. Sadly, I can't even give you the odds for SHA-512, Qalculate
considers that too close to infinite to display. :)

You should worry more about your head spontaneously exploding than you
should about a hash collision on that scale. To do otherwise is
irrational paranoia.
Nigel Rantor
2009-04-17 10:49:09 UTC
Permalink
Adam Olsen wrote:
> On Apr 16, 4:27 pm, "Rhodri James" <rho... at wildebst.demon.co.uk>
> wrote:
>> On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha... at gmail.com> wrote:
>>> On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
>>>> Okay, before I tell you about the empirical, real-world evidence I have
>>>> could you please accept that hashes collide and that no matter how many
>>>> samples you use the probability of finding two files that do collide is
>>>> small but not zero.
>>> I'm afraid you will need to back up your claims with real files.
>> So that would be a "no" then. If the implementation of dicts in Python,
>> say, were to assert as you are that the hashes aren't going to collide,
>> then I'd have to walk away from it. There's no point in using something
>> that guarantees a non-zero chance of corrupting your data.
>
> Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
> 65 thousand) will give you a decent chance of a collision. In
> contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
> 2**256 (SHA-512). The two are at totally different extremes.

I'm just going to go ahead and take the above as an admission by you
that the chance of collision is non-zero, and that if we accept that
fact you cannot rely on a hash function to tell you if two files are
identical.

Thanks.

> There is *always* a non-zero chance of corruption, due to software
> bugs, hardware defects, or even operator error. It is only in that
> broader context that you can realize just how minuscule the risk is.

Please explain why you're dragging the notion of corruption into this
when it seems to be beside the point?

> Can you explain to me why you justify great lengths of paranoia, when
> the risk is so much lower?

Because in the real world, where I work, in practical, real, factual
terms I have seen it happen. Not once. Not twice. But many, many times.

>> Why are you advocating a solution to the OP's problem that is more
>> computationally expensive than a simple byte-by-byte comparison and
>> doesn't guarantee to give the correct answer?
>
> For single, one-off comparison I have no problem with a byte-by-byte
> comparison. There's a decent chance the files won't be in the OS's
> cache anyway, so disk IO will be your bottleneck.
>
> Only if you're doing multiple comparisons is a hash database
> justified. Even then, if you expect matching files to be fairly rare
> I won't lose any sleep if you're paranoid and do a byte-by-byte
> comparison anyway. New vulnerabilities are found, and if you don't
> update promptly there is a small (but significant) chance of a
> malicious file leading to collision.

If I have a number of files then I would certainly use a hash as a quick
test, but if two files hash to the same value I still have to go compare
them. Hashing in this case saves time doing a comparison for each file
but it doesn't obviate the need to do a byte-by-byte comparison to see
if the files that hash to the same value are actually the same.

> That's not my concern though. What I'm responding to is Nigel
> Rantor's grossly incorrect statements about probability. The chance
> of collision, in our life time, is *insignificant*.

Please tell me which statements? That the chance of two files hashing to
the same value is non-zero? You admit as much above.

Also, please remember I gave you a way of verifying what I said, go
crawl the web for images, pages, whatever, build a hash DB and tell me
how long it takes you to find a collision using MD5 (which is the hash I
was talking about when I told you I had real-world, experience to back
up the theoretical claim that collisions occur.

Regards,

Nigel
Tim Wintle
2009-04-17 11:30:49 UTC
Permalink
On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
> The Wayback Machine has 150 billion pages, so 2**37. Google's index
> is a bit larger at over a trillion pages, so 2**40. A little closer
> than I'd like, but that's still 562949950000000 to 1 odds of having
> *any* collisions between *any* of the files. Step up to SHA-256 and
> it becomes 191561940000000000000000000000000000000000000000000000 to
> 1. Sadly, I can't even give you the odds for SHA-512, Qalculate
> considers that too close to infinite to display. :)

That might be true as long as your data is completely uniformly
distributed. For the example you give there's:

a) a high chance that there's "<html>" near the top

b) a non-uniform distribution of individual words within the text.

c) a non-unifom distribution of all n-grams within the text (as there is
in natural language)

So it's very far from uniformly distributed. Just about the only
situation where I could imagine that holding would be where you are
hashing uniformly random data for the sake of testing the hash.


I believe the point being made is that comparing hash values is a
probabilistic algorithm anyway, which is fine if you're ok with that,
but for mission critical software it's crazy.

Tim Wintle
Adam Olsen
2009-04-17 18:19:31 UTC
Permalink
On Apr 17, 5:30?am, Tim Wintle <tim.win... at teamrubber.com> wrote:
> On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
> > The Wayback Machine has 150 billion pages, so 2**37. ?Google's index
> > is a bit larger at over a trillion pages, so 2**40. ?A little closer
> > than I'd like, but that's still 562949950000000 to 1 odds of having
> > *any* collisions between *any* of the files. ?Step up to SHA-256 and
> > it becomes 191561940000000000000000000000000000000000000000000000 to
> > 1. ?Sadly, I can't even give you the odds for SHA-512, Qalculate
> > considers that too close to infinite to display. :)
>
> That might be true as long as your data is completely uniformly
> distributed. For the example you give there's:
>
> a) a high chance that there's "<html>" near the top
>
> b) a non-uniform distribution of individual words within the text.
>
> c) a non-unifom distribution of all n-grams within the text (as there is
> in natural language)
>
> So it's very far from uniformly distributed. Just about the only
> situation where I could imagine that holding would be where you are
> hashing uniformly random data for the sake of testing the hash.
>
> I believe the point being made is that comparing hash values is a
> probabilistic algorithm anyway, which is fine if you're ok with that,
> but for mission critical software it's crazy.

Actually, *cryptographic* hashes handle that just fine. Even for
files with just a 1 bit change the output is totally different. This
is known as the Avalanche Effect. Otherwise they'd be vulnerable to
attacks.

Which isn't to say you couldn't *construct* a pattern that it would be
vulnerable to. Figuring that out is pretty much the whole point of
attacking a cryptographic hash. MD5 has significant vulnerabilities
by now, and other will in the future. That's just a risk you need to
manage.
Steven D'Aprano
2009-04-18 03:47:20 UTC
Permalink
On Fri, 17 Apr 2009 11:19:31 -0700, Adam Olsen wrote:

> Actually, *cryptographic* hashes handle that just fine. Even for files
> with just a 1 bit change the output is totally different. This is known
> as the Avalanche Effect. Otherwise they'd be vulnerable to attacks.
>
> Which isn't to say you couldn't *construct* a pattern that it would be
> vulnerable to. Figuring that out is pretty much the whole point of
> attacking a cryptographic hash. MD5 has significant vulnerabilities by
> now, and other will in the future. That's just a risk you need to
> manage.

There are demonstrated methods for constructing MD5 collisions against an
arbitrary file in about a minute on a laptop computer.

http://en.wikipedia.org/wiki/MD5



--
Steven
Piet van Oostrum
2009-04-18 08:19:21 UTC
Permalink
>>>>> Adam Olsen <rhamph at gmail.com> (AO) wrote:

>AO> The Wayback Machine has 150 billion pages, so 2**37. Google's index
>AO> is a bit larger at over a trillion pages, so 2**40. A little closer
>AO> than I'd like, but that's still 562949950000000 to 1 odds of having
>AO> *any* collisions between *any* of the files. Step up to SHA-256 and
>AO> it becomes 191561940000000000000000000000000000000000000000000000 to
>AO> 1. Sadly, I can't even give you the odds for SHA-512, Qalculate
>AO> considers that too close to infinite to display. :)

>AO> You should worry more about your head spontaneously exploding than you
>AO> should about a hash collision on that scale. To do otherwise is
>AO> irrational paranoia.

And that is the probability if there being two files in that huge
collection with the same hash. If you just take two files, not
fabricated to collide, the probability of them having the same hash
under MD5 is 2**-128 which I think is way smaller than the probability
of the bit representing the answer being swapped by some physical cause
in your computer. But then again, it doesn't make sense to use that
instead of byte-by-byte comparison if the files are on the same machine.
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
SpreadTooThin
2009-04-16 17:15:14 UTC
Permalink
On Apr 16, 3:16?am, Nigel Rantor <wig... at wiggly.org> wrote:
> Adam Olsen wrote:
> > On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
> >> Adam Olsen wrote:
> >>> The chance of *accidentally* producing a collision, although
> >>> technically possible, is so extraordinarily rare that it's completely
> >>> overshadowed by the risk of a hardware or software failure producing
> >>> an incorrect result.
> >> Not when you're using them to compare lots of files.
>
> >> Trust me. Been there, done that, got the t-shirt.
>
> >> Using hash functions to tell whether or not files are identical is an
> >> error waiting to happen.
>
> >> But please, do so if it makes you feel happy, you'll just eventually get
> >> an incorrect result and not know it.
>
> > Please tell us what hash you used and provide the two files that
> > collided.
>
> MD5
>
> > If your hash is 256 bits, then you need around 2**128 files to produce
> > a collision. ?This is known as a Birthday Attack. ?I seriously doubt
> > you had that many files, which suggests something else went wrong.
>
> Okay, before I tell you about the empirical, real-world evidence I have
> could you please accept that hashes collide and that no matter how many
> samples you use the probability of finding two files that do collide is
> small but not zero.
>
> Which is the only thing I've been saying.
>
> Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.
>
> If you are of the opinion though that a hash function can be used to
> tell you whether or not two files are identical then you are wrong. It
> really is that simple.
>
> I'm not sitting here discussing this for my health, I'm just trying to
> give the OP the benefit of my experience, I have worked with other
> people who insisted on this route and had to find out the hard way that
> it was a Bad Idea (tm). They just wouldn't be told.
>
> Regards,
>
> ? ?Nige

And yes he is right CRCs hashing all have a probability of saying that
the files are identical when in fact they are not.
Adam Olsen
2009-04-16 19:39:57 UTC
Permalink
On Apr 16, 11:15?am, SpreadTooThin <bjobrie... at gmail.com> wrote:
> And yes he is right CRCs hashing all have a probability of saying that
> the files are identical when in fact they are not.

Here's the bottom line. It is either:

A) Several hundred years of mathematics and cryptography are wrong.
The birthday problem as described is incorrect, so a collision is far
more likely than 42 trillion trillion to 1. You are simply the first
person to have noticed it.

B) Your software was buggy, or possibly the input was maliciously
produced. Or, a really tiny chance that your particular files
contained a pattern that provoked bad behaviour from MD5.

Finding a specific limitation of the algorithm is one thing. Claiming
that the math is fundamentally wrong is quite another.
Nigel Rantor
2009-04-17 10:54:35 UTC
Permalink
Adam Olsen wrote:
> On Apr 16, 11:15 am, SpreadTooThin <bjobrie... at gmail.com> wrote:
>> And yes he is right CRCs hashing all have a probability of saying that
>> the files are identical when in fact they are not.
>
> Here's the bottom line. It is either:
>
> A) Several hundred years of mathematics and cryptography are wrong.
> The birthday problem as described is incorrect, so a collision is far
> more likely than 42 trillion trillion to 1. You are simply the first
> person to have noticed it.
>
> B) Your software was buggy, or possibly the input was maliciously
> produced. Or, a really tiny chance that your particular files
> contained a pattern that provoked bad behaviour from MD5.
>
> Finding a specific limitation of the algorithm is one thing. Claiming
> that the math is fundamentally wrong is quite another.

You are confusing yourself about probabilities young man.

Just becasue something is extremely unlikely does not mean it can't
happen on the first attempt.

This is true *no matter how big the numbers are*.

If you persist in making these ridiculous claims that people *cannot*
have found collisions then as I said, that's up to you, but I'm not
going to employ you to do anything except make tea.

Thanks,

Nigel
norseman
2009-04-17 15:59:27 UTC
Permalink
Adam Olsen wrote:
> On Apr 16, 11:15 am, SpreadTooThin <bjobrie... at gmail.com> wrote:
>> And yes he is right CRCs hashing all have a probability of saying that
>> the files are identical when in fact they are not.
>
> Here's the bottom line. It is either:
>
> A) Several hundred years of mathematics and cryptography are wrong.
> The birthday problem as described is incorrect, so a collision is far
> more likely than 42 trillion trillion to 1. You are simply the first
> person to have noticed it.
>
> B) Your software was buggy, or possibly the input was maliciously
> produced. Or, a really tiny chance that your particular files
> contained a pattern that provoked bad behaviour from MD5.
>
> Finding a specific limitation of the algorithm is one thing. Claiming
> that the math is fundamentally wrong is quite another.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
================================
Spending a lifetime in applied math has taught me:
1) All applied math is finite.
2) Any algorithm failing to handle all contingencies is flawed.

The meaning of 1) is that it is limited in what it can actually do.
The meaning of 2) is that the designer missed or left out something.

Neither should be taken as bad. Both need to be accepted 'as 'is' and
the decision to use (when,where,conditions) based on the probability of
non-failure.


"...a pattern that provoked bad behavior... " does mean the algorithm is
incomplete and may be fundamentally wrong. Underscore "is" and "may".

The more complicated the math the harder it is to keep a higher form of
math from checking (or improperly displacing) a lower one. Which, of
course, breaks the rules. Commonly called improper thinking. A number
of math teasers make use of that.



Steve
SpreadTooThin
2009-04-17 15:59:32 UTC
Permalink
On Apr 17, 4:54?am, Nigel Rantor <wig... at wiggly.org> wrote:
> Adam Olsen wrote:
> > On Apr 16, 11:15 am, SpreadTooThin <bjobrie... at gmail.com> wrote:
> >> And yes he is right CRCs hashing all have a probability of saying that
> >> the files are identical when in fact they are not.
>
> > Here's the bottom line. ?It is either:
>
> > A) Several hundred years of mathematics and cryptography are wrong.
> > The birthday problem as described is incorrect, so a collision is far
> > more likely than 42 trillion trillion to 1. ?You are simply the first
> > person to have noticed it.
>
> > B) Your software was buggy, or possibly the input was maliciously
> > produced. ?Or, a really tiny chance that your particular files
> > contained a pattern that provoked bad behaviour from MD5.
>
> > Finding a specific limitation of the algorithm is one thing. ?Claiming
> > that the math is fundamentally wrong is quite another.
>
> You are confusing yourself about probabilities young man.
>
> Just becasue something is extremely unlikely does not mean it can't
> happen on the first attempt.
>
> This is true *no matter how big the numbers are*.
>
> If you persist in making these ridiculous claims that people *cannot*
> have found collisions then as I said, that's up to you, but I'm not
> going to employ you to do anything except make tea.
>
> Thanks,
>
> ? ?Nigel

You know this is just insane. I'd be satisfied with a CRC16 or
something in the situation i'm in.
I have two large files, one local and one remote. Transferring every
byte across the internet to be sure that the two files are identical
is just not feasible. If two servers one on one side and the other on
the other side both calculate the CRCs and transmit the CRCs for
comparison I'm happy.
Adam Olsen
2009-04-17 18:32:34 UTC
Permalink
On Apr 17, 9:59?am, SpreadTooThin <bjobrie... at gmail.com> wrote:
> You know this is just insane. ?I'd be satisfied with a CRC16 or
> something in the situation i'm in.
> I have two large files, one local and one remote. ?Transferring every
> byte across the internet to be sure that the two files are identical
> is just not feasible. ?If two servers one on one side and the other on
> the other side both calculate the CRCs and transmit the CRCs for
> comparison I'm happy.

Definitely use a hash, ignore Nigel. SHA-256 or SHA-512. Or, if you
might need to update one of the files, look at rsync. Rsync still
uses MD4 and MD5 (optionally!), but they're fine in a trusted
environment.
Adam Olsen
2009-04-17 18:29:59 UTC
Permalink
On Apr 17, 9:59?am, norseman <norse... at hughes.net> wrote:
> The more complicated the math the harder it is to keep a higher form of
> math from checking (or improperly displacing) a lower one. ?Which, of
> course, breaks the rules. ?Commonly called improper thinking. A number
> of math teasers make use of that.

Of course, designing a hash is hard. That's why the *recommended*
ones get so many years of peer review and attempted attacks first.

I'd love of Nigel provided evidence that MD5 was broken, I really
would. It'd be quite interesting to investigate, assuming malicious
content can be ruled out. Of course even he doesn't think that. He
claims that his 42 trillion trillion to 1 odds happened not just once,
but multiple times.
Continue reading on narkive:
Loading...