CRC was: Re: [HACKERS] beta testing version 
Author Message
 CRC was: Re: [HACKERS] beta testing version

Quote:
> This may be implemented very fast (if someone points me where
> I can find CRC func). And I could implement "physical log"
> till next monday.

I have been experimenting with CRCs for the past 6 month in our database for
internal logging purposes. Downloaded a lot of hash libraries, tried
different algorithms, and implemented a few myself. Which algorithm do you
want? Have a look at the openssl libraries (www.openssl.org) for a start -if
you don't find what you want let me know.

As the logging might include large data blocks, especially now that we can
TOAST our data, I would strongly suggest to use strong hashes like RIPEMD or
MD5 instead of CRC-32 and the like. Sure, it takes more time tocalculate and
more place on the hard disk, but then: a database without data integrity
(and means of _proofing_ integrity) is pretty worthless.

Horst



Mon, 26 May 2003 15:48:12 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> > This may be implemented very fast (if someone points me where
> > I can find CRC func). And I could implement "physical log"
> > till next monday.

> I have been experimenting with CRCs for the past 6 month in our database for
> internal logging purposes. Downloaded a lot of hash libraries, tried
> different algorithms, and implemented a few myself. Which algorithm do you
> want? Have a look at the openssl libraries (www.openssl.org) for a start -if
> you don't find what you want let me know.

> As the logging might include large data blocks, especially now that we can
> TOAST our data, I would strongly suggest to use strong hashes like RIPEMD or
> MD5 instead of CRC-32 and the like. Sure, it takes more time tocalculate and
> more place on the hard disk, but then: a database without data integrity
> (and means of _proofing_ integrity) is pretty worthless.

The choice of hash algoritm could be made a compile-time switch quite
easyly I guess.

---------
Hannu



Mon, 26 May 2003 18:38:43 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> > This may be implemented very fast (if someone points me where
> > I can find CRC func). And I could implement "physical log"
> > till next monday.

> As the logging might include large data blocks, especially now that
> we can TOAST our data, I would strongly suggest to use strong hashes
> like RIPEMD or MD5 instead of CRC-32 and the like.

Cryptographically-secure hashes are unnecessarily expensive to compute.
A simple 64-bit CRC would be of equal value, at much less expense.

Nathan Myers



Tue, 27 May 2003 04:07:27 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:
> > This may be implemented very fast (if someone points me where
> > I can find CRC func). And I could implement "physical log"
> > till next monday.

> I have been experimenting with CRCs for the past 6 month in
> our database for internal logging purposes. Downloaded a lot of
> hash libraries, tried different algorithms, and implemented a few
> myself. Which algorithm do you want? Have a look at the openssl
> libraries (www.openssl.org) for a start -if you don't find what
> you want let me know.

Thanks.

Quote:
> As the logging might include large data blocks, especially
> now that we can TOAST our data,

TOAST breaks data into a few 2K (or so) tuples to be inserted
separately. But first after checkpoint btree split will require
logging of 2x8K record -:(

Quote:
> I would strongly suggest to use strong hashes like RIPEMD or
> MD5 instead of CRC-32 and the like. Sure, it takes more time
> tocalculate and more place on the hard disk, but then: a database
> without data integrity (and means of _proofing_ integrity) is
> pretty worthless.

Other opinions? Also, we shouldn't forget licence issues.

Vadim



Tue, 27 May 2003 04:41:26 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

>> I would strongly suggest to use strong hashes like RIPEMD or
>> MD5 instead of CRC-32 and the like.
> Other opinions? Also, we shouldn't forget licence issues.

I agree with whoever commented that crypto hashes are silly for this
application.  A 64-bit CRC *might* be enough stronger than a 32-bit
CRC to be worth the extra calculation, but frankly I doubt that too.

Remember that we are already sitting atop hardware that's really pretty
reliable, despite the carping that's been going on in this thread.  All
that we have to do is detect the infrequent case where a block of data
didn't get written due to system failure.  It's wildly pessimistic to
think that we might get called on to do so as much as once a day (if
you are trying to run a reliable database, and are suffering power
failures once a day, and haven't bought a UPS, you're a lost cause).
A 32-bit CRC will fail to detect such an error with a probability of
about 1 in 2^32.  So, a 32-bit CRC will have an MBTF of 2^32 days, or
11 million years, on the wildly pessimistic side --- real installations
probably 100 times better.  That's plenty for me, and improving the odds
to 2^64 or 2^128 is not worth any slowdown IMHO.

                        regards, tom lane



Tue, 27 May 2003 05:45:58 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> Remember that we are already sitting atop hardware that's really
> pretty reliable, despite the carping that's been going on in this
> thread. All that we have to do is detect the infrequent case where a
> block of data didn't get written due to system failure. It's wildly
> pessimistic to think that we might get called on to do so as much as
> once a day (if you are trying to run a reliable database, and are
> suffering power failures once a day, and haven't bought a UPS, you're
> a lost cause). A 32-bit CRC will fail to detect such an error with a
> probability of about 1 in 2^32. So, a 32-bit CRC will have an MBTF of
> 2^32 days, or 11 million years, on the wildly pessimistic side ---
> real installations probably 100 times better. That's plenty for me,
> and improving the odds to 2^64 or 2^128 is not worth any slowdown
> IMHO.

1. Computing a CRC-64 takes only about twice as long as a CRC-32, for
   2^32 times the confidence.  That's pretty cheap confidence.

2. I disagree with way the above statistics were computed.  That eleven
   million-year figure gets whittled down pretty quickly when you
   factor in all the sources of corruption, even without crashes.  
   (Power failures are only one of many sources of corruption.)  They
   grow with the size and activity of the database.  Databases are
   getting very large and busy indeed.

3. Many users clearly hope to be able to pull the plug on their hardware
   and get back up confidently.  While we can't promise they won't have
   to go to their backups, we should at least be equipped to promise,
   with confidence, that they will know whether they need to.

4. For a way to mark the "current final" log entry, you want a lot more
   confidence, because you read a lot more of them, and reading beyond
   the end may cause you to corrupt a currently-valid database, which
   seems a lot worse than just using a corrupted database.

Still, I agree that a 32-bit CRC is better than none at all.  

Nathan Myers



Tue, 27 May 2003 08:02:40 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> 2. I disagree with way the above statistics were computed.  That eleven
>    million-year figure gets whittled down pretty quickly when you
>    factor in all the sources of corruption, even without crashes.  
>    (Power failures are only one of many sources of corruption.)  They
>    grow with the size and activity of the database.  Databases are
>    getting very large and busy indeed.

Sure, but the argument still holds.  If the net MTBF of your underlying
system is less than a day, it's too unreliable to run a database that
you want to trust.  Doesn't matter what the contributing failure
mechanisms are.  In practice, I'd demand an MTBF of a lot more than a
day before I'd accept a hardware system as satisfactory...

Quote:
> 3. Many users clearly hope to be able to pull the plug on their hardware
>    and get back up confidently.  While we can't promise they won't have
>    to go to their backups, we should at least be equipped to promise,
>    with confidence, that they will know whether they need to.

And the difference in odds between 2^32 and 2^64 matters here?  I made
a numerical case that it doesn't, and you haven't refuted it.  By your
logic, we might as well say that we should be using a 128-bit CRC, or
256-bit, or heck, a few kilobytes.  It only takes a little longer to go
up each step, right, so where should you stop?  I say MTBF measured in
megayears ought to be plenty.  Show me the numerical argument that 64
bits is the right place on the curve.

Quote:
> 4. For a way to mark the "current final" log entry, you want a lot more
>    confidence, because you read a lot more of them,

You only need to make the distinction during a restart, so I don't
think that argument is correct.

                        regards, tom lane



Tue, 27 May 2003 08:39:11 GMT
 CRC was: Re: [HACKERS] beta testing version
--bp/iNruPH9dso1Pn
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Quote:

> 1. Computing a CRC-64 takes only about twice as long as a CRC-32, for
>    2^32 times the confidence.  That's pretty cheap confidence.

Incidentally, I benchmarked the previously mentioned 64-bit fingerprint,
the standard 32-bit CRC, MD5 and SHA, and the fastest algorithm on my
Celeron and on a PIII was MD5.  The 64-bit fingerprint was only a hair
slower, the CRC was (quite surprisingly) about 40% slower, and the
implementation of SHA that I had available was a real dog.  Taking an
arbitrary 32 bits of a MD5 would likely be less collision prone than
using a 32-bit CRC, and it appears faster as well.
--=20

--bp/iNruPH9dso1Pn
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6MSY76W+y3GmZgOgRAsHAAJoDZ91+xKebYuObyUSZKTg7AHLAqgCdGQDD
Rb4jdsO2Inr/+b4EjobvRbM=
=RpGr
-----END PGP SIGNATURE-----

--bp/iNruPH9dso1Pn--



Wed, 28 May 2003 02:20:41 GMT
 CRC was: Re: [HACKERS] beta testing version
   Date: Fri, 8 Dec 2000 12:19:39 -0600

   Incidentally, I benchmarked the previously mentioned 64-bit fingerprint,
   the standard 32-bit CRC, MD5 and SHA, and the fastest algorithm on my
   Celeron and on a PIII was MD5.  The 64-bit fingerprint was only a hair
   slower, the CRC was (quite surprisingly) about 40% slower, and the
   implementation of SHA that I had available was a real dog.  Taking an
   arbitrary 32 bits of a MD5 would likely be less collision prone than
   using a 32-bit CRC, and it appears faster as well.

I just want to confirm that you used something like the fast 32-bit
CRC algorithm, appended.  The one posted earlier was accurate but
slow.

Ian

/*
 * Copyright (C) 1986 Gary S. Brown.  You may use this program, or
 * code or tables extracted from it, as desired without restriction.
 */


   Taylor UUCP.  */

#include "uucp.h"
#include "prot.h"

/* First, the polynomial itself and its table of feedback terms.  The  */
/* polynomial is                                                       */
/* X^32+X^26+X^23+X^22+X^16+X^12+X^11+X^10+X^8+X^7+X^5+X^4+X^2+X^1+X^0 */
/* Note that we take it "backwards" and put the highest-order term in  */
/* the lowest-order bit.  The X^32 term is "implied"; the LSB is the   */
/* X^31 term, etc.  The X^0 term (usually shown as "+1") results in    */
/* the MSB being 1.                                                    */

/* Note that the usual hardware shift register implementation, which   */
/* is what we're using (we're merely optimizing it by doing eight-bit  */
/* chunks at a time) shifts bits into the lowest-order term.  In our   */
/* implementation, that means shifting towards the right.  Why do we   */
/* do it this way?  Because the calculated CRC must be transmitted in  */
/* order from highest-order term to lowest-order term.  UARTs transmit */
/* characters in order from LSB to MSB.  By storing the CRC this way,  */
/* we hand it to the UART in the order low-byte to high-byte; the UART */
/* sends each low-bit to hight-bit; and the result is transmission bit */
/* by bit from highest- to lowest-order term without requiring any bit */
/* shuffling on our part.  Reception works similarly.                  */

/* The feedback terms table consists of 256, 32-bit entries.  Notes:   */
/*                                                                     */
/*     The table can be generated at runtime if desired; code to do so */
/*     is shown later.  It might not be obvious, but the feedback      */
/*     terms simply represent the results of eight shift/xor opera-    */
/*     tions for all combinations of data and CRC register values.     */
/*     [this code is no longer present--ian]                           */
/*                                                                     */
/*     The values must be right-shifted by eight bits by the "updcrc"  */
/*     logic; the shift must be unsigned (bring in zeroes).  On some   */
/*     hardware you could probably optimize the shift in assembler by  */
/*     using byte-swap instructions.                                   */

static const unsigned long aicrc32tab[] = { /* CRC polynomial 0xedb88320 */
0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L, 0x706af48fL, 0xe963a535L, 0x9e6495a3L,
0x0edb8832L, 0x79dcb8a4L, 0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L, 0x90bf1d91L,
0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL, 0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L,
0x136c9856L, 0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L, 0xfa0f3d63L, 0x8d080df5L,
0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L, 0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L, 0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L,
0x26d930acL, 0x51de003aL, 0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L, 0xb8bda50fL,
0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L, 0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL,
0x76dc4190L, 0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL, 0x9fbfe4a5L, 0xe8b8d433L,
0x7807c9a2L, 0x0f00f934L, 0x9609a88eL, 0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL, 0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L,
0x65b0d9c6L, 0x12b7e950L, 0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L, 0xfbd44c65L,
0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L, 0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL,
0x4369e96aL, 0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L, 0xaa0a4c5fL, 0xdd0d7cc9L,
0x5005713cL, 0x270241aaL, 0xbe0b1010L, 0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L, 0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL,
0xedb88320L, 0x9abfb3b6L, 0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L, 0x73dc1683L,
0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L, 0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L,
0xf00f9344L, 0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL, 0x196c3671L, 0x6e6b06e7L,
0xfed41b76L, 0x89d32be0L, 0x10da7a5aL, 0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L, 0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL,
0xd80d2bdaL, 0xaf0a1b4cL, 0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL, 0x4669be79L,
0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L, 0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL,
0xc5ba3bbeL, 0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L, 0x2cd99e8bL, 0x5bdeae1dL,
0x9b64c2b0L, 0xec63f226L, 0x756aa39cL, 0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL, 0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L,
0x86d3d2d4L, 0xf1d4e242L, 0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L, 0x18b74777L,
0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL, 0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L,
0xa00ae278L, 0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L, 0x4969474dL, 0x3e6e77dbL,
0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L, 0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L, 0xcdd70693L, 0x54de5729L, 0x23d967bfL,
0xb3667a2eL, 0xc4614ab8L, 0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL, 0x2d02ef8dL

Quote:
};

/*
 * IUPDC32 macro derived from article Copyright (C) 1986 Stephen Satchell.
 *  NOTE: First argument must be in range 0 to 255.
 *        Second argument is referenced twice.
 *
 * Programmers may incorporate any or all code into their programs,
 * giving proper credit within the source. Publication of the
 * source routines is permitted so long as proper credit is given
 * to Stephen Satchell, Satchell Evaluations and Chuck Forsberg,
 * Omen Technology.
 */

#define IUPDC32(b, ick) \
  (aicrc32tab[((int) (ick) ^ (b)) & 0xff] ^ (((ick) >> 8) & 0x00ffffffL))

unsigned long
icrc (z, c, ick)
     const char *z;
     size_t c;
     unsigned long ick;
{
  while (c > 4)
    {
      ick = IUPDC32 (*z++, ick);
      ick = IUPDC32 (*z++, ick);
      ick = IUPDC32 (*z++, ick);
      ick = IUPDC32 (*z++, ick);
      c -= 4;
    }
  while (c-- != 0)
    ick = IUPDC32 (*z++, ick);
  return ick;

Quote:
}



Wed, 28 May 2003 02:38:01 GMT
 CRC was: Re: [HACKERS] beta testing version
--A9z/3b/E4MkkD+7G
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Quote:

>    Incidentally, I benchmarked the previously mentioned 64-bit fingerprin=
t,
>    the standard 32-bit CRC, MD5 and SHA, and the fastest algorithm on my
>    Celeron and on a PIII was MD5.  The 64-bit fingerprint was only a hair
>    slower, the CRC was (quite surprisingly) about 40% slower, and the
>    implementation of SHA that I had available was a real dog.  Taking an
>    arbitrary 32 bits of a MD5 would likely be less collision prone than
>    using a 32-bit CRC, and it appears faster as well.
>=20
> I just want to confirm that you used something like the fast 32-bit
> CRC algorithm, appended.  The one posted earlier was accurate but
> slow.

Yes.  I just rebuilt the framework using this exact code, and it
performed identically to the previous CRC code (which didn't have an
unrolled inner loop).  These were compiled with -O6 with egcs 1.1.2.
--=20

--A9z/3b/E4MkkD+7G
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6MTAH6W+y3GmZgOgRAly7AJsFT7v+c7YrPrw67GrjrMr9Ho1yxgCaAkqu
hF6uAa7UXhYJtKV77lZ/r1M=
=G3Bq
-----END PGP SIGNATURE-----

--A9z/3b/E4MkkD+7G--



Wed, 28 May 2003 03:04:29 GMT
 CRC was: Re: [HACKERS] beta testing version
--AGZzQgpsuUlWC1xT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Quote:


> > ... Taking an
> > arbitrary 32 bits of a MD5 would likely be less collision prone than
> > using a 32-bit CRC, and it appears faster as well.
>=20
> ... but that would be an algorithm that you know NOTHING about the
> properties of.  What is your basis for asserting it's better than CRC?

MD5 is a cryptographic hash, which means (AFAIK) that ideally it is
impossible to produce a collision using any other method than brute
force attempts.  In other words, any stream of input to the hash that is
longer than the hash length (8 bytes for MD5) is equally probable to
produce a given hash code.

Quote:
> CRC is pretty well studied and its error-detection behavior is known
> (and good).  MD5 has been studied less thoroughly AFAIK, and in any
> case what's known about its behavior is that the *entire* MD5 output
> provides a good signature for a datastream.  If you pick some ad-hoc
> method like taking a randomly chosen subset of MD5's output bits,
> you really don't know anything at all about what the error-detection
> properties of the method are.

Actually, in my reading reagarding the properties of MD5, I read an
article that stated that if a smaller number of bits was desired, one
could either (and here's where my memory fails me) just select the
middle N bits from the resulting hash, or fold the hash using XOR until
the desired number of bits was reached.  I'll see if I can find a
reference...

RFC2289 (http://www.ietf.org/rfc/rfc2289.txt) includes an algorithm for
folding MD5 digests down to 64 bits by XORing the top half with the
bottom half.  See appendix A.
--=20

--AGZzQgpsuUlWC1xT
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6MTQK6W+y3GmZgOgRAtUTAJsETWwvBZkZHh/1f4xHdFCFRkoegACfcx/F
Zw+y6RT9ZUe1anq8isRJQ2I=
=BVNd
-----END PGP SIGNATURE-----

--AGZzQgpsuUlWC1xT--



Wed, 28 May 2003 03:20:22 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> ... Taking an
> arbitrary 32 bits of a MD5 would likely be less collision prone than
> using a 32-bit CRC, and it appears faster as well.

... but that would be an algorithm that you know NOTHING about the
properties of.  What is your basis for asserting it's better than CRC?

CRC is pretty well studied and its error-detection behavior is known
(and good).  MD5 has been studied less thoroughly AFAIK, and in any
case what's known about its behavior is that the *entire* MD5 output
provides a good signature for a datastream.  If you pick some ad-hoc
method like taking a randomly chosen subset of MD5's output bits,
you really don't know anything at all about what the error-detection
properties of the method are.

I am reminded of Knuth's famous advice about random number generators:
"Random numbers should not be generated with a method chosen at random.
Some theory should be used."  Error-detection codes, like random-number
generators, have decades of theory behind them.  Seat-of-the-pants
tinkering, even if it starts with a known-good method, is not likely to
produce an improvement.

                        regards, tom lane



Wed, 28 May 2003 03:53:16 GMT
 CRC was: Re: [HACKERS] beta testing version

Quote:

> MD5 is a cryptographic hash, which means (AFAIK) that ideally it is
> impossible to produce a collision using any other method than brute
> force attempts.

True but irrelevant.  What we need to worry about is the probability
that a random error will be detected, not the computational effort that
a malicious attacker would need in order to insert an undetectable
error.

MD5 is designed for a purpose that really doesn't have much to do with
error detection, when you think about it.  It says "you will have a hard
time computing a different string that produces the same hash as some
prespecified string".  This is not the same as promising
better-than-random odds against a damaged copy of some string having the
same hash as the original.  CRC, on the other hand, is specifically
designed for error detection, and for localized errors (such as a
corrupted byte or two) it does a provably better-than-random job.
For nonlocalized errors you don't get a guarantee, but you do get
same-as-random odds of detection (ie, 1 in 2^N for an N-bit CRC).
I really doubt that MD5 can beat a CRC with the same number of output
bits for the purpose of error detection; given the lack of guarantee
about short burst errors, I doubt it's even as good.  (Wild-pointer
stomps on disk buffers are an example of the sort of thing that may
look like a burst error.)

Now, if you are worried about crypto-capable gremlins living in your
file system, maybe what you want is MD5.  But I'm inclined to think that
CRC is more appropriate for the job at hand.

                        regards, tom lane



Wed, 28 May 2003 04:39:01 GMT
 CRC was: Re: [HACKERS] beta testing version
--NMuMz9nt05w80d4+
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Quote:


> > MD5 is a cryptographic hash, which means (AFAIK) that ideally it is
> > impossible to produce a collision using any other method than brute
> > force attempts.
> True but irrelevant.  What we need to worry about is the probability
> that a random error will be detected,

Which I indicated immediately after the sentence you quoted.  The
probability that a random error will be detected is the same as the
probability of a collision in the hash given two different inputs.  The
brute force note means that the probability of a collision is as good as
random.

Quote:
> MD5 is designed for a purpose that really doesn't have much to do with
> error detection, when you think about it.  It says "you will have a hard
> time computing a different string that produces the same hash as some
> prespecified string".  This is not the same as promising
> better-than-random odds against a damaged copy of some string having the
> same hash as the original.

It does provide as-good-as-random odds against a damaged copy of some
string having the same hash as the original -- nobody has been able to
exhibit any collisions in MD5 (see http://cr.yp.to/papers/hash127.ps,
page 18 for notes on this).

Quote:
> CRC, on the other hand, is specifically
> designed for error detection, and for localized errors (such as a
> corrupted byte or two) it does a provably better-than-random job.
> For nonlocalized errors you don't get a guarantee, but you do get
> same-as-random odds of detection (ie, 1 in 2^N for an N-bit CRC).

For the log, the CRC's primary function (as far as I understand it)
would be to guard against inconsistent transaction being treated as
consistent data.  Such inconsistent transactions would be partially
written, resulting in errors much larger than a small burst.

For guarding the actual record data, I agree with you 100% -- what we're
likely to see is a few localized bytes with flipped bits due to hardware
failure of one kind or another.  However, if the data is really
critical, an ECC may be more appropriate, but that would make the data
significantly larger (9/8 for the algorithms I've seen).

Quote:
> I really doubt that MD5 can beat a CRC with the same number of output
> bits for the purpose of error detection;

Agreed.  However, MD5 provides four times as many bits as the standard
32-bit CRC.

(I think I initially suggested you could take an arbitrary 32 bits out
of MD5 to provide a check code "as good as CRC-32".  I now take that
back.  Due to the burst error nature of CRCs, nothing else could be as
good as it, unless the alternate algorithm also made some guarantees,
which MD5 definitely doesn't.)

Quote:
> (Wild-pointer
> stomps on disk buffers are an example of the sort of thing that may
> look like a burst error.)

Actually, wild-pointer incidents involving disk buffers at the kernel
level, from my experience, are characterized by content from one file
appearing in another, which is distinctly different than a burst error,
and more like what would be seen if a log record were partially written.
--=20

--NMuMz9nt05w80d4+
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6MVJ56W+y3GmZgOgRAuWFAJ41HmfDwc9XtBis1q4kokqWch6T2gCgpb8P
9bbQUTCGnylo9fDwEPm9Z5c=
=ygUQ
-----END PGP SIGNATURE-----

--NMuMz9nt05w80d4+--



Wed, 28 May 2003 05:54:30 GMT
 
 [ 14 post ] 

 Relevant Pages 

1. CRC was: Re: [HACKERS] beta testing version

2. Fwd: Re: [HACKERS] CRC, hash & Co.

3. beta testing version (not really anymore ;)

4. Beta Test a new version of Ecora's Configuration Management Solution for Oracle

5. beta testing version

6. beta testing version

7. beta testing version

8. CRCs (was: beta testing version)

9. Open Items (was: RE: [HACKERS] Beta going well)

10. Open Items (was: RE: [HACKERS] Beta going well)

11. TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST

12. Open Items (was: RE: [HACKERS] Beta going well)


 
Powered by phpBB® Forum Software