Normalization, Natural Keys, Surrogate Keys 
Author Message
 Normalization, Natural Keys, Surrogate Keys

Quote:
>> Without sounding too dim, what you just said has confused me. It

seems like
you're denormalising the schema here?  Are you saying that you
shouldn't store a list of codes in a separate table (providing they're
not supposed to change)? <<

Whar we would really like is to have a CREATE DOMAIN statement that
would give us a domain for the attributes at the scheam level.  Thus,
ALL zipcodes would checked and validated the same way, UPC/EAN numbers
would be the same, etc. byt virtue of being declared to be of that
type.

Well, we don't have that in most SQL products, so we can use CHECK()
constraints or REFERENCES clauses to ensure that an attribute has only
valid values.  My rule of thumb -- and that is all it is -- is to put
short lists of
constant values into a CHECK() clause and long lists of varying values
into a auxillary code table.  But use one and only one table per code;
if a table has more than one code, then it is not normalized because
columns have more than one meaning.



Sat, 06 Nov 2004 01:11:44 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:

> >       This said, the only thing I see wrong with surrogate keys is
> > that they can cause you extra work. You must, in some cases, think
> > about the "properness" of both the surrogate and the natural,
> > alternate key.

> True.  But "thinking" about the surrogate is brain-dead.  Its the
> natural key that must be thought of and is always the hardest.

        Yes, but what I meant is that it is an additional worry, if not a
difficult one. :)

Quote:
> > Furthermore, it requires additional rules on the physical
> > implementation which could hamper performance

> There will be one extra index to maintain on every table.

        Agreed. Indices are the ones that will add to the performance
consideration.

Quote:
> > or which might not be doable depending on the engine used.

> Hm...  Haven't thought of this one.  Hm...  What engines would have
> issues?

        Don't know. Haven't used one where this would be an issue. It's a
CYA... :)

(Any opinions expressed are strictly mine only and not my employer's)
--------------------------------------------------------------------
Paul Tiseo, Intermediate Systems Programmer



Sat, 06 Nov 2004 02:00:14 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:
> If I didn't get paid so darned much money dumping surrogate keys and cleaning
> up the messes I'd actually be upset. And not once has there been a resulting
> performance problem that warranted recreating a surrogate key and instituting
> a fake primary key using a unique index with a not null constraint.

        The SSN is a surrogate key itself, and a BAD one at that. There
are duplicates out there. It is "just as good" as a surrogate. In fact,
it might be worse, because at least I can QA my surrogate, but I can't
QA the government's surrogate for Person.

        Secondly, as someone else explained, the error you illustrate of
duplication is NOT due to the use of a surrogate key, but of bad usage
of integrity operations (checks, triggers or stored procedures) that
would have prevented it.

(Any opinions expressed are strictly mine only and not my employer's)
--------------------------------------------------------------------
Paul Tiseo, Intermediate Systems Programmer



Sat, 06 Nov 2004 02:04:05 GMT
 Normalization, Natural Keys, Surrogate Keys

Quote:



> > If I didn't get paid so darned much money dumping surrogate keys and cleaning
> > up the messes I'd actually be upset. And not once has there been a resulting
> > performance problem that warranted recreating a surrogate key and instituting
> > a fake primary key using a unique index with a not null constraint.

>         The SSN is a surrogate key itself, and a BAD one at that. There
> are duplicates out there. It is "just as good" as a surrogate. In fact,
> it might be worse, because at least I can QA my surrogate, but I can't
> QA the government's surrogate for Person.

>         Secondly, as someone else explained, the error you illustrate of
> duplication is NOT due to the use of a surrogate key, but of bad usage
> of integrity operations (checks, triggers or stored procedures) that
> would have prevented it.

> (Any opinions expressed are strictly mine only and not my employer's)
> --------------------------------------------------------------------
> Paul Tiseo, Intermediate Systems Programmer


Just to dwell on a point ... the origin of a value, for example a Social Security
Number is not important. What is important is the way that the value is used within
your system.

I keep hearing that Social Security Numbers are not unique but have yet to come
upon a situation where that was true. But assuming it is and you are creating a
system for the XYZ company (lets say a health insurance provider or employer). If
you are going to set up your policies using Social Security Number as the method of
identifying a unique individual ... you had better know right up front that you
have two people with the same number. If you do not, if some system, somehow,
assigns different surrogate keys, you are going to have a huge mess with legal
implications somewhere down the line.

Personally; I'd rather have the system catch that one up front and with a blaring
of trumpets and beating of drums.

Daniel Morgan



Sat, 06 Nov 2004 03:47:21 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:
>> I know I would never use this manuafactured key because I
>> wouldn't be able to depend on it.  Is it a child to some other
>> table?  No.  It is just some number.  I don't think it would ever
>> get used.

>> You sound like you really want to go with natural keys.  Go with
>> it.  It certainly is a sound design.

>Kind of jumping in to the end of a discussion here, but...

>the surrogate keys look more and more to me like the data base is no
>longer relational. It smells a lot like a network model database.  I'm
>not saying we should never use surrogate keys. But using the real data
>can help avoid a lot of other problems as has already been mentioned,
>primarily "duplicates".

>This topic is dear to me right now, because I'm dealing with the clean
>up of data that uses ID's and we have all the problems associated with
>this "network model" style of database design: duplicates, invalid
>relations (the keys relate, but the "natural" data does not), orphans
>(child without a true parent).

The surrogate key has its uses (others disagree on that) but you do have
to recognise that there are risks. Any move away from the logical data
structure has a risk. It's perfectly possible to take two copies of the
same datum and attach two different surrogate key values. If you allow
this to happen then your data has been corrupted, and corrupted in a way
that's not immediately obvious.

Creating a surrogate key does not absolve you from the need to identify
a true key if it exists, and to check its uniqueness in your database.

--
Bernard Peek

In search of cognoscenti



Sat, 06 Nov 2004 04:01:07 GMT
 Normalization, Natural Keys, Surrogate Keys
Quote:
>> The SSN is a surrogate key itself, and a BAD one at that. There are

duplicates out there. <<

With Mexican illegals in Calfornia, Texas and other states, the
duplication rate might be as high as 5%, according to the last figures
I saw.  But the problem is that people do not want to know if their
workers are illegal.

Quote:
>>  It is "just as good" as a surrogate. In fact, it might be worse,

because at least I can QA my surrogate, but I can't QA the
government's surrogate for Person. <<

Actually, there are several websites that will validate SSN for you
and the US Government used to have a "Project Clean Data" where they
would send you files of SSNs with the date and place of issue.  You
can check the data yourself and might even have gotten mailing from
the Social Security Administration summarizing your account.



Sat, 06 Nov 2004 05:59:15 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:
> >> The SSN is a surrogate key itself, and a BAD one at that. There
are
> duplicates out there. <<

> With Mexican illegals in Calfornia, Texas and other states,

Singling out one ethnic group isn't very productive.... oh well ...
amazing ... no, actually it's not ...

Businesses operated by good old American citizens are generating
duplicate social security numbers for illegal aliens.
--
Pablo Sanchez, High-Performance Database Engineering

http://www.hpdbe.com
Available for short-term and long-term contracts



Sat, 06 Nov 2004 06:37:30 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:



> > >> The SSN is a surrogate key itself, and a BAD one at that. There
> are
> > duplicates out there. <<

> > With Mexican illegals in Calfornia, Texas and other states,

> Singling out one ethnic group isn't very productive.... oh well ...
> amazing ... no, actually it's not ...

> Businesses operated by good old American citizens are generating
> duplicate social security numbers for illegal aliens.

It's not even that.  Under the original scheme there was no method of
generating unique SSN's.  Since the first 3 digits were code digits, you had
a smaller set of unique numbers to chose from.  The original thinking was
that people would die off fast enough they'd have no proble recycling
numbers.

At this point I'm not aware that the SS Administration has handed out any
legit dupicates and I know they've taken steps in the past decade or so to
eliminate this problem.

Quote:
> --
> Pablo Sanchez, High-Performance Database Engineering

> http://www.hpdbe.com
> Available for short-term and long-term contracts



Sat, 06 Nov 2004 08:24:02 GMT
 Normalization, Natural Keys, Surrogate Keys

Quote:

> The surrogate key has its uses (others disagree on that) but you
> do have to recognise that there are risks. Any move away from the
> logical data structure has a risk. It's perfectly possible to
> take two copies of the same datum and attach two different
> surrogate key values.

Well, its perfectly possible to attach the wrong ssn to a person's row.
How can you be assured that the correct ssn gets tagged to the correct
person?  The arguments against surrogate keys haven't helped me change
my mind, although I wasn't the one asking.  It should be quite obvious
that surrogate keys are there as dummy keys so don't act like a dummy
and assume that their uniqueness defines the true uniqueness of the
row.

Quote:
> If you allow this to happen then your data has been corrupted, and
> corrupted in a way that's not immediately obvious.

What happens if a customer id is email address?  There are apps around
where you can't get a natural, non-changing primary key for the
customer.  But, a completely logical identifier for, say, an online app
could be an email address, or maybe even a credit card.  But, this
identifier can change, or they could have more than one.  What do you do
then?

Or, is the whole surrogate/natural argument centered around creating
keys for data that already has "natural" keys and has nothing to do with
new apps/data?

Quote:
> Creating a surrogate key does not absolve you from the need to
> identify a true key if it exists, and to check its uniqueness in
> your database.

Bingo!

--
Galen deForest Boyer
Sweet dreams and flying machines in pieces on the ground.



Sat, 06 Nov 2004 09:59:14 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:
> I keep hearing that Social Security Numbers are not unique but have
> yet to come upon a situation where that was true.

        Now that you have made such a statement in public, Murphy's Law
will come along and prove my point for me. :)

        Maybe it's just my luck, but I learned early on that as soon as
someone says something like: "Yeah, it's not truly unique, but it's good
enough for us." in a database implementation is the moment just before a
duplicate is entered... :)

Quote:
> If you do not, if some system, somehow, assigns different surrogate
> keys, you are going to have a huge mess with legal implications
> somewhere down the line.

        Well, the concept was the usefulness of surrogates to replace
complex, natural keys and the example chosen was SSNs used as PKs
instead of another common bad choice of last name/DOB. The example was
given to be used generally, and not for getting into the politics and
pros-and-cons of SSNs in particular.

        However, I'll allow myself the diversion to bring up three
counterpoints:

        First, do you know which of the two (ex: the new applicant and the
ten-year employee) is the legal holder of that erroneous SSN? Why should
the database be the draconic arbiter, rather than a human policy
officer? The best thing would be to be able to have both entered and
flagged for administrative attention giving you flexibility and legal
protection.

        Secondly, legal implication also exist if you use the SSN as a
basis to deny hiring or fire someone. From http://www.ssa.gov/: "Any
employer that uses the information SSA provides regarding name/SSN
verification to justify taking adverse action against an employee may
violate state or federal law and be subject to legal consequences." Is a
duplicate SSN immediate cause for suspicion that should affect hiring or
employement?

        Thirdly, and given the SSA doesn't even trust their own numbers,
do you want to prevent yourself from hiring that much-needed employee
because your database design is too tight and he/she's going to have to
do whatever he/she needs to with the Social Security Administration
rather than use your own surrogate key in your design? Along with my
first point, I'd say that policy makers should dictate the data model,
not the other way around.

        Personally, I don't use SSNs as PKs. I've heard too many horror
stories.

Quote:
> Personally; I'd rather have the system catch that one up front and
> with a blaring of trumpets and beating of drums.

        IMO, that's a policy-level decision that has no bearing on the
fact that surrogate keys that replace complex, natural keys can be a
good idea at times, depending on the model.

(Any opinions expressed are strictly mine only and not my employer's)
--------------------------------------------------------------------
Paul Tiseo, Intermediate Systems Programmer



Sat, 06 Nov 2004 22:01:17 GMT
 Normalization, Natural Keys, Surrogate Keys

Quote:
> Creating a surrogate key does not absolve you from the need to identify
> a true key if it exists, and to check its uniqueness in your database.

        Well said.

        Surrogate keys usage exists in two situations. First, if there is
no *easily identifiable* natural key, either complex or simple. Note
that everything must have a set of one or more attributes that are
unqiue for each instance of the entity, or else how can we distinguish
the instances of that entity in real-life anyways? However, sometimes,
the number of attributes can be high to get guaranteed uniqueness. This
begets the next situation, where there is a natural key, but it is very
complex and you use a surrogate for purposes of joins and modeling ease,
but not for purposes of uniqueness guarantee. The natural key must still
be monitored. The natural key is an alternate key to the surrogate, or
vice versa.

(Any opinions expressed are strictly mine only and not my employer's)
--------------------------------------------------------------------
Paul Tiseo, Intermediate Systems Programmer



Sat, 06 Nov 2004 22:09:26 GMT
 Normalization, Natural Keys, Surrogate Keys
I am about to commit thread drift, but just the same, you might want to
cut and paste this one for future reference:

 ======
SSN

The first three numbers are geographic in where they are issued, and one
series was used for Railroad Retirement.  You would probably need a
"table" to verify the groups.

The middle two digits were issued in odd numbers 01 03 05 07 09 then
they were issued in evens since then 10 12 14 etc....  Not all evens
have been used in all areas.

The last four are sequential and there is no way of telling if there is
a transposition.

I have used the GO TRACE service on CompuServe for looking up death
records from Social Security and the first 5 digits of the social
security number are displayed, along with where issued with approximate
year of issue.

The ancestry web site has a Social Security death search that give the
full 9 digit number of the deceased individual.  It does not supply the
years or location of issue.

This information is from 1988 and there may be changes in the
particulars since then.  However, the Social Security Administration
actively resists giving information to citizens on their number system.
However, there are commercial firms such as Security Software Solutions
(Box 30125; Tucson, AZ 85751-0125; phone 800-681-8933;
http://www.***.com/ ), which will verify SSNs for living and
deceased persons.  

The Social Security Number (SSN) is composed of 3 parts, XXX-XX-XXXX,
called the Area, Group, and Serial.  For the most part, (there are
exceptions), the Area is determined by where the individual APPLIED for
the SSN (before 1972) or RESIDED at time
of application (after 1972).  The areas are assigned as follows:

  000 unused
  001-003 NH
  004-007 ME
  008-009 VT
  010-034 MA
  035-039 RI
  040-049 CT
  050-134 NY
  135-158 NJ
  159-211 PA
  212-220 MD
  221-222 DE
  223-231 VA
  232-236 WV
  237-246 NC
  247-251 SC
  252-260 GA
  261-267 FL
  268-302 OH
  303-317 IN
  318-361 IL
  362-386 MI
  387-399 WI
  400-407 KY
  408-415 TN
  416-424 AL
  425-428 MS
  429-432 AR
  433-439 LA
  440-448 OK
  449-467 TX
  468-477 MN
  478-485 IA
  486-500 MO
  501-502 ND
  503-504 SD
  505-508 NE
  509-515 KS
  516-517 MT
  518-519 ID
  520 WY
  521-524 CO
  525 NM
  526-527 AZ
  528-529 UT
  530 NV
  531-539 WA
  540-544 OR
  545-573 CA
  574 AK
  575-576 HI
  577-579 DC
  580 VI {*filter*} Islands
  581-584 PR Puerto Rico
  585 NM
  586 PI Pacific Islands
         (Northern Mariana Islands,Guam, American Samoa,
          Philippine Islands)
  587-588 MS
  589-595 FL
  596-599 PR Puerto Rico
  600-601 AZ
  602-626 CA
  627-699 unassigned, for future use
  700-728 Railroad workers through 1963, then discontinued
  729-899 unassigned, for future use
  900-999 not valid SSNs, but were used for program purposes when state
aid to the aged, blind and disabled was converted to a federal program
administered by SSA.

As the Areas assigned to a locality are exhausted, new areas from the
pool are assigned.  This is why some states have non- contiguous groups
of Areas.

As the Areas assigned to a locality are exhausted, new areas from the
pool are assigned.  This is why some states have non- contiguous groups
of Areas.

The Group portion of the SSN has no meaning other than to determine
whether or not a number has been assigned.  SSA publishes a list every
month of the highest group assigned for each SSN Area.  The order of
assignment for the Groups is: odd numbers under 10, even numbers over 9,
even numbers under 9 except for 00 which is never used, and odd numbers
over 10.  For example, if the highest group assigned for area 999 is 72,
then we know that the number 999-04-1234 is an invalid number because
even Groups under 9 have not yet been assigned.

The Serial portion of the SSN has no meaning.  The Serial is not
assigned in strictly numerical order.  The Serial 0000 is never
assigned.

Before 1973, Social Security Cards with pre-printed numbers were issued
to each local SSA office.  The numbers were assigned by the local
office.  In 1973, SSN assignment was automated and outstanding stocks of
pre-printed cards were destroyed.  All SSNs are now assigned by computer
from headquarters.  There are rare cases in which the computer system
can be forced to accept a manual assignment such as a person refusing a
number with 666 in it.

A pamphlet entitled "The Social Security Number" (Pub.  No. 05-10633)
provides an explanation of the SSN's structure and the method of
assigning and validating Social Security numbers.

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.***.com/ ***
Don't just participate in USENET...get rewarded for it!



Sat, 06 Nov 2004 23:50:20 GMT
 Normalization, Natural Keys, Surrogate Keys

Quote:
>>  .. another common bad choice of last name/DOB. <<

Date of birth is part of the national identity card numbers in Nordic
countries.  There were political problems when the place of birth part
of the code could be used to identify immigrants.  But otherwise, the
system seems to have worked for them for over 50 years to administer an
elaborate system of social welfare programs.  

But you are right about the real point; how do I keep the
surrogate/artificial key linked to the natural/verifiable key?  If I
make that link physical -- like putting UPC codes on packages -- then
the surrogate/artificial key becomes a natural/verifiable key.

My arguments are

1) Why create the coirespondence problem for yourself in the first
place?  

2) The natural key will naturally appear in other places (i.e. the SSN
is the driver license number, the HMO account number, the public library
card number, etc.) and the odds of getting all the 1 to 1 links right is
too high.  

3) People do not know a machine generated number, but they can discover
or verify a natural key.  

4) Most natural keys were created for human use, so they are not much
longer than the machine generated ones.  In case of the GUID in SQL
Server, they are usually considerably shorter, in fact!

SSN was popular because people actually used to carry their Social
Security Card in their wallets.  If you did not have your card, then
your driver's license in most states used the SSN, etc.

Today, a common intelligent key is the email address, which is very easy
to verify.  The problem is updating the key when someone changes email
providers.  But this update can be done by emailing the old address and
getting the new one automatically.  

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Sun, 07 Nov 2004 00:10:18 GMT
 Normalization, Natural Keys, Surrogate Keys


Quote:



>> >> The SSN is a surrogate key itself, and a BAD one at that. There
>are
>> duplicates out there. <<

>> With Mexican illegals in Calfornia, Texas and other states,

>Singling out one ethnic group isn't very productive.... oh well ...
>amazing ... no, actually it's not ...

>Businesses operated by good old American citizens are generating
>duplicate social security numbers for illegal aliens.

In the UK the SSN is issued by the government and is usually based on
the persons place and time of birth. We still had duplicates, even
before we had a significant number of illegal immigrants.

If you really need a surrogate key that you can trust implicitly then
you need to issue it yourself.

--
Bernard Peek

In search of cognoscenti



Sat, 06 Nov 2004 20:59:24 GMT
 Normalization, Natural Keys, Surrogate Keys

writes

Quote:
>> If you allow this to happen then your data has been corrupted, and
>> corrupted in a way that's not immediately obvious.

>What happens if a customer id is email address?  There are apps around
>where you can't get a natural, non-changing primary key for the
>customer.  But, a completely logical identifier for, say, an online app
>could be an email address, or maybe even a credit card.  But, this
>identifier can change, or they could have more than one.  What do you do
>then?

You find out who decided to use e-mail address as a unique identifier
and tell them to fix it before they leave. I have multiple entries in
several on-line databases because I use multiple e-mail addresses. This
is fine for me as long as I can remember which address I used where and
when.

A common fix is to have the customer generate their own identifier as a
login ID and just check to make sure that it is unique within your own
database. You can't prevent a customer from using multiple IDs if they
choose, so don't bother trying. You might de-duplicate the data later.
There are companies that exist to do nothing else but de-duplicate lists
of names.

Quote:

>Or, is the whole surrogate/natural argument centered around creating
>keys for data that already has "natural" keys and has nothing to do with
>new apps/data?

We've been through this one before in comp.databases.theory and I
proposed that we use latitude, longitude, altitude and birth-time as a
natural key for people. Anything else is a surrogate. We can't usually
get a real natural key for people so surrogates are all we have.

Other types of data have natural keys and I would strongly advise using
them wherever possible. I know that there are people who always use
surrogate keys even when they have natural keys available. I hope I
don't have to use any of the databases they designed.

For me there are three types of key and I use them in this order of
preference.

        true natural keys guaranteed by the laws of physics.

        surrogate keys that I generate and therefore control.

        surrogate keys issued by *trusted* third-parties.

Anything else shouldn't be completely trusted and perhaps shouldn't have
a unique index in a database. If you choose to make the SSN (or any
other third-party surrogate) a unique field in the database then you
must have a procedure in place to handle duplicate values.

--
Bernard Peek

In search of cognoscenti



Sat, 06 Nov 2004 21:37:18 GMT
 
 [ 123 post ]  Go to page: [1] [2] [3] [4] [5] [6] [7] [8] [9]

 Relevant Pages 

 
Powered by phpBB® Forum Software