Gut feeling: Which query looks better 
Author Message
 Gut feeling: Which query looks better

Without getting into details about table structure, indexes, number of rows,
etc. Which query do you think would run better on SQL Server.

Or do you think that SQL Server will always use the EXACT same plan, since
they are (or seem to me to be) the same query:

Query1
=====
SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
INNER JOIN (
   SELECT DISTINCT(UserGUID) FROM AuditLog
   WHERE TableName = 'Reservations'
   AND ChangeType = 'INSERTED'
) dt
ON Users.UserGUID = dt.UserGUID

Query2
=====
SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
WHERE Users.UserGUID IN (
   SELECT DISTINCT(UserGUID)
   FROM AuditLog
   WHERE TableName = 'Reservations'
   AND ChangeType = 'INSERTED')

The english explanation for the query is, i want all users who can created a
reservation. So i find all audit log table entries for "Insert" into the
"Reservations" table.

The sweeping generalization i want addressed is:
"Which is better, INNER JOIN or WHERE IN?"

And don't get hung up on "depends on the situation, table sizes, indexing
strategies, etc."
Feel free to make broad sweeping generalizations.



Fri, 28 Jul 2006 19:10:24 GMT
 Gut feeling: Which query looks better

Which application is better?  Don't spend too much time on what it is supposed to do or on what platform it has to run.

I would lead with the inner join solution.  If it does an adequate job, I would quit.  If there is a problem, then I would see if the subquery solution works better.

You forgot one other option, I think this may run faster than the in clause.

SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
WHERE exists (
   SELECT *
   FROM AuditLog
   WHERE TableName = 'Reservations'
   AND ChangeType = 'INSERTED' and UserGUID = Users.UserGUID )



Fri, 28 Jul 2006 19:26:05 GMT
 Gut feeling: Which query looks better
In this case, it should be the same on both.

--
-oj
http://www.rac4sql.net


Quote:
> Without getting into details about table structure, indexes, number of
rows,
> etc. Which query do you think would run better on SQL Server.

> Or do you think that SQL Server will always use the EXACT same plan, since
> they are (or seem to me to be) the same query:

> Query1
> =====
> SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> INNER JOIN (
>    SELECT DISTINCT(UserGUID) FROM AuditLog
>    WHERE TableName = 'Reservations'
>    AND ChangeType = 'INSERTED'
> ) dt
> ON Users.UserGUID = dt.UserGUID

> Query2
> =====
> SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> WHERE Users.UserGUID IN (
>    SELECT DISTINCT(UserGUID)
>    FROM AuditLog
>    WHERE TableName = 'Reservations'
>    AND ChangeType = 'INSERTED')

> The english explanation for the query is, i want all users who can created
a
> reservation. So i find all audit log table entries for "Insert" into the
> "Reservations" table.

> The sweeping generalization i want addressed is:
> "Which is better, INNER JOIN or WHERE IN?"

> And don't get hung up on "depends on the situation, table sizes, indexing
> strategies, etc."
> Feel free to make broad sweeping generalizations.



Fri, 28 Jul 2006 19:33:43 GMT
 Gut feeling: Which query looks better
Of the two, I would go with query #2, and drop that DISTINCT in the the
IN predicate.  The optimizer ought to be smart enough by now to do
figure it out.  I don't think it will make any difference, tho.

I hope that "userGUID" doesn't actually mean that you are using GUIDs to
identify users (no data integrity or verification).  I also hope that
you did not try to write your own audit log routines instead of buying a
package for that function (high overhead and needless complexity
compared to a package).  See
http://www.sswug.org/searchresults.asp?keywordstofind2=lumigent

I hope fullname did not mean the user's full name in one column.  Talk
to anyone in direct mail, and you'll find out about keeping first and
last names separate (Google "Search Software America" and ask for their
booklet "The Math, Myth & Magic of Name Searching and Matching"; it is
quite informative).

Quote:
>> The English explanation for the query is, I want all users who can

created a reservation. So I find all audit log table entries for
"Insert" into the "Reservations" table. <<

Most of us would query the Reservations table to find who made
reservations.  You would think of this kind of query instead of either
of yours:

SELECT user_id, first_name, last_name
  FROM Users AS U1
 WHERE EXISTS
       (SELECT *
          FROM Reservations AS R1
         WHERE R1.user_id = U1.user_id);

This does not tell us who had the power to make a reservation but has
not exercised it yet.  Need to know a bit more about the application for
that one.  

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Fri, 28 Jul 2006 22:43:40 GMT
 Gut feeling: Which query looks better
At least some of the time all three variations of the query will produce
plans that look the same in the plan display.
(goahead and try it on some of your own tables)

Bye,
Delbert Glass



Quote:
> Which application is better?  Don't spend too much time on what it is

supposed to do or on what platform it has to run.
Quote:

> I would lead with the inner join solution.  If it does an adequate job, I

would quit.  If there is a problem, then I would see if the subquery
solution works better.
Quote:

> You forgot one other option, I think this may run faster than the in
clause.

> SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> WHERE exists (
>    SELECT *
>    FROM AuditLog
>    WHERE TableName = 'Reservations'
>    AND ChangeType = 'INSERTED' and UserGUID = Users.UserGUID )



Fri, 28 Jul 2006 23:21:27 GMT
 Gut feeling: Which query looks better

Quote:
> I hope that "userGUID" doesn't actually mean that you are using GUIDs to
> identify users (no data integrity or verification).

At the least, GUIDs are being used for linking.
Look through the newsgroup for previous threads by Ian Boyd
to find a thread in which a performance problem related to the GUIDs was
addressed.

Quote:
> ... and drop that DISTINCT in the the
> IN predicate.  The optimizer ought to be smart enough by now to do
> figure it out.  I don't think it will make any difference, tho.

If you leave out DISTINCT, the optimizer is smart enought to put it in;
*whenever*, it deems doing so is benefitial.

If you put in DISTINCT, I don't recall off hand whether or not
the optimizer ever elects to ignore you.

Bye,
Delbert Glass


Quote:
> Of the two, I would go with query #2, and drop that DISTINCT in the the
> IN predicate.  The optimizer ought to be smart enough by now to do
> figure it out.  I don't think it will make any difference, tho.

> I hope that "userGUID" doesn't actually mean that you are using GUIDs to
> identify users (no data integrity or verification).  I also hope that
> you did not try to write your own audit log routines instead of buying a
> package for that function (high overhead and needless complexity
> compared to a package).  See
> http://www.sswug.org/searchresults.asp?keywordstofind2=lumigent

> I hope fullname did not mean the user's full name in one column.  Talk
> to anyone in direct mail, and you'll find out about keeping first and
> last names separate (Google "Search Software America" and ask for their
> booklet "The Math, Myth & Magic of Name Searching and Matching"; it is
> quite informative).

> >> The English explanation for the query is, I want all users who can
> created a reservation. So I find all audit log table entries for
> "Insert" into the "Reservations" table. <<

> Most of us would query the Reservations table to find who made
> reservations.  You would think of this kind of query instead of either
> of yours:

> SELECT user_id, first_name, last_name
>   FROM Users AS U1
>  WHERE EXISTS
>        (SELECT *
>           FROM Reservations AS R1
>          WHERE R1.user_id = U1.user_id);

> This does not tell us who had the power to make a reservation but has
> not exercised it yet.  Need to know a bit more about the application for
> that one.

> --CELKO--
>  ===========================
>  Please post DDL, so that people do not have to guess what the keys,
> constraints, Declarative Referential Integrity, datatypes, etc. in your
> schema are.

> *** Sent via Developersdex http://www.developersdex.com ***
> Don't just participate in USENET...get rewarded for it!



Fri, 28 Jul 2006 23:21:30 GMT
 Gut feeling: Which query looks better
Query2/IN:
IN makes it clear what _this_ query does.

Query3/EXISTS:
Sometimes queries do not skip work they could -- exists tends to be better
at actually skipping.

Query1/INNERJOIN:
This arrangement tends to avoid "my query is talking way longer than really
necessary".
This arrangement can be used for a variety of tasks. While suffering from
not being as clear as IN,
this arrangement has the benefit that if your requirements change,
like say order the users by who has been waiting the longest etc,
you be already be halfway there and thus will be able to fullfill the change
request
more easily, quicker, and confidently.

Bye,
Delbert Glass


Quote:
> Without getting into details about table structure, indexes, number of
rows,
> etc. Which query do you think would run better on SQL Server.

> Or do you think that SQL Server will always use the EXACT same plan, since
> they are (or seem to me to be) the same query:

> Query1
> =====
> SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> INNER JOIN (
>    SELECT DISTINCT(UserGUID) FROM AuditLog
>    WHERE TableName = 'Reservations'
>    AND ChangeType = 'INSERTED'
> ) dt
> ON Users.UserGUID = dt.UserGUID

> Query2
> =====
> SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> WHERE Users.UserGUID IN (
>    SELECT DISTINCT(UserGUID)
>    FROM AuditLog
>    WHERE TableName = 'Reservations'
>    AND ChangeType = 'INSERTED')

> The english explanation for the query is, i want all users who can created
a
> reservation. So i find all audit log table entries for "Insert" into the
> "Reservations" table.

> The sweeping generalization i want addressed is:
> "Which is better, INNER JOIN or WHERE IN?"

> And don't get hung up on "depends on the situation, table sizes, indexing
> strategies, etc."
> Feel free to make broad sweeping generalizations.



Fri, 28 Jul 2006 23:59:08 GMT
 Gut feeling: Which query looks better
Quote:
>> If you put in DISTINCT, I don't recall off hand whether or not the

optimizer ever elects to ignore you. <<

Some of the smarter optimizer look at an estimate of the size of the
subquery, then either just do it (small, fits into main storage) or
prune it (bigger, might page in and out a bit, so sort it and DISTINCT
it) or go to a hash table (honking big thing, need to get parallel
processors and DISTINCT it in the hash buckets).  

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Sat, 29 Jul 2006 02:19:05 GMT
 Gut feeling: Which query looks better
At least there were no barn-yard references. :P


Quote:
> At least some of the time all three variations of the query will produce
> plans that look the same in the plan display.
> (goahead and try it on some of your own tables)

> Bye,
> Delbert Glass



> > Which application is better?  Don't spend too much time on what it is
> supposed to do or on what platform it has to run.

> > I would lead with the inner join solution.  If it does an adequate job,
I
> would quit.  If there is a problem, then I would see if the subquery
> solution works better.

> > You forgot one other option, I think this may run faster than the in
> clause.

> > SELECT Users.UserGUID, Users.Username, Users.Fullname FROM Users
> > WHERE exists (
> >    SELECT *
> >    FROM AuditLog
> >    WHERE TableName = 'Reservations'
> >    AND ChangeType = 'INSERTED' and UserGUID = Users.UserGUID )



Mon, 31 Jul 2006 14:55:13 GMT
 Gut feeling: Which query looks better

Quote:
> I hope that "userGUID" doesn't actually mean that you are using GUIDs to
> identify users

i am.

Quote:
> (no data integrity or verification).

i don't see why this would be the case.

Quote:
> I also hope that you did not try to write your own audit log routines
instead of buying a
> package for that function (high overhead and needless complexity compared

to a package).

Yeah, i saw a nice program by "Red Matrix Technologies."
i wrote my own, "Red Matrix Reloaded"

Plus, it's not like it's a horribly complicated thing to implement; and we
get to customize it to boot!

Quote:
> I hope fullname did not mean the user's full name in one column.

You can hope all you want, it will still be the full name.
i took the idea from Windows' own user manager (the pre-Win2k user manager
if you
want to get picky about it)

Quote:
> Most of us would query the Reservations table to find who made
> reservations.

Yeah, i would have done that, except i forgot to add a "CreatedByUserGUID"
column
to the Reservations table (oops). Not to worry too much, since i now have
the information
in the Audit Log.

Quote:
> --CELKO--

Holy hell! You mean i'm getting a smack-down from _the_ Joe Celko?
i've hit the big time.


Mon, 31 Jul 2006 15:30:41 GMT
 Gut feeling: Which query looks better

Quote:
>> I am. [using GUIDs to identify users] <<

Got to your data model and explain what that GUID means in terms of the
LOGICAL model.  You cannot because this thing was generated in a
particular piece of PHYSICAL storage, based on the PHYSICAL state of
particular release of a particular proprietyary softeware package at the
time the PHYSICAL record was created.  

The first practical consideration is that GUID is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products.  Newbies
actually think they will never port code!  Perhaps they only work for
companies that are failing and will be gone.  Perhaps their code is such
{*filter*}nobody else want their application.

But let's look at the logical problems.  Create a simple table with one
auto-generated key of any kind and a few other columns.  Use a few
statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the auto-generated key
numbered them in the order they were presented.  But now use a statement
with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
  FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the auto-generated key be?  The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to arrange (n) rows, so which one do you pick?  The
answer has been to use whatever the PHYSICAL order of the result set
happened to be.  That non-relational phrase "physical order" again!

But it is actually worse than that.  If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different PHYSICAL numbers?  In the relational model, they
should be treated the same if all the values of all the attributes are
identical.  

Using auto-generated values as a primary key is a sign that there is no
data model, only an imitation of a sequential file system.  Since this
"magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as
a result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?  

You will see newbies who design tables like this:

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,  
 ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
 vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times.  Your data integrity is trashed.  The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
 vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
 PRIMARY KEY (ssn, vin));

To demonstrate, here is a typical idiot newbie schema -- you will them
all over the news groups.  I call them "idiots" because they always name
the IDENTITY property column "id" in EVERY table.  They don't understand
basic data modeling -- one and only name for an attribute.  About half
the time they don't use any DRI, but let's show it.

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
 vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 ssn CHAR(9) NOT NULL UNIQUE,
 ..);

CREATE TABLE Vehicles
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 vin CHAR(17) NOT NULL UNIQUE,
.);

Now change the natural key in Personnel:

UPDATE Personnel
   SET ssn = '666666666'
 WHERE ssn = '000000000';

Nothing happened in Motorpool, did it?  You can do the same thing with a
VIN.  

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering.  Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
.."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning.  ACM Transactions
on Database Systems, 4(4).  pp. 397-434.  

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user.  That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1)  The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2)  Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3)  It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities.  A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning.  ACM Transactions on Database Systems, 4(4).  pp. 397-434

Quote:
>> I wrote my own, "Red Matrix Reloaded" Plus, it's not like it's a

horribly complicated thing to implement; and we
get to customize it to boot! <<

I always tend toward buying instead of building.  People forget that
building software **also** means maintaining it and that the code
developement is only a small fraction of the total cost of software.
You might want to look at the SEI and DoD studies on the total cost of
each approach.  

I am always afraid of mixing different applications into one block of
code.  It tends to destroy cohesion and coupling.  If I can put a
"border" around my database and have my private "INS"  to log things as
they cross, I am free to design the best schema for the enterprise's
business operations.  If I can hire those guards without putting in the
time and effort to train them, even better.

Quote:
>> You can hope all you want, it will still be the full name. <<

Again, get that little booklet on names; it makes you appreciate what a
direct mail list company has to do to avoid redundancy with names.  Oh,
and having the same data (last name) in two columns in the same row is
not a good design.

Quote:
>> Yeah, I would have done that [query the Reservations table], except I

forgot to add a "CreatedByUserGUID"
column to the Reservations table (oops). Not to worry too much, since I
now have the information in the Audit Log.<<

ARRRGHH!  Is the creator an attribute of a reservation or not? If it is,
then put it in the Reservations table.  Drop a row from either table,
and I bet you have an orphan in the other.  

1) Orphan row = a row in one table that should reference another table,
but due to lack of DRI, does not.  Example:
an Order Details table row which does not reference a row in the Orders
table.  This usually occurs when a referenced table is changed and the
action is not cascaded to the referencing tables.  

2) Crippled Orphan row = an orphaned row which does not hold a complete
fact in itself and cannot be manually associated to a referenced table
row.  Usually there is a third table that held the relationship and it
has been trashed.  Example: Orders and Order Details are keyed with
GUIDs.  The relationship that associates them is in a table with two
columns for their GUIDs and a UNIQUE constraint on the details; this
table is trashed and there is no way to match Order GUIDs to Order
Detail GUIDs.  

3) Adopted Orphan row = An orphan that is associated with an unrelated
table.  Example: an Order Details table row which uses a GUID or other
unverifiable exposed locator as part of a key that does not reference a
row in the Orders table.  This GUID might appear in a totally different
table that has nothing to do with Orders.  

All this kind of thing used to happen in the old network databases when
pointer chains wwre corrupted, but we had some tools to fix them.  You
are repeating the mistakes we made over 30 years ago, but without the
repair kit.  

Quote:
>> Holy hell! You mean i'm getting a smack-down from _the_ Joe Celko?

i've hit the big time. <<

It's not the big time yet, kid, but it is a "Rite of Passage" :)  

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.***.com/ ***
Don't just participate in USENET...get rewarded for it!



Mon, 31 Jul 2006 19:19:40 GMT
 Gut feeling: Which query looks better

Quote:
>Create a simple table with one
>auto-generated key of any kind and a few other columns.  Use a few
>statements like

>INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
>INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
>INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

>To put a few rows into the table and notice that the auto-generated key
>numbered them in the order they were presented.

True, the *numbering* of the rows occurred in the sequence
the rows where inserted -- as expected since each insert query
is executed sequentially per the design of the system.

However, the *value* of the numbers assigned are not necessarily
positively correlated with the sequence of insertion.
For example, auto-generation via IDENTITY with a negative step size
is negatively correlated and auto-generation via NEWID is suppose
to be non-correlated.

Quote:
>                  But now use a statement
>with a query expression in it, like this:

>INSERT INTO Foobar (a, b, c)
>SELECT x, y, z
>  FROM Floob;

>Since a query result is a table, and a table is a set which has no
>ordering, what should the auto-generated key be?  The entire, whole,
>completed set is presented to Foobar all at once, not a row at a time.
>There are (n!) ways to arrange (n) rows, so which one do you pick?

Any one I want -- but does my pick matter?

Although at the highest level the query inserts all the rows at once,
in some systems (in which you might not of even
had auto-generation capabilities),
one could use an ORDER BY clause (or lack thereof)
to control what sequence the individual new rows where inserted
and thus have an impact on how effectively the query was executed.
It would also be allowing one to pick the arrangement
of new rows vs auto-generated values (which are not necessarily keys).

However, it appears to me that the SQL-92 does not even
specify an ORDER BY clause can present; long less,
specify what should happen when one is present.

TSQL does allow an ORDER BY clause to be present.
However, I have not seen any documentation
(which says nothing about whether or not such documentation exist)
that specifies what TSQL should do when an ORDER BY clause is present.

One possibility is that the ORDER BY clause is excepted but ignored.
But that does not appear to be the case (as the newsgroup is fond
of pointing out) including the ORDER BY clause effects how long
it takes for the query to execute.

Is TSQL really wasting effort (perhaps, ordering the rows
and then processing in a manner such that the order
was a waste of effort)?

Could it be that the new rows are utilized in the sorted sequence,
but are not necessarily matched up to
effectively created candidate rows
in the order they where effectively created?

Or could it be that misinformation
has been created due to confusing
this situation with the situation discussed in:

Q273586 - BUG: SELECT INTO With ORDER BY Clause Ignores Order By During the
Insert
http://www.***.com/ ;en-us;273586

?

Can anyone point out documentation that (non-mis-informatively)
states (or at least reveals) whether or not
one can expect the rows to get particular
auto-generated values in *this* case?

Can anyone create an example which illustrates that
one cannot expect the rows to get particular
auto-generated values in *this* case?
Is that example behaving as designed/documented?

Bye,
Delbert Glass


Quote:
> >> I am. [using GUIDs to identify users] <<

> Got to your data model and explain what that GUID means in terms of the
> LOGICAL model.  You cannot because this thing was generated in a
> particular piece of PHYSICAL storage, based on the PHYSICAL state of
> particular release of a particular proprietyary softeware package at the
> time the PHYSICAL record was created.

> The first practical consideration is that GUID is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other products.  Newbies
> actually think they will never port code!  Perhaps they only work for
> companies that are failing and will be gone.  Perhaps their code is such
> {*filter*}nobody else want their application.

> But let's look at the logical problems.  Create a simple table with one
> auto-generated key of any kind and a few other columns.  Use a few
> statements like

> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

> To put a few rows into the table and notice that the auto-generated key
> numbered them in the order they were presented.  But now use a statement
> with a query expression in it, like this:

> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
>   FROM Floob;

> Since a query result is a table, and a table is a set which has no
> ordering, what should the auto-generated key be?  The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to arrange (n) rows, so which one do you pick?  The
> answer has been to use whatever the PHYSICAL order of the result set
> happened to be.  That non-relational phrase "physical order" again!

> But it is actually worse than that.  If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different physical order.

> Can you explain from a logical model why the same rows in the second
> query get different PHYSICAL numbers?  In the relational model, they
> should be treated the same if all the values of all the attributes are
> identical.

> Using auto-generated values as a primary key is a sign that there is no
> data model, only an imitation of a sequential file system.  Since this
> "magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as
> a result of the physical state of a particular piece of hardware at a
> particular time as read by the current release of a particular database
> product, how do you verify that an entity has such a number in the
> reality you are modeling?

> You will see newbies who design tables like this:

> CREATE Drivers
> (driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
>  vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

> Now input data and submit the same row a thousand times, a million
> times.  Your data integrity is trashed.  The natural key was this:

> CREATE Drivers
> (ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
>  vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
>  PRIMARY KEY (ssn, vin));

> To demonstrate, here is a typical idiot newbie schema -- you will them
> all over the news groups.  I call them "idiots" because they always name
> the IDENTITY property column "id" in EVERY table.  They don't understand
> basic data modeling -- one and only name for an attribute.  About half
> the time they don't use any DRI, but let's show it.

> CREATE TABLE MotorPool
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
>  vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

> CREATE TABLE Personnel
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL UNIQUE,
>  ..);

> CREATE TABLE Vehicles
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  vin CHAR(17) NOT NULL UNIQUE,
> .);

> Now change the natural key in Personnel:

> UPDATE Personnel
>    SET ssn = '666666666'
>  WHERE ssn = '000000000';

> Nothing happened in Motorpool, did it?  You can do the same thing with a
> VIN.

> Now you are REALLY thinking about relations and keys instead of 1950's
> sequential record numbering.  Adding an IDENTITY column to either of
> these tables as a candidate key would be dangerously redundant; one
> query uses the IDENTITY and another uses the real key, and like a man
> with two watches, you are never sure what time it is.

> Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
> users may cause the system to generate or delete a surrogate, but they
> have no control over its value, nor is its value ever displayed to them
> .."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning.  ACM Transactions
> on Database Systems, 4(4).  pp. 397-434.

> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user.  That means never
> used in queries, DRI or anything else that a user does.

> Codd also wrote the following:

> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.

> (1)  The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result that
> some or all of the serial numbers might be changed.).

> (2)  Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.

> (3)  It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it has
> ceased to have one (e.g. and applicant for a job and a retiree).

> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities.  A solution - proposed in part [4] and more fully in [14] - is
> to introduce entity domains which contain system-assigned surrogates.
> Database users may cause the system to generate or delete a surrogate,
> but they have no control over its value, nor is its value ever displayed
> to them....." (Codd in ACM TODS,

...

read more »



Tue, 01 Aug 2006 01:24:08 GMT
 Gut feeling: Which query looks better

Quote:
> >> I am. [using GUIDs to identify users] <<

> Got to your data model and explain what that GUID means in terms of the
> LOGICAL model.

Logically, it means nothing. It is my surrogate key.

Quote:
> The first practical consideration is that GUID is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other products.

i switched to GUID's for a surrogate key insteady of IDENTITY's because they
are more portable.

Quote:
> To put a few rows into the table and notice that the auto-generated key
> numbered them in the order they were presented.  But now use a statement
> with a query expression in it, like this:
...
> There are (n!) ways to arrange (n) rows, so which one do you pick?

i don't have to care. It's a surrogate key that the server can generate as
it sees fit.
It's not part of my logical model.

Quote:
> Can you explain from a logical model why the same rows in the second
> query get different PHYSICAL numbers?  In the relational model, they
> should be treated the same if all the values of all the attributes are
> identical.

Are surrogate key fields supposed to have some logical meaning?

Quote:
> How do you verify that an entity has such a number in the
> reality you are modeling?

It's a surrogate key, it has no meaning in reality.

Quote:
> You will see newbies who design tables like this:

> CREATE Drivers
> (driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
>  vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

> Now input data and submit the same row a thousand times, a million
> times.  Your data integrity is trashed.

You mean the data integrity is trashed becase i managed to insert a million
times a row with the same SSN and VIN?
i don't think that would happen, since i would also have a unique constraint
on fields that must remain unique.

Quote:
> To demonstrate, here is a typical idiot newbie schema

> CREATE TABLE MotorPool
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
>  vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

> CREATE TABLE Personnel
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  ssn CHAR(9) NOT NULL UNIQUE,
>  ..);

> CREATE TABLE Vehicles
> (id IDENTITY (1,1) NOT NULL PRIMARY KEY,
>  vin CHAR(17) NOT NULL UNIQUE,
> .);

Well i don't think the idiot newbie will be able to create these tables,
since a CHAR(9) field cannot reference an IDENTITY field. But if you meant:

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 PersonID INT NOT NULL REFERENCES Personnel(id),
 VehicleID INT NOT NULL REFERENCES Vehicle(id));

Then okay, we're on the same page.

Quote:
> Now change the natural key in Personnel:

> UPDATE Personnel
>    SET ssn = '666666666'
>  WHERE ssn = '000000000';

> Nothing happened in Motorpool, did it?  You can do the same thing with a
> VIN.

Right, nothing happened. Is this a good thing or a bad thing?
A person's SSN can change, without affecting them being a driver.

Quote:
> Adding an IDENTITY column to either of
> these tables as a candidate key would be dangerously redundant;

Are you saying that i would then have essentially two unique keys in my
table,
    1. the IDENTITY and
    2. the ssn/vin combination field?

If so, then are you advocating only having the ssn/vin as a primary key, and
not creating a surrogate IDENTITY key?

Quote:
> Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
> users may cause the system to generate or delete a surrogate, but they
> have no control over its value, nor is its value ever displayed to them
> .."
> This means that a surrogate ought to act like an index;

My surrogate UserGUID is like an index.

Quote:
> created by the user,

i thought it was generated by the system

Quote:
>managed by the system and NEVER seen by a user.  That means never
> used in queries, DRI or anything else that a user does.

My users never see the GUID. But in this case am _I_ the user, or _my_ users
the user?


Tue, 01 Aug 2006 22:24:57 GMT
 
 [ 13 post ] 

 Relevant Pages 

1. jTree - Look and Feel

2. APT Motif look-and-feel

3. Ingres Windows4gl, look and feel portability.

4. Filemaker Apps - generic look and feel

5. Exporting Look and Feel

6. FoxPro look and feel (Changing)

7. Looking to feel, touch, taste

8. Ingres Windows4gl, look and feel portability.

9. Feel better, lose we1ght, and signs of aging

10. 5.0 feels good.

11. vwbcl feel better wywj


 
Powered by phpBB® Forum Software