Help... float data resulting in inaccurate statistics :( 
Author Message
 Help... float data resulting in inaccurate statistics :(

Because of the sqrt in the formula for stdev, floats will be used no
matter whether you use decimals or floats to store values, if you use
that builtin function.  And I'm not sure that even for decimal values,
AVG() will return the expected answer when all values are identical,
since division is not exact for decimals.

You are seeing different answers because you use SELECT in one place and
PRINT in another.  They don't display values in the same way or with the
same precision.

A safe way to proceed is


  select
    case when min(x) = max(x) or count(x) = 0
    then 0 else stdev(x) end
  from Data
)


You might think twice about any situation where sigma=0 is meaningful,
since statistics is founded on measuring data, and no non-integer data
measurement should be assumed to exact.  Just because all the recordings
are identical shouldn't lead you to the assumption that the underlying
physical data are also identical.  They are only identical to the
precision of the measurement.

All that said, though, there may be a bug here.  ;)

Here's a repro that gives an unexpected value of the standard
deviation.  See if my safe approach helps, and I'll keep looking into this.

create table Data (
  x float
)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)

select stdev(x) as stdev_x from Data
-- The result I get is 14654.295097342621

-- Steve Kass
-- Drew University
-- Ref: 6F22694F-5415-4A17-9B80-2C93BECACC12


Quote:
>I've got a problem stemming from the fact that our company's software
>stores data (measurements) as float. What I am doing is calculating a
>statistic (Cpk) that sometimes uses the standard deviation of the
>data. Due to the inexactness of the float fields, I am getting back
>non-zero standard deviation when it should be zero, and this is a
>critical value for the Cpk formula.

>For example I have four rows in the result set each having the data
>field with a float value of "exactly" 2.3501. The STDEV function
>returns a non-zero value and the VAR function returns a non-zero
>value.

>There is no possibility that the database structure will change. I
>have tried casting the field to a decimal type, which works as long as
>my precision is 5 or less... which is too restrictive for my purposes.

>I am working in a user-defined function here and the rules seem to be
>different, because in Query Analyzer if I directly run a "select
>STDEV(value_)...." I get back zero. But if I paste my code in I get
>back nonzero... any suggestions would be appreciated. I have tried
>variations on the data type of the local variable into which I am


>-- precision 5 would work

>-- precision 6 would not work


>declare cur_stddev cursor for select stdev(value_) from [...]
>open cur_stddev



>close cur_stddev
>deallocate cur_stddev




Sat, 01 Jul 2006 22:35:48 GMT
 Help... float data resulting in inaccurate statistics :(

It turns out that nothing is amiss.  This is effectively a textbook
example for a Numerical Analysis class.  Here's a simple example that
returns a seemingly wrong standard deviation, with some explanatory
notes.  Numerical Analysis is hard stuff that never stops biting when
you aren't on guard.

create table Data (
  x float
)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)
insert into Data values (1111777777777)

select stdev(x) from Data
-- returns 16384.0
go



select



go
-- returns these three values:
-- 6.1802491357938216E+24
-- 6.1802491357938226E+24 (note it ends in 26, not 16)
-- 16384.0

-- The first two expressions selected above
-- are mathematically equal, but the calculated
-- results differ by one part in about 10^16.
-- This is acceptable for 8-byte floating-point
-- arithmetic.  The third expression is the
-- formula for the standard deviation, and simply
-- through arithmetic becomes a noticeable error,
-- a standard deviation of about 7 parts per billion
-- instead of zero.

drop table Data

-- Steve Kass
-- Drew University
-- Ref: 053B98F8-14EB-4A2E-8762-7957ED37BA42


Quote:
>I've got a problem stemming from the fact that our company's software
>stores data (measurements) as float. What I am doing is calculating a
>statistic (Cpk) that sometimes uses the standard deviation of the
>data. Due to the inexactness of the float fields, I am getting back
>non-zero standard deviation when it should be zero, and this is a
>critical value for the Cpk formula.

>For example I have four rows in the result set each having the data
>field with a float value of "exactly" 2.3501. The STDEV function
>returns a non-zero value and the VAR function returns a non-zero
>value.

>There is no possibility that the database structure will change. I
>have tried casting the field to a decimal type, which works as long as
>my precision is 5 or less... which is too restrictive for my purposes.

>I am working in a user-defined function here and the rules seem to be
>different, because in Query Analyzer if I directly run a "select
>STDEV(value_)...." I get back zero. But if I paste my code in I get
>back nonzero... any suggestions would be appreciated. I have tried
>variations on the data type of the local variable into which I am


>-- precision 5 would work

>-- precision 6 would not work


>declare cur_stddev cursor for select stdev(value_) from [...]
>open cur_stddev



>close cur_stddev
>deallocate cur_stddev




Sat, 01 Jul 2006 23:26:21 GMT
 Help... float data resulting in inaccurate statistics :(
Quote:
>> Due to the inexactness of the float fields, I am getting back

non-zero standard deviation when it should be zero, and this is a
critical value for the Cpk formula. <<

Ouch!  Any chance of doign this in a stat package which will have the
extra code to correct floating point errors?

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Sun, 02 Jul 2006 00:33:19 GMT
 
 [ 3 post ] 

 Relevant Pages 

1. Decimal / float data - Help Help!

2. HELP - MOLAP aggregations are inaccurate

3. Help formating float/integer fields via data module

4. Create View Help - change column data type from Money(16,2) to FLOAT in the view

5. Help with ESQL/C and float data types

6. Help with ESQL/C and float data types

7. Help - Inaccurate calculated fields in reports (currency) More Info

8. Invalid Float values resulting in exception

9. FLOAT yields wrong results

10. Weird results when rounding floats?

11. Retrieve index statistics (estimated results for search term)?

12. Statistics IO results


 
Powered by phpBB® Forum Software