Help... float data resulting in inaccurate statistics :(

Because of the sqrt in the formula for stdev, floats will be used no

matter whether you use decimals or floats to store values, if you use

that builtin function. And I'm not sure that even for decimal values,

AVG() will return the expected answer when all values are identical,

since division is not exact for decimals.

You are seeing different answers because you use SELECT in one place and

PRINT in another. They don't display values in the same way or with the

same precision.

A safe way to proceed is

select

case when min(x) = max(x) or count(x) = 0

then 0 else stdev(x) end

from Data

)

You might think twice about any situation where sigma=0 is meaningful,

since statistics is founded on measuring data, and no non-integer data

measurement should be assumed to exact. Just because all the recordings

are identical shouldn't lead you to the assumption that the underlying

physical data are also identical. They are only identical to the

precision of the measurement.

All that said, though, there may be a bug here. ;)

Here's a repro that gives an unexpected value of the standard

deviation. See if my safe approach helps, and I'll keep looking into this.

create table Data (

x float

)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

insert into Data values (1111777777777)

select stdev(x) as stdev_x from Data

-- The result I get is 14654.295097342621

-- Steve Kass

-- Drew University

-- Ref: 6F22694F-5415-4A17-9B80-2C93BECACC12

Quote:

>I've got a problem stemming from the fact that our company's software

>stores data (measurements) as float. What I am doing is calculating a

>statistic (Cpk) that sometimes uses the standard deviation of the

>data. Due to the inexactness of the float fields, I am getting back

>non-zero standard deviation when it should be zero, and this is a

>critical value for the Cpk formula.

>For example I have four rows in the result set each having the data

>field with a float value of "exactly" 2.3501. The STDEV function

>returns a non-zero value and the VAR function returns a non-zero

>value.

>There is no possibility that the database structure will change. I

>have tried casting the field to a decimal type, which works as long as

>my precision is 5 or less... which is too restrictive for my purposes.

>I am working in a user-defined function here and the rules seem to be

>different, because in Query Analyzer if I directly run a "select

>STDEV(value_)...." I get back zero. But if I paste my code in I get

>back nonzero... any suggestions would be appreciated. I have tried

>variations on the data type of the local variable into which I am

>-- precision 5 would work

>-- precision 6 would not work

>declare cur_stddev cursor for select stdev(value_) from [...]

>open cur_stddev

>close cur_stddev

>deallocate cur_stddev