In class yesterday we were looking at descriptive statistics, and I showed the formulae for calculating the standard deviation of a population and sample. The only real difference (besides notation) is that for the former you divide by the population size, N, whereas for the latter you divide by n-1. Why?
My answer went along the following lines... it's difficult to explain, but should make more sense as we move throughout the course. In both cases we're not so much dividing by the volume of the data set but by the 'Degrees of Freedom' which is the amount of independent data that is being used. When we move on to a Student's T distribution you'll see that we divide by n-2. In this case, our sample is generating the mean and variance, and therefore we have fewer pieces of information to utilise. Understandebly, this left some students disatisfied.
Having a quick look on google (there's no mention of this in the textbook we use, Curwin & Slater) reassures me that this isn't easy to explain (or for that matter, to understand). Most answers are either: (i) it's complicated; (ii) it produces an unbiased estimator and is therefore 'better'. Regarding the latter, this is a useful experiment that inductively shows why 'n-1' is a better estimate than 'n'.
If anyone has a relatively succinct explanation, I'd love to hear it.
Update: this is a neat article, which offers:
it is not possible to obtain an estimate of {sigma} from a sample size n=1, because there is no internal variation of any degree within such a sample. Having n-1 in the denomentator reflects this impossibility, and therefore at least n=2 data points are needed if we want to make the formula work






excellent link to the article!
Posted by: Darin Ellis | July 08, 2009 at 12:08 AM
Dear Sir/Madam,
please correct my understanding articulated below.
Reason for n-1 is because in a sample of n there are n-1 independent values that can vary.
Question is why that 1 is thought to be unable to vary?
Posted by: Bharat Pandit | August 25, 2009 at 03:16 AM