2 Statistical estimation
When we are dealing with large populations (the production of items such as LEDs, light bulbs, piston rings etc.) it is extremely unlikely that we will be able to calculate population parameters such as the mean and variance directly from the full population.
We have to use processes which enable us to estimate these quantities. There are two basic methods used called point estimation and interval estimation. The essential difference is that point estimation gives single numbers which, in the sense defined below, are Ôbest estimatesÕ of population parameters, while interval estimates give a range of values together with a figure called the confidence that the true value of a parameter lies within the calculated range. Such ranges are usually called confidence intervals .
Statistically, the word ‘estimate’ implies a defined procedure for finding population parameters. In statistics, the word ‘estimate’ does not mean a guess, something which is rough-and-ready. What the word does mean is that an agreed precise process has been (or will be) used to find required values and that these values are ‘best values’ in some sense. Often this means that the procedure used, which is called the ‘estimator’, is:
- consistent in the sense that the difference between the true value and the estimate approaches zero as the sample size used to do the calculation increases;
- unbiased in the sense that the expected value of the estimator is equal to the true value;
- efficient in the sense that the variance of the estimator is small.
Expectation is covered in Workbooks 37 and 38. You should note that it is not always possible to find a ‘best’ estimator. You might have to decide (for example) between one which is
consistent, biased and efficient
and one which is
consistent, unbiased and inefficient
when what you really want is one which is
consistent, unbiased and efficient.
2.1 Point estimation
We will look at the point estimation of the mean and variance of a population and use the following notation.
Notation
Population | Sample | Estimator | |
Size | |||
Mean | or | for | |
Variance | or | for |
Estimating the mean
This is straightforward.
is a sensible estimate since the difference between the population mean and the sample mean disappears with increasing sample size. We can show that this estimator is unbiased. Symbolically we have:
so that
Note that the expected value of is , i.e. . Similarly for .
Estimating the variance
This is a little more difficult. The true variance of the population is which suggests the estimator, calculated from a sample, should be .
However, we do not know the true value of , but we do have the estimator .
Replacing by the estimator gives
This can be written in the form
Hence
We already have the important result
and
Using the result gives us
This result is biased , for an unbiased estimator the result should be not .
Fortunately, the remedy is simple, we just multiply by the so-called Bessel’s correction, namely and obtain the result
There are two points to note here. Firstly (and rather obviously) you should not take samples of size 1 since the variance cannot be estimated from such samples. Secondly, you should check the operation of any hand calculators (and spreadsheets!) that you use to find out exactly what you are calculating when you press the button for standard deviation. You might find that you are calculating either
or
It is just as well to know which, as the first formula assumes that you are calculating the variance of a population while the second assumes that you are estimating the variance of a population from a random sample of size taken from that population.
From now on we will assume that we divide by in the sample variance and we will simply write for
2.2 Interval estimation
We will look at the process of finding an interval estimation of the mean and variance of a population and use the notation used above.
Interval estimation for the mean
This interval is commonly called the Confidence Interval for the Mean.
Firstly, we know that while the sample mean is a good estimator of the population mean . We also know that the calculated mean of a sample of size is unlikely to be exactly equal to . We will now construct an interval around in such a way that we can quantify the confidence that the interval actually contains the population mean .
Secondly, we know that for sufficiently large samples taken from a large population, follows a normal distribution with mean and standard deviation . Thirdly, looking at the following extract from the normal probability tables,
0.00 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | |
1.9 | .4713 | 4719 | 4726 | 4732 | 4738 | 4744 | 4750 | 4756 | 4762 | 4767 |
we can see that of the values in the standard normal distribution lie between standard deviation either side of the mean.
So before we see the data we may say that
After we see the data we say with confidence that
which leads to
This interval is called a 95% confidence interval for the mean .
Note that while the level is very commonly used, there is nothing sacrosanct about this level. If we go through the same argument but demand that we need to be certain that lies within the confidence interval developed, we obtain the interval
since an inspection of the standard normal tables reveals that of the values in a standard normal distribution lie within 2.58 standard deviations of the mean.
The above argument assumes that we know the population variance. In practice this is often not the case and we have to estimate the population variance from a sample. From the work we have seen above, we know that the best estimate of the population variance from a sample of size n is given by the formula
It follows that if we do not know the population variance, we must use the estimate in place of . Our 95% and 99% confidence intervals (for large samples) become
and
where
When we do not know the population variance, we need to estimate it. Hence we need to gauge the confidence we can have in the estimate.
In small samples, when we need to estimate the variance, the values 1.96 and 2.58 need to be replaced by values from the Student’s -distribution. See HELM booklet 41.
Example 1
After 1000 hours of use the weight loss, in gm, due to wear in certain rollers in machines, is normally distributed with mean and variance Fifty independent observations are taken. (This may be regarded as a “large” sample.) If observation is then and
Estimate and and give a 95% confidence interval for
Solution
We estimate using the sample mean:
We estimate using the sample variance:
The estimated standard error of the mean is
The 95% confidence interval for is That is
Exercises
-
The voltages of sixty nominally 10 volt cells are measured. Assuming these
to be independent observations from a normal distribution with mean
and
variance
estimate
and
Regarding this as a “large”sample, find a 99% confidence interval for
The data are:
10.3 10.5 9.6 9.7 10.6 9.9 10.1 10.1 9.9 10.5 10.1 10.1 9.9 9.8 10.6 10.0 9.9 10.0 10.3 10.1 10.1 10.3 10.5 9.7 10.1 9.7 9.8 10.3 10.2 10.2 10.1 10.5 10.0 10.0 10.6 10.9 10.1 10.1 9.8 10.7 10.3 10.4 10.4 10.3 10.4 9.9 9.9 10.5 10.0 10.7 10.1 10.6 10.0 10.7 9.8 10.4 10.3 10.0 10.5 10.1 -
The natural logarithms of the times in minutes taken to complete a certain task are normally distributed
with mean
and variance
Seventy-five independent observations are taken. (This may be regarded
as a “large” sample.) If the natural logarithm of the time for observation
is
then
and
Estimate and and give a 95% confidence interval for
Use your confidence interval to find a 95% confidence interval for the median time to complete the task.
-
and
We estimate
using the sample mean:
We estimate using the sample variance:
The estimated standard error of the mean is
The 99% confidence interval for is That is
-
We estimate
using the sample mean:
We estimate using the sample variance:
The estimated standard error of the mean is
The 95% confidence interval for is That is
The 95% confidence interval for the median time, in minutes, to complete the task is
That is