Statistical estimation

2 Statistical estimation

When we are dealing with large populations (the production of items such as LEDs, light bulbs, piston rings etc.) it is extremely unlikely that we will be able to calculate population parameters such as the mean and variance directly from the full population.

We have to use processes which enable us to estimate these quantities. There are two basic methods used called point estimation and interval estimation. The essential difference is that point estimation gives single numbers which, in the sense defined below, are Ôbest estimatesÕ of population parameters, while interval estimates give a range of values together with a figure called the confidence that the true value of a parameter lies within the calculated range. Such ranges are usually called confidence intervals .

Statistically, the word ‘estimate’ implies a defined procedure for finding population parameters. In statistics, the word ‘estimate’ does not mean a guess, something which is rough-and-ready. What the word does mean is that an agreed precise process has been (or will be) used to find required values and that these values are ‘best values’ in some sense. Often this means that the procedure used, which is called the ‘estimator’, is:

consistent in the sense that the difference between the true value and the estimate approaches zero as the sample size used to do the calculation increases;
unbiased in the sense that the expected value of the estimator is equal to the true value;
efficient in the sense that the variance of the estimator is small.

Expectation is covered in Workbooks 37 and 38. You should note that it is not always possible to find a ‘best’ estimator. You might have to decide (for example) between one which is

consistent, biased and efficient

and one which is

consistent, unbiased and inefficient

when what you really want is one which is

consistent, unbiased and efficient.

2.1 Point estimation

We will look at the point estimation of the mean and variance of a population and use the following notation.

Notation

	Population	Sample	Estimator
Size	$N$	$n$
Mean	$μ$ or $E (x)$	$\bar{x}$	$\hat{μ}$ for $μ$
Variance	$σ^{2}$ or $V (x)$	$s^{2}$	${\hat{σ}}^{2}$ for $σ^{2}$

Estimating the mean

This is straightforward.

$\hat{μ} = \bar{x}$

is a sensible estimate since the difference between the population mean and the sample mean disappears with increasing sample size. We can show that this estimator is unbiased. Symbolically we have:

$\hat{μ} = \frac{x_{1} + x_{2} + &ctdot; x_{n}}{n}$

so that

\begin{array}{rcl} E (\hat{μ}) & = & \frac{E (x_{1}) + E (x_{2}) + &ctdot; + E (x_{n})}{n} \\ = & \frac{E (X) + E (X) + &ctdot; + E (X)}{n} \\ = & E (X) \\ = & μ \end{array}

Note that the expected value of $x_{1}$ is $E (X)$ , i.e. $E (x_{1}) = E (X)$ . Similarly for $x_{1}, x_{2}, &ctdot;, x_{n}$ .

Estimating the variance

This is a little more difficult. The true variance of the population is $σ^{2} = \frac{\sum {(x - μ)}^{2}}{N}$ which suggests the estimator, calculated from a sample, should be ${\hat{σ}}^{2} = \frac{\sum {(x - μ)}^{2}}{n}$ .

However, we do not know the true value of $μ$ , but we do have the estimator $\hat{μ} = \bar{x}$ .

Replacing $μ$ by the estimator $\hat{μ} = \bar{x}$ gives

${\hat{σ}}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n}$

This can be written in the form

${\hat{σ}}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n} = \frac{\sum x^{2}}{n} - {(\bar{x})}^{2}$

Hence

$E ({\hat{σ}}^{2}) = \frac{E (\sum x^{2})}{n} - E {{(\bar{X})}^{2}} = E (X^{2}) - E {{(\bar{X})}^{2}}$

We already have the important result

$E (x) = E (\bar{x})$ and $V (\bar{x}) = \frac{V (x)}{n}$

Using the result $E (x) = E (\bar{x})$ gives us

\begin{array}{rcl} E ({\hat{σ}}^{2}) & = & E (x^{2}) - E {{(\bar{x})}^{2}} \\ = & E (x^{2}) - {E (x)}^{2} - E {{(\bar{x})}^{2}} + {E (\bar{x})}^{2} \\ = & E (x^{2}) - {E (x)}^{2} - (E {{(\bar{x})}^{2}} - {E (\bar{x})}^{2}) \\ = & V (x) - V (\bar{x}) \\ = & σ^{2} - \frac{σ^{2}}{n} \\ = & \frac{n - 1}{n} σ^{2} \end{array}

This result is biased , for an unbiased estimator the result should be $σ^{2}$ not $\frac{n - 1}{n} σ^{2}$ .

Fortunately, the remedy is simple, we just multiply by the so-called Bessel’s correction, namely $\frac{n}{n - 1}$ and obtain the result

${\hat{σ}}^{2} = \frac{n}{n - 1} \frac{\sum {(x - \bar{x})}^{2}}{n} = \frac{\sum {(x - \bar{x})}^{2}}{n - 1}$

There are two points to note here. Firstly (and rather obviously) you should not take samples of size 1 since the variance cannot be estimated from such samples. Secondly, you should check the operation of any hand calculators (and spreadsheets!) that you use to find out exactly what you are calculating when you press the button for standard deviation. You might find that you are calculating either

$σ^{2} = \frac{\sum {(x - μ)}^{2}}{N}$ or ${\hat{σ}}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n - 1}$

It is just as well to know which, as the first formula assumes that you are calculating the variance of a population while the second assumes that you are estimating the variance of a population from a random sample of size $n$ taken from that population.

From now on we will assume that we divide by $n - 1$ in the sample variance and we will simply write $s^{2}$ for $s_{n - 1}^{2} .$

2.2 Interval estimation

We will look at the process of finding an interval estimation of the mean and variance of a population and use the notation used above.

Interval estimation for the mean

This interval is commonly called the Confidence Interval for the Mean.

Firstly, we know that while the sample mean $\bar{x} = \frac{x_{1} + x_{2} + &ctdot; + x_{n}}{n}$ is a good estimator of the population mean $μ$ . We also know that the calculated mean $\bar{x}$ of a sample of size $n$ is unlikely to be exactly equal to $μ$ . We will now construct an interval around $\bar{x}$ in such a way that we can quantify the confidence that the interval actually contains the population mean $μ$ .

Secondly, we know that for sufficiently large samples taken from a large population, $\bar{x}$ follows a normal distribution with mean $μ$ and standard deviation $\frac{σ}{\sqrt{n}}$ . Thirdly, looking at the following extract from the normal probability tables,

$Z = \frac{X - μ}{σ}$	0.00	0.01	0.02	0.03	0.04	0.05	0.06	0.07	0.08	0.09
1.9	.4713	4719	4726	4732	4738	4744	4750	4756	4762	4767

we can see that $2 \times 47.5 % = 95 %$ of the values in the standard normal distribution lie between $\pm 1.96$ standard deviation either side of the mean.

So before we see the data we may say that

$P (μ - 1.96 \frac{σ}{\sqrt{n}} \leq \bar{x} \leq μ + 1.96 \frac{σ}{\sqrt{n}}) = 0.95$

After we see the data we say with $95 %$ confidence that

$μ - 1.96 \frac{σ}{\sqrt{n}} \leq \bar{x} \leq μ + 1.96 \frac{σ}{\sqrt{n}}$

which leads to

$\bar{x} - 1.96 \frac{σ}{\sqrt{n}} \leq μ \leq \bar{x} + 1.96 \frac{σ}{\sqrt{n}}$

This interval is called a 95% confidence interval for the mean $μ$ .

Note that while the $95 %$ level is very commonly used, there is nothing sacrosanct about this level. If we go through the same argument but demand that we need to be $99 %$ certain that $μ$ lies within the confidence interval developed, we obtain the interval

$\bar{x} - 2.58 \frac{σ}{\sqrt{n}} \leq μ \leq \bar{x} + 2.58 \frac{σ}{\sqrt{n}}$

since an inspection of the standard normal tables reveals that $99 %$ of the values in a standard normal distribution lie within 2.58 standard deviations of the mean.

The above argument assumes that we know the population variance. In practice this is often not the case and we have to estimate the population variance from a sample. From the work we have seen above, we know that the best estimate of the population variance from a sample of size n is given by the formula

${\hat{σ}}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n - 1}$

It follows that if we do not know the population variance, we must use the estimate $\hat{σ}$ in place of $σ$ . Our 95% and 99% confidence intervals (for large samples) become

$\bar{x} - 1.96 \frac{\hat{σ}}{\sqrt{n}} \leq μ \leq \bar{x} + 1.96 \frac{\hat{σ}}{\sqrt{n}}$ and $\bar{x} - 2.58 \frac{\hat{σ}}{\sqrt{n}} \leq μ \leq \bar{x} + 2.58 \frac{\hat{σ}}{\sqrt{n}}$

where

${\hat{σ}}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n - 1}$

When we do not know the population variance, we need to estimate it. Hence we need to gauge the confidence we can have in the estimate.

In small samples, when we need to estimate the variance, the values 1.96 and 2.58 need to be replaced by values from the Student’s $t$ -distribution. See HELM booklet 41.

Example 1

After 1000 hours of use the weight loss, in gm, due to wear in certain rollers in machines, is normally distributed with mean $μ$ and variance $σ^{2} .$ Fifty independent observations are taken. (This may be regarded as a “large” sample.) If observation $i$ is $y_{i},$ then $\sum_{i = 1}^{50} y_{i} = 497.2$ and $\sum_{i = 1}^{50} y_{i}^{2} = 5473.58 .$

Estimate $μ$ and $σ^{2}$ and give a 95% confidence interval for $μ .$

Solution

We estimate $μ$ using the sample mean: $ȳ = \frac{\sum y_{i}}{n} = \frac{497.2}{50} = 9.944 gm$

We estimate $σ^{2}$ using the sample variance:

\begin{array}{rcl} s^{2} & = & \frac{1}{n - 1} \sum {(y_{i} - ȳ)}^{2} = \frac{1}{n - 1} \{\sum y_{i}^{2} - \frac{1}{n} {[\sum y_{i}]}^{2}\} \\ = & \frac{1}{49} \{5473.58 - \frac{1}{50} 497 . 2^{2}\} = 10.8046 {gm}^{2} \end{array}

The estimated standard error of the mean is $\sqrt{\frac{s^{2}}{n}} = \sqrt{\frac{10.8046}{50}} = 0.4649 gm$

The 95% confidence interval for $μ$ is $ȳ \pm 1.96 \sqrt{\frac{s^{2}}{n}} .$ That is $9.479 < μ < 10.409$

Exercises

The voltages of sixty nominally 10 volt cells are measured. Assuming these to be independent observations from a normal distribution with mean

μ

and variance

σ^{2},

estimate

μ

and

σ^{2} .

Regarding this as a “large”sample, find a 99% confidence interval for

μ .

The data are:

10.3	10.5	9.6	9.7	10.6	9.9	10.1	10.1	9.9	10.5
10.1	10.1	9.9	9.8	10.6	10.0	9.9	10.0	10.3	10.1
10.1	10.3	10.5	9.7	10.1	9.7	9.8	10.3	10.2	10.2
10.1	10.5	10.0	10.0	10.6	10.9	10.1	10.1	9.8	10.7
10.3	10.4	10.4	10.3	10.4	9.9	9.9	10.5	10.0	10.7
10.1	10.6	10.0	10.7	9.8	10.4	10.3	10.0	10.5	10.1

The natural logarithms of the times in minutes taken to complete a certain task are normally distributed with mean $μ$ and variance $σ^{2} .$ Seventy-five independent observations are taken. (This may be regarded as a “large” sample.) If the natural logarithm of the time for observation $i$ is $y_{i},$ then $\sum y_{i} = 147.75$ and $\sum y_{i}^{2} = 292.8175 .$
Estimate $μ$ and $σ^{2}$ and give a 95% confidence interval for $μ .$

Use your confidence interval to find a 95% confidence interval for the median time to complete the task.

$\sum y_{i} = 611.0,$ $\sum y_{i}^{2} = 6227.34$ and $n = 60.$ We estimate $μ$ using the sample mean:
$ȳ = \frac{\sum y_{i}}{n} = \frac{611.0}{60} = 10.1833 V$

We estimate $σ^{2}$ using the sample variance:
$\begin{array}{rcl} s^{2} & = & \frac{1}{n - 1} \sum {(y_{i} - ȳ)}^{2} = \frac{1}{n - 1} \{\sum y_{i}^{2} - \frac{1}{n} {[\sum y_{i}]}^{2}\} \\ = & \frac{1}{59} \{6227.34 - \frac{1}{59} 611 . 0^{2}\} = 0.090226 \end{array}$
The estimated standard error of the mean is

$\sqrt{\frac{s^{2}}{n}} = \sqrt{\frac{0.090226}{60}} = 0.03878 V$

The 99% confidence interval for $μ$ is $ȳ \pm 2.58 \sqrt{s^{2} ∕ n} .$ That is

$10.08 < μ < 10.28$
We estimate $μ$ using the sample mean:
$ȳ = \frac{\sum y_{i}}{n} = \frac{147.75}{75} = 1.97$

We estimate $σ^{2}$ using the sample variance:
$\begin{array}{rcl} s^{2} & = & \frac{1}{n - 1} \sum {(y_{i} - ȳ)}^{2} = \frac{1}{n - 1} \{\sum y_{i}^{2} - \frac{1}{n} {[\sum y_{i}]}^{2}\} \\ = & \frac{1}{74} \{292.8175 - \frac{1}{75} 147.7 5^{2}\} = 0.02365 \end{array}$
The estimated standard error of the mean is

$\sqrt{\frac{s^{2}}{n}} = \sqrt{\frac{0.02365}{75}} = 0.01776$

The 95% confidence interval for $μ$ is $ȳ \pm 1.96 \sqrt{s^{2} ∕ n} .$ That is

$1.935 < μ < 2.005$

The 95% confidence interval for the median time, in minutes, to complete the task is

$e^{1.935} < M < e^{2.005}$

That is

$6.93 < M < 7.42$

2 Statistical estimation

2.1 Point estimation

2.2 Interval estimation

Example 1

Solution

Answer