Regression

1 Regression

As we have already noted, relationship(s) between variables are of interest to engineers who may wish to determine the degree of association existing between independent and dependent variables. Knowing this often helps engineers to make predictions and, on this basis, to forecast and plan. Essentially, regression analysis provides a sound knowledge base from which accurate estimates of the values of a dependent variable may be made once the values of related independent variables are known.

It is worth noting that in practice the choice of independent variable(s) may be made by the engineer on the basis of experience and/or prior knowledge since this may indicate to the engineer which independent variables are likely to have a substantial influence on the dependent variable. In summary, we may state that the principle objectives of regression analysis are:

to enable accurate estimates of the values of a dependent variable to be made from known values of a set of independent variables;
to enable estimates of errors resulting from the use of a regression line as a basis of prediction.

Note that if a regression line is represented as $y = f (x)$ where $x$ is the independent variable, then the actual function used (linear, quadratic, higher degree polynomial etc.) may be obtained via the use of a theoretical analysis or perhaps a scatter diagram (see below) of some real data. Note that a regression line represented as $y = f (x)$ is called a regression line of $y$ on $x$ .

1.1 Scatter diagrams

A useful first step in establishing the degree of association between two variables is the plotting of a scatter diagram . Examples of pairs of measurements which an engineer might plot are:

volume and pressure;
acceleration and tyre wear;
current and magnetic field;
torsion strength of an alloy and purity.

If there exists a relationship between measured variables, it can take many forms. Even though an outline introduction to non-linear regression is given at the end of this Workbook, we shall focus on the linear relationship only.

In order to produce a good scatter diagram you should follow the steps given below:

Give the diagram a clear title and indicate exactly what information is being displayed;
Choose and clearly mark the axes;
Choose carefully and clearly mark the scales on the axes;
Indicate the source of the data.

Examples of scatter diagrams are shown below.

No alt text was set. Please request alt text from the person who provided you with this resource.

Figure 1 shows an association which follows a curve, possibly exponential, quadratic or cubic;
Figure 2 shows a reasonable degree of linear association where the points of the scatter diagram lie in an area surrounding a straight line;
Figure 3 represents a randomly placed set of points and no linear association is present between the variables.

Note that in Figure 2, the word ‘reasonable’ is not defined and that while points ‘close’ to the indicated straight line may be explained by random variation, those ‘far away’ may be due to assignable variation.

The rest of this Section will deal with linear association only although it is worth noting that techniques do exist for transforming many non-linear relationships into linear ones. We shall investigate linear association in two ways, firstly by using educated guess work to obtain a regression line ‘by eye’ and secondly by using the well-known technique called the method of least squares.

1.2 Regression lines by eye

Note that at a very simple level, we may look at the data and, using an ‘educated guess’, draw a line of regression ‘by eye’ through a set of points. However, finding a regression line by eye is unsatisfactory as a general statistical method since it involves guess-work in drawing the line with the associated errors in any results obtained. The guess-work can be removed by the method of least squares in which the equation of a regression line is calculated using data. Essentially, we calculate the equation of the regression line by minimising the sum of the squared vertical distances between the data points and the line.

1.3 The method of least squares - an elementary view

We assume that an experiment has been performed which has resulted in $n$ pairs of values, say $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$ and that these results have been checked for approximate linearity on the scatter diagram given below.

Figure 4

No alt text was set. Please request alt text from the person who provided you with this resource.

The vertical distances of each point from the line $y = a + b x$ are easily calculated as

$y_{1} - a - b x_{1}, y_{2} - a - b x_{2}, y_{3} - a - b x_{3} \dots y_{n} - a - b x_{4}$

These distances are squared to guarantee that they are positive and calculus is used to minimise the sum of the squared distances. Effectively we are minimizing the sum of a two-variable expression and need to use partial differentiation. If you wish to follow this up and look in more detail at the technique, any good book (engineering or mathematics) containing sections on multi-variable calculus should suffice. We will not look at the details of the calculations here but simply note that the process results in two equations in the two unknowns $m$ and $c$ being formed. These equations are:

$\sum x y - a \sum x - b \sum x^{2} = 0$ (i)

and

$\sum y - n a - b \sum x = 0$ (ii)

The second of these equations (ii) immediately gives a useful result. Rearranging the equation we get

$\frac{\sum y}{n} - a - b \frac{\sum x}{n} = 0$ or, put more simply $ȳ = a + b \bar{x}$

where $(\bar{x}, ȳ)$ is the mean of the array of data points $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$ .

This shows that the mean of the array always lies on the regression line. Since the mean is easily calculated, the result forms a useful check for a plotted regression line. Ensure that any regression line you draw passes through the mean of the array of data points.

Eliminating $a$ from the equations gives a formula for the gradient $b$ of the regression line, this is:

$b = \frac{\frac{\sum x y}{n} - \frac{\sum x}{n} \frac{\sum y}{n}}{\frac{\sum x^{2}}{n} - {(\frac{\sum x}{n})}^{2}}$ often written as $b = \frac{S_{x y}}{S_{x}^{2}}$

The quantity $S_{x}^{2}$ is, of course, the variance of the $x$ -values. The quantity $S_{x y}$ is known as the covariance (of $x$ and $y$ ) and will appear again later in this Workbook when we measure the degree of linear association between two variables.

Knowing the value of $b$ enables us to obtain the value of $a$ from the equation $ȳ = a + b \bar{x}$

Key Point 1

Least Squares Regression - $y$ on $x$

The least squares regression line of $y$ on $x$ has the equation $y = a + b x$ , where

$b = \frac{\frac{\sum x y}{n} - \frac{\sum x}{n} \frac{\sum y}{n}}{\frac{\sum x^{2}}{n} - {(\frac{\sum x}{n})}^{2}}$ and $a$ is given by the equation $a = ȳ - b \bar{x}$

It should be noted that the coefficients $b$ and $a$ obtained here will give us the regression line of $y$ on $x$ . This line is used to predict $y$ values given $x$ values. If we need to predict the values of $x$ from given values of $y$ we need the regression line of $x$ on $y$ . The two lines are not the same except in the (very) special case where all of the points lie exactly on a straight line. It is worth noting however, that the two lines cross at the point $(\bar{x}, ȳ)$ . It can be shown that the regression line of $x$ on $y$ is given by Key Point 2:

Key Point 2

Least Squares Regression - $x$ on $y$

The regression line of $x$ on $y$ is

$x = a^{'} + b^{'} y$

where

$b^{'} = \frac{\frac{\sum x y}{n} - \frac{\sum x}{n} \frac{\sum y}{n}}{\frac{\sum y^{2}}{n} - {(\frac{\sum y}{n})}^{2}}$ and $a^{'} = \bar{x} - b^{'} ȳ$

Example 1

A warehouse manager of a company dealing in large quantities of steel cable needs to be able to estimate how much cable is left on his partially used drums. A random sample of twelve partially used drums is taken and each drum is weighed and the corresponding length of cable measured. The results are given in the table below:

Weight of drum and cable ( $x$ ) kg.	Measured length of cable ( $y$ ) m.
30	70
40	90
40	100
50	120
50	130
50	150
60	160
70	190
70	200
80	200
80	220
80	230

Find the least squares regression line in the form $y = m x + c$ and use it to predict the lengths of cable left on drums whose weights are:

(i) 35 kg (ii) 85 kg (iii) 100 kg

In the latter case state any assumptions which you make in order to find the length of cable left on the drum.

Solution

Excel calculations give $\sum x = 700, \sum x^{2} = 44200, \sum y = 1860 \sum x y = 118600$ so that the formulae

$b = \frac{\frac{\sum x y}{n} - \frac{\sum x}{n} \frac{\sum y}{n}}{\frac{\sum x^{2}}{n} - {(\frac{\sum x}{n})}^{2}}$ and $a = ȳ - b \bar{x}$

give $a = - 20$ and $b = 3$ . Our regression line is $y = - 20 + 3 x,$ so $y = 3 x - 20$ .

Hence, the required predicted values are:

$y_{35} = 3 \times 35 - 20 = 85 y_{85} = 3 \times 85 - 20 = 235 y_{100} = 3 \times 100 - 20 = 280$

all results being in metres.

To obtain the last result we have assumed that the linearity of the relationship continues beyond the range of values actually taken.

Task!

An article in the Journal of Sound and Vibration 1991 (151) explored a possible relationship between hypertension (defined as blood pressure rise in mm of mercury) and exposure to noise levels (measured in decibels). Some data given is as follows:

Noise Level ( $x$ )	Blood pressure rise ( $y$ )	Noise Level ( $x$ )	Blood pressure rise ( $y$ )
60	1	85	5
63	0	89	4
65	1	90	6
70	2	90	8
70	5	90	4
70	1	90	5
80	4	94	7
90	6	100	9
80	2	100	7
80	3	100	6

Draw a scatter diagram of the data.
Comment on whether a linear model is appropriate for the data.
Calculate a line of best fit of $y$ on $x$ for the data given.
Use your regression line predict the expected rise in blood pressure for a exposure to a noise level of 97 decibels.

Entering the data into Microsoft Excel and plotting gives
Blood Pressure increase versus recorded sound level
A linear model is appropriate.
Excel calculations give $\sum x = 1656, \sum x^{2} = 140176, \sum y = 86, \sum x y = 7654$ so that $b = 0.1743$ and $a = - 10.1315$ . Our regression line is $y = 0.1743 x - 10.1315$ .
The predicted value is: $y_{97} = 0.1743 \times 97 - 10.1315 = 6.78$ mm mercury.

1.4 The method of least squares - a modelling view

We take the dependent variable $Y$ to be a random variable whose value, for a fixed value of $x$ depends on the value of $x$ and a random error component say $e$ and we write

$Y = α + β x + e$

Adopting the notation of conditional probability, we are looking for the expected value of $Y$ for a given value of $x$ . The expected value of $Y$ for a given value of $x$ is denoted by

$E (Y | x) = E (α + β x + e) = E (α + β x) + E (e)$

The variance of $Y$ for a given value of $x$ is given by the relationship

$V (Y | x) = V (α + β x + e) = V (α + β x) + V (e)$ , assuming independence.

If $μ_{Y | x}$ represents the true mean value of $Y$ for a given value of $x$ then

$μ_{Y | x} = α + β x$ , assuming a linear relationship holds,

is a straight line of mean values. If we now assume that the errors $e$ are distributed with mean 0 and variance $σ^{2}$ we may write

$E (Y | x) = E (α + β x) + E (e) = α + β x$ since $E (e) = 0$ .

and

$V (Y | x) = V (α + β x) + V (e) = σ^{2}$ since $V (α + β x) = 0$ .

This implies that for each value of $x$ , $Y$ is distributed with mean $α + β x$ and variance $σ^{2}$ . Hence when the variance is small the observed values of $Y$ will be close to the regression line and when the variance is large, at least some of the observed values of $Y$ may not be close to the line. Note that the assumption that the errors $e$ are distributed with mean 0 may be made without loss of generality. If the errors had any other mean, we could subtract it and then add the mean to the value of $c$ . The ideas are illustrated in the following diagram.

Figure 5

No alt text was set. Please request alt text from the person who provided you with this resource.

The regression line is shown passing through the means of the distributions for the individual values of $x$ . The value of $y$ corresponding to the $x$ -value $x_{i}$ can be represented by the equation

$y_{i} = α + β x_{i} + e_{i}$

where $e_{i}$ is the error of the observed value of $y,$ that is the difference from its expected value, namely

$E (Y | x_{i}) = μ_{y | x_{i}} = α + β x_{i}$

Now, if we estimate $α$ and $β$ with $a$ and $b$ , the residual , or estimated error, becomes

$ê_{i} = y_{i} - a - b x_{i}$

so that the sum of the squares of the residuals is given by

$S = \sum ê_{i}^{2} = \sum {(y_{i} - a - b x_{i})}^{2}$

and we may minimize the quantity $S$ by using the method of least squares as before. The mathematical details are omitted as before and the equations obtained for $b$ and $a$ are as before, namely

$b = \frac{\frac{\sum x y}{n} - \frac{\sum x}{n} \frac{\sum y}{n}}{\frac{\sum x^{2}}{n} - {(\frac{\sum x}{n})}^{2}}$ and $a = ȳ - b \bar{x}$ .

Note that since the error $e_{i}$ in the $i$ th observation essentially describes the error in the fit of the model to the $i$ th observation, the sum of the squares of the errors $\sum e_{i}^{2}$ will now be used to allow us to comment on the adequacy of fit of a linear model to a given data set.

1.5 Adequacy of fit

We now know that the variance $V (Y | x) = σ^{2}$ is the key to describing the adequacy of fit of our simple linear model. In general, the smaller the variance, the better the fit although you should note that it is wise to distinguish between ‘poor fit’ and a large error variance. Poor fit may suggest, for example, that the relationship is not in fact linear and that a fundamental assumption made has been violated. A large value of $σ^{2}$ does not necessarily mean that a linear model is a poor fit.

It can be shown that the sum of the squares of the errors say $S S_{E}$ can be used to give an unbiased estimator ${\hat{σ}}^{2}$ of $σ^{2}$ via the formula

${\hat{σ}}^{2} = \frac{S S_{E}}{n - p}$

where $p$ is the number of independent variables used in the regression equation. In the case of simple linear regression $p = 2$ since we are using just $x$ and $c$ and the estimator becomes:

${\hat{σ}}^{2} = \frac{S S_{E}}{n - 2}$

The quantity $S S_{E}$ is usually used explicitly in formulae whose purpose is to determine the adequacy of a linear model to explain the variability found in data. Two ways in which the adequacy of a regression model may be judged are given by the so-called Coefficient of Determination and the Adjusted Coefficient of Determination .

1.6 The coefficient of determination

Denoted by $R^{2}$ , the Coefficient of Determination is defined by the formula

$R^{2} = 1 - \frac{S S_{E}}{S S_{T}}$

where $S S_{E}$ is the sum of the squares of the errors and $S S_{T}$ is the sum of the squares of the totals given by $\sum {(y_{i} - ŷ_{i})}^{2} = \sum y_{i}^{2} - n ȳ^{2}$ . The value of $R^{2}$ is sometimes described as representing the amount of variability explained or accounted for by a regression model. For example, if after a particular calculation it was found that $R^{2} = 0.884$ , we could say that the model accounts for about 88% of the variability found in the data. However, deductions made on the basis of the value of $R^{2}$ should be treated cautiously, the reasons for this are embedded in the following properties of the statistic. It can be shown that:

$0 \leq R^{2} \leq 1$
a large value of $R^{2}$ does not necessarily imply that a model is a good fit;
adding a regressor variable (simple regression becomes multiple regression) always increases the value of $R^{2}$ . This is one reason why a large value of $R^{2}$ does not necessarily imply a good model;
models giving large values of $R^{2}$ can be poor predictors of new values if the fitted model does not apply at the appropriate $x$ -value.

Finally, it is worth noting that to check the fit of a linear model properly, one should look at plots of residual values. In some cases, tests of goodness-of-fit are available although this topic is not covered in this Workbook.

1.7 The adjusted coefficient of determination

Denoted (often) by $R_{a d j}^{2}$ , the Adjusted Coefficient of Determination is defined as

$R_{a d j}^{2} = 1 - \frac{S S_{E} ∕ (n - p)}{S S_{T} ∕ (n - 1)}$

where $p$ is the number of variables in the regression equation. For the simple linear model, $p = 2$ since we have two unknown parameters in the regression equation, the intercept $c$ and the coefficient $m$ of $x$ . It can be shown that:

$R_{a d j}^{2}$ is a better indicator of the adequacy of predictive power than $R^{2}$ since it takes into account the number of regressor variables used in the model;
$R_{a d j}^{2}$ does not necessarily increase when a new regressor variable is added.

Both coefficients claim to measure the adequacy of the predictive power of a regression model and their values indicate the proportion of variability explained by the model. For example a value of

$R^{2} or R_{a d j}^{2} = 0.9751$

may be interpreted as indicating that a model explains 97.51% of the variability it describes. For example, the drum and cable example considered previously gives the results outlined below with

$R^{2} = 96.2 and R_{a d j}^{2} = 0.958$

In general, $R_{a d j}^{2}$ is (perhaps) more useful than $R^{2}$ for comparing alternative models. In the context of a simple linear model, $R^{2}$ is easier to interpret. In the drum and cable example we would claim that the linear model explains some 96.2% of the variation it describes.

Drum & Cable $(x)$	$x^{2}$	Cable Length $(y)$	$y^{2}$	$x y$	Predicted Values	Error Squares
30	900	70	4900	2100	70	0.00
40	1600	90	8100	3600	100	100.00
40	1600	100	10000	4000	100	0.00
50	2500	120	14400	6000	130	100.00
50	2500	130	16900	6500	130	0.00
50	2500	150	22500	7500	130	400.00
60	3600	160	25600	9600	160	0.00
70	4900	190	36100	1330	190	0.00
70	4900	200	40000	14000	190	100.00
80	6400	200	40000	16000	220	400.00
80	6400	220	48400	17600	220	0.00
80	6400	230	52900	18400	220	100.00
Sum of $x = 700$	Sum of $x^{2} = 44200$	Sum of $y = 1860$	Sum of $y^{2} = 319800$	Sum of $x y = 118600$		SSE $= 1200.00$
$b = 3$	$a = - 20$	SST $= 31500$		$R^{2} = 0.962$		$R_{adj}^{2} = 0.958$

Task!

Use the drum and cable data given in Example 1 (page 7) and set up a spreadsheet to verify the values of the Coefficient of Determination and the Adjusted Coefficient of Determination calculated on page 12.

As per the table on page 12 giving $R^{2} = 0.962$ and $R_{a d j}^{2} = 0.958$ .

1.8 Significance testing for regression

Note that the results in this Section apply to the simple linear model only. Some additions are necessary before the results can be generalized.

The discussions so far pre-suppose that a linear model adequately describes the relationship between the variables. We can use a significance test involving the distribution to decide whether or not $y$ is linearly dependent on $x$ . We set up the following hypotheses:

$H_{0} : β = 0 and H_{1} : β \neq 0$

Key Point 3

Significance Test for Regression

The test statistic is

$F_{t e s t} = \frac{S S_{R}}{S S_{E} ∕ (n - 2)}$

where $S S_{R} = S S_{T} - S S_{E}$ and rejection at the 5% level of significance occurs if

$F_{t e s t} > f_{0.05, 1, n - 2}$

Note that we have one degree of freedom since we are testing only one parameter ( $m$ ) and that $n$ denotes the number of pairs of $(x, y)$ values. A set of tables giving the 5% values of the $F$ -distribution is given at the end of this Workbook (Table 1).

Example 2

Test to determine whether a simple linear model is appropriate for the data previously given in the drum and cable example above.

Solution

We know that

$S S_{T} = S S_{R} + S S_{E}$

where $S S_{T} = \sum y^{2} - \frac{{(\sum y)}^{2}}{n}$ is the total sum of squares (of $y$ ) so that (from the spreadsheet above) we have:

$S S_{R} = 31500 - 1200 = 30300$

Hence

$F_{t e s t} = \frac{S S_{R}}{S S_{E} ∕ (n - 2)} = \frac{30300}{1200 ∕ (12 - 2)} = 252.5$

From Table 1, the critical value is $f_{0.05, 1, 10} = 241.9$ .

Hence, since $F_{t e s t} > f_{0.05, 1, 10}$ , we reject the null hypothesis and conclude that $β \neq 0$ .

1.9 Regression curves

The Section should be regarded as introductory only. The reason for including non-linear regression is to demonstrate how the method of least squares can be extended to deal with cases where the relationship between variables is, for example, quadratic or exponential.

A regression curve is defined to be the curve passing through the expected value of $Y$ for a set of given values of $x$ . The idea is illustrated by the following diagram.

Figure 6

No alt text was set. Please request alt text from the person who provided you with this resource.

We will look at the quadratic and exponential cases in a little detail.

1.10 The quadratic case

We are looking for a functional relation of the form

$y = α + β x + γ x^{2}$

and so, using the method of least squares, we require the values of $a, b$ and $c$ which minimize the expression

$f (a, b, c) = \sum_{r = 1}^{n} {(y_{r} - a - b x_{r} - c x_{r}^{2})}^{2}$

Note here that the regression described by the form

$y = α + β x + γ x^{2}$

is actually a linear regression since the expression is linear in $α, β$ and $γ$ .

Omitting the subscripts and using partial differentiation gives

\begin{array}{l} \frac{\partial f}{\partial a} & = - 2 \sum (y - a - b x - c x^{2}) \\ \frac{\partial f}{\partial b} & = - 2 \sum x (y - a - b x - c x^{2}) \\ \frac{\partial f}{\partial c} & = - 2 \sum x^{2} (y - a - b x - c x^{2}) \end{array}

At a minimum we require

$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial b} = \frac{\partial f}{\partial c} = 0$

which results in the three linear equations

\begin{array}{l} \sum y - n a - b \sum x - c \sum x^{2} = 0 \\ \sum x y - a \sum x - b \sum x^{2} - c \sum x^{3} = 0 \\ \sum x^{2} y - a \sum x^{2} - b \sum x^{3} - c \sum x^{4} = 0 \end{array}

which can be solved to give the values of $a, b$ and $c$ .

1.11 The exponential case

We use the same technique to look for a functional relation of the form

$y = α e^{β x}$

As before, using the method of least squares, we require the values of $a$ and $b$ which minimize the expression

$f (a, b) = \sum_{r = 1}^{n} {(y_{r} - a e^{b x_{r}})}^{2}$

Again omitting the subscripts and using partial differentiation gives

\begin{array}{l} \frac{\partial f}{\partial a} & = - 2 \sum e^{b x} (y - a e^{b x}) \\ \frac{\partial f}{\partial b} & = - 2 \sum a x e^{b x} (y - a e^{b x}) \end{array}

At a minimum we require

$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial b} = 0$

which results in the two non-linear equations

\begin{array}{l} \sum y e^{b x} - a \sum e^{2 b x} = 0 \\ \sum x y e^{b x} - a \sum x e^{2 b x} = 0 \end{array}

which can be solved by iterative methods to give the values of $a$ and $b$ .

Note that it is possible to combine (for example) linear and exponential regression to obtain a regression equation of the form

$y = (α + β x) e^{γ x}$

The method of least squares may then be used to find estimates $a, b, c$ of $α, β, γ$ .

1 Regression

1.1 Scatter diagrams

1.2 Regression lines by eye

1.3 The method of least squares - an elementary view

Answer

1.4 The method of least squares - a modelling view

1.5 Adequacy of fit

1.6 The coefficient of determination

1.7 The adjusted coefficient of determination

Answer

1.8 Significance testing for regression

1.9 Regression curves

1.10 The quadratic case

1.11 The exponential case