Correlation

1 Correlation

So far we have assumed that we have a random variable $Y$ related to an independent variable $x$ which can be measured with some accuracy. In the equation below, the dependent variable $Y$ is a random variable whose value, for a fixed value of $x$ depends on a random error component say $e$ and we have

$Y = m x + c + e$

In some situations, both $X$ and $Y$ are random variables and you should note that we can still use a regression line of $y$ on $x$ if we are required to predict values of $y$ from observations made on $x$ . In this case the variables $x$ and $y$ play different roles. In correlation, the two variables are interchangeable. Examples involving two random variables often quoted are the shear strength ( $y$ ) and diameter of spot welds ( $x$ ) (neither can be precisely controlled) and the bending moment ( $y$ ) and shear ( $x$ ) at the fixed point of a beam as illustrated below

Figure 6

No alt text was set. Please request alt text from the person who provided you with this resource.

Again, neither variable (shear or moment) can be precisely controlled, each is a random variable. In cases such as these, we turn to the correlation coefficient (sometimes called Pearson’s coefficient of correlation or simply Pearson’s $r$ ) defined as

$r = \frac{σ_{x y}}{σ_{x} σ_{y}}$

where $σ_{x y}$ is the covariance between $X$ and $Y$ and $σ_{x}$ and $σ_{y}$ are the standard deviations of $X$ and $Y$ . We need to express this formula in terms of quantities which facilitate the easy calculation of the correlation coefficient.

Key Point 4

Pearson’s Coefficient of Correlation , $r$

In terms of corresponding sample values $(x, y)$ ,

r = \frac{n \sum x y - \sum x \sum y}{\sqrt{(n \sum x^{2} - {(\sum x)}^{2}) (n \sum y^{2} - {(\sum y)}^{2})}}

Further, it can also be shown that $- 1 \leq r \leq 1$ and that:

$r = - 1$ represents perfect negative correlation with all $(x, y)$ lying on a straight line with negative gradient;
$r = 1$ represents perfect positive correlation with all $(x, y)$ lying on a straight line with positive gradient;
$r = 0$ represents the situation where either there is no linear relationship between the variables or that any relationship existing is non-linear.

1.1 The calculation of Pearson’s $r$

The worked example below shows the setting out of a table which will facilitate the easy calculation of Pearson’s $r$ .

Example 3

Find the value of Pearson’s $r$ for the following set of data obtained by reading seven torque values ( $x$ ) from an electric motor using current ( $y$ ).

Student	1	2	3	4	5	6	7
$x$ -Value	16	14	12	10	8	6	4
$y$ -Value	12	8	16	14	4	10	6

Solution

The calculation is done as follows:

$x$	$y$	$x^{2}$	$y^{2}$	$x y$
16	12	256	144	192
14	8	196	64	112
12	16	144	256	192
10	14	100	196	140
8	4	64	16	32
6	10	36	100	60
4	6	16	36	24
$\sum x = 70$	$\sum y = 70$	$\sum x^{2} = 812$	$\sum y^{2} = 812$	$\sum x y = 752$

Substituting in the formula we developed for $r$ gives the result:

$r = \frac{752 \times 7 - 70 \times 70}{\sqrt{(7 \times 812 - 7 0^{2}) (7 \times 812 - 7 0^{2})}} = 0.46$

In practice, one would set up a spreadsheet or use a specialist statistical software package to do the calculations.

Comment

Any value of $r$ calculated says something about the degree of correlation present between the two independent random variables present in the calculation. In order to give real meaning to the value of the correlation coefficient we should test the significance of the value of $r$ , in this case 0.46.

1.2 The significance of Pearson’s $r$

In order to test the significance of a calculated value of $r$ we assume that both $x$ and $y$ are normally distributed and set up the hypotheses:

$H_{0} : ρ = 0 H_{1} : ρ \neq 0$

where $ρ$ is the ‘true’ value of the population correlation. If the assumption of normality is false the test must not be used. We know that the value of $- 1 \leq r \leq 1$ and we wish to know whether our correlation coefficient is significantly different to zero.

Key Point 5

Significance of Pearson’s $r$

It can be shown that the test statistic

r_{t e s t} = \frac{|r| \sqrt{n - 2}}{\sqrt{1 - r^{2}}}

calculated from a sample of

n

pairs of values, follows a

t

-distribution with

n - 2

degrees of freedom.

Note that many authors simply miss out the modulus sign and ignore the sign of $r$ should it be negative. The test statistic is then written

$r_{t e s t} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}}$

and critical values depending on the level of significance required are read off from $t$ -tables in the usual way. A copy of $t$ -distribution tables is included at the end of this Workbook (Table 2).

Example 4

Test the significance of the value of $r$ obtained from Example 3 concerning electric motor torque values. Use the 5% level of significance.

Solution

The sample size is 7 so we have 5 degrees of freedom. The value of $r_{t e s t}$ is given by

$r_{t e s t} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}} = \frac{0.46 \times \sqrt{7 - 2}}{\sqrt{1 - 0.4 6^{2}}} = 1.158$

From Table 2, the critical value for a two-sided test at the 5% level of significance is 2.571. In this case, since $1.158 < 2.571$ we cannot reject the null hypothesis at the 5% level of significance and conclude that for the motor under investigation, there is no evidence of a relationship between torque produced and current used.

Task!

Hooke’s law relates the extension of a spring under load to its extended length. The following results were obtained experimentally.

Load ( $N$ )	2	5	8	11	15
Extension ( $m m$ )	2	23	62	119	223

Calculate Pearson’s r and test its significance at the 5% level. What conclusion can you draw?

Setting up a spreadsheet to do the calculations gives:

Load $(x)$	Exten. $(y)$	$x y$	$x^{2}$	$y^{2}$
2	2	4	4	4
5	23	115	25	529
8	62	496	64	3844
11	119	1309	121	14161
15	223	3345	225	49729
Sum $(x) = 41$	Sum $(y) = 429$	Sum $(x y) = 5269$	Sum $(x^{2}) = 439$	Sum $(y^{2}) = 68267$

$r = 0.97379629 r_{test} = 7.41645174$

Hence, since the critical value for a two-sided $t$ -test at the 5% level read off from tables is 3.182 we see that since $7.416 > 3.182$ we can reject the null hypothesis at the 5% level and conclude that the correlation coefficient is significantly different from zero.

Comments on interpretation

Some care should always be taken when interpreting results obtained from correlation coefficient calculations.

A high correlation does not necessarily imply that a causal relationship exists between the variables considered. For example, it may be that a high degree of correlation exists between the number of road accidents in a particular city and the number of late trains arriving at a station in another city both over the same time period. In general one would not expect to find a causal relation between the variables involved. Similar comments apply to, for example, water hardness and average income for towns in the UK.
When considering the behaviour of two variables, one should realize that it is possible that both variables may change because of the influence of a third variable. An example often quoted in this context is the Gas law
$\frac{P V}{T} = constant$

where say, pressure and volume may change because of a change in temperature.
A low value of the correlation coefficient does not necessarily imply that no relationship exists between the variables being considered. Remember that the correlation coefficient is indicative of a linear relationship only and that a low or zero value of $r$ may indicate that a non-linear relationship exists. For example a set of points lying on the curve $y = x^{2}$ might (see the Tasks below) result in a zero value of $r$ .

Task!

Write down five $(x, y)$ points (symmetrical about zero) lying on the parabola $y = x^{2}$ . Show that the correlation coefficient between $x$ and $y$ is zero.

Let the five points be (for example) $(- 2, 4), (- 1, 1), (0, 0), (1, 1), (2, 4)$

$x$	$y$	$x y$	$x^{2}$	$y^{2}$
-2	4	-8	4	16
-1	1	-1	1	1
0	0	0	0	0
1	1	1	1	1
2	4	8	4	16
Sum $(x) = 0$	Sum $(y) = 10$	Sum $(x y) = 0$	Sum $(x^{2}) = 10$	Sum $(y^{2}) = 34$

The value of $r$ is given by

$r = \frac{n \sum x y - \sum x \sum y}{\sqrt{(n \sum x^{2} - {(\sum x)}^{2}) (n \sum y^{2} - {(\sum y)}^{2})}} = \frac{5 \times 0 - 0 \times 10}{\sqrt{(5 \times 10 - 0^{2}) (5 \times 34 - 1 0^{2})}} = 0$

Task!

Write down five $(x, y)$ points (all involving positive values of $x$ and $y$ ) lying on the parabola $y = x^{2}$ . Show that the correlation coefficient between $x$ and $y$ is non-zero.

Let the five points be (for example) $(0, 0), (1, 1), (2, 4), (3, 9), (4, 16),$

$x$	$y$	$x y$	$x^{2}$	$y^{2}$
0	0	0	0	0
1	1	1	1	1
2	4	8	4	16
3	9	27	9	81
4	16	64	16	256
Sum $(x) = 10$	Sum $(y) = 30$	Sum $(x y) = 100$	Sum $(x^{2}) = 30$	Sum $(y^{2}) = 354$

$r = \frac{n \sum x y - \sum x \sum y}{\sqrt{(n \sum x^{2} - {(\sum x)}^{2}) (n \sum y^{2} - {(\sum y)}^{2})}} = \frac{5 \times 100 - 10 \times 30}{\sqrt{(5 \times 30 - 1 0^{2}) (5 \times 354 - 3 0^{2})}} = 0.959$

1.3 Spearman’s coefficient of correlation

There are times when data cannot be expressed in terms of numbers directly. For example, an audio engineer might be asked to give an opinion on the quality of sound produced by three sets of speakers. The results will represent a judgement made by the engineer. The engineer could adopt a set of criteria including, for example, the clarity of the treble, the power of the base and the ability of the speakers to distinguish between instruments. Suppose the results are as follows:

Test Item	Rating	Rank Order
Speaker Pair B	9/10	1
Speaker Pair A	8/10	2
Speaker Pair C	5/10	3

Note that the results are not numeric in an arithmetic sense so you cannot do meaningful arithmetic using the results. In order to see this, just ask what a calculation based on the ranks such as

$\frac{1 + 2^{2}}{3}$

would actually mean. The answer is, of course, nothing!

During your career as an engineer you may be asked to rank data in a similar way to that outlined above. You may be asked to assess the work of colleagues for promotion purposes or give an opinion on the visual appeal of alternative designs of manufactured objects such as mobile telephones, food containers or television sets.

Assigning numbers to data in order of size (often called ranking methods) can also be useful if one does not wish to make assumptions about the nature of the distributions underlying the data. (For example whenever at least one of the distributions describing the behaviour of the variables may not be normal.) In order to check the level of correlation between results obtained by ranking data we calculate Spearman’s coefficient of correlation.

Key Point 6

Spearman’s Coefficient of Correlation , $R$

R = 1 - \frac{6 \sum D^{2}}{n (n^{2} - 1)}

where

D = R_{X} - R_{y}

is the difference of the rank

R_{X}

of an item according to variable

X

and rank

R_{Y}

of the item according to variable

Y

The formula indicates that the differences of each pair of ranked values are to be found, squared and summed. It is worth noting that even though it is not obvious, Spearman’s coefficient is just Pearson’s coefficient applied to ranks.

1.4 The calculation of Spearman’s $R$

The following worked example illustrates the procedure.

Example 5

A production engineer is asked to grade, on the basis of 12 criteria $A$ to $L$ , a junior colleague who has applied for promotion. In order to try to ensure that he treats the colleague fairly, the engineer repeats his gradings after a few days. On the basis of the results below, can you conclude that the results are consistent? The gradings are percentages.

Criterion	First Grading ( $X$ )	$R_{X}$	Second Grading ( $Y$ )	$R_{Y}$
$A$	55	8	75	7
$B$	53	9	80	6
$C$	78	3	89	4
$D$	50	10	63	11
$E$	48	11	67	10
$F$	61	7	69	9
$G$	66	6	73	8
$H$	76	4	93	2
$I$	85	2	87	5
$J$	90	1	95	1
$K$	69	5	92	3
$L$	45	12	59	12

Solution

The calculation may be set out as follows:

Criterion	$R_{X}$	$R_{Y}$	$D = R_{X} - R_{Y}$	$D^{2}$
$A$	8	7	1	1
$B$	9	6	3	9
$C$	3	4	-1	1
$D$	10	11	-1	1
$E$	11	10	1	1
$F$	7	9	-2	4
$G$	6	8	-2	4
$H$	4	2	2	4
$I$	2	5	-3	9
$J$	1	1	0	0
$K$	5	3	2	4
$L$	12	12	0	0
				$\sum D^{2} = 38$

Substituting in the formula for $R$ gives the value

$R = 1 - \frac{6 \times 38}{12 \times 143} = 0.87$

Note that we have not made any attempt to interpret the meaning of this figure of 0.87. Methods for doing this are discussed below.

1.5 The significance of spearman’s $R$

Like Pearson’s $r$ the value of $R$ may be shown to lie in the range $- 1 \leq R \leq 1$ and in order to test the significance of a calculated value of $R$ we set up the hypotheses

$H_{0} : ρ = 0 H_{1} : ρ \neq 0$

Key Point 7

Significance of Spearman’s $R$

We wish to know whether our correlation coefficient is significantly different to zero. It can be shown that for large samples, the test statistic

R_{t e s t} = \frac{R \sqrt{n - 2}}{\sqrt{1 - R^{2}}}

calculated from a sample of n pairs of values, follows a

t

-distribution with

n - 2

degrees of freedom.

Critical values depending on the level of significance required are read from $t$ -tables. When dealing with Spearman’s coefficient of correlation, the size of the sample is important. Different authors recommend different minimum sample sizes, a common recommendation being a minimum of $n = 10$ . Even though they are not used here, you should note that tables are available which allow us to read critical values corresponding to small sample sizes.

Example 6

A production engineer is asked to grade, on the basis of 12 criteria (say) $A$ to $L$ a junior colleague who has applied for promotion. He repeats his gradings after a few days. The results (calculated in Example 5) gave a value of $R = 0.87$ . Test at the 5% level to determine whether the results are consistent.

Solution

The calculation is:

$R_{t e s t} = \frac{R \sqrt{n - 2}}{\sqrt{1 - R^{2}}} = \frac{0.87 \times \sqrt{12 - 2}}{\sqrt{1 - 0.8 7^{2}}} = 5.580$

The 5% critical value for a two sided test read from tables is 2.228 and since $5.580 > 2.228$ we conclude that we must reject the null hypothesis that the correlation coefficient is zero.

Task!

As a result of two tests given to 10 students studying laboratory safety, the students were placed in the following class order.

Student	Test 1	Test 2
A	2	3
B	4	5
C	3	7
D	5	9
E	1	10
F	6	2
G	8	6
H	7	8
I	9	4
J	10	1

Use Spearman’s $R$ to discuss the consistency of their performances. Can you make any meaningful comment regarding the two tests as a means of assessing laboratory safety?

Setting up the hypotheses

$H_{0} : R = 0 H_{1} : R \neq 0$

and doing the appropriate calculations using a spreadsheet gives:

Test 1	Test 2	$D$	$D^{2}$
2	3	-1	1
4	5	-1	1
3	7	-4	16
5	9	-4	16
1	10	-9	81
6	2	4	16
8	6	2	4
7	8	-1	1
9	4	5	25
10	1	9	81
			sum = 242
	$R = - 0.4666667$	$R_{test} = 1.49240501$

From $t$ -tables it may be seen that the critical value (8 degrees of freedom) at the 5% level of significance is 2.306. Since $1.492 < 2.306$ we cannot reject the null hypothesis that there is no correlation between the results. This implies that the performances of the students on the tests may not be related and we should question at least one of the tests as a means of assessing laboratory safety. One could, of course, question the usefulness of both tests!

Task!

As part of an educational research project, twelve engineering students were given an intelligence test (IQ score) at the start of their first year course. At the end of the first year their results in engineering science (ES score) were noted down on the expectation that they would correlate with the results of the intelligence test. The results were as follows:

Student	1	2	3	4	5	6	7	8	9	10	11	12
IQ Score	135	120	125	135	125	140	135	140	135	140	120	135
ES Score	85	74	76	90	85	87	94	98	81	91	76	74

Calculate Pearson’s r for these data. Can you conclude that there is a linear relationship between IQ scores and ES scores? You may assume that the IQ scores and the ES scores are each normally distributed.

Setting up the hypotheses

$H_{0} : R = 0 H_{1} : R \neq 0$

and doing the appropriate calculations using a spreadsheet gives:

$I Q (x)$	$E S (y)$	$x y$	$x^{2}$	$y^{2}$
135	85	11475	18225	7225
120	74	8880	14400	5476
125	76	9500	15625	5776
135	90	12150	18225	8100
125	85	10625	15625	7225
140	87	12180	19600	7569
135	94	12690	18225	8836
140	98	13720	19600	9604
135	81	10935	18225	6561
140	91	12740	19600	8281
120	76	9120	14400	5776
135	74	9990	18225	5476
sum $x = 1585$	sum $y = 1011$	sum $x y = 134005$	sum $x^{2} = 209975$	sum $y^{2} = 85905$

$r = 0.696 r_{test} = 3.065$

From $t$ -tables it may be seen that the critical value (10 degrees of freedom) at the 5% level of significance is 1.812. Since $3.065 > 1.812$ we reject the null hypothesis that there is no linear association between the results. This implies that the performances of the students on the ES tests is linearly related to their IQ scores.

1.6 Table 1: Upper 5% points of the $F$ distribution

No alt text was set. Please request alt text from the person who provided you with this resource.

1.7 Table 2: Critical points of student’s $t$ distribution

No alt text was set. Please request alt text from the person who provided you with this resource.

$α$	.40	.25	.10	.05	.025	.01	.005	.0025	.001	.0005
$v$
1	.325	1.000	3.078	6.314	12.706	31.825	63.657	127.32	318.31	636.62
2	.289	.816	1.886	2.902	4.303	6.965	9.925	14.089	23.326	31.598
3	.277	.765	1.638	2.353	3.182	4.514	5.841	7.453	10.213	12.924
4	.271	.741	1.533	2.132	2.776	3.747	4.604	5.598	7.173	8.610
5	.267	.727	1.476	2.015	2.571	3.365	4.032	4.773	5.893	6.869
6	.265	.718	1.440	1.943	2.447	3.143	3.707	4.317	5.208	5.959
7	.263	.711	1.415	1.895	2.365	2.998	3.499	4.029	4.785	5.408
8	.262	.706	1.397	1.860	2.306	2.896	3.355	3.833	4.501	5.041
9	.261	.703	1.383	1.833	2.262	2.821	3.250	3.690	4.297	4.781
10	.260	.700	1.372	1.812	2.228	2.764	3.169	3.581	4.144	4.487
11	.260	.697	1.363	1.796	2.201	2.718	3.106	3.497	4.025	4.437
12	.259	.695	1.356	1.782	2.179	2.681	3.055	3.428	3.930	4.318
13	.259	.694	1.350	1.771	2.160	2.650	3.012	3.372	3.852	4.221
14	.258	.692	1.345	1.761	2.145	2.624	2.977	3.326	3.787	4.140
15	.258	.691	1.341	1.753	2.131	2.602	2.947	3.286	3.733	4.073
16	.258	.690	1.337	1.746	2.120	2.583	2.921	3.252	3.686	4.015
17	.257	.689	1.333	1.740	2.110	2.567	2.898	3.222	3.646	3.965
18	.257	.688	1.330	1.734	2.101	2.552	2.878	3.197	3.610	3.922
19	.257	.688	1.328	1.729	2.093	2.539	2.861	3.174	3.579	3.883
20	.257	.687	1.325	1.725	2.086	2.528	2.845	3.153	3.552	3.850
21	.257	.686	1.323	1.721	2.080	2.518	2.831	3.135	3.527	3.819
22	.256	.686	1.321	1.717	2.074	2.508	2.819	3.119	3.505	3.792
23	.256	.685	1.319	1.714	2.069	2.500	2.807	3.104	3.485	3.767
24	.256	.685	1.318	1.711	2.064	2.492	2.797	3.091	3.467	3.745
25	.256	.684	1.316	1.708	2.060	2.485	2.787	3.078	3.450	3.725
26	.256	.684	1.315	1.706	2.056	2.479	2.779	3.067	3.435	3.707
27	.256	.684	1.314	1.703	2.052	2.473	2.771	3.057	3.421	3.690
28	.256	.683	1.313	1.701	2.048	2.467	2.763	3.047	3.408	3.674
29	.256	.683	1.311	1.699	2.045	2.462	2.756	3.038	3.396	3.659
30	.256	.683	1.310	1.697	2.042	2.457	2.750	3.030	3.385	3.646
40	.255	.681	1.303	1.684	2.021	2.423	2.704	2.971	3.307	3.551
60	.254	.679	1.296	1.671	2.000	2.390	2.660	2.915	3.232	3.460
120	.254	.677	1.289	1.658	1.980	2.358	2.617	2.860	3.160	3.373
$\infty$	.253	.674	1.282	1.645	1.960	2.326	2.576	2.807	3.090	3.291

1 Correlation

1.1 The calculation of Pearson’s r

1.2 The significance of Pearson’s r

Answer

Answer

Answer

1.3 Spearman’s coefficient of correlation

1.4 The calculation of Spearman’s R

1.5 The significance of spearman’s R

Answer

Answer

1.6 Table 1: Upper 5% points of the F distribution

1.7 Table 2: Critical points of student’s t distribution

1.1 The calculation of Pearson’s $r$

1.2 The significance of Pearson’s $r$

1.4 The calculation of Spearman’s $R$

1.5 The significance of spearman’s $R$

1.6 Table 1: Upper 5% points of the $F$ distribution

1.7 Table 2: Critical points of student’s $t$ distribution