1 Correlation
So far we have assumed that we have a random variable related to an independent variable which can be measured with some accuracy. In the equation below, the dependent variable is a random variable whose value, for a fixed value of depends on a random error component say and we have
In some situations, both and are random variables and you should note that we can still use a regression line of on if we are required to predict values of from observations made on . In this case the variables and play different roles. In correlation, the two variables are interchangeable. Examples involving two random variables often quoted are the shear strength ( ) and diameter of spot welds ( ) (neither can be precisely controlled) and the bending moment ( ) and shear ( ) at the fixed point of a beam as illustrated below
Figure 6
Again, neither variable (shear or moment) can be precisely controlled, each is a random variable. In cases such as these, we turn to the correlation coefficient (sometimes called Pearson’s coefficient of correlation or simply Pearson’s ) defined as
where is the covariance between and and and are the standard deviations of and . We need to express this formula in terms of quantities which facilitate the easy calculation of the correlation coefficient.
Further, it can also be shown that and that:
- represents perfect negative correlation with all lying on a straight line with negative gradient;
- represents perfect positive correlation with all lying on a straight line with positive gradient;
- represents the situation where either there is no linear relationship between the variables or that any relationship existing is non-linear.
1.1 The calculation of Pearson’s
The worked example below shows the setting out of a table which will facilitate the easy calculation of Pearson’s .
Example 3
Find the value of Pearson’s for the following set of data obtained by reading seven torque values ( ) from an electric motor using current ( ).
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
-Value | 16 | 14 | 12 | 10 | 8 | 6 | 4 |
-Value | 12 | 8 | 16 | 14 | 4 | 10 | 6 |
Solution
The calculation is done as follows:
16 | 12 | 256 | 144 | 192 |
14 | 8 | 196 | 64 | 112 |
12 | 16 | 144 | 256 | 192 |
10 | 14 | 100 | 196 | 140 |
8 | 4 | 64 | 16 | 32 |
6 | 10 | 36 | 100 | 60 |
4 | 6 | 16 | 36 | 24 |
Substituting in the formula we developed for gives the result:
In practice, one would set up a spreadsheet or use a specialist statistical software package to do the calculations.
Comment
Any value of calculated says something about the degree of correlation present between the two independent random variables present in the calculation. In order to give real meaning to the value of the correlation coefficient we should test the significance of the value of , in this case 0.46.
1.2 The significance of Pearson’s
In order to test the significance of a calculated value of we assume that both and are normally distributed and set up the hypotheses:
where is the ‘true’ value of the population correlation. If the assumption of normality is false the test must not be used. We know that the value of and we wish to know whether our correlation coefficient is significantly different to zero.
Key Point 5
Significance of Pearson’s
It can be shown that the test statistic
Note that many authors simply miss out the modulus sign and ignore the sign of should it be negative. The test statistic is then written
and critical values depending on the level of significance required are read off from -tables in the usual way. A copy of -distribution tables is included at the end of this Workbook (Table 2).
Example 4
Test the significance of the value of obtained from Example 3 concerning electric motor torque values. Use the 5% level of significance.
Solution
The sample size is 7 so we have 5 degrees of freedom. The value of is given by
From Table 2, the critical value for a two-sided test at the 5% level of significance is 2.571. In this case, since we cannot reject the null hypothesis at the 5% level of significance and conclude that for the motor under investigation, there is no evidence of a relationship between torque produced and current used.
Task!
Hooke’s law relates the extension of a spring under load to its extended length. The following results were obtained experimentally.
Load ( ) | 2 | 5 | 8 | 11 | 15 |
Extension ( ) | 2 | 23 | 62 | 119 | 223 |
Calculate Pearson’s r and test its significance at the 5% level. What conclusion can you draw?
Setting up a spreadsheet to do the calculations gives:
Load | Exten. | |||
2 | 2 | 4 | 4 | 4 |
5 | 23 | 115 | 25 | 529 |
8 | 62 | 496 | 64 | 3844 |
11 | 119 | 1309 | 121 | 14161 |
15 | 223 | 3345 | 225 | 49729 |
Sum | Sum | Sum | Sum | Sum |
Hence, since the critical value for a two-sided -test at the 5% level read off from tables is 3.182 we see that since we can reject the null hypothesis at the 5% level and conclude that the correlation coefficient is significantly different from zero.
Comments on interpretation
Some care should always be taken when interpreting results obtained from correlation coefficient calculations.
- A high correlation does not necessarily imply that a causal relationship exists between the variables considered. For example, it may be that a high degree of correlation exists between the number of road accidents in a particular city and the number of late trains arriving at a station in another city both over the same time period. In general one would not expect to find a causal relation between the variables involved. Similar comments apply to, for example, water hardness and average income for towns in the UK.
-
When considering the behaviour of two variables, one should realize that it is possible
that both variables may change because of the influence of a third variable. An example
often quoted in this context is the Gas law
where say, pressure and volume may change because of a change in temperature.
- A low value of the correlation coefficient does not necessarily imply that no relationship exists between the variables being considered. Remember that the correlation coefficient is indicative of a linear relationship only and that a low or zero value of may indicate that a non-linear relationship exists. For example a set of points lying on the curve might (see the Tasks below) result in a zero value of .
Task!
Write down five points (symmetrical about zero) lying on the parabola . Show that the correlation coefficient between and is zero.
Let the five points be (for example)
-2 | 4 | -8 | 4 | 16 |
-1 | 1 | -1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 1 |
2 | 4 | 8 | 4 | 16 |
Sum | Sum | Sum | Sum | Sum |
The value of is given by
Task!
Write down five points (all involving positive values of and ) lying on the parabola . Show that the correlation coefficient between and is non-zero.
Let the five points be (for example)
0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 1 |
2 | 4 | 8 | 4 | 16 |
3 | 9 | 27 | 9 | 81 |
4 | 16 | 64 | 16 | 256 |
Sum | Sum | Sum | Sum | Sum |
1.3 Spearman’s coefficient of correlation
There are times when data cannot be expressed in terms of numbers directly. For example, an audio engineer might be asked to give an opinion on the quality of sound produced by three sets of speakers. The results will represent a judgement made by the engineer. The engineer could adopt a set of criteria including, for example, the clarity of the treble, the power of the base and the ability of the speakers to distinguish between instruments. Suppose the results are as follows:
Test Item | Rating | Rank Order |
Speaker Pair B | 9/10 | 1 |
Speaker Pair A | 8/10 | 2 |
Speaker Pair C | 5/10 | 3 |
Note that the results are not numeric in an arithmetic sense so you cannot do meaningful arithmetic using the results. In order to see this, just ask what a calculation based on the ranks such as
would actually mean. The answer is, of course, nothing!
During your career as an engineer you may be asked to rank data in a similar way to that outlined above. You may be asked to assess the work of colleagues for promotion purposes or give an opinion on the visual appeal of alternative designs of manufactured objects such as mobile telephones, food containers or television sets.
Assigning numbers to data in order of size (often called ranking methods) can also be useful if one does not wish to make assumptions about the nature of the distributions underlying the data. (For example whenever at least one of the distributions describing the behaviour of the variables may not be normal.) In order to check the level of correlation between results obtained by ranking data we calculate Spearman’s coefficient of correlation.
Key Point 6
Spearman’s Coefficient of Correlation ,
The formula indicates that the differences of each pair of ranked values are to be found, squared and summed. It is worth noting that even though it is not obvious, Spearman’s coefficient is just Pearson’s coefficient applied to ranks.
1.4 The calculation of Spearman’s
The following worked example illustrates the procedure.
Example 5
A production engineer is asked to grade, on the basis of 12 criteria to , a junior colleague who has applied for promotion. In order to try to ensure that he treats the colleague fairly, the engineer repeats his gradings after a few days. On the basis of the results below, can you conclude that the results are consistent? The gradings are percentages.
Criterion | First Grading ( ) | Second Grading ( ) | ||
55 | 8 | 75 | 7 | |
53 | 9 | 80 | 6 | |
78 | 3 | 89 | 4 | |
50 | 10 | 63 | 11 | |
48 | 11 | 67 | 10 | |
61 | 7 | 69 | 9 | |
66 | 6 | 73 | 8 | |
76 | 4 | 93 | 2 | |
85 | 2 | 87 | 5 | |
90 | 1 | 95 | 1 | |
69 | 5 | 92 | 3 | |
45 | 12 | 59 | 12 |
Solution
The calculation may be set out as follows:
Criterion | ||||
8 | 7 | 1 | 1 | |
9 | 6 | 3 | 9 | |
3 | 4 | -1 | 1 | |
10 | 11 | -1 | 1 | |
11 | 10 | 1 | 1 | |
7 | 9 | -2 | 4 | |
6 | 8 | -2 | 4 | |
4 | 2 | 2 | 4 | |
2 | 5 | -3 | 9 | |
1 | 1 | 0 | 0 | |
5 | 3 | 2 | 4 | |
12 | 12 | 0 | 0 | |
Substituting in the formula for gives the value
Note that we have not made any attempt to interpret the meaning of this figure of 0.87. Methods for doing this are discussed below.
1.5 The significance of spearman’s
Like Pearson’s the value of may be shown to lie in the range and in order to test the significance of a calculated value of we set up the hypotheses
Key Point 7
Significance of Spearman’s
We wish to know whether our correlation coefficient is significantly different to zero. It can be shown that for large samples, the test statistic
Critical values depending on the level of significance required are read from -tables. When dealing with Spearman’s coefficient of correlation, the size of the sample is important. Different authors recommend different minimum sample sizes, a common recommendation being a minimum of . Even though they are not used here, you should note that tables are available which allow us to read critical values corresponding to small sample sizes.
Example 6
A production engineer is asked to grade, on the basis of 12 criteria (say) to a junior colleague who has applied for promotion. He repeats his gradings after a few days. The results (calculated in Example 5) gave a value of . Test at the 5% level to determine whether the results are consistent.
Solution
The calculation is:
The 5% critical value for a two sided test read from tables is 2.228 and since we conclude that we must reject the null hypothesis that the correlation coefficient is zero.
Task!
As a result of two tests given to 10 students studying laboratory safety, the students were placed in the following class order.
Student | Test 1 | Test 2 |
A | 2 | 3 |
B | 4 | 5 |
C | 3 | 7 |
D | 5 | 9 |
E | 1 | 10 |
F | 6 | 2 |
G | 8 | 6 |
H | 7 | 8 |
I | 9 | 4 |
J | 10 | 1 |
Use Spearman’s to discuss the consistency of their performances. Can you make any meaningful comment regarding the two tests as a means of assessing laboratory safety?
Setting up the hypotheses
and doing the appropriate calculations using a spreadsheet gives:
Test 1 | Test 2 | ||
2 | 3 | -1 | 1 |
4 | 5 | -1 | 1 |
3 | 7 | -4 | 16 |
5 | 9 | -4 | 16 |
1 | 10 | -9 | 81 |
6 | 2 | 4 | 16 |
8 | 6 | 2 | 4 |
7 | 8 | -1 | 1 |
9 | 4 | 5 | 25 |
10 | 1 | 9 | 81 |
sum = 242 | |||
From -tables it may be seen that the critical value (8 degrees of freedom) at the 5% level of significance is 2.306. Since we cannot reject the null hypothesis that there is no correlation between the results. This implies that the performances of the students on the tests may not be related and we should question at least one of the tests as a means of assessing laboratory safety. One could, of course, question the usefulness of both tests!
Task!
As part of an educational research project, twelve engineering students were given an intelligence test (IQ score) at the start of their first year course. At the end of the first year their results in engineering science (ES score) were noted down on the expectation that they would correlate with the results of the intelligence test. The results were as follows:
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
IQ Score | 135 | 120 | 125 | 135 | 125 | 140 | 135 | 140 | 135 | 140 | 120 | 135 |
ES Score | 85 | 74 | 76 | 90 | 85 | 87 | 94 | 98 | 81 | 91 | 76 | 74 |
Calculate Pearson’s r for these data. Can you conclude that there is a linear relationship between IQ scores and ES scores? You may assume that the IQ scores and the ES scores are each normally distributed.
Setting up the hypotheses
and doing the appropriate calculations using a spreadsheet gives:
135 | 85 | 11475 | 18225 | 7225 |
120 | 74 | 8880 | 14400 | 5476 |
125 | 76 | 9500 | 15625 | 5776 |
135 | 90 | 12150 | 18225 | 8100 |
125 | 85 | 10625 | 15625 | 7225 |
140 | 87 | 12180 | 19600 | 7569 |
135 | 94 | 12690 | 18225 | 8836 |
140 | 98 | 13720 | 19600 | 9604 |
135 | 81 | 10935 | 18225 | 6561 |
140 | 91 | 12740 | 19600 | 8281 |
120 | 76 | 9120 | 14400 | 5776 |
135 | 74 | 9990 | 18225 | 5476 |
sum | sum | sum | sum | sum |
From -tables it may be seen that the critical value (10 degrees of freedom) at the 5% level of significance is 1.812. Since we reject the null hypothesis that there is no linear association between the results. This implies that the performances of the students on the ES tests is linearly related to their IQ scores.
1.6 Table 1: Upper 5% points of the distribution
1.7 Table 2: Critical points of student’s distribution
.40 | .25 | .10 | .05 | .025 | .01 | .005 | .0025 | .001 | .0005 | |
1 | .325 | 1.000 | 3.078 | 6.314 | 12.706 | 31.825 | 63.657 | 127.32 | 318.31 | 636.62 |
2 | .289 | .816 | 1.886 | 2.902 | 4.303 | 6.965 | 9.925 | 14.089 | 23.326 | 31.598 |
3 | .277 | .765 | 1.638 | 2.353 | 3.182 | 4.514 | 5.841 | 7.453 | 10.213 | 12.924 |
4 | .271 | .741 | 1.533 | 2.132 | 2.776 | 3.747 | 4.604 | 5.598 | 7.173 | 8.610 |
5 | .267 | .727 | 1.476 | 2.015 | 2.571 | 3.365 | 4.032 | 4.773 | 5.893 | 6.869 |
6 | .265 | .718 | 1.440 | 1.943 | 2.447 | 3.143 | 3.707 | 4.317 | 5.208 | 5.959 |
7 | .263 | .711 | 1.415 | 1.895 | 2.365 | 2.998 | 3.499 | 4.029 | 4.785 | 5.408 |
8 | .262 | .706 | 1.397 | 1.860 | 2.306 | 2.896 | 3.355 | 3.833 | 4.501 | 5.041 |
9 | .261 | .703 | 1.383 | 1.833 | 2.262 | 2.821 | 3.250 | 3.690 | 4.297 | 4.781 |
10 | .260 | .700 | 1.372 | 1.812 | 2.228 | 2.764 | 3.169 | 3.581 | 4.144 | 4.487 |
11 | .260 | .697 | 1.363 | 1.796 | 2.201 | 2.718 | 3.106 | 3.497 | 4.025 | 4.437 |
12 | .259 | .695 | 1.356 | 1.782 | 2.179 | 2.681 | 3.055 | 3.428 | 3.930 | 4.318 |
13 | .259 | .694 | 1.350 | 1.771 | 2.160 | 2.650 | 3.012 | 3.372 | 3.852 | 4.221 |
14 | .258 | .692 | 1.345 | 1.761 | 2.145 | 2.624 | 2.977 | 3.326 | 3.787 | 4.140 |
15 | .258 | .691 | 1.341 | 1.753 | 2.131 | 2.602 | 2.947 | 3.286 | 3.733 | 4.073 |
16 | .258 | .690 | 1.337 | 1.746 | 2.120 | 2.583 | 2.921 | 3.252 | 3.686 | 4.015 |
17 | .257 | .689 | 1.333 | 1.740 | 2.110 | 2.567 | 2.898 | 3.222 | 3.646 | 3.965 |
18 | .257 | .688 | 1.330 | 1.734 | 2.101 | 2.552 | 2.878 | 3.197 | 3.610 | 3.922 |
19 | .257 | .688 | 1.328 | 1.729 | 2.093 | 2.539 | 2.861 | 3.174 | 3.579 | 3.883 |
20 | .257 | .687 | 1.325 | 1.725 | 2.086 | 2.528 | 2.845 | 3.153 | 3.552 | 3.850 |
21 | .257 | .686 | 1.323 | 1.721 | 2.080 | 2.518 | 2.831 | 3.135 | 3.527 | 3.819 |
22 | .256 | .686 | 1.321 | 1.717 | 2.074 | 2.508 | 2.819 | 3.119 | 3.505 | 3.792 |
23 | .256 | .685 | 1.319 | 1.714 | 2.069 | 2.500 | 2.807 | 3.104 | 3.485 | 3.767 |
24 | .256 | .685 | 1.318 | 1.711 | 2.064 | 2.492 | 2.797 | 3.091 | 3.467 | 3.745 |
25 | .256 | .684 | 1.316 | 1.708 | 2.060 | 2.485 | 2.787 | 3.078 | 3.450 | 3.725 |
26 | .256 | .684 | 1.315 | 1.706 | 2.056 | 2.479 | 2.779 | 3.067 | 3.435 | 3.707 |
27 | .256 | .684 | 1.314 | 1.703 | 2.052 | 2.473 | 2.771 | 3.057 | 3.421 | 3.690 |
28 | .256 | .683 | 1.313 | 1.701 | 2.048 | 2.467 | 2.763 | 3.047 | 3.408 | 3.674 |
29 | .256 | .683 | 1.311 | 1.699 | 2.045 | 2.462 | 2.756 | 3.038 | 3.396 | 3.659 |
30 | .256 | .683 | 1.310 | 1.697 | 2.042 | 2.457 | 2.750 | 3.030 | 3.385 | 3.646 |
40 | .255 | .681 | 1.303 | 1.684 | 2.021 | 2.423 | 2.704 | 2.971 | 3.307 | 3.551 |
60 | .254 | .679 | 1.296 | 1.671 | 2.000 | 2.390 | 2.660 | 2.915 | 3.232 | 3.460 |
120 | .254 | .677 | 1.289 | 1.658 | 1.980 | 2.358 | 2.617 | 2.860 | 3.160 | 3.373 |
.253 | .674 | 1.282 | 1.645 | 1.960 | 2.326 | 2.576 | 2.807 | 3.090 | 3.291 | |