1 Correlation

So far we have assumed that we have a random variable Y related to an independent variable x which can be measured with some accuracy. In the equation below, the dependent variable Y is a random variable whose value, for a fixed value of x depends on a random error component say e and we have

Y = m x + c + e

In some situations, both X and Y are random variables and you should note that we can still use a regression line of y on x if we are required to predict values of y from observations made on x . In this case the variables x and y play different roles. In correlation, the two variables are interchangeable. Examples involving two random variables often quoted are the shear strength ( y ) and diameter of spot welds ( x ) (neither can be precisely controlled) and the bending moment ( y ) and shear ( x ) at the fixed point of a beam as illustrated below

Figure 6

No alt text was set. Please request alt text from the person who provided you with this resource.

Again, neither variable (shear or moment) can be precisely controlled, each is a random variable. In cases such as these, we turn to the correlation coefficient (sometimes called Pearson’s coefficient of correlation or simply Pearson’s r ) defined as

r = σ x y σ x σ y

where σ x y is the covariance between X and Y and σ x and σ y are the standard deviations of X and Y . We need to express this formula in terms of quantities which facilitate the easy calculation of the correlation coefficient.

Key Point 4

Pearson’s Coefficient of Correlation , r

In terms of corresponding sample values ( x , y ) ,

r = n x y x y n x 2 x 2 n y 2 y 2

Further, it can also be shown that 1 r 1 and that:

  1. r = 1 represents perfect negative correlation with all ( x , y ) lying on a straight line with negative gradient;
  2. r = 1 represents perfect positive correlation with all ( x , y ) lying on a straight line with positive gradient;
  3. r = 0 represents the situation where either there is no linear relationship between the variables or that any relationship existing is non-linear.

1.1 The calculation of Pearson’s r

The worked example below shows the setting out of a table which will facilitate the easy calculation of Pearson’s r .

Example 3

Find the value of Pearson’s r for the following set of data obtained by reading seven torque values ( x ) from an electric motor using current ( y ).

Student 1 2 3 4 5 6 7
x -Value 16 14 12 10 8 6 4
y -Value 12 8 16 14 4 10 6
Solution

The calculation is done as follows:

x y x 2 y 2 x y
16 12 256 144 192
14 8 196 64 112
12 16 144 256 192
10 14 100 196 140
8 4 64 16 32
6 10 36 100 60
4 6 16 36 24
x = 70 y = 70 x 2 = 812 y 2 = 812 x y = 752

Substituting in the formula we developed for r gives the result:

r = 752 × 7 70 × 70 ( 7 × 812 7 0 2 ) ( 7 × 812 7 0 2 ) = 0.46

In practice, one would set up a spreadsheet or use a specialist statistical software package to do the calculations.

Comment

Any value of r calculated says something about the degree of correlation present between the two independent random variables present in the calculation. In order to give real meaning to the value of the correlation coefficient we should test the significance of the value of r , in this case 0.46.

1.2 The significance of Pearson’s r

In order to test the significance of a calculated value of r we assume that both x and y are normally distributed and set up the hypotheses:

H 0 : ρ = 0 H 1 : ρ 0

where ρ is the ‘true’ value of the population correlation. If the assumption of normality is false the test must not be used. We know that the value of 1 r 1 and we wish to know whether our correlation coefficient is significantly different to zero.

Key Point 5

Significance of Pearson’s r

It can be shown that the test statistic

r t e s t = r n 2 1 r 2
calculated from a sample of n pairs of values, follows a t -distribution with n 2 degrees of freedom.

Note that many authors simply miss out the modulus sign and ignore the sign of r should it be negative. The test statistic is then written

r t e s t = r n 2 1 r 2

and critical values depending on the level of significance required are read off from t -tables in the usual way. A copy of t -distribution tables is included at the end of this Workbook (Table 2).

Example 4

Test the significance of the value of r obtained from Example 3 concerning electric motor torque values. Use the 5% level of significance.

Solution

The sample size is 7 so we have 5 degrees of freedom. The value of r t e s t is given by

r t e s t = r n 2 1 r 2 = 0.46 × 7 2 1 0.4 6 2 = 1.158

From Table 2, the critical value for a two-sided test at the 5% level of significance is 2.571. In this case, since 1.158 < 2.571 we cannot reject the null hypothesis at the 5% level of significance and conclude that for the motor under investigation, there is no evidence of a relationship between torque produced and current used.

Task!

Hooke’s law relates the extension of a spring under load to its extended length. The following results were obtained experimentally.

Load ( N ) 2 5 8 11 15
Extension ( m m ) 2 23 62 119 223

Calculate Pearson’s r and test its significance at the 5% level. What conclusion can you draw?

Setting up a spreadsheet to do the calculations gives:

Load ( x ) Exten. ( y ) x y x 2 y 2
2 2 4 4 4
5 23 115 25 529
8 62 496 64 3844
11 119 1309 121 14161
15 223 3345 225 49729
Sum ( x ) = 41 Sum ( y ) = 429 Sum ( x y ) = 5269 Sum ( x 2 ) = 439 Sum ( y 2 ) = 68267

r = 0.97379629 r test = 7.41645174

Hence, since the critical value for a two-sided t -test at the 5% level read off from tables is 3.182 we see that since 7.416 > 3.182 we can reject the null hypothesis at the 5% level and conclude that the correlation coefficient is significantly different from zero.

Comments on interpretation

Some care should always be taken when interpreting results obtained from correlation coefficient calculations.

  1. A high correlation does not necessarily imply that a causal relationship exists between the variables considered. For example, it may be that a high degree of correlation exists between the number of road accidents in a particular city and the number of late trains arriving at a station in another city both over the same time period. In general one would not expect to find a causal relation between the variables involved. Similar comments apply to, for example, water hardness and average income for towns in the UK.
  2. When considering the behaviour of two variables, one should realize that it is possible that both variables may change because of the influence of a third variable. An example often quoted in this context is the Gas law

    P V T = constant

    where say, pressure and volume may change because of a change in temperature.

  3. A low value of the correlation coefficient does not necessarily imply that no relationship exists between the variables being considered. Remember that the correlation coefficient is indicative of a linear relationship only and that a low or zero value of r may indicate that a non-linear relationship exists. For example a set of points lying on the curve y = x 2 might (see the Tasks below) result in a zero value of r .
Task!

Write down five ( x , y ) points (symmetrical about zero) lying on the parabola y = x 2 . Show that the correlation coefficient between x and y is zero.

Let the five points be (for example) ( 2 , 4 ) , ( 1 , 1 ) , ( 0 , 0 ) , ( 1 , 1 ) , ( 2 , 4 )

x y x y x 2 y 2
-2 4 -8 4 16
-1 1 -1 1 1
0 0 0 0 0
1 1 1 1 1
2 4 8 4 16
Sum ( x ) = 0 Sum ( y ) = 10 Sum ( x y ) = 0 Sum ( x 2 ) = 10 Sum ( y 2 ) = 34

The value of r is given by

r = n x y x y n x 2 x 2 n y 2 y 2 = 5 × 0 0 × 10 ( 5 × 10 0 2 ) ( 5 × 34 1 0 2 ) = 0

Task!

Write down five ( x , y ) points (all involving positive values of x and y ) lying on the parabola y = x 2 . Show that the correlation coefficient between x and y is non-zero.

Let the five points be (for example) ( 0 , 0 ) , ( 1 , 1 ) , ( 2 , 4 ) , ( 3 , 9 ) , ( 4 , 16 ) ,

x y x y x 2 y 2
0 0 0 0 0
1 1 1 1 1
2 4 8 4 16
3 9 27 9 81
4 16 64 16 256
Sum ( x ) = 10 Sum ( y ) = 30 Sum ( x y ) = 100 Sum ( x 2 ) = 30 Sum ( y 2 ) = 354

r = n x y x y n x 2 x 2 n y 2 y 2 = 5 × 100 10 × 30 ( 5 × 30 1 0 2 ) ( 5 × 354 3 0 2 ) = 0.959

1.3 Spearman’s coefficient of correlation

There are times when data cannot be expressed in terms of numbers directly. For example, an audio engineer might be asked to give an opinion on the quality of sound produced by three sets of speakers. The results will represent a judgement made by the engineer. The engineer could adopt a set of criteria including, for example, the clarity of the treble, the power of the base and the ability of the speakers to distinguish between instruments. Suppose the results are as follows:

Test Item Rating Rank Order
Speaker Pair B 9/10 1
Speaker Pair A 8/10 2
Speaker Pair C 5/10 3

Note that the results are not numeric in an arithmetic sense so you cannot do meaningful arithmetic using the results. In order to see this, just ask what a calculation based on the ranks such as

1 + 2 2 3

would actually mean. The answer is, of course, nothing!

During your career as an engineer you may be asked to rank data in a similar way to that outlined above. You may be asked to assess the work of colleagues for promotion purposes or give an opinion on the visual appeal of alternative designs of manufactured objects such as mobile telephones, food containers or television sets.

Assigning numbers to data in order of size (often called ranking methods) can also be useful if one does not wish to make assumptions about the nature of the distributions underlying the data. (For example whenever at least one of the distributions describing the behaviour of the variables may not be normal.) In order to check the level of correlation between results obtained by ranking data we calculate Spearman’s coefficient of correlation.

Key Point 6

Spearman’s Coefficient of Correlation , R

R = 1 6 D 2 n ( n 2 1 )
where D = R X R y is the difference of the rank R X of an item according to variable X and rank R Y of the item according to variable Y .

The formula indicates that the differences of each pair of ranked values are to be found, squared and summed. It is worth noting that even though it is not obvious, Spearman’s coefficient is just Pearson’s coefficient applied to ranks.

1.4 The calculation of Spearman’s R

The following worked example illustrates the procedure.

Example 5

A production engineer is asked to grade, on the basis of 12 criteria A to L , a junior colleague who has applied for promotion. In order to try to ensure that he treats the colleague fairly, the engineer repeats his gradings after a few days. On the basis of the results below, can you conclude that the results are consistent? The gradings are percentages.

Criterion First Grading ( X ) R X Second Grading ( Y ) R Y
A 55 8 75 7
B 53 9 80 6
C 78 3 89 4
D 50 10 63 11
E 48 11 67 10
F 61 7 69 9
G 66 6 73 8
H 76 4 93 2
I 85 2 87 5
J 90 1 95 1
K 69 5 92 3
L 45 12 59 12
Solution

The calculation may be set out as follows:

Criterion R X R Y D = R X R Y D 2
A 8 7 1 1
B 9 6 3 9
C 3 4 -1 1
D 10 11 -1 1
E 11 10 1 1
F 7 9 -2 4
G 6 8 -2 4
H 4 2 2 4
I 2 5 -3 9
J 1 1 0 0
K 5 3 2 4
L 12 12 0 0
D 2 = 38

Substituting in the formula for R gives the value

R = 1 6 × 38 12 × 143 = 0.87

Note that we have not made any attempt to interpret the meaning of this figure of 0.87. Methods for doing this are discussed below.

1.5 The significance of spearman’s R

Like Pearson’s r the value of R may be shown to lie in the range 1 R 1 and in order to test the significance of a calculated value of R we set up the hypotheses

H 0 : ρ = 0 H 1 : ρ 0

Key Point 7

Significance of Spearman’s R

We wish to know whether our correlation coefficient is significantly different to zero. It can be shown that for large samples, the test statistic

R t e s t = R n 2 1 R 2
calculated from a sample of n pairs of values, follows a t -distribution with n 2 degrees of freedom.

Critical values depending on the level of significance required are read from t -tables. When dealing with Spearman’s coefficient of correlation, the size of the sample is important. Different authors recommend different minimum sample sizes, a common recommendation being a minimum of n = 10 . Even though they are not used here, you should note that tables are available which allow us to read critical values corresponding to small sample sizes.

Example 6

A production engineer is asked to grade, on the basis of 12 criteria (say) A to L a junior colleague who has applied for promotion. He repeats his gradings after a few days. The results (calculated in Example 5) gave a value of R = 0.87 . Test at the 5% level to determine whether the results are consistent.

Solution

The calculation is:

R t e s t = R n 2 1 R 2 = 0.87 × 12 2 1 0.8 7 2 = 5.580

The 5% critical value for a two sided test read from tables is 2.228 and since 5.580 > 2.228 we conclude that we must reject the null hypothesis that the correlation coefficient is zero.

Task!

As a result of two tests given to 10 students studying laboratory safety, the students were placed in the following class order.

Student Test 1 Test 2
A 2 3
B 4 5
C 3 7
D 5 9
E 1 10
F 6 2
G 8 6
H 7 8
I 9 4
J 10 1

Use Spearman’s R to discuss the consistency of their performances. Can you make any meaningful comment regarding the two tests as a means of assessing laboratory safety?

Setting up the hypotheses

H 0 : R = 0 H 1 : R 0

and doing the appropriate calculations using a spreadsheet gives:

Test 1 Test 2 D D 2
2 3 -1 1
4 5 -1 1
3 7 -4 16
5 9 -4 16
1 10 -9 81
6 2 4 16
8 6 2 4
7 8 -1 1
9 4 5 25
10 1 9 81
sum = 242
R = 0.4666667 R test = 1.49240501

From t -tables it may be seen that the critical value (8 degrees of freedom) at the 5% level of significance is 2.306. Since 1.492 < 2.306 we cannot reject the null hypothesis that there is no correlation between the results. This implies that the performances of the students on the tests may not be related and we should question at least one of the tests as a means of assessing laboratory safety. One could, of course, question the usefulness of both tests!

Task!

As part of an educational research project, twelve engineering students were given an intelligence test (IQ score) at the start of their first year course. At the end of the first year their results in engineering science (ES score) were noted down on the expectation that they would correlate with the results of the intelligence test. The results were as follows:

Student 1 2 3 4 5 6 7 8 9 10 11 12
IQ Score 135 120 125 135 125 140 135 140 135 140 120 135
ES Score 85 74 76 90 85 87 94 98 81 91 76 74

Calculate Pearson’s r for these data. Can you conclude that there is a linear relationship between IQ scores and ES scores? You may assume that the IQ scores and the ES scores are each normally distributed.

Setting up the hypotheses

H 0 : R = 0 H 1 : R 0

and doing the appropriate calculations using a spreadsheet gives:

I Q ( x ) E S ( y ) x y x 2 y 2
135 85 11475 18225 7225
120 74 8880 14400 5476
125 76 9500 15625 5776
135 90 12150 18225 8100
125 85 10625 15625 7225
140 87 12180 19600 7569
135 94 12690 18225 8836
140 98 13720 19600 9604
135 81 10935 18225 6561
140 91 12740 19600 8281
120 76 9120 14400 5776
135 74 9990 18225 5476
sum x = 1585 sum y = 1011 sum x y = 134005 sum x 2 = 209975 sum y 2 = 85905

r = 0.696 r test = 3.065

From t -tables it may be seen that the critical value (10 degrees of freedom) at the 5% level of significance is 1.812. Since 3.065 > 1.812 we reject the null hypothesis that there is no linear association between the results. This implies that the performances of the students on the ES tests is linearly related to their IQ scores.

1.6 Table 1: Upper 5% points of the F distribution

No alt text was set. Please request alt text from the person who provided you with this resource.

No alt text was set. Please request alt text from the person who provided you with this resource.

1.7 Table 2: Critical points of student’s t distribution

No alt text was set. Please request alt text from the person who provided you with this resource.

α .40 .25 .10 .05 .025 .01 .005 .0025 .001 .0005
v
1 .325 1.000 3.078 6.314 12.706 31.825 63.657 127.32 318.31 636.62
2 .289 .816 1.886 2.902 4.303 6.965 9.925 14.089 23.326 31.598
3 .277 .765 1.638 2.353 3.182 4.514 5.841 7.453 10.213 12.924
4 .271 .741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 .267 .727 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 .265 .718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 .263 .711 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 .262 .706 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 .261 .703 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 .260 .700 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.487
11 .260 .697 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 .259 .695 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 .259 .694 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 .258 .692 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 .258 .691 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 .258 .690 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 .257 .689 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 .257 .688 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 .257 .688 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 .257 .687 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
21 .257 .686 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819
22 .256 .686 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 .256 .685 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.767
24 .256 .685 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 .256 .684 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725
26 .256 .684 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 .256 .684 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690
28 .256 .683 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 .256 .683 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 .256 .683 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646
40 .255 .681 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
60 .254 .679 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460
120 .254 .677 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
.253 .674 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291