Data are often presented to statisticians in raw form - it needs organising so that statisticians and non-statisticians alike can view the information contained in the data. Simple columns of figures do not mean a lot to most people! As a start, we usually organise the data into a frequency table. The way in which this may be done is illustrated below.
The following data are the heights (to the nearest tenth of a centimetre) of 30 students studying engineering statistics.
Notice first of all that all of the numbers lie in the range 150 cm. - 185 cm. This suggests that we try to organize the data into classes as shown below. This first attempt has deliberately taken easy class intervals which give a reasonable number of classes and span the numerical range covered by the data.
|1||150 - 155|
|2||155 - 160|
|3||160 - 165|
|4||165 - 170|
|5||170 - 175|
|6||175 - 180|
|7||180 - 185|
Note that in extreme we could argue that the original data are already represented by one class with thirty members or we could say that we already have 30 classes with one member each!
Neither interpretation is helpful and usually look to use about 5 to 8 classes. Note that this range may be varied depending on the data under investigation.
When we attempt to allocate data to classes, difficulties can arise, for example, to which class should the number 165 be allocated? Clearly we do not have a reason for choosing the class 160-165 in preference to the class 165-170, either class would do equally well.
Rather than adopt an arbitrary convention such as always placing boundary values in the higher (or lower) class we usually define the class boundaries in such a way that such difficulties do not occur.
This can always be done by using one more decimal place for the class boundaries than is used in the data themselves although sometimes it is not necessary to use an extra decimal place. Two possible alternatives for the data set above are shown below.
|Class||Class Interval 1||Class Interval 2|
|1||149.5 - 154.5||149.55 - 154.55|
|2||154.5 - 159.5||154.55 - 159.55|
|3||159.5 - 164.5||159.55 - 164.55|
|4||164.5 - 169.5||164.55 - 169.55|
|5||169.5 - 174.5||169.55 - 174.55|
|6||174.5 - 179.5||174.55 - 179.55|
|7||179.5 - 184.5||179.55 - 184.55|
Notice that no member of the original data set can possibly lie on a boundary in the case of Class Intervals 2 - this is the advantage of using an extra decimal place to define the boundaries. Notice also that in this particular case the first alternative suffices since is happens that no member of the original data set lies on a boundary defined by Class Intervals 1.
Since Class Intervals 1 is the simpler of the two alternative, we shall use it to obtain a frequency table of our data.
The data is organised into a frequency table using a tally count . To do a tally count you simply lightly mark or cross off a data item with a pencil as you work through the data set to determine how many members belong to each class. Light pencil marks enable you to check that you have allocated all of the data to a class when you have finished. The number of tally marks must equal the number of data items. This process gives the tally marks and the corresponding frequencies as shown below.
|Class Interval (cm)||Tally||Frequency|
|149.5 - 154.5||11||2|
|154.5 - 159.5||0|
|159.5 - 164.5||1111||4|
|164.5 - 169.5||11111 111||8|
|169.5 - 174.5||11111||5|
|174.5 - 179.5||11111 11||7|
|179.5 - 184.5||1111||4|
It is now easier to see some of the information contained in the original data set. For example, we now know that there is no data in the class 154.5 - 159.5 and that the class 164.5 - 169.5 contains the most entries.
Understanding the information contained in the original table is now rather easier but, as in all branches of mathematics, diagrams make the situation easier to visualise.