Ch1 Measurement Pt2

Chapter 1. Principles of Measurement Part 2

Measurement by Arranging in Assessed Order of Magnitude: Rank-Order Scales and Rating Scales

The temperature inside my office at a certain moment on a winter afternoon is 67.2°F, and the outdoor temperature at the same moment is 22.8°F. The equal interval property of the Fahrenheit scale permits you to conclude not only that it is warmer inside than out, but also that the magnitude of the difference is exactly 44.4°F. The same is true of equal interval scales in general: first, they allow us to sort out the relationships of 'greater than' and 'less than'; and second, they allow us to specify just how much greater-than or less-than one measure is in comparison with another.

In general, the relationships of 'greater than' and 'less than' are spoken of as ordinal relationships, so named because they specify the order of size, quantity, or magnitude among the particular items that are being measured. Although in principle any measurement scale at all that delineates such ordinal relationships could be described as an ordinal scale, the designation is superfluous for equal interval scales, since these are all ordinal scales necessarily. In practice, the term "ordinal scale" is confined to measurement scales that measure only ordinal relationships. Suppose, for example, that I did not have a thermometer. I could judge accurately enough that it is warmer inside than outside. I might even be able to judge that the indoor temperature is "slightly warmer," "moderately warmer," or "quite a lot warmer" than the outdoor temperature. But unless I had an extraordinarily well developed talent for judging temperature differences in fine-grained detail, I would not be in a position to specify precisely how much warmer the one is than the other.

If you ever have a choice between measuring a variable with an equal interval scale (which is also ordinal) or an ordinal scale (which is merely ordinal), choose the equal interval scale, for that is obviously the stronger form of measurement, since it contains a greater amount of information about the item or items that are being measured. As it happens, however, there are many kinds of variables for which measurement by an equal interval scale is either not possible or not practicable, and for which one therefore simply does not have the choice. For example, suppose we were interested in measuring the importance of the following four life goals for each of the students in your class: (i) enjoying a good family life, (ii) enjoying good friends, (iii) achieving career success, and (iv) having plenty of money. For better or worse, there is no standard meter stick, no clever electronic instrument, no mental calipers nor anything of the sort that will give us equal interval measures of the variable described by the phrase "importance of life goals." The same is true of any variable that in the final analysis comes down to a matter of preference, attitude, opinion, or judgment. Perhaps someday some really ingenious researcher will come along to invent the requisite equal interval mental calipers. In the meantime, no matter how deft or subtle we might be in attempting to measure such variables, all we will ever end up with is measurement on an ordinal scale.

¶Rank-Order Scales

The most obvious form of ordinal measurement is rank-ordering. In order to get the feel of this form of measurement, please take a moment to arrange the items that were just mentioned—(i) enjoying a good family life, (ii) enjoying good friends, (iii) achieving career success, and (iv) having plenty of money—in accordance with how important a role they play in your aims and hopes for what your life will be ten or fifteen years from now. Give the most important item the rank of "1", the second most important item the rank of "2", and so on. And then, having done it, reflect upon whether this particular measure of your life priorities, incomplete though it certainly is, would be capable of telling an outside observer anything about you that is worth knowing. Assuming that you have performed the rank-ordering seriously and honestly, I think you will agree that the answer is yes. If your rank-ordering ended up as

(1) family; (2) friends; (3) career; (4) money

it says one thing about you; and if it ended up as

(1) money; (2) career; (3) friends; (4) family

it surely says something quite different. Clearly this form of measurement is quite a lot weaker than the measurement of inches of width, pounds of weight, and so forth, but is a potentially useful form of measurement all the same. The key to using it well is to understand its limitations.

The central limitation of rank-order scales is simply that the successive points on such scales, which in all cases translate down to the ordinal succession "first," "second," "third," and so forth, are not intrinsically separated by equal intervals. Thus, for the hypothetical student who ranks the life-goal items as

(1) family; (2) friends; (3) career; (4) money

there is no reason whatever to suppose that the difference in importance between family and friends is of the same size as the difference in importance between friends and career, or between career and money. Similarly, for any two students whose rankings came out as

Student A: (1) family; (2) friends; (3) career; (4) money
Student B: (1) family; (2) friends; (3) career; (4) money

we could certainly say for both of them that enjoying good friends appears to be more important than achieving career success, but there would be no reason at all to suppose that the magnitude of the preference was the same for the first student as for the second. It would be entirely possible that the first student regarded friends as only marginally more important than career, while the second considered friends to be quite a lot more important than career.

The far-reaching consequence of this limitation is that mathematical operations involving addition, subtraction, and the calculation of averages can be applied to data that derive from a rank-order scale only in certain special circumstances, which in turn severely restricts the kinds of statistical analyses that can be legitimately performed upon such data. We will describe these special circumstances in later chapters as they become relevant. Until then, just bear in mind that while rank-order scales are capable of effectively delineating the ordinal relationships of 'greater than' and 'less than,' that is all they are capable of doing.

¶Rating Scales

A useful but potentially problematical variation on the theme of ordinal scales involves measurement by way of assigning numerical ratings. The usefulness of it stems from the fact that a rating scale is capable of providing somewhat more information than a rank-ordering. To illustrate, consider the two hypothetical students mentioned in the preceding section who had both ranked the four life-importance items, from most important to least important, as:

family > friends > career > money

Suppose that they were now to rate these same items on a scale from zero to 4, with the units of the scale defined as

zero =
1 =
2 =
3 =
4 =

completely unimportant
slightly important
moderately important
quite important
exceedingly important

To allow them to make fine-grained distinctions, we will stipulate that they can break the scale into fractional units if they so wish; thus, in addition to using the integer values zero, 1, 2, 3, and 4, they can also rate items as, for example, 0.5, 2.6, 3.4, etc. When we mentioned the rank-orderings of these two hypothetical students in the preceding section, we indicated it would be entirely possible that the first student regarded friends as only marginally more important than career, while the second considered friends to be quite a lot more important than career. Here is a set of ratings for them that would correspond to this possibility:

	family		friends		career		money
Student A:	4.0	>	3.8	>	3.6	>	3.4
Student B:	4.0	>	3.5	>	2.0	>	1.0

As you can see, the rating scale would still reflect the same bare ordinal relationships of greater than and less than, but it would also potentially reveal a texture in these relationships that goes beyond the merely ordinal. Both students rate family and friends quite high on the scale, but whereas student A also rates career and money within the range between quite important and exceedingly important, student B places them down toward the bottom of the scale.

The critical difference between rank-ordering and rating is that, whereas the points on a rank-ordering scale (first, second, third, etc.) delineate ordinal relationships and nothing other than ordinal relationships, the successive points on a rating scale represent intervals, in something of the same way that the scale points on a yard stick (6 inches, 7 inches, 8 inches, etc.) or a thermometer (66°F, 67°F, 68°F, etc.) represent intervals. Hence the ability of a rating scale to take us somewhat beyond the merely ordinal.

Please note well, however, that the operative word here is "somewhat." Rating scales can take us somewhat beyond the merely ordinal—but that does not mean that they take us anywhere near where an equal interval scale would take us. When rating scales are represented pictorially, they usually end up looking something like this

completely
unimportant
|
0

slightly
important
|
1

moderately
important
|
2

quite
important
|
3

exceedingly
important
|
4

which is, of course, highly suggestive of an equal interval scale. The resemblance, however, is superficial and entirely illusory; for while it is true that the rating scale is divided up into intervals, these intervals are not intrinsically equal. For student A, there is no reason at all to suppose that the distance between 4 and 3 on the rating scale is the same as the difference between 3 and 2, or between 2 and 1, or between 1 and zero. The same is true for student B and for anyone else who might be rating items on this or any other rating scale. Moreover, there is no reason to suppose that the difference between 4 and 3 for student A is the same as the difference between 4 and 3 for student B; and so on.

The implication is the same as described earlier in connection with rank-ordering. Basic mathematical operations involving addition, subtraction, and the calculation of averages can be applied to data that derive from a rating scale only in certain special circumstances, which in turn severely restricts the kinds of statistical analyses that can be legitimately performed upon such data. Here again, we will describe these special circumstances in later chapters as they become relevant. Until then, please keep clearly in mind that while rating scales do go somewhat beyond the merely ordinal delineations provided by rank-ordering, they still do not possess the properties of equal interval scales, and thus they do not easily permit the kinds of mathematical operations that are so readily permitted by equal interval scales. This caution is all the more important for rating scales, since their superficial resemblance to an equal interval scale can easily beguile the mind into thinking that the measurement is stronger than it really is.

A familiar example of this tendency to beguile is the standard letter grade scale of A, A—, B+, B, B—, etc., commonly used for the measurement of academic achievement. Clearly it is only a rating scale. The various letters are really only shorthand ways of saying "excellent, superior, top-drawer" for an A, "not quite a full A, but still very good" for an A—, and so on down the line for B+, B, B—, C+, C, etc. And yet, once the merely ordinal verbal phrases are translated into letters, and the practice becomes enshrined in academic tradition, how very easy it is to drift into the fallacy of supposing that letter grades have the properties of an equal interval scale—to suppose, for example, that the difference between an A and a B is of the same magnitude as the difference between a B and a C, or between a C and a D. In most academic institutions this fallacy is even given official sanction by way of further translating the letter grades into numerical values known as quality points. At my own institution the conventional translation is as follows:

A	A—	B+	B	B—	C+	C	C—	D+	D	D—		F
4.0	3.7	3.3	3.0	2.7	2.3	2.0	1.7	1.3	1.0	0.7		0

Now this certainly looks like an equal interval scale, and one, moreover, that is capable of making some rather fine distinctions. But here again the resemblance is only superficial and illusory. The letter grade system is fundamentally an ordinal rating scale, and merely exchanging the letters for numbers will not make it anything more than an ordinal rating scale. [SideTrip on the "grade point average."]

¶Measurement by Sorting: Categorical Measurement

Measurement of this type involves examining each item in a set of items to determine whether its properties match the defining criteria of this, that, or the other category. Thus, the measure of gender has two possible categorical outcomes, female and male. Sort the students in your classroom according to the defining criteria of these two categories, and the result for each student is a categorical measure of gender. Sort the same students into the four categories freshman, sophomore, junior, and senior, and the result for each student is a categorical measure of academic class.

In some cases of categorical measurement, such as the division of gender into female and male or the division of religious affiliation into Catholic, Protestant, Jewish, Moslem, etc., there are no intrinsic relationships of 'greater than' or 'less than' among the categories. It would be nonsense, for example, to say that females have more gender than males, or vice versa. Equally nonsensical would be the suggestion that the category Catholic, in and of itself, denotes a greater or lesser degree of religious affiliation than the category Protestant. Categories of this type are spoken of as nominal categories (from the Latin nomen, meaning "name").

In other cases, however, such as the delineation of academic class into freshman, sophomore, junior, and senior, the categories do have intrinsic quantitative relationships of 'greater than' or 'less than' . The category of sophomore represents a higher academic class than the category of freshman; junior represents a higher academic class than sophomore; and so on. Another fairly obvious example would be the sorting of college faculty into the categories instructor, assistant professor, associate professor, and full professor. Categories of this type are spoken of as ordinal categories, so named because there is a quantitative order that exists among them.

At first glance, categorical measurement is likely to seem rather inconsequential. True, you can measure each person in a group with respect to the categorical variable of gender—but then what? Clearly it would make no sense to say "one unit of gender plus two units of gender equals three units of gender," for there is simply no such thing as a unit of gender. It would be equally bizarre to speak of the average gender of a group of persons, or of the average of maleness or femaleness, because there is nothing here that you can take an average of. Categorical measurement is sometimes described of as measurement on a "nominal scale," but that is actually a misnomer, since there is really no scale involved here at all. It is simply a matter of sorting items into categories.

Note, however, that even the most purely nominal and non-quantitative form of categorical measurement is capable of producing data that do have a distinctly quantitative dimension. Sort the students in your classroom into the two strictly nominal categories female and male, and you are not only taking the categorical measure of gender for each student; you are also forming the basis for measuring how many students there are that fall within each of the two categories. Thus, in a class of 29 students you might find by enumeration that there are 15 females and 14 males. As we noted earlier, anytime you count up the number of discrete items in a set (females, males, Catholics, Protestants, etc.) you are using the scale of cardinal numbers, which, since it is both an equal interval scale and a ratio scale, readily permits the basic mathematical operations of addition, subtraction, multiplication, and division. Hence, even though categorical measurement starts out as a rather weak form of measurement, it is capable of yielding secondary measures that are quite strong. There are certain cautions that do need to be kept in mind with respect to categorical measurement, but we will save these until later chapters.

An especially useful variation on the theme of categorical measurement is the situation in which you are cross-categorizing items according to two or more dimensions of classification concurrently. This, in fact, is what we did earlier for Table 1.1 when we cross-classified equal interval scales of measurement according to the two dimensions, non-ratio versus ratio and discrete versus continuous. Similarly, you could sort the students in a classroom concurrently according to any two or more applicable dimensions of categorical measurement, such as gender, academic class, ethnic identity, liberal versus conservative, geographical origins, and so on. The scientific utility of such bivariate (two variables) or multivariate (three or more variables) cross-classification is that it provides a basis for determining whether the two or more categorical variables tend to be systematically associated with each other. Here as well is a point that will be examined further in later chapters.

Taking the Measure of Things: Beyond the Obvious

The next time you take an examination in a course, you can console yourself with the thought that you are participating in the process of measurement. Your instructor wishes to assess your knowledge and understanding of the subject matter; to that end, he or she will give you a set of tasks to perform; and the degree to which you perform well on those tasks will then be taken as the measure of your knowledge and understanding.

And now for a practice mini-exam. When your instructor measures your knowledge and understanding of the subject matter in this fashion, which of the following forms of measurement does it most closely resemble: (a) the measurement of the width of a desk with a tape measure, (b) the measurement of the number of students in a room by counting them, or (c) the measurement of temperature with a conventional bulb and glass-tube thermometer? ~~ The answer is that it most closely resembles (c), and here is the reason for it. When you measure the width of an object by holding a tape measure right up against it, you are measuring width directly. Similarly, when you are measuring the number of students in a room, you are directly counting them off as one, two, three, four, etc. When you measure temperature with a thermometer, on the other hand, what you are measuring directly is not temperature itself, but rather the effect that temperature has upon the fluid that is inside the glass tube. As the temperature rises or falls, the fluid in the tube expands or contracts. Hence, the column of fluid rises and falls in accordance with increases and decreases of temperature. Similarly for the exam: what is being measured directly is not your knowledge and understanding of the subject, but rather the effect that knowledge and understanding are presumed to have upon your performance on the exam. Your instructor's assumption is that performance on the exam will be relatively high for those students whose mastery of the subject is strong, and relatively low for those whose mastery of it is weak. But even if the exam is the best and fairest test that any instructor has ever devised, the fact remains that your score on the exam is a direct measure only of how well you perform on the exam. As a measure of the knowledge and understanding that are presumed to underlie the performance, it is strictly indirect.

Scientific researchers are very often in the same position as your instructor. They can get nowhere at all without taking the measure of the variables that are of interest to them; and as many of these variables cannot be measured directly, ways must be found of measuring them indirectly. Indeed, a very considerable part of scientific progress over the past several centuries has resulted from precisely this task of devising indirect ways of measuring variables that cannot be measured directly.

Although indirect measurement has played a prominent role in the development of the hard-core physical sciences, such as physics and chemistry, it is especially important in those realms of scientific research that are concerned with what goes on inside living organisms. For not only is it often the case that the internal states and processes of an organism cannot be directly measured—in many instances they cannot even be directly observed. Take, for example, the internal state known as fear. Now there is certainly a sense in which I can directly observe my own state of fear. If called upon to do so, I could even quantify it somewhat by rating it is 'slight fear', 'moderate fear', 'intense fear', and so on. But I have no such direct access to the fear that you or any other human being might experience, nor to the fear that any non-human organism might experience. In those instances, the only way I could measure fear would be indirectly by way of measuring the effect that fear has upon this, that, or the other aspect of the behavior or physiology of the organism. In fact, if I were to ask you or some other verbal organism to rate the fear on an ordinal scale, that too would be only an indirect measure, for in this case the only thing I would be measuring directly would be the effect that the fear (in conjunction with the request) has upon this particular momentary aspect of your verbal behavior. The same general point applies to any presumed internal state or process of an organism that cannot be directly observed by the investigator. These include a very large part of what is of interest to the behavioral sciences, such as psychology, as well as a considerable portion of what is of interest nowadays to the biological sciences—in general, mental states, mental processes, perception, intelligence, emotion, motivation, the organismic states produced by learning, conditioning, habituation, and the like, to mention but a few.

Although the indirectness of a measure certainly does not disqualify it from being useful, it does raise two flags of caution. Actually, these are flags that should be raised, at least tacitly, every time you come across any kind of measurement at all, even if it seems to be entirely direct and straight-forward. The first has to do with what is conventionally spoken of as the reliability of a measure, and the second pertains to what is described as the validity of a measure.

¶Reliability

The dimensions of my solid wooden desk do not vary substantially from one time to another nor from one place to another; and, providing that I avoid extremes of temperature, neither do the dimensions of my steel tape measure. Hence, if the width of the desk measures out at precisely 60.0 inches on one occasion, it is virtually certain to measure out at precisely 60.0 inches on all other occasions as well. Thus, if I were to take three successive measures of the width of my desk at one-hour intervals (at normal room temperature), the three results would almost certainly be: 60.0 inches, 60.0 inches, and 60.0 inches. This is Scenario 1. Now consider two alternative scenarios. Scenario 2. My tape measure is made of a cloth fabric that stretches slightly in accordance with the amount of tension that is placed upon it. Each time I use it I inadvertently apply varying amounts of tension, with the result that the three measure of desk width come out as: 60.5 inches, 60.0 inches, and 60.3 inches. Scenario 3. My tape measure is made of a highly elastic substance that stretches quite a lot when tension is applied to it. In addition, it undergoes a quite large degree of expansion and contraction in response to small moment-to-moment fluctuations in normal room temperature. The three measures in this scenario come out as: 3.0 inches, 1,098.7 inches, and 63.2 inches.

Here in a nutshell is the meaning of the concept of the reliability of measurement: if you repeatedly measure the same item by the same process, and the result is precisely the same on each occasion (Scenario 1), the measurement process is highly reliable; if the results are somewhat variable but consistent (Scenario 2), it is relatively reliable; and if they are variable and wildly inconsistent (Scenario 3), then it is highly unreliable.

¶Validity

Obviously, a procedure of measurement is worth bothering with only in the degree that it is reliable, in the sense that we have just defined. But equally obvious is that reliability, in and of itself, is not sufficient. Here is a classic example. In the early decades of the nineteenth century there developed a would-be method of psychological assessment known as phrenology, which claimed that important variables of personality and mentality could be measured by mapping out the contours of a person's skull. Thus, the protuberance or depression of the skull in one location was taken as a measure of self-esteem (the greater the protuberance, the higher the self-esteem, and vice versa); elsewhere, it was a measure of cautiousness, benevolence, combativeness, or of some other item in a rather long list of such attributes. There was even a bump on the head that was alleged to measure a person's capacity for reverence.

But preposterous though these measures of the phrenologists might seem, the fact is that they were very highly reliable. Three or five or ten separate measures of the contours of a person's skull made by phrenologist A would all end up virtually identical. Phrenologist B, mapping the same skull, would also end up with virtually identical results. The shortcoming of the phrenologists' measures was not that they were unreliable, but that they were not measuring what they were supposed to be measuring. As direct measures of the depressions and protuberances of the skull they were perfectly fine. But as indirect measures of the dimensions of personality and mentality, they were simply humbug.

In brief, the measures of the phrenologists had plenty of reliability—but as they were not really measuring what they purported to be measuring, they had no validity. The basic meaning of this concept is probably already familiar to you through your experience with school or college examinations of the type described earlier. You have in all likelihood taken some exams that seemed to you to be poor assessments of a student's mastery of the subject, even though you personally might have done well on them; and you have no doubt taken other exams that seemed to you to be good assessments, even though you personally might have done less well on them than you would have liked. In the language of students, it is the distinction between a "fair" exam and an "unfair" exam. A fair exam is one that actually measures what it purports to measure, namely, the student's knowledge and understanding of the subject matter; and an unfair exam is one for which the student's score substantially reflects something other than knowledge and understanding, for example, the student's ability to spot and deal with trick questions, to remember picayune details mentioned by the instructor or the textbook en passant, or to adhere to some particular theoretical or ideological party line favored by the instructor. [Another SideTrip on the "grade point average."]

End of Chapter 1.
Return to Top of Chapter 1, Part 2
Go to Chapter 2 [Distributions]

Home

Click this link only if the present page does not appear in a frameset headed by the logo Concepts and Applications of Inferential Statistics