Chapter 4 Measurement

Modified: 2008-02-22


In The Know--Perhaps the most important step you can take if you want to move from confusion to confidence about a scientific topic is to find out how the concepts are measured. Knowing how schizophrenia or intelligence or time is measured gives you a big boost toward understanding the topic.

In The Know--The question of how to use the information in Table 4.3 is one that has produced different answers at different times. When this scales of measurement catalog was first presented by Stevens (1946), it was promoted as a way to determine what NHST statistical tests were appropriate for a particular set of data. The idea was that different classes of NHST tests required a particular scale of measurement. Over the years (but with occasional resurgence), this idea has lost adherents. Our position is that the distinctions among scales of measurement are not helpful in deciding what statistical test to use. What is required is knowledge of the nature of the distributions of the numbers and not the scale of measurement.

In the Know--Magazines and the Internet are the source of many tests. People like to take tests. Using tests published in magazines, you can determine your score as a friend, lover, or intellectual. You can determine whether you are right-brained, depressed, or destined to become rich or lose weight. Unfortunately, most of these tests have not been checked for reliability or validity. Their purpose is entertainment, not measurement. Of course, now that you know the basics of reliability, you can check a test's reliability yourself.

GLOSSARY


An explanation of how to make a measurement is called an operational definition. More formally, an operational definition is a description of the procedures used by the researcher to measure a variable or create levels of a variable. These procedures are connected in a logical way to the concept or condition that is being operationally defined.


To illustrate operational definitions of psychological concepts, we contrast them with their dictionary definitions. Table 4.1 lists six psychological variables, their dictionary definitions and one or more operational definitions. You already know that dictionary definitions almost always help us understand a concept better. An operational definition adds to that understanding in a special way by telling how the concept is measured. Notice in Table 4.1 that several variables have two or more operational definitions. A variable such as ADHD can be measured in different ways by different researchers. Thus, not all studies that refer to ADHD are referring to the same thing. However, in every case, the operational definition of ADHD tells you what the researcher meant by that concept.

Table 4.1 Dictionary definitions and operational definitions of psychological terms

Psychological variable

Dictionary definition*

Operational definition
(to measure or create)

Frustration

State of insecurity and dissatisfaction arising from unresolved problems or unfulfilled needs

1. Abruptly remove toys that a child is playing with
2. Withhold reinforcement for performing a previously reinforced behavior

Hunger

A craving or urgent need for food

1. hours of food deprivation
2. percent of ad libitum weight

Depression

A state of feeling sad

Beck Depression Inventory score†

Happiness

A state of well-being and contentment

Subjective Happiness Scale score‡

Memory

The power or process of reproducing or recalling what has been learned and retained

1. trials to relearn
2. custom made multiple-choice test
3. fMRI activity

Attention deficit/hyperactivity disorder

A syndrome of learning and behavioral problems that is not caused by any serious underlying physical or mental disorder and is characterized esp. by difficulty in sustaining attention, by impulsive behavior (as in speaking out of turn), and usu. by excessive activity.

1. ACTeRS checklist
2. ADDES-2 checklist
3. ADHD-IV checklist
4. ADHDT checklist
5. CPR-S checklist §
6. T.O.V.A‡

*Merriam Webster’s Collegiate Dictionary, 11th edition (2004).
† Beck, Steer and Garbin (1988)
‡ Lyubomirsky (1999)
§ Demaray, Elting, and Schaefer (2003) describe and compare the five checklists.
‡ TOVA--http://www.tovacompany.com/

One way to distinguish among variables is to determine the variable’s scale of measurement. To determine scale of measurement, focus on the object that is being measured and not on the numbers themselves. To illustrate, look at Table 4.2, which shows three different measurements for two teams participating in an adventure race. Look at the first line, an 8 and a 16, which are the identification numbers of the two teams. What can you say about the two teams based on these two numbers? The two numbers tell you only that they are two different teams and nothing more. Agree?

Table 4.2            Measurements of two teams on three variables in an adventure race                       

Thing measured

Team A

Team B

Identification #

8

16

Order of finish

8

16

Hours required

8

16


Now look at the next pair of scores which are also 8 and 16. These two numbers indicate the order of crossing the finish line. Of course, the two different orders also tell us that there were two different teams. Team A finished before Team B, but these two numbers tell nothing about the time between the two crossings.


Finally, the bottom row again shows an 8 and a 16, which are the number of hours required to finish the race. The interpretation of these two numbers is that Team A was twice as fast as Team B (and also that it was faster and different). The three 8s for Team A are all the same, but each 8 carries different information. The take-home message about scales of measurement is that the interpretation of numbers such as 8 and 16 depend on the scale of measurement. Four different scales of measurement are typically identified. They are nominal, ordinal, interval, and ratio scales.


Nominal Scale  On a nominal scale, numbers carry the same information as names – the numbers carry no quantitative information. Examples of psychological variables that are measured on a nominal scale are psychological disorders, personality types, and gender. The variable psychological disorders, for example, consists of different categories. One person might be classified as a paranoid type schizophrenic (which is numbered 295.30) and another as obsessive-compulsive disorder (which has a numerical classification of 300.3). The numbers associated with these two diagnoses are from the Diagnostic and Statistical Manual IV, which describes different psychological disorders and gives them numbers (American Psychiatric Association, 1994). The numbers are just substitute labels and carry no quantitative information, either about amount or order. In Table 4.2, the numbers used to identify the teams are from a nominal variable.


Ordinal Scale  Numbers that result from ordinal measurement indicate more than and less than (in addition to different). If a participant is asked to rank animals from most frightening to least frightening, then fear of animals is being measured with an ordinal scale. Scores based on checklists in which the score is the sum of the number of items checked are ordinal scores. Judgments of recovery such as much improved, improved, no change, and worse produce an ordinal scale. Ordinal scales are characterized by rank order. In Table 4.2, the numbers used to identify the order of finish of teams are from an ordinal variable.


Interval Scale  For an interval scale measurement, equal intervals between numbers indicate equal amounts of the thing being measured and the zero point (if any) is arbitrarily defined. To illustrate, we turn to a nonpsychological example: temperature. Everyday thermometers that measure temperature use an interval scale. The amount of heat needed to increase the temperature from 35° F to 45° F is the same amount of heat that is needed to raise the temperature from 80° F  to 90°. Any two temperatures that differ by 10 degrees indicate equal amounts of heat. Note that a value of zero on the Fahrenheit scale does not mean that no heat exists.

Turning to psychological variables measured on an interval scale, we have to acknowledge a problem. We know of no measurement of a psychological variable that definitely has the characteristics of an interval scale.  Many concepts measured by psychologists produce numbers that clearly carry more than ordinal scale information, but proving equal intervals requires assumptions that are not universally accepted. Nevertheless, most researchers treat these measurements as if they were interval data, even though they cannot prove that equal intervals indicate equal amounts of the variable being measured. Thus, variables such as intelligence, depression, happiness, sociability, and many others are usually treated as an interval scale variable.


Ratio Scale  The characteristic that distinguishes a ratio scale from the others is that zero on this scale means that there is a zero amount of the thing being measured. Variables such as reaction time, errors, height, and weight are all measured with ratio scales. In each case, zero means that none of the thing measured was detected. With ratio scale measurements, interpretations such as “twice as much” or “a 10 percent increase” are permissible. Such statements are inappropriate for the other three scales. In Table 4.2, the numbers used to identify the hours required to finish are from a ratio variable. The differences among these four scales of measurement are summarized in Table 4.3.


Table 4.3             Characteristics of the four scales of measurement (Based on Spatz, 2005)                                                                        Scale Characteristics

 

Scale of Measurement

Different Numbers for Different Things

Numbers Convey More Than & Less Than

Equal Differences Mean Equal Amounts

Zero Means None of What Was Measured Was Detected

Nominal

Yes

No

No

No

Ordinal

Yes

Yes

No

No

Interval

Yes

Yes

Yes

No

Ratio

Yes

Yes

Yes

Yes

Using scales of measurement     Knowing a variable’s scale of measurement is important when you determine what descriptive statistic to use. For example, nominal scale data permit only a mode. Ordinal scale data permit the use of the median or the mode. Interval and ratio scale data permit the calculation of the mean (as well as the median and the mode). Interval or ratio data are required if a standard deviation is to be meaningful. As for determining a NHST statistical test to use, those decisions should be made on the basis of the assumptions of the test, rather than on the scale of measurement.

In The Know The question of how to use the information in Table 4.3 is one that has produced different answers at different times. When this catalog of scales of measurement was first presented by Stevens (1946), it was promoted as a way to determine what inferential statistical tests were appropriate for a particular set of data. The idea was that different classes of inferential tests required a particular scale of measurement. Over the years (but with occasional resurgence), this idea has lost adherents. Our position is that the distinctions among scales of measurement are not helpful in deciding what statistical test to use. What is required is knowledge of the nature of the distributions of the numbers and not the scale of measurement.  

Adventure races have coed teams that travel by foot, water, and bicycle for hours and hours across rugged terrain with only a map and a compass to guide them.

Other ways to look at variables:

Continuous, Discrete, and Dichotomous Variables


Continuous variables have, in theory, an infinite number of different values between the highest and lowest score. Thus, continuous variables can have any number of decimal places. In practical experiments, the scale values are not infinite, although the number is still large. Length, time, and cognitive ability are all examples of continuous variables that, in theory, could have an infinite number of different values between the lowest and highest value on the scale. 

Discrete variables have only a limited and countable number of distinct steps between the highest score and the lowest score, both in theory and in practice. Discrete variables include birth order (first born, second born), college class standing (sophomore, junior), and the number of people in your research methods class.

Dichotomous variables have only two levels. Gender (male and female), performance (pass or fail), and psychiatric diagnoses such as schizophrenic and not schizophrenic are examples of dichotomous variables. Testing a child for ADHD results in a diagnosis of ADHD or not ADHD, even though everyone recognizes that there are variations in children with an ADHD diagnosis.


Quantitative and Qualitative Variables


Quantitative variables are those whose levels are characterized by numerical differences. Scores on a quantitative variable tell you something about the amount or degree of the variable.  Time, distance, and size are typically measured on a quantitative scale. Psychological measures such as intelligence, shyness, and attitudes are also measured quantitatively, using tests that are designed to measure a particular trait.

Qualitative variables are those whose levels are characterized by differences in kind. Each level has a different label, but the labels do not indicate an amount or degree of difference. For some qualitative variables, the levels are ordered (military rank, and college class) and for others there is no inherent order (gender, political affiliation, and race).


Category, Ranked, and Scaled Data


This list of variables is most useful when deciding what statistical test is appropriate for a particular set of data.

Category data consist of frequency counts of defined groups. Examples include the number of participants who cheat on an exam and the number who do not cheat. The categories of cheater and noncheater represent values on a qualitative variable. Another example is the number of students with grade point averages in the categories of

less than0.99,
1.00 to 1.99,
2.00 to 2.99,
3.00 to 3.99,
4.00


The categories of GPA are divisions of a quantitative variable. The critical point about category data is that the measurement of the participant is simply that the participant is counted in a category.

Ranked data show each participant’s position among all participants. Children’s art work that is judged and ranked by a professional artist produces ranked data. Class ranks among the graduates at your university this year are ranked data as is order of finish of participants in a race.

Scaled data are what we usually think when the word measurements is used. Scaled data are quantitative data for which a participant’s score does not depend on other participant’s performance. Whereas a person’s rank depends on other participants, a person’s scaled score stands on its own. Measures of time, distance, and psychological characteristics are scaled data.

Errors in Measurement

Any score or label obtained by measurement is always subject to some degree of error. The classic way to convey the error that exists in a measurement is a formula that shows the obtained measurement (X) as a sum of a true score plus some error. One version of this formula is:


X = T + e


With this formula in mind, obtaining more precise measures depends on reducing error.

Experts in measurement find it helpful to distinguish between two kinds of error, systematic error and random error.

Systematic errors are variations in the data for which a cause can be specified. Examples of systematic errors include transposing two numbers, making a check mark in the wrong column and always rounding numbers down. It is also a systematic error if a participant’s response on a questionnaire fails to acknowledge behavior that actually occurred (but now seems too embarrassing to admit). As you might expect, researchers develop and use procedures that reduce systematic errors, but despite their best efforts, research data are not error-free.

Random error results from the chance fluctuations that accompany every set of measurements and from undetected systematic error. Even the measurements of the 10 gram weight at the U.S. National Bureau of Standards show random variation. As reported by Freedman, Pisani, Purves, and Adhikari (1991), five consecutive weighings under very controlled conditions by experienced employees produced weights of:


9.999591 grams
9.999600 grams
9.999594 grams
9.999601 grams
9.999598 grams


In these five weighings, the effects of random error show up in the last three decimal places. In the case of biological and psychological measurements, where data are gathered under less than very controlled conditions, random error is even more apparent. Fortunately, there are statistical methods that allow the measurement of random error. Of course, the greater the size of the error, the less trust we should put in a particular measurement.

 

Reliability refers to the degree that chance influences the scores. Tests that are reliable produce consistent scores; that is, the same people make about the same score if they are measured a second time. Random error plays only a small role in a reliable score.

Validity refers to whether the test measures what it claims to measure. Scores on a valid test of memory, for example, are related to actual evidence of memory. The scores on a valid test of depression or happiness or procrastination show a relationship to other independent measures of depression, happiness, or procrastination.

Sensitivity refers to a test’s ability to make fine distinctions. A test that divides participants into two personality types is not as sensitive as one that gives each person a quantitative score.

Test-Retest Reliability  One way to determine the reliability of a test is to administer it to a group of individuals and then, after a period of time ranging from days to months, administer it to them again. If the instrument is reliable and the individuals have not changed, the scores from time 1 to time 2 will be consistent. That is, individuals with low scores the first time will get just about the same low scores the second time, and high-scoring individuals will get similar high scores the second time.

Split-Half Reliability  Psychologists call techniques that measure reliability with just one testing session measures of internal consistency. One of the simplest of these is split-half reliability. To determine split-half reliability, divide the items that make up the test into two halves. A common way to do this is to separate the even-numbered items from the odd-numbered items. The scores on the even-numbered items could be the X variable and the odd scores the Y variable. Each test taker has two scores. A Pearson r measures the degree of consistency between the odd scores and the even scores. Even if the even scores are generally higher than the odd scores, r is a measure of consistency. A high correlation (.80 or higher) indicates that the test is internally reliable. One of the problems with the split-half reliability coefficient is that its value depends on whether you divide the items by odd-even, or first half-second half, or some other method.

Interobserver Reliability  In some research situations, measurement of the participants does not involve a test. Indeed, the participants may not even know they are in an experiment. Instead, they are simply observed for a period of time by a researcher who looks for particular behaviors such as head swaying, approaching, retreating, touching, or trumpeting (if the participants are elephants in a zoo). After the observation period, the researcher has a record of the elephant’s behavior, but is it a reliable record? A technique called interobserver reliability can answer that question.

Content Validity  A test has content validity if the items on the test are indeed related to the characteristic or trait that the test is measuring. For example, the purpose of a test in a college course is to measure whether the students learned the assigned material. If the test has items about material that was not assigned then those items have zero content validity because they are measuring something other than whether the students learned the assignment. Content validity is not just a concern for conventional tests; it is also a concern for researchers.

Criterion-Related Validity To establish criterion-related validity, a researcher must have an independent, outside-of-the-experiment standard to compare the test to. This standard (or criterion) is a measure of the same thing as test, and it is one whose validity has already been established. If there is a match between the test and the criterion, then the test is measuring what it was intended to measure.

Operational Definitions
Web pages

Wikipedia on Operational Definitions
Wikipedia’s page covers operational definitions in terms of science and business. It also gives examples.
http://en.wikipedia.org/wiki/Operational_definition

Operational Definitions
Covers the difference between conceptual definitions and operational definitions.
http://web.utk.edu/~wrobinso/540_lec_opdefs.html

Variables
Web pages
Independent variable
Page displays definitions of independent variable.
http://www.answers.com/topic/independent-variable

Dependent variable
Page displays definitions of dependent variable.
http://www.answers.com/topic/dependent-variable

Continuous variable
Definition of continuous variable from David Lane’s HyperStat pages.
http://davidmlane.com/hyperstat/A97418.html

Discrete variable
Definition of discrete variable from David Lane’s Hyperstat pages.
http://davidmlane.com/hyperstat/A96915.html

Dichotomous variable
Short definition of a dichotomous variable.
http://riskinstitute.ch/00011136.htm

Quantitative variable
Short definition of a quantitative variable.
http://onlinestatbook.com/glossary/quantitative_variable.html

Qualitative variable
Short definition of a qualitative variable.
http://onlinestatbook.com/glossary/qualitative_variable.html

Categorical data
Definition of categorical data with several examples.
http://www.childrens-mercy.org/stats/definitions/categorical.htm

Ranked data
Short example shows how to convert data into ranked data.
http://www.variation.com/cpa/help/hs105.htm

Scales of measurement
Long page provides overview, measurement principles, examples, and appropriate statistics for the four scales of measurement.
http://web.uccs.edu/lbecker/SPSS/scalemeas.htm

Errors in Measurement
Web pages

Wikipedia on systematic error
Discusses systematic errors and how to reduce them.
http://en.wikipedia.org/wiki/Systematic_error

Random error
Defines random error and relates it to the normal distribution and to the error distribution.
http://amsglossary.allenpress.com/glossary/search?id=random-error1

Trustworthy Measures
Web pages

Correlation coefficient
Interactive page allows user to see regression line and scatterplot of bivariate distribution whose values range from -1.00 to +1.00.
http://noppa5.pc.helsinki.fi/koe/corr/cor7.html

Wikipedia on correlation coefficient
Long page discusses correlation coefficient in detail including mathematical properties, non-parametric coefficients, and common misconceptions about correlation.
http://en.wikipedia.org/wiki/Correlation

Scatterplot
Shows sample scatterplots and explains positive and negative associations between bivariate data.
http://www.stat.yale.edu/Courses/1997-98/101/scatter.htm

Reliability (test-retest)
Definition of test-retest reliability relates the concept to the evaluation of survey instruments.
http://www.statistics.com/resources/glossary/t/trtreliab.php

Reliability (split-half)
Short definition of split-half reliability and its interpretation.
http://www.alleydog.com/glossary/definition.cfm?term=Split-Half%20Reliability

Reliability (interobserver)
Defines interobserver reliability and shows how to compute it.
http://www.ccny.cuny.edu/bbpsy/modules/interob_reliability.htm

Measurement validity types
Comprehensive page describes and explains two types of content validity and four types of criterion-related validity.
http://www.socialresearchmethods.net/kb/measval.htm

Wikipedia on content validity
Defines content validity and distinguishes it from face validity.
http://en.wikipedia.org/wiki/Content_validity

Validity (criterion-related)
Defines criterion-related validity and gives and example.
http://writing.colostate.edu/guides/research/relval/com2b3.cfm

Wikipedia on sensitivity of tests
Defines sensitivity and how it is used to make decisions in experiments.
http://en.wikipedia.org/wiki/Sensitivity_(tests)

Ceiling effect
Short definition of ceiling effect explains why it should be avoided.
http://www.everything2.com/index.pl?node_id=807148

Floor effect
Short definition of floor effect explains why it should be avoided.
http://everything2.com/index.pl?node_id=807173

Tests and Measurements
Testing and assessment
From APA, this FAQ gives information about finding and using published psychological tests.
http://www.apa.org/science/faq-findtests.html

Buros Institute of Mental Measurements
Site allows users to search for published tests from their Mental Measurements Yearbook by name or by category.
http://buros.unl.edu/buros/jsp/search.jsp

Questionnaire Design
Long page gives tips on how to write and test custom-made questions and tests.
http://www.cc.gatech.edu/classes/cs6751_97_winter/Topics/quest-design/

Writing Effective Tests
Helps users write better test questions.
http://www.glencoe.com/sec/teachingtoday/educationupclose.phtml/40

Back to RMI Home Page