We make the distinction between three types of reliability.

- Test-retest reliability (a measure of stability)
- Interrater reliability (a measure of agreement)
- Internal consistency reliability (a measure of how correlated the items of a measure are with one another)

Test-retest and interrater reliability are most often indexed with a product
moment correlation if our measure is continuous. To assess test-retest
reliability, for example, we would give the measure to a group of people on two
occasions, separated by a specified period of time. We then compute the
correlation between the measures given on the two occasions. With test-retest
reliability, we should ALWAYS specify the length of time when we are reporting
the level of reliability (e.g., "The test-retest reliability for this
measure over 10 weeks was .72."). Similarly, we assess interrater
reliability by having two raters rate the same group of subjects and then
compute the product-moment correlation between their ratings. The procedures for
computing these reliability indices using *SPSS for Windows* is detailed elsewhere
on this website.

Note that unless you expect a characteristic to be stable over time, you should not expect high test-retest reliability. Sometimes we naively think that all measures should ideally have all types of reliability, but it depends entirely on what we are measuring and what its theoretical characteristics are. For example, if we are measuring something like anxiety, which we would expect would go up and down depending on the situation, we would not generally expect high test-retest reliability. In fact, we could say it more strongly. If we did get high test-retest reliability on our anxiety measure, we might well question whether we are really measuring anxiety as we conceptualize this construct.

When the measure being used is categorical, typically a measure of percent agreement is used. For example, if we were categorizing patients on the basis of the clinical diagnoses, we would use an agreement measure. We could technically use an agreement measure for continuous data as well, although this is rarely done in practice. The percent agreement index is simply the percentage of times that two rater agree or the percentage of times that the same classification is made on retest. The procedures for computing this simple index are described in Chapter 9 of the text.

The problem with the percent agreement index is that it is easier to agree
consistently if the number of categories are small or if most of the
participants fall into a single category. If, for example, we were diagnosing
patients as either psychotic or not psychotic (just two categories), this is a
relatively easy discrimination compared to giving them precise diagnoses. If one
rater thought that a patient qualified for the diagnosis of bipolar disorder and the other rater
thought the patient qualified for the diagnosis of schizophrenia, they would still be in
agreement, because both of these are considered psychotic disorders. However, if
dozens of diagnoses are possible, the raters must agree exactly on which
diagnosis applies to be in agreement. In other words, the amount of agreement
will depend on how fine a discrimination is required. The Kappa coefficient
takes this into account and also takes into account the fact that it is easier
to agree if almost everyone is in one category. Many advanced statistical
packages can compute Kappa, although the version of *SPSS for Windows*
included with this text does not. The procedures for the manual computation of
Kappa are detailed elsewhere on this website.

Internal consistency reliability is a measure of how intercorrelated the items of a measure are. When the items of a measure are highly intercorrelated, it means that the items are all apparently measuring the same characteristic or trait. We expect to get high internal consistency reliability when our measure is theoretically trait-like. For example, if we are measuring a construct such as "knowledge of research methods" with a 50-item test based on material in this textbook, we would think of this as a trait and would expect students who know a lot about research methods to do consistently well on most items and those who know very little to do consistently poorly on most items. If on the other hand, we were measuring dozens of different knowledge domains in a single test, we would expect much less agreement, with some people being knowledgeable on certain topics, while others are knowledgeable about different topics. In this latter case, we would expect low to moderate internal consistency reliability.

The most widely-used index of internal consistency reliability is called
coefficient alpha. [Please note that coefficient alpha has nothing to do with
the alpha level of statistical tests, even though the names are similar.]
Unfortunately, as in many areas of research, independent development of
concepts has led to a confusing array of names and procedures for internal
consistency reliability. We will briefly review them here, BUT we would like to
make it clear that coefficient alpha is the most appropriate and most general of
the internal consistency coefficients and is the one that should be used. Many
statistical analysis packages will compute coefficient alpha for you, although
the version of *SPSS for Windows* that can be bundled with this text does not.

Two other terms that are often used to refer to internal consistency
reliability are *split-half reliability* and *KR-20*. Splitting the
test into two halves and correlating those halves will give us some sense of how
intercorrelated the items of the test are. There are two problems with this
approach. The first is that there are many different ways in which one could
split a test, and most of the time, different splits of the test will produce
different levels of correlation. The second is that when you correlate half the
test with the other half of the test, you get a measure of the reliability of
half of the test, not the reliability of the whole test. There is a statistical
method for estimating what the reliability of the whole test is, but it requires
additional computations. Coefficient alpha is equal to the mean of all possible
split-half reliabilities, and it is much easier to compute than computing all
possible split-half reliabilities.

KR-20 is short for Kudor-Richardson formula #20, which is the 20th formula in a classic paper by Kudor and Richardson detailing this method of quantifying internal consistency reliability. The term KR-20 is sometimes used interchangeably with coefficient alpha, although this is technically not a correct usage. Technically, KR-20 is a computational formula for internal consistency that can be used ONLY when all items are dichotomous (either entirely right or entirely wrong). Many measures are of this nature. In such a case, KR-20 and coefficient alpha will give exactly the same reliability. However, if some or all of the items can be partially correct, then KR-20 cannot be used. KR-20 was developed long before computers were available for statistical analysis. It was designed to simplify computation so that one could compute an index of reliability in just a few hours. With modern computers and computer software, KR-20 has been rendered obsolete.