|
| |
The most important attribute of any measurement device is its consistency. We
refer to this consistency as reliability.
For example, think of a tape measure. Each time we use the same tape measure we get
approximately the same answer.
Sometimes we may be off by a few millimeters. Why?
This is termed the measurement error.
Unlike a tape measure, psychological tests are not and never will be as reliable as tape
measures. Why?
What factors are responsible for the consistency and inconsistency in psychological test
scores?
Thorndike (1949) listed (see pg. 110 in your text) possible sources of variability on
any particular psychological test.
Thorndikes categories of factors include:
- Lasting and general characteristics of the individual:
- Lasting but specific characteristics of the individual:
- Temporary but general characteristics of the individual:
- Temporary and specific characteristic of the individual:
- Systematic or change factors affecting the administration of the test:
- Variance not otherwise accounted for (chance factors):
- Reliability tells us how accurate and trustworthy a test is. Because of this,
reliability will be your friend when you are defending the results of your test.
- However, theories of reliability suggest that the accuracy of any psychological measure
is influenced by 2 main factors:
- 1) Factors that contribute to test consistency: Stable characteristics of the person or
of the attribute that one is trying to measure.
- 2) Factors that contribute to inconsistency: As noted above.
- Or: Observed Test Score = True Score + Measurement Error (X=T + e)
- What does the above formula really suggest?
- It suggests that the scores you gather on psychological tests are not in fact
true or real scores. But, those scores represent a combination of
many factors.
- The ultimate goals of reliability theory are to:
- 1) Estimate errors in psychological measurement;
- 2) Then devise techniques to improve testing so errors are reduced;
Assumptions Made About Errors:
- The central assumption of reliability theory is that measurement errors, pertaining to
large groups of people, are random.
- If errors are essentially random then it is reasonable to assume that errors are:
- i) Mean Error of Measurement = 0 (errors are equally + and -);
- ii) True test scores and errors are not correlated: (rte = 0);
- iii) Errors on different measure are not correlated: (re1e2 = 0);
- Because of the above assumptions it is theorized that:
- The variance of an obtained score is the sum of the variance of the true score plus the
variance of measurement error (Variance of Score (X) = Variance of True Score (T) +
variance of error (E)) or (VX = VT + VE).
- If errors are responsible for most of the variability in a test score, then it makes
sense that test scores will be inconsistent. If the same test is then given again, will
the test scores remain stable? No!
- However, if measurement errors have little effect on test scores, then test results are
demonstrating true scores, and subsequent test scores should be more similar.
- In theory, the reliability coefficient (rxx) gives us an index of the
influence of true scores and error scores on any given test. It is the ratio of true score
variance to the total variance of the test. rxx = VT / VX
; or rxx = VT / (VT + VE).
- In words, rxx is the proportion of variance in test scores that is accounted
for by variability in a tests true scores.
- In actuality, rxx is very similar to r. The addition of two similar
subscripts tells us that this correlation represents a reliability coefficient. The
calculation of the reliability coefficient for two interval or ratio measures is the same
as the calculation for the Pearson r that we discussed earlier.
Types of Reliability
- If you were not sure that a room you measured was accurate, what would you do?
- Most persons would measure it again, using the same tape measure.
- A few would measure it again with a different tape measure.
- Psychology uses the same strategies.
- Note:
Not all of the following reliability strategies are used for all psychological
tests. The strategy you pick depends on the type of test you have, and the conditions
under which it is measured.
Test-Retest Reliability
- With test-retest reliability a test developer gives the same test to the same group of
test takers on 2 different occasions.
- Scores on the 1st administration are compared to scores on the 2nd
administration using correlation (r).
- This method examines performance over time and gives an estimate of stability.
- Often researchers consider test-retest reliability to be a better measure of temporal
stability, which refers to consistency of test scores, rather than true reliability, which
is defined as the ratio of true to observed variance.
- The interval between the administration of the 2 tests can be either a few hours or
several years.
- What should you expect as the time interval between test administration increases?
- Reliability probably decreases! Why?
- Thus, when looking at test-retest reliability you must be aware of the time interval.
- An assumption made with test-retest reliability is that test takers do not or have not
changed over the time period of the 2 administrations.
- Can you think of any factors that could change a persons responses very quickly?
- One concerns of test-retest reliability is termed practice or carryover effects.
- Practice or carryover effects are benefits test takers derive from already having taken
a test. This enables them to solve problems more quickly or correctly the second time they
take the same test.
- The reason why practice or carryover effects is of concern as to do with the attribution
of error.
- Some researchers argue that carryover effects should be regarded as sources of real
stability or instability in measurement; while others consider it to be a source of
measurement error.
- Can you think of a real example?
Alternate Forms Reliability
- To eliminate practice effects and other problems with the test-retest method (i.e.,
reactivity), test developers often give 2 highly similar forms of the test to the same
people at different times.
- Reliability, in this case, is again assessed by correlation. What is correlated?
- The key aspect of this reliability is to develop an alternate form that is equivalent in
terms of content, response processes, and statistical characteristics.
- Do you think that reactivity and carryover effects are totally eliminated?
- Can you think of any drawbacks of the alternative form method?
- Order effects!
- Forms are not really equivalent! This is difficult enough to do in math ability, can you
imagine how difficult it is for something like personality or intelligence.
Split-Half Reliability
- Spilt-half methods of reliability measure the internal consistency of a test. Remember
the measure tape, it has great internal consistency. The first foot is the same length as
the second and third foot, and the length of every centimeter is also uniform.
- Split-half methods also eliminate or reduce the following problems:
- The need for 2 administrations of a test;
- The difficulty of developing another form;
- Carryover and reactivity effects;
- Changes in a person over time.
- The simplest way to perform spilt-half reliability is to:
- 1) Administer a test to a group of individuals;
- 2) Randomly or by some other predetermined method (i.e., split on similar content;
odd-even spilt) divide or split the test into halves (each half is an alternative form);
- 3) Correlate the scores on one half with those on the other half.
- 4) This correlation can be used in estimating the reliability of the test.
- A concern with split-half models of internal consistency revolves around the shortening
of a test. When we take a long test, say 100-questions, and spilt it into two 50-questions
tests we are decreasing its reliability. Why? More homogeneous questions reveal more
information about the test takers trait, skill, or knowledge. This provides more specific
information about each test taker and produces more variation in test scores, which
increases reliability.
- For this reason an adjustment the to the split-half reliability is recommended. The Spearman-Brown
Formula can be employed when estimating the reliability using the split half method.
- rxx = k r / (1 + (k 1))r
- Where: k = number of items in the 'new' split-half test (i.e., usually the original
number of questions your test had before you split it) divided by the number of original
items in the split-half test (i.e., the number of question in your split-half
correlation). In other words, the number of times longer the 'new' test will
be. For example, assume your test has 80 questions. Your perform a spilt-half
reliability and obtain an r = 0.8. That r = 0.8 is based on 40 items. Those 40
question comprise your original items in the split-half test. Now you want to adjust
your reliability since the test actually had 80 questions. 80 is considered the
length of the 'new' test. Thus, k = 80/40 = 2. Note: The Spearman-Brown
formula is also used to estimate how much a test's reliability will increase when the test
is increased by adding parallel items.
- r = the correlation between the original split-halves.
- Well do examples in class.
- Another weakness in the spilt-half method is the number of different ways a test can be
spilt. Why is this a problem?
- Some splits may give a much higher correlation than other splits.
- An even better way to measure internal consistency is to compare individuals' scores on
all possible ways of splitting the test in halves. This will compensate for any
error introduced by any lack of equivalence in the two halves. We can do this by
describing the amount of intercorrelation between questions on a test or subscale, and the
number of questions we have on the test.
- 1) KR-20 (Kuder & Richardson, 1937, 1939). This method is used for tests
whose questions can be scores as either 0 or 1.
- 2) Coefficicent alpha (Cronbach, 1951). This method is used for questions such as
rating scales that have 2 or more possible answers.
- Another problem for the split-half method is whether the test being split is homogeneous
(i.e., measuring one characteristic) or heterogeneous (measuring many characteristics).
Which do you think creates more of a concern?
- One solution here is to determine reliability for each heterogeneous component of the
test and then compare, using correlation, those components.
- Note:
Your text makes the distinction between Split-half methods of reliability and
internal consistency methods. Many researchers and most other texts consider split-half
reliability as a component of internal consistency.
- Since the mathematics between split-half and internal consistency seems to be linked
(i.e., coefficient alpha) Ill tend to describe the split-half method traditionally.
However, your text notes that the main difference between split-half and internal
consistency is the unit of measure. Split-half methods compare halves of tests, whereas
internal consistency methods compare each item to every other item.
Scorer Reliability
- The above methods have not considered the person who administers the test? Can
individuals make mistakes in scoring that add error? Yes!
- Thus, judgements or ratings made by different scorers are often compared using
correlation to see how much they agree.
- This is termed the scorer reliability or the inter-rater reliability.
- Can you think of test that require high rates of inter-rater reliability?
|