|
| |
Whatever the type of test, standardized, ability or personality, a post hoc
(after-the-fact) analysis of the results should be carried out. Why?
To answer questions like: Were the time limits okay?; Did the subjects understand the
instructions?; Were any questions ambiguous?
The major purpose of item analysis is to improve tests by revising or eliminating
ineffective items, and to increase our understanding of a test (why a test is reliable,
valid, or not). Another important aspect of item analysis relates specifically to
achievement tests. Here, item analysis can provide important diagnostic information on
what examinees have learned and what they have not learned.
Item analysis refers to a varied group of statistics that are computed for each item on
a test. These item statistics help to determine the role each item plays with respect to
the entire test.
There are many different types/procedures for determining item statistics. The procedure
employed in evaluating an item's effectiveness depends to some extent on the researcher's
preference and on the purpose of the test.
Lets consider some of these item analysis procedures. I'll base a good deal of my
discussion on achievement/multiple choice types of tests to illustrate the item-analysis
methods. Please remember that these concepts can also apply to other types of test (e.g.,
personality).
Distractor Analysis
- With multiple choice tests there is usually one correct answer and a few wrong answers
or distractors. A lot can be learned from analyzing the frequency with which test-takers
choose distractors.
- Consider that perfect multiple-choice questions should have 2 distinctive features:
- 1) Person's who know the answer pick the correct answer;
- 2) People who do not know the answer guess among the possible responses. This means that
each distractor should be equally popular. It also means that the number of correct
answers = those who truly know + some random amount. To account for this, should
professors subtract the randomness factor from each person's score to get a more accurate
view of a person's true knowledge?
- The number of people expected to pick each distractor is easily calculated by:
- # expected to choose distractor = (# answered item wrong) / # of distrators
- Why might the number of people choosing a distractor be lower or higher that the
expected?
- A) Partial knowledge;
- B) Poorly constructed item;
- C) Distractor is outside of the domain content;
- How does distractor analysis contribute to overall test reliability and validity?
Item Difficulty
- Item difficulty is most commonly measured by calculating the percentage of test-takers
who answer the item correctly {p value for an item = (# of people responding correctly) /
(# of people taking the test)}.
- Generally, items with p values of 0.5 yield test scores with the most variation. Thus,
most test developers seek to develop tests where the average difficulty scores is about
0.5. Why?
- Often items with difficulty levels between 0 - 0.2 and 0.8 - 1.0 are discarded because
they are either too difficult or too easy, respectively. They are not differentiating the
population.
- The logic behind item difficulty is easy to see for knowledge or skill based tests where
there is only one right answer. But why are difficulty levels also calculated for
personality and attitude tests?
- Even though there is no one right answer for personality or attitude items, test
developers often consider the item that indicates the presence of a construct or attitude
as correct. Item difficulty analysis assures the test developer that the items are not
being answered in the same direction or with the same answer by everyone.
- Although p values provide an indication of how difficult test item are, they tell us
very little about the items usefulness in measuring the tests construct.
Item Discrimination
The Discrimination Index:
- To measure how well a test item separates those test takers who show a high degree of a
skill, knowledge, attitude, or personality from those who have low skill, knowledge, etc.,
a discrimination index (D) is calculate.
- This index compares, for each test item, the performance of those who scored the best (U
upper group) with those who scored the worst (L lower group).
- The procedure for calculating D is straightforward:
- 1) Rank-order your test scores from lowest to highest.
- 2) The upper 25-35% and the lower 25-35% form your analysis groups.
- 3) Calculate the percentage of test-takers passing each item in both groups {i.e., U =
(# of uppers who responded correctly) / (Total number in the Upper group); L = (# of
lowers who responded correctly) / (Total number in the lower group)};
- 4) D = U L
- The logic of the D statistic is simple. Tests are more difficult for those who score
poorly (lower group). If an item is measuring the same thing as a test, then the item
should be more difficult for the lower group. The D statistic provides a measure of each
items discriminating power with respect to the upper and lower groups.
- What should you look for?
- Generally a high positive value indicates a good discriminating item. A low or negative
value indicates that the item was equally or more difficult for the upper group. This is
not a good thing, and generally the item is rewritten or eliminated.
Inter-Item Correlations:
- The inter-item correlation matrix is another important component of item
analysis.
- This matrix displays the correlation of each item with every other item.
- Usually each item is coded as dichotomous (incorrect = 0, correct = 1), and the
resulting matrix is composed of phi coefficients, that are interpreted much like
the Pearson product-moment correlation coefficients.
- This matrix provides important information about a tests internal consistency, and
what could be done to improve it. Ideally each item should be correlated highly with the
other items measuring the same construct. Items that do not correlate with the other items
measuring the same construct can be dropped without reducing the tests reliability.
Why?
Item-Total Correlations
- Point-biserial or item-total correlations assess the usefulness of an item as a measure
of individual differences in knowledge, ability, or personality characteristic.
- Here each dichotomous test item (incorrect = 0; correct =1) is correlated with the
persons total test score.
- Interpretation of the item-total correlation is similar to that of the D statistic. A
modest positive correlation illustrates 2 things:
- 1) That the item in question is measuring the same construct as the test;
- 2) The item is successfully discriminating between those who perform well and those who
perform poorly.
- In reality, item-total correlation provides the same sort of information that the D
statistics does. It is, however, more popular and can easily be done with the help of a
statistical software package.
Item Characteristic Curves (ICC) and Item Response Theory (IRT): The Essentials:
- Recently, test developers have begun to use the concepts associated with IRT for item
analysis. In essence, IRT relates each test item's performance to a complex statistical
estimate (beyond this course's scope) of the test taker's knowledge or ability on the
measured construct.
- A basic characteristic of IRT is an ICC. An ICC is a graphical representation of the
probability of answering an item correctly with the level of ability on the construct
being measured. It gives you a picture of:
- 1) The item's difficulty;
- 2) The item's discriminatory power;
- 3) The probability of answering correctly by guessing.
- IRT logically assumes that individuals who have high scores on the test have greater
ability than those who score low on the test.
- With this in mind we can conclude that the greater the slope of the ICC the better the
item is at discriminating between high and low test performers.
- Difficulty, on an ICC, is operationally defined by the point at which the curve
indicates a chance probability of 0.5 (a 50-50 chance) for answering the item correctly.
The higher the level of ability needed to obtain a 0.5 probability (curve shifted to the
right) the more difficult the item.
- Well look at some examples in class.
Some More Detailed Notes on IRT
- I originally did not want to cover this topic in any more detail than above because I
remembered IRT to be largely theoretical. However, after a literature search on IRT, I've
come to the conclusion that amount of recent empirical IRT research warrants a closer look
at IRT. It is an emerging measurement trend that you should be aware of.
- As mentioned previously, IRT is a group of statistical procedures aimed at assessing the
quality of each test item with respect to a person's specific trait level (i.e., how much
motivation, self-esteem, or knowledge does that person possess?). In other words, IRT
seeks to understand how individual differences in attributes affect the behaviour on an
individual when confronted with a specific item.
- IRT has a set of assumptions about the mathematical relationship between a persons
true ability and the likelihood that that individual will answer an item successfully. The
assumptions are as follows:
- A) A relationship should exist between what the measurement instrument is trying to
measure (i.e., the specific trait or attribute) and the test-takers responses to the
items on the instrument.
- B) There should be a simple mathematical relationship between the individuals
ability on some attribute and the likelihood that the individual will answer the question
successfully.
- When these 2 assumptions are true, IRT allows researchers to make some very precise
inferences about the underlying attribute on the basis of observed behaviour.
- One of the most prominent IRT techniques is: Item Response Function.
- Item response function
generates a curve (I consider these curves to be standardized
ICC curves) which answers the question: Does this item measure this person's trait or
proficiency adequately? Some items will do a better job at this than others, and these
items may not necessarily be the same for every individual! Thus, 2 questions emerge:
- 1) How do you determine the individual's trait level?
- 2) What constitutes a good item?
- An estimate of an individual's true trait or attribute level is their score on
the test. With IRT, the raw test scores are standardized (mean=0, SD=1). Why? To identify
outliers (at least 2 SD from the mean), and to equate the test scores to other data sets
of the same test. Note: Identification of outliers is an extremely important aspect of
measurement since many of the decisions based on the test pertain to those outliers (i.e.,
diagnosis, treatment, award of money, getting a job).
- A test item is considered to be 'good' or 'informative' to the extent that it is free of
measurement imprecision at some specific trait level. This concept is similar to that of a
test's reliability, except IRT provides knowledge about an item's usefulness not only at
the level of the item, but also at some specific attribute level.
- Well look at some examples in class.
- The mostly widely used IRT model for constructing standardized ICC curves estimates is
the 3-parameter logistic model (3PL). This model incorporates the following 3 parameters
into its estimates of the ICC curve:
- 1) Susceptibility of the item to guessing;
- 2) The role of ability in terms of item response;
- 3) The item's discriminating power.
- The formula, for those interested in the relationship between these 3 parameters, is:
p (q ) = c + [{(1 - c) eda(q - b)}/{1 + eda(q - b)}]
- Where: p (q ) = p value for a given level of skill,
personality, or attribute; e = 2.187; d is a scaling factor, usually of value 1.7 or 1. If
d = 1.7 then the results are comparable to a cumulative normal distribution, if d =1 then
it falls out of the equation; b = an item's difficulty or location parameter
(y-intercept); a = slope parameter or the discrimination index.
- Some of the advantages of IRT:
- 1) One of the principal advantages of IRT is that it provides information on measures
that are not dependent on the specific sample. In other words, the measures used to
describe an ICC are not dependent on the population sample. This is not true for standard
item analysis, why?
- 2) With IRT, you can explicitly compare tests comprised of different items. This is
termed 'test-free measurement' and is important for computerized adaptive testing.
- 3) With IRT, persons with the same score on a test may be shown to differ in abilities
depending on the assumptions made by the IRT model.
Practical Applications of IRT
Distractors:
- With tradition item-analysis a multiple-choice answer is either right or wrong. Can you
see a problem with this?
- IRT models allow for construction of separate ICC curves for each distractor, which may
provide more information about one's knowledge or ability.
- See example p. 209 of your text.
Adaptive Testing:
- Adaptive tests are tests made up of questions chosen from a large test bank to match the
skill and ability level of the test taker. These tests are administered via computer.
- IRT models can allow adaptive tests to zero in on a test-taker's ability or skill level.
This, in turn, gives a more precise measure of true ability, while at the same time
shortening the test.
Screening and Criterion-Keyed Tests:
- Please read this on your own. We'll take about the ICC curves (Fig. 10-6, 10.7 in your
text) that could be very helpful.
|