• Whatever the type of test, standardized, ability or personality, a post hoc (after-the-fact) analysis of the results should be carried out. Why?
  • To answer questions like: Were the time limits okay?; Did the subjects understand the instructions?; Were any questions ambiguous?
  • The major purpose of item analysis is to improve tests by revising or eliminating ineffective items, and to increase our understanding of a test (why a test is reliable, valid, or not). Another important aspect of item analysis relates specifically to achievement tests. Here, item analysis can provide important diagnostic information on what examinees have learned and what they have not learned.
  • Item analysis refers to a varied group of statistics that are computed for each item on a test. These item statistics help to determine the role each item plays with respect to the entire test.
  • There are many different types/procedures for determining item statistics. The procedure employed in evaluating an item's effectiveness depends to some extent on the researcher's preference and on the purpose of the test.
  • Let’s consider some of these item analysis procedures. I'll base a good deal of my discussion on achievement/multiple choice types of tests to illustrate the item-analysis methods. Please remember that these concepts can also apply to other types of test (e.g., personality).

Distractor Analysis

  • With multiple choice tests there is usually one correct answer and a few wrong answers or distractors. A lot can be learned from analyzing the frequency with which test-takers choose distractors.
  • Consider that perfect multiple-choice questions should have 2 distinctive features:
  • 1) Person's who know the answer pick the correct answer;
  • 2) People who do not know the answer guess among the possible responses. This means that each distractor should be equally popular. It also means that the number of correct answers = those who truly know + some random amount. To account for this, should professors subtract the randomness factor from each person's score to get a more accurate view of a person's true knowledge?
  • The number of people expected to pick each distractor is easily calculated by:
  • # expected to choose distractor = (# answered item wrong) / # of distrators
  • Why might the number of people choosing a distractor be lower or higher that the expected?
  • A) Partial knowledge;
  • B) Poorly constructed item;
  • C) Distractor is outside of the domain content;
  • How does distractor analysis contribute to overall test reliability and validity?

Item Difficulty

  • Item difficulty is most commonly measured by calculating the percentage of test-takers who answer the item correctly {p value for an item = (# of people responding correctly) / (# of people taking the test)}.
  • Generally, items with p values of 0.5 yield test scores with the most variation. Thus, most test developers seek to develop tests where the average difficulty scores is about 0.5. Why?
    • Often items with difficulty levels between 0 - 0.2 and 0.8 - 1.0 are discarded because they are either too difficult or too easy, respectively. They are not differentiating the population.
  • The logic behind item difficulty is easy to see for knowledge or skill based tests where there is only one right answer. But why are difficulty levels also calculated for personality and attitude tests?
    • Even though there is no one right answer for personality or attitude items, test developers often consider the item that indicates the presence of a construct or attitude as correct. Item difficulty analysis assures the test developer that the items are not being answered in the same direction or with the same answer by everyone.
  • Although p values provide an indication of how difficult test item are, they tell us very little about the item’s usefulness in measuring the test’s construct.

Item Discrimination

The Discrimination Index:

  • To measure how well a test item separates those test takers who show a high degree of a skill, knowledge, attitude, or personality from those who have low skill, knowledge, etc., a discrimination index (D) is calculate.
  • This index compares, for each test item, the performance of those who scored the best (U – upper group) with those who scored the worst (L – lower group).
  • The procedure for calculating D is straightforward:
  • 1) Rank-order your test scores from lowest to highest.
  • 2) The upper 25-35% and the lower 25-35% form your analysis groups.
  • 3) Calculate the percentage of test-takers passing each item in both groups {i.e., U = (# of uppers who responded correctly) / (Total number in the Upper group); L = (# of lowers who responded correctly) / (Total number in the lower group)};
  • 4) D = U – L
  • The logic of the D statistic is simple. Tests are more difficult for those who score poorly (lower group). If an item is measuring the same thing as a test, then the item should be more difficult for the lower group. The D statistic provides a measure of each item’s discriminating power with respect to the upper and lower groups.
  • What should you look for?
  • Generally a high positive value indicates a good discriminating item. A low or negative value indicates that the item was equally or more difficult for the upper group. This is not a good thing, and generally the item is rewritten or eliminated.

Inter-Item Correlations:

  • The inter-item correlation matrix is another important component of item analysis.
  • This matrix displays the correlation of each item with every other item.
  • Usually each item is coded as dichotomous (incorrect = 0, correct = 1), and the resulting matrix is composed of phi coefficients, that are interpreted much like the Pearson product-moment correlation coefficients.
  • This matrix provides important information about a test’s internal consistency, and what could be done to improve it. Ideally each item should be correlated highly with the other items measuring the same construct. Items that do not correlate with the other items measuring the same construct can be dropped without reducing the test’s reliability. Why?

Item-Total Correlations

  • Point-biserial or item-total correlations assess the usefulness of an item as a measure of individual differences in knowledge, ability, or personality characteristic.
  • Here each dichotomous test item (incorrect = 0; correct =1) is correlated with the person’s total test score.
  • Interpretation of the item-total correlation is similar to that of the D statistic. A modest positive correlation illustrates 2 things:
  • 1) That the item in question is measuring the same construct as the test;
  • 2) The item is successfully discriminating between those who perform well and those who perform poorly.
  • In reality, item-total correlation provides the same sort of information that the D statistics does. It is, however, more popular and can easily be done with the help of a statistical software package.

Item Characteristic Curves (ICC) and Item Response Theory (IRT): The Essentials:

  • Recently, test developers have begun to use the concepts associated with IRT for item analysis. In essence, IRT relates each test item's performance to a complex statistical estimate (beyond this course's scope) of the test taker's knowledge or ability on the measured construct.
  • A basic characteristic of IRT is an ICC. An ICC is a graphical representation of the probability of answering an item correctly with the level of ability on the construct being measured. It gives you a picture of:
  • 1) The item's difficulty;
  • 2) The item's discriminatory power;
  • 3) The probability of answering correctly by guessing.
  • IRT logically assumes that individuals who have high scores on the test have greater ability than those who score low on the test.
    • With this in mind we can conclude that the greater the slope of the ICC the better the item is at discriminating between high and low test performers.
  • Difficulty, on an ICC, is operationally defined by the point at which the curve indicates a chance probability of 0.5 (a 50-50 chance) for answering the item correctly. The higher the level of ability needed to obtain a 0.5 probability (curve shifted to the right) the more difficult the item.
  • We’ll look at some examples in class.

Some More Detailed Notes on IRT

  • I originally did not want to cover this topic in any more detail than above because I remembered IRT to be largely theoretical. However, after a literature search on IRT, I've come to the conclusion that amount of recent empirical IRT research warrants a closer look at IRT. It is an emerging measurement trend that you should be aware of.
  • As mentioned previously, IRT is a group of statistical procedures aimed at assessing the quality of each test item with respect to a person's specific trait level (i.e., how much motivation, self-esteem, or knowledge does that person possess?). In other words, IRT seeks to understand how individual differences in attributes affect the behaviour on an individual when confronted with a specific item.
  • IRT has a set of assumptions about the mathematical relationship between a person’s true ability and the likelihood that that individual will answer an item successfully. The assumptions are as follows:
  • A) A relationship should exist between what the measurement instrument is trying to measure (i.e., the specific trait or attribute) and the test-takers’ responses to the items on the instrument.
  • B) There should be a simple mathematical relationship between the individual’s ability on some attribute and the likelihood that the individual will answer the question successfully.
  • When these 2 assumptions are true, IRT allows researchers to make some very precise inferences about the underlying attribute on the basis of observed behaviour.
  • One of the most prominent IRT techniques is: Item Response Function.
  • Item response function generates a curve (I consider these curves to be standardized ICC curves) which answers the question: Does this item measure this person's trait or proficiency adequately? Some items will do a better job at this than others, and these items may not necessarily be the same for every individual! Thus, 2 questions emerge:
  • 1) How do you determine the individual's trait level?
  • 2) What constitutes a good item?
  • An estimate of an individual's true trait or attribute level is their score on the test. With IRT, the raw test scores are standardized (mean=0, SD=1). Why? To identify outliers (at least 2 SD from the mean), and to equate the test scores to other data sets of the same test. Note: Identification of outliers is an extremely important aspect of measurement since many of the decisions based on the test pertain to those outliers (i.e., diagnosis, treatment, award of money, getting a job).
  • A test item is considered to be 'good' or 'informative' to the extent that it is free of measurement imprecision at some specific trait level. This concept is similar to that of a test's reliability, except IRT provides knowledge about an item's usefulness not only at the level of the item, but also at some specific attribute level.
  • We’ll look at some examples in class.
  • The mostly widely used IRT model for constructing standardized ICC curves estimates is the 3-parameter logistic model (3PL). This model incorporates the following 3 parameters into its estimates of the ICC curve:
  • 1) Susceptibility of the item to guessing;
  • 2) The role of ability in terms of item response;
  • 3) The item's discriminating power.
  • The formula, for those interested in the relationship between these 3 parameters, is:

p (q ) = c + [{(1 - c) eda(q - b)}/{1 + eda(q - b)}]

  • Where: p (q ) = p value for a given level of skill, personality, or attribute; e = 2.187; d is a scaling factor, usually of value 1.7 or 1. If d = 1.7 then the results are comparable to a cumulative normal distribution, if d =1 then it falls out of the equation; b = an item's difficulty or location parameter (y-intercept); a = slope parameter or the discrimination index.
  • Some of the advantages of IRT:
  • 1) One of the principal advantages of IRT is that it provides information on measures that are not dependent on the specific sample. In other words, the measures used to describe an ICC are not dependent on the population sample. This is not true for standard item analysis, why?
  • 2) With IRT, you can explicitly compare tests comprised of different items. This is termed 'test-free measurement' and is important for computerized adaptive testing.
  • 3) With IRT, persons with the same score on a test may be shown to differ in abilities depending on the assumptions made by the IRT model.

Practical Applications of IRT

Distractors:

  • With tradition item-analysis a multiple-choice answer is either right or wrong. Can you see a problem with this?
  • IRT models allow for construction of separate ICC curves for each distractor, which may provide more information about one's knowledge or ability.
  • See example p. 209 of your text.

Adaptive Testing:

  • Adaptive tests are tests made up of questions chosen from a large test bank to match the skill and ability level of the test taker. These tests are administered via computer.
  • IRT models can allow adaptive tests to zero in on a test-taker's ability or skill level. This, in turn, gives a more precise measure of true ability, while at the same time shortening the test.

Screening and Criterion-Keyed Tests:

  • Please read this on your own. We'll take about the ICC curves (Fig. 10-6, 10.7 in your text) that could be very helpful.