Step #1: Define the Test Universe, Test Takers, and Purpose
- Prepare a working definition of the construct the test will measure.
- Be sure to check the psychological literature to help with this.
- List the characteristics of the people you expect to take the test. Especially those
that will affect how the test taker responds. Some examples include:
- Reading level; Disability; Motivation to answer honestly; Language.
Purpose of The Test:
- Include what the test will measure, e.g., temptation.
- Also include how the outcomes will be used. Will the scores be used to compare test
takers (normative) or will they be compared to some achievement level (criterion)? Will
the scores be used to predict some performance or make a diagnosis? Will the scores be
used cumulatively to help prove/disprove a theory, or individually to provide information
about the test taker?
- Why is it important to include information about the test universe, test takers and
purpose of the test?
- This type of information helps the user to determine if the test is appropriate for
things such as group administration, or paper-pencil verses oral administration. It helps
the user make a more informed decision about the test's usefulness.
Step 2: Develop a Test Plan
- Now you want to provide your construct with a concise definition. The definition should
- 1) An operationalized statement regarding observable and measurable behaviours.
- 2) Boundaries of the test domain. The content that you are testing. Include content that
is not appropriate.
- Include an estimate (percentage) of how many questions are needed to sample the test
- Choose the test format (e.g., objective or subjective) and the type of questions the
test will contain (e.g., multiple-choice, true/false, short answer, verbal questions and
- Most tests use a consistent format. However, if you use different formats be sure to
provide detailed instructions about each type of question.
Specify Administration and Scoring Methods:
- Specify how the test should be administered and scored.
- Answer questions like: How will the test be administered? How long do test takers have
to complete the test? Should the test be given in a group setting or to individuals? Does
the test need to be scored by the publisher or administer? Is there a particular weighting
for each question? What type of data will the test yield?
- The most common method for scoring is termed the cumulative model. This type of model
states that the more the test taker responds in a given fashion, the greater the
exhibition of the attribute being measured. The most common method gives one point for
each measure of the attribute. The total accumulation of the 'one points' becomes the raw
score. These tests typically yield interval-level data.
- Other scoring methods include:
- A) Categorical model: This type of method is used to place test takers into a given
group. This type of model generally yields nominal data.
- B) Ipsative model: The test taker's scores on various scales within the test are
compared to each other and yield a profile of the individual.
- C) Note: All the above 3 models can be combined in any fashion on any given test.
Step 3: Develop Test Items
- The most commonly used formats are multiple choice, true/false, forced-choice, and
- Multiple choice (MC) format consist of a sentence (or part of a sentence), called the
stem, which is followed by a number of response (usually 3-5) of which only one is
correct. Incorrect responses are termed distractors.
- When composing a MC question you should try to clearly differentiate the correct
response from the distractors. Distractors that are clearly wrong or almost right often
detract from your test's accuracy, unless you take this into account with IRT.
- See handout on some MC questions guidelines.
- The stem component of a true/false item is usually, Which of the following is
true? Some researchers like to convert true/false items into MC items to decrease
the guessing advantage.
- Forced-choice items typically have a format similar to that of MC. For example, a
question may read,
- Place an X in the space to the left of the word that best describes your
- _____ cheery
- Forced choice often makes it difficult to guess or fake since the paired words often
appear to have nothing in common, thus it is difficult to guess what the correct response
- Can you think of a short fall with the forced-choice method?
- Likert Scales are generally used when expressing positive or negative attitudes towards
a specific object or event.
- A large number of item statements are presented and test takers are asked to indicate,
on a 4 - 7 point scale (5 point scales are most common), how they feel about the
statement. Each point on the scale is assigned a value, and the test taker's score is
calculated in some fashion.
- Since Likert scales are assumed to be equal-interval, most statistical procedure can be
- We'll look at an example in class.
- Can you think of any pros and cons?
- Examples include essay questions, interview questions, projective techniques, sentence
- Essay questions are popular in educational setting, as Im sure you are aware, and
allow freedom of response, and generally require higher cognitive functioning (analysis,
synthesis, and evaluation) to answer.
- What are some pros and cons to this method?
- Interview questions tend to also be general in scope and good answers are up to the
interviewer. Interview tests should be planned (not just sit down and chat) and focus on
the knowledge, skills, abilities, and other characteristics needed for the job (KSAOs).
Clinical interviews should also be highly structured. Why?
- Projective techniques are often used in clinical settings and use highly ambiguous
stimuli to elicit unstructured responses from test takers. Result interpretation is often
difficult, and often requires significant study and special skills.
- Sentence completion is another subjective technique, especially in personality testing.
This technique presents a partial sentence and the test taker is asked to complete it. For
example, I feel betrayed when ________. Scoring is often accomplished by
comparing responses with those provided by the test developer. If a match occurs a point
is rewarded for a particular trait.
Writing Good Items
- As you can see a lot of creativity, originality, and knowledge goes into writing an
item. Here are some rules of thumb that may help.
- 1) Identify item topics by consulting the test plan.
- 2) Be sure each item is based on an important learning objective or topic.
- 3) Write items that assess information or skills drawn only from your testing universe.
- 4) Write each item as clearly and directly as possible.
- 5) Use appropriate language.
- 6) Try to make all items independent.
- 7) Ask someone to review the items.
Some Things To Be Aware Of With Both Objective and Subjective Items
- Some test takers have response sets or styles for choosing answers on tests. These
response sets often lead to false or misleading information. For example, if they are
unsure of a response they always say 'yes' or 'no' or pick 'c'.
- There are some methods that attempt to detect or minimize response bias errors.
Well discuss some later in the course.
- Some types of response biases include:
- One of the most problematic of the response biases is social desirability.
- Often test takers have a tendency to choose answers that make them look good or are
- It is important that you take this into account when developing your test items. Why?
- Try to balance the social desirability of the distractors. Or use an ipsative format.
How would you do this?
- Some researchers believe that social responding is a part of personality and should not
be removed statistically since that will detract from the validity of your results.
- Sometimes test takers cannot respond accurately to a test item. Thus, they respond in a
- How can you check to see if a person is responding randomly?
- Add a scale that attempts to identify this. For example, questions that almost everyone
in the population answer correctly.
- Faking refers to the inclination of test takers to try to answer items in a way that
will cause a desirable outcome or diagnosis.
- You can 'fake good' à try to answer items in a way that
makes you appear to have more of a desirable trait.
- You can 'fake bad' à Answer items in a way that makes you
appear to have more of an undesirable trait.
- What types of situations generate faking?
- How can you prevent faking or cheating?
- You can use what are referred to as subtle questions. These questions have no
real relation to the test purpose and thus are hard to fake, i.e., a personality test with
a question like, Birds fly south for the winter? or you can built in catch
scales (see p. 224-225 of your text).
- Research is mixed with respect to faking. Some studies show that even if faking is
detected there's no way of estimating a person's true score. Thus, what so you do with
- Some research suggests that even if persons fake, this may not affect the validity of
predicting future behaviours. What does this suggest?
- This type of response style refers to the tendency to agree (yes people) with any ideas
or behaviours that are presented to the test taker.
- An example could be someone who only responds true on true/false test items.
- Because of acquiescence it is a good idea to balance items for which the correct
response would be positive with an equal number of items for which the correct response
would be negative.
- How will this affect your scoring?
- Well consider an example in class.
Writing Test Administration Instructions
- It is important to remember that even though the test items make up the bulk of any new
test, they are meaningless without good, specific instructions on how the test is to be
- As the test developer you need to develop 2 sets of instructions. One for the persons
who will be administering the test, and another set for the test takers.
- The testing environment, the circumstances under which the test is administered, can
affect how test takers respond.
- Standardized testing environments decrease scoring error or variation that cannot be
attributed to the attribute being measured.
- Specific and concise instructions should address the following:
- A) Group or individual administration;
- B) Specific requirements for the location and equipment. Include things like privacy
needs, quiet, chairs, tables, desks, and required equipment (i.e., pencil, computer with
- C) Specify time limitations, or approximate completion times if there is no time limit.
- D) Prepare a script for the administrator to read to the test takers. The script should
include answers to specific questions that test takers are likely to ask.
Instructions for the Test Taker
- Test taker instructions are usually, but not always, delivered orally by test
administrators, who read your prepared script. Can you see a problem with this?
- Instructions also appear in writing at the beginning of the test or in the test booklet.
- Instructions should include things like:
- A) Where the test taker should respond.
- B) How the test taker should respond. An example is always helpful.
- C) Instructions should encourage accurate and honest answering.
- D) Some tests need to have test takers thinking of a specific context or environment
(e.g., at home, work, school). Thus provide statements like, Think of you current
work situation when responding to the following questions.
- E) If the instructions are too complicated you are very likely to confuse some test
takers and thus increase the probability of response errors. If you find that your
instructions are too complicated then revise them, or alternatively revise your test!!
- Ill bring an example to class.
Revising the Test
- Revision of the test is a major part of the test development process.
- Usually test developers write more items than are needed and then use the quantitative
and qualitative analyses of the test to choose those items that together provide the most
information about the construct being measured (see handout).
Choosing the Final Items
- To choose the tests final items, you must weigh each of the following for each
- A) Content Validity;
- B) Item difficulty and discrimination;
- C) Inter-item correlation (reliability measure);
- D) All biases.
- E) Also you should take into account test length and face validity.
- Often a matrix highlighting all the above characteristics is created to help in the
selection process. The matrix organizes all the information, in clear view, for you to
consider. Well go through one in class.
- Can you recall what it is youll be looking for as evidence of a good item?
- Now you should be able to see why you need to start with so many items.
- Dont forget about the test instructions. They too should be revised. You will
undoubtedly discover instructions that you forgot or were unclear or were difficult to
follow. These should be changed.
Validation and Cross Validation
- Now that you are satisfied with your final test items, your test needs be evaluated to
ensure that it is reliable and valid.
- How should you go about this? That depends on the type of test and what it will be used
for. But here are some general guidelines:
- 1) The first thing that usually is done in validation is establishing content validity.
If you followed all the steps required in defining the construct initially, this is
already taken care of.
- 2) To establish other types of validity (construct and criterion-related) youll
need to run another round of data collection. These additional rounds of data collection
are similar to the pilot round except they may match the actual testing protocol better
than your pilot study. That is, they may use a sample of people who closely resemble the
target audience. Your additional samples should also be large enough so youll be
able to comfortably run the analyses in question. Power analysis will help you here.
- 3) You should also be collecting data on the demographic characteristics of the test
takers. Things like sex, race, age, SES, etc. will help when detecting for bias.
Developing Norms And Cut Scores
- Norms and cut scores are decision points for dividing test scores into groups (i.e.,
pass/fail, depressed/not depressed).
- These norms and cut scores help the test user with interpretation of the test data.
- Not all test will have norms. Norms and cut scores depend on 2 things:
- A) The purpose of the test;
- B) How widely used the test is.
How do you develop norms?
- Norms are based on the distribution of test scores and provide a reference point or
structure for understanding ones personal score.
- Here are the steps required for developing norms:
- 1) Obtain a random sample of the target audience. As you know, larger is better. This is
difficult to do, why? What is the next best thing? Can you use some of the pilot data or
the data obtained during validation?
- 2) Once the sample is large enough norms are calculated. As the database grows the norms
should be adjusted as well. Why?
- 3) Larger data bases also allow for the development of subgroup normsà stats that describe a specific proportion of the target
audience (i.e., males only).
- 4) The norms should be published in the test manual.
Identifying Cut Scores
- Cut scores are scores at which a decision changes.
- Setting cut scores is not an easy process and has all kinds of legal, professional, and
psychometric implications. Can you think of any?
- There are 2 main approaches to setting cut scores:
- 1) With employment tests a panel of expert judges provide an opinion or rating about the
number of test items that a barely qualified person is likely to get right. This
information then becomes the cut score.
- 2) Also a more empirical approach can be used. Here the correlations obtained between
the test and an outside criterion are used to predict the test score that a minimally
acceptable candidate is likely to achieve. A regression equation is used to predict the
score that a person who is rated minimally acceptable is likely to make. This score then
becomes the cut score.
- A major problem with setting cut scores is error. Recall that SEM is an indicator of how
much error exists in someone's test score. It is very likely that a person who scores only
a point or two below the cut score will score above the cut score if asked to take the
test again. And the increase in score can be solely due to test error, and not ability.
- Because of this Anastasi and Urbina (1997) suggest that cut scores be a band of scores
rather than a single score. Instead of a cut score of 60, you use the SEM to compute a
band of scores, say 58 - 62.
Developing The Test Manual
- We have noted previously that the test manual is an important component of any test.
- The manual includes things such as:
- A) Rationale for test construction;
- B) A history of the developmental process;
- C) Results of validation studies;
- D) The target audience;
- E) Instructions for administration and scoring;
- F) Norms;
- G) Information on interpreting individual scores;
- H) Limitations of use and measurement accuracy.
- The writing of the manual is not left until the end. It is an ongoing process that
begins with your conception of the test, and continues throughout the developmental
phases. If you diligently record things in your test manual it will serve as a source of
documentation and reference for each part of the developmental process.
- We'll look at the Wisconsin Card Scoring Test Manual in class.