- Often people believe that the title of a test tells them what the test measures. This
strategy is a poor one to use.
- For instance, consider a test titled, 'The Math Achievement Test'. It may measure the
broader concept of academic achievement, or the narrower concept of achievement in
multiplication/division, or it may even measure a very different attribute like general
- While reliability tells us whether a test is measuring something consistently, only validity
can provide us with information about what a test really measures.
- Validity then is evidence that a test is being used appropriately and measures what it
is supposed to measure.
Major Types of Validity
- The APA's Standards for Educational and Psychological Testing (1985) and specialists in
psychological testing generally agree that there are 3 ways of deciding whether a test is
valid enough to be useful. Those 3 strategies are named:
- Content Validity;
- Construct Validity;
- Criterion-Related Validity (often referred to as Predictive & Concurrent Validity).
- Another type of validity, termed Face Validity, is not recognized as a primary type of
validity, but is commonly used by test developers and test users.
The Use of Validation Strategies
- A valid test is one that does the job it is suppose to do. It measures the construct it
is suppose to measure or predicts the outcome it claims to predict.
- For example, your take home test is an achievement test that is suppose to measure how
well you understand or have mastered the content of the first half of this course.
Employment tests, on the other hand, are supposed to predict future job performance.
- With this in mind, tests must be used for their intended purposes. Neither, a valid
reading test nor your take home test should be used as a valid measure of intelligence.
- Some tests measure concrete attributes like the ability to throw a baseball. Most
persons agree on the specific behaviours associated with throwing a baseball.
- Other test measure abstract attributes like love, personality, intelligence, or
creativity. These attributes are more difficult to describe because many people disagree
on what these behaviours represent. What does it mean to be aggressive?
- Thus, depending on the type of test you have it is important to understand the different
types of validity, what they mean, and when they should be used.
- For example, for achievement tests it is important and relatively easy to demonstrate
content validity (see below).
- However, gathering content validity for something abstract, like personality, may be
more difficult (but not necessarily less important).
- Criterion-related validity is generally used for tests that claim to predict outcomes.
Can you think of some examples?
- Construct validity is appropriate when your test is measuring an abstract concept like
beauty. Construct validity involves the accumulation of a variety of evidence (reliability
and other types of validity) that shows the test is functioning as it was intended to. For
example, your test of beauty may correlate highly with another test of beauty, or may
cause specific behaviours (e.g., dilation of pupils) as predicted by some theory.
- One of the simplest ways to obtain evidence for the validation of a test is to examine
the content of the test.
- Content validity then is the extent to which the questions on a test are representative
of the trait, behaviour, or attribute that is being measured.
- Content validity focuses on the questions of the test, which differs from
criterion-related and construct validity which correlate test scores with other measures
- As such, content validity need not involve a statistical procedure.
- How would you show that the following are content-valid?
- Classroom statistical achievement test.
- Employment test to measure mechanical ability.
- Paper-and-pencil test of 'life attitude'.
- How do you go about obtaining evidence of content validity?
- A) Systematically defining the Testing Universe: The testing universe or content domain
is the sample of all possible behaviours of the attribute or trait being measured. This is
typically done before the test is developed and gives the user confidence that the test is
- B) Expert ratings. Once developed experts should be consulted so they can evaluate how
relevant each test question is to what is being measured.
The Testing Universe or Content Domain
- The first step for the development of any test is to determine the testing universe
(i.e., content domain)--the set of knowledge or behaviours the test represents. Usually,
this step involves locating theoretical or empirical research, talking with experts, or
reviewing other similar instruments.
- A content-valid test will representatively sample the testing universe.
- Content domains often have defined boundaries, and can usually be structured into
distinct subcategories (see pg. 149). Describing the boundaries and categories of a
content domain facilitates test question development (i.e., it is easy to see if and where
a specific question may fit in) and is crucial in evaluating content validity.
- There are no formal statistical measures of content validity. It is a judgement call.
However, the judgements are not made haphazardly or arbitrarily. The general procedure for
determining content validity is simple.
- 1) Describe the content domain and subcategories of the testing universe (this is the
most difficult step).
- 2) Determine where each test items fits with respect to the testing universe.
- 3) Compare the tests structure with that of the structure of the testing universe.
- Although there is no statistical measure of content validity, tests that provide more
detail about the structure and boundaries of content domains generate more confidence
about content validity.
- Content validity by itself cannot guarantee the validity of a measure! Why?
- After the development of a test, test-users should not assume that a test is content
valid. Publishers should provide evidence, in the test manual, that a test demonstrates
content validity. What types of evidence should they provide?
- Content validity ratios! This is the proportion of experts (i.e., experts divided by the
total number of experts) who state that each question is essential. A question is usually
described as content valid when half of the experts deem it essential.
- You may have noticed that content validity is conceptually similar to reliability.
- The main difference between the two is that content validity places an emphasis on
providing a detailed description of the content domain. In comparison, reliability assumes
that a domain exists, but makes little effort to define it. Thus, you could have a
reliable test that has little content validity.
- Can you think of an example where content validity could be used to establish the
validity of a decision on the bases of test scores?
- Consider personnel selection for a statistical position with Manitoba Health.
- What would you do to determine the quality of an applicant? (see text pg. 153).
- Face validity tells us nothing about what a test actually measures.
- Face validity refers to how test takers perceive the attractiveness and appropriateness
of a test. Why then is it important?
- If test takers consider the test to have face validity, they may offer a more
conscientious effort to complete the test. If a test does not have face validity they
might hurry through a test and take it less seriously.
- Is defined as the extent to which a test measures some theoretical construct.
- The process of establishing construct validity for a test is a somewhat tedious, and
requires the gradual accumulation evidence that illustrates that the test's "test
scores" relate to observable behaviours in such a way that they were predicted by the
- Note: If you accept the evidence provided by construct validity, then you are obligated
to accept the underlying definition of construct used in the process of validation. In
other words, you accept the definition provided by those who developed and validated the
- What is a Construct?
- Constructs are attributes that exist in the theoretical sense. Thus, they do not exist
in either the literal or physical sense. Despite this, we can observe and measure
behaviours that provide evidence of these constructs.
- For example, consider gravity. We cannot see gravity, but we can see what we assume to
be its resultsà a falling apple.
- Definitions of constructs often vary from person to person, even among persons who are
considered experts in an area of study. For example, take the construct of alcoholism. If
we surveyed the class, how many different definitions would we generate?
- Consider the construct introduced by Bandura (1977) of self-efficacy. It is defined as a
person's expectations about his or her own competence and ability to accomplish an
activity or task.
- Form the model, Bandura (1977) proposed the following about the construct of
self-efficacy, "expectations of personal efficacy determine whether coping
behavior will be initiated, how much effort will be expanded, and how long it will be
sustained in the face of obstacles and aversive experiences.
- Since our ability to measure an abstract concept like self-efficacy depends on our
ability to observe and measure related behaviour, how should we go about defining or
explaining a psychological construct?
- Your text describes 3 steps, referred to as construct explication, which outlines
the process of defining a construct.
- 1) Identify the behaviours that relate to the construct. The more you can generate the
better able you are to define the construct.
- 2) Identify other constructs that may be related or unrelated to the construct being
explained. This will help determine the boundaries of the construct.
- 3) Identify behaviours related to these similar and dissimilar constructs and determine
whether these behaviours are related to the current construct being measured.
- We'll do an example in class, and your text also gives an excellent example (p. 157).
- Once you've completed your detailed descriptions of the relationships between sets of
constructs and their behavioural universes, you've now generated what is referred to as a nomological
- The nomological network or method defines constructs by illustrating their relationship
to as many other constructs and behaviours as possible. This network then provides the
starting point for establishing a test's construct validity since it provides a number of
hypotheses about the behaviours that people who have small or large amounts of the
construct should display.
Gathering Evidence for Construct Validity
- There are 2 main ways that we can obtain scientific evidence for construct validity:
- 1) Gathering Theoretical Evidence: (see above)
- The nomological network;
- Proposal of experimental hypotheses;
- 2) Gathering Psychometric evidence:
- Evidence of reliability;
- Convergent and discriminant validity;
- Experimental interventions;
- Let's consider some of the psychometric evidence in a bit more detail.
- Recall that reliability is a necessary characteristic for a psychological test.
- High reliability scores generally indicate that a single theoretical construct is
- Also, psychological testing theory suggests that a test should not have a stronger
correlation with any other variable than it does with itself. With this in mind,
reliability estimates may be used to evaluate the relative strength of a test's
correlations with other variables that are related to the theoretical construct.
- Convergent and Discriminant Validity:
- If a test is construct valid, then we should expect that our test's scores will
correlate strongly with the scores on other tests that measure the same construct. This is
termed convergent validity.
- This raises another intriguing question. If we already have a test that measures some
construct, why would we want to develop another one?
- Discriminant validity is the opposite of convergent validity. If different constructs
are not consider to be related then we should expect to find no correlation between test
scores measuring these different constructs.
- The Multitrait-Multimethod Design: This method creatively combines the need to
collect evidence of convergent and discriminant validity, and reliability into one
- With this method researchers can test for convergence across different measures of the
same construct, and for divergence between measures of related but conceptually different
- In essence, you choose 3 constructs that are unrelated in theory and 3 different types
of tests (i.e., maximal performance, projective, and peer review) that measure each of the
- You then collect data on each participate in the study on each construct and using each
method. Each person should have 9 scores.
- You should then generate a correlation matrix. Your headings for the horizontal and
vertical axis of the matrix should be identical, and include the method name and construct
being measured. Your correlation matrix should have 81 possible values. See table 8.7, p.
163 of your text.
- We'll go through the table in class.
- The multitrait-multimethod design is an efficient and very informative method for
studying construct validity. Be sure you understand it. It makes for a great final exam
question! Hint, hint!!
- Experimental Intervention:
- When a test is used as an independent or dependent variable in a research study, the
results of the study can make a substantial contribution to the argument of construct
- Hint: Think of a significant difference between pre- and post-test scores on some
construct that was predicted to change due to experimental treatment.
Factor Analysis (FA)
- Note: Your text briefly describes FA in chapter 4, p. 81-83.
- FA is an analytical/statistical technique based on correlation that takes a large number
of interrelated variables or items on a test/scale and reduces them to a smaller number of
latent or hidden dimensions that we refer to as factors.
- FA has helped researchers and test developers to broaden studies of construct validity
by allowing investigation of the underlying factors that a test is measuring. This is
called confirmatory FA.
- Using FA allows test developers to consider the underlying theory associated with the
construct in question, and to propose a set of underlying factors that they expect the
test to contain. Developers then conduct FA to see whether the factors they proposed
exist. If the factors do exist, then this is considered good evidence of construct
- Consider that you administered a 50-question test intended to measures self-control in
young offenders to 100 troubled youth. After reviewing the literature you find that
adolescent self-control has at least 3 underlying features termed impulsiveness,
self-centeredness, and physical activity. To see if the test you used has construct
validity you may want to run a FA and see if 3 underlying factors emerge from your test
- Factors are determined by relating each test question's relationship to the other test
questions. As test questions begin to group together, they begin to form factors. These
factors underlie dimensions of questions that measure the same attribute or trait.
- You can name the factors by looking at the questions that were grouped together to form
each factor. Names are generally based on the content of the questions that group together
or most highly correlated.
- There are many ways of conducting a factor analysis that depend on basic assumptions,
like if the underlying factors are not correlated (independent) or dependent (correlated).
- We'll do a few examples in class using SPSS.