Validity

  • Often people believe that the title of a test tells them what the test measures. This strategy is a poor one to use.
  • For instance, consider a test titled, 'The Math Achievement Test'. It may measure the broader concept of academic achievement, or the narrower concept of achievement in multiplication/division, or it may even measure a very different attribute like general intelligence.
  • While reliability tells us whether a test is measuring something consistently, only validity can provide us with information about what a test really measures.
  • Validity then is evidence that a test is being used appropriately and measures what it is supposed to measure.

Major Types of Validity

  • The APA's Standards for Educational and Psychological Testing (1985) and specialists in psychological testing generally agree that there are 3 ways of deciding whether a test is valid enough to be useful. Those 3 strategies are named:
  • Content Validity;
  • Construct Validity;
  • Criterion-Related Validity (often referred to as Predictive & Concurrent Validity).
  • Another type of validity, termed Face Validity, is not recognized as a primary type of validity, but is commonly used by test developers and test users.

The Use of Validation Strategies

  • A valid test is one that does the job it is suppose to do. It measures the construct it is suppose to measure or predicts the outcome it claims to predict.
  • For example, your take home test is an achievement test that is suppose to measure how well you understand or have mastered the content of the first half of this course. Employment tests, on the other hand, are supposed to predict future job performance.
  • With this in mind, tests must be used for their intended purposes. Neither, a valid reading test nor your take home test should be used as a valid measure of intelligence.
  • Some tests measure concrete attributes like the ability to throw a baseball. Most persons agree on the specific behaviours associated with throwing a baseball.
  • Other test measure abstract attributes like love, personality, intelligence, or creativity. These attributes are more difficult to describe because many people disagree on what these behaviours represent. What does it mean to be aggressive?
  • Thus, depending on the type of test you have it is important to understand the different types of validity, what they mean, and when they should be used.
  • For example, for achievement tests it is important and relatively easy to demonstrate content validity (see below).
  • However, gathering content validity for something abstract, like personality, may be more difficult (but not necessarily less important).
  • Criterion-related validity is generally used for tests that claim to predict outcomes. Can you think of some examples?
  • Construct validity is appropriate when your test is measuring an abstract concept like beauty. Construct validity involves the accumulation of a variety of evidence (reliability and other types of validity) that shows the test is functioning as it was intended to. For example, your test of beauty may correlate highly with another test of beauty, or may cause specific behaviours (e.g., dilation of pupils) as predicted by some theory.

    Content Validity

  • One of the simplest ways to obtain evidence for the validation of a test is to examine the content of the test.
  • Content validity then is the extent to which the questions on a test are representative of the trait, behaviour, or attribute that is being measured.
  • Content validity focuses on the questions of the test, which differs from criterion-related and construct validity which correlate test scores with other measures of performance.
  • As such, content validity need not involve a statistical procedure.
  • How would you show that the following are content-valid?
  • Classroom statistical achievement test.
  • Employment test to measure mechanical ability.
  • Paper-and-pencil test of 'life attitude'.
  • How do you go about obtaining evidence of content validity?
  • A) Systematically defining the Testing Universe: The testing universe or content domain is the sample of all possible behaviours of the attribute or trait being measured. This is typically done before the test is developed and gives the user confidence that the test is representative.
  • B) Expert ratings. Once developed experts should be consulted so they can evaluate how relevant each test question is to what is being measured.

The Testing Universe or Content Domain

  • The first step for the development of any test is to determine the testing universe (i.e., content domain)--the set of knowledge or behaviours the test represents. Usually, this step involves locating theoretical or empirical research, talking with experts, or reviewing other similar instruments.
  • A content-valid test will representatively sample the testing universe.
  • Content domains often have defined boundaries, and can usually be structured into distinct subcategories (see pg. 149). Describing the boundaries and categories of a content domain facilitates test question development (i.e., it is easy to see if and where a specific question may fit in) and is crucial in evaluating content validity.
  • There are no formal statistical measures of content validity. It is a judgement call. However, the judgements are not made haphazardly or arbitrarily. The general procedure for determining content validity is simple.
  • 1) Describe the content domain and subcategories of the testing universe (this is the most difficult step).
  • 2) Determine where each test items fits with respect to the testing universe.
  • 3) Compare the tests structure with that of the structure of the testing universe.
  • Although there is no statistical measure of content validity, tests that provide more detail about the structure and boundaries of content domains generate more confidence about content validity.
  • Content validity by itself cannot guarantee the validity of a measure! Why?
  • After the development of a test, test-users should not assume that a test is content valid. Publishers should provide evidence, in the test manual, that a test demonstrates content validity. What types of evidence should they provide?
  • Content validity ratios! This is the proportion of experts (i.e., experts divided by the total number of experts) who state that each question is essential. A question is usually described as content valid when half of the experts deem it essential.
  • You may have noticed that content validity is conceptually similar to reliability.
  • The main difference between the two is that content validity places an emphasis on providing a detailed description of the content domain. In comparison, reliability assumes that a domain exists, but makes little effort to define it. Thus, you could have a reliable test that has little content validity.
  • Can you think of an example where content validity could be used to establish the validity of a decision on the bases of test scores?
  • Consider personnel selection for a statistical position with Manitoba Health.
  • What would you do to determine the quality of an applicant? (see text pg. 153).

Face Validity

  • Face validity tells us nothing about what a test actually measures.
  • Face validity refers to how test takers perceive the attractiveness and appropriateness of a test. Why then is it important?
  • If test takers consider the test to have face validity, they may offer a more conscientious effort to complete the test. If a test does not have face validity they might hurry through a test and take it less seriously.

Construct Validity

  • Is defined as the extent to which a test measures some theoretical construct.
  • The process of establishing construct validity for a test is a somewhat tedious, and requires the gradual accumulation evidence that illustrates that the test's "test scores" relate to observable behaviours in such a way that they were predicted by the underlying theory.
  • Note: If you accept the evidence provided by construct validity, then you are obligated to accept the underlying definition of construct used in the process of validation. In other words, you accept the definition provided by those who developed and validated the test.
  • What is a Construct?
  • Constructs are attributes that exist in the theoretical sense. Thus, they do not exist in either the literal or physical sense. Despite this, we can observe and measure behaviours that provide evidence of these constructs.
  • For example, consider gravity. We cannot see gravity, but we can see what we assume to be its resultsà a falling apple.
  • Definitions of constructs often vary from person to person, even among persons who are considered experts in an area of study. For example, take the construct of alcoholism. If we surveyed the class, how many different definitions would we generate?
  • Consider the construct introduced by Bandura (1977) of self-efficacy. It is defined as a person's expectations about his or her own competence and ability to accomplish an activity or task.
  • Form the model, Bandura (1977) proposed the following about the construct of self-efficacy, "expectations of personal efficacy determine whether coping behavior will be initiated, how much effort will be expanded, and how long it will be sustained in the face of obstacles and aversive experiences.
  • Since our ability to measure an abstract concept like self-efficacy depends on our ability to observe and measure related behaviour, how should we go about defining or explaining a psychological construct?
  • Your text describes 3 steps, referred to as construct explication, which outlines the process of defining a construct.
    • 1) Identify the behaviours that relate to the construct. The more you can generate the better able you are to define the construct.
    • 2) Identify other constructs that may be related or unrelated to the construct being explained. This will help determine the boundaries of the construct.
    • 3) Identify behaviours related to these similar and dissimilar constructs and determine whether these behaviours are related to the current construct being measured.
  • We'll do an example in class, and your text also gives an excellent example (p. 157).
  • Once you've completed your detailed descriptions of the relationships between sets of constructs and their behavioural universes, you've now generated what is referred to as a nomological network.
  • The nomological network or method defines constructs by illustrating their relationship to as many other constructs and behaviours as possible. This network then provides the starting point for establishing a test's construct validity since it provides a number of hypotheses about the behaviours that people who have small or large amounts of the construct should display.

Gathering Evidence for Construct Validity

  • There are 2 main ways that we can obtain scientific evidence for construct validity:
  • 1) Gathering Theoretical Evidence: (see above)
  • The nomological network;
  • Proposal of experimental hypotheses;
  • 2) Gathering Psychometric evidence:
  • Evidence of reliability;
  • Convergent and discriminant validity;
  • Experimental interventions;
  • Let's consider some of the psychometric evidence in a bit more detail.
  • Reliability:
  • Recall that reliability is a necessary characteristic for a psychological test.
  • High reliability scores generally indicate that a single theoretical construct is present.
  • Also, psychological testing theory suggests that a test should not have a stronger correlation with any other variable than it does with itself. With this in mind, reliability estimates may be used to evaluate the relative strength of a test's correlations with other variables that are related to the theoretical construct.
  • Convergent and Discriminant Validity:
  • If a test is construct valid, then we should expect that our test's scores will correlate strongly with the scores on other tests that measure the same construct. This is termed convergent validity.
  • This raises another intriguing question. If we already have a test that measures some construct, why would we want to develop another one?
  • Discriminant validity is the opposite of convergent validity. If different constructs are not consider to be related then we should expect to find no correlation between test scores measuring these different constructs.
  • The Multitrait-Multimethod Design: This method creatively combines the need to collect evidence of convergent and discriminant validity, and reliability into one research study.
  • With this method researchers can test for convergence across different measures of the same construct, and for divergence between measures of related but conceptually different constructs.
  • In essence, you choose 3 constructs that are unrelated in theory and 3 different types of tests (i.e., maximal performance, projective, and peer review) that measure each of the concepts.
  • You then collect data on each participate in the study on each construct and using each method. Each person should have 9 scores.
  • You should then generate a correlation matrix. Your headings for the horizontal and vertical axis of the matrix should be identical, and include the method name and construct being measured. Your correlation matrix should have 81 possible values. See table 8.7, p. 163 of your text.
  • We'll go through the table in class.
  • The multitrait-multimethod design is an efficient and very informative method for studying construct validity. Be sure you understand it. It makes for a great final exam question! Hint, hint!!
  • Experimental Intervention:
  • When a test is used as an independent or dependent variable in a research study, the results of the study can make a substantial contribution to the argument of construct validity. How?
  • Hint: Think of a significant difference between pre- and post-test scores on some construct that was predicted to change due to experimental treatment.

Factor Analysis (FA)

- Note: Your text briefly describes FA in chapter 4, p. 81-83.

  • FA is an analytical/statistical technique based on correlation that takes a large number of interrelated variables or items on a test/scale and reduces them to a smaller number of latent or hidden dimensions that we refer to as factors.
  • FA has helped researchers and test developers to broaden studies of construct validity by allowing investigation of the underlying factors that a test is measuring. This is called confirmatory FA.
  • Using FA allows test developers to consider the underlying theory associated with the construct in question, and to propose a set of underlying factors that they expect the test to contain. Developers then conduct FA to see whether the factors they proposed exist. If the factors do exist, then this is considered good evidence of construct validity.
  • Consider that you administered a 50-question test intended to measures self-control in young offenders to 100 troubled youth. After reviewing the literature you find that adolescent self-control has at least 3 underlying features termed impulsiveness, self-centeredness, and physical activity. To see if the test you used has construct validity you may want to run a FA and see if 3 underlying factors emerge from your test sample.
  • Factors are determined by relating each test question's relationship to the other test questions. As test questions begin to group together, they begin to form factors. These factors underlie dimensions of questions that measure the same attribute or trait.
  • You can name the factors by looking at the questions that were grouped together to form each factor. Names are generally based on the content of the questions that group together or most highly correlated.
  • There are many ways of conducting a factor analysis that depend on basic assumptions, like if the underlying factors are not correlated (independent) or dependent (correlated).
  • We'll do a few examples in class using SPSS.