Criterion-Related Validity:

  • How well do our tests predict behaviour or events? This is an important question for making decisions on the basis of our tests. The stronger our test scores correlate with independent behaviours, attitudes, or events the better our decisions will be, and the greater our criterion-related validity will be.
  • For example, suppose a test developer wants to develop a test to be used in a clinical setting to diagnose depression. What criterion would you used to determine if indeed the test does diagnose depression? Probably, a diagnosis made by a psychologist or psychiatrist independent of the test developer's test. If the test correlates well with the independent diagnosis of professionals then the test is said to have criterion-related validity.
  • A criterion is simple the measure of performance that is correlated with the test scores. Or, the criterion is the measure that is used to determine how accurate your decision is.
  • The criterion for the GRE test is most often the student's grade point average.

Methods for Demonstrating Criterion-Related Validity:

  • Basically, there are 2 methods used for demonstrating criterion-related validity.
  • The Predictive Method:
  • The Concurrent Method:

The Predictive Method:

  • When you are trying to show a relationship between test scores and some future behaviour, the predictive method should be used to determine validity.
  • The general procedure for predictive validity is as follows:
  • 1) A large group of people take the test;
  • 2) The scores for those people are held for a predetermined period of time;
  • 3) Once the time period elapses, a measure of some behaviour (i.e., the criterion) is taken.
  • 4) The test scores are then correlated with the criterion scores.
  • 5) If the scores correlate, the test has predictive validity.
  • 6) The resulting correlation coefficient is called the validity coefficient.
  • Can you think of some situations that are meant for predictive validity?
  • What would the ideal predictive validation study be like? Is it realistic?
  • When determining predictive validity it is important that for everyone who took the test, there is also measured on the criterion. Why?
  • Because a restriction in the range of the distribution of test scores will lower your correlation.
  • This 'restriction of range' occurs fairly often in the real world. Why?
  • Consider an industrial setting, where 1000 candidates apply for 100 jobs. Generally of all those who take the test, usually those who score low are weeded out, and only those with high scores are selected.
  • Thus, we need to make a correction in our correlation when we know that a restriction in range has occurred. See pg. 176 in your text.

rc = {r (SDu / SDres)} / SQRT {1 - r2 + r2 (SD2u / SD2res)}

  • Where:
  • rc = correct correlation for restriction of range;
  • r = sample observed correlation;
  • SDu = standard deviation of the sample before range restriction;
  • SDres = standard deviation of sample after range restriction.

The Concurrent Method:

  • Concurrent validity is the practical alternative to the ideal predictive method.
  • With concurrent validity you obtain at roughly the same time both test scores and criterion scores in some predetermined population. Once this is accomplished, you simply correlate test scores with the criterion scores.
  • What is the basic difference between concurrent and predictive validation strategies?
  • With predictive validation you use a more or less random sample of the population, and with concurrent validation you are using a preselected sample. The preselected sample may be different than the population at large and thus is not as powerful as the predictive method--it theoretically gives an underestimate of the true population validity. However, most studies show that concurrent validities and predictive validities are very similar.
  • The concurrent is more practical and thus is more commonly used than the predictive model.
  • The concurrent method does not predict! Instead it provides information about the present state of affairs and the status quo. Why is this distinction important? See pg. 177 of your text.
  • Can you think of some methods used for concurrent validity?
  • Since your selected sample may also be restricted in range, the adjustment formula for restriction of range can also apply here.

Interpreting Validity Coefficients:

  • In theory, validity coefficients have values that, like correlation, range from -1 to +1.
  • However, in practice most of the validity scores you'll see will be relatively small. Most usually occur in the .3 to .5 range, and few exceed .6 or .7. Thus, there's lots of room for improvement in most of our psychological measurements.
  • Suppose a test used to select graduate students has a criterion-related validity coefficient of rv = .5. How might you interpret this finding?
  • One way of interpreting the finding is to consider the squared correlation coefficient (rv2). The squared coefficient gives you an indication of how much of the variation in the criterion can be accounted for by the predictor (your test). Thus, in our example, 25% of the variance in graduate student performance can be accounted for by our test. Or, 75% of graduate student performance cannot be accounted for by our test.
  • Are tests with such low criterion-related validity coefficients really helping us to make important decisions?

Tests and Decisions:

  • The validity coefficient is only one factor, of many, that helps us to determine the degree to which a test may improve or detract from the quality of a decision.
  • You must also consider base rates and the selection ratio when making decisions.
  • The 'Base Rate' is considered to be the level of successful performance on the criterion in the general population. For example, if 80% of all applicants for a job at a bakery perform successfully, then the base rate is considered to be .80.
  • The 'selection ratio' is the ratio of the number of available positions to the number of applicants. For example, The Psychology is looking for 3 new professors, if there are 9 applicants for the 3 positions then the selection ratio is 33%. If there are 10 psychology graduate student positions at the UofM this year and 700 applicants, then the selection ratio is 1.5%.
  • Consider the following example of accepting students into a graduate program on the prediction that they will succeed.
  • 1) You can accept the student. They will either succeed or fail.
  • 2) You can reject the student. They would have either succeeded or would have failed.
  • Consider that you want evaluate the accuracy of your decisions. What would you do?
  • An easy way is to compare predictions with the decision outcome.
  • There are 4 possible outcomes associated with this evaluation.
  • 1) TP (true positive): a person is predicted to succeed and does succeed.
  • 2) TN (true negative): a person is predicted to fail (thus they are not accepted) and would have failed if they were accepted.
  • 3) FP (false positive): a person who is predicted to succeed actually fails.
  • 4) FN (false negative): a person is not accepted, but if given the chance would have been successful.
  • True scores represent accurate decisions and false scores inaccurate decisions or errors.
  • Not surprisingly, both base rates and selection ratios will affect the quality of our decisions. How?
  • Consider a base rate of 90%. If 90% of the population can perform the criterion successfully, then it should not be too difficult to select a highly successful candidate (true positive). You could randomly pick from your applicants and be successful 9 out of 10 times.
  • Unfortunately, a high base rate also generates many false negatives. Why?
  • Would the addition of a measurement instrument to help in the decision making process, really help in situations where high base rates exist?
  • Can you think of some of the problems associated with very low base rates?
  • Tests are most likely to contribute to decisions when the base rate is around .5.
  • Selection ratio also affects the quality of our decisions. If only 10 people apply for 9 positions, then you do not have the luxury of being selective. Thus, the consequence of any decision you make will probably be like any other decision.
  • Contrast this with 10 people applying for 1 position. Here the decision's validity will greatly impact the quality of the decision. Why?
  • See table 9-4, on pg. 185 of your text. It highlights the relationship between selection ratio and validity. With low selection rates and a base rate = .5, tests with moderate validity (.4) greatly improve the quality of a decision.
  • Let's put this all together:
  • If you were to make your selection decision on the basis of randomness, the probability of a TP (true positive) would depend on the base rate (BR) and the selection ratio (SR).

    P (TP) = BR x SR

  • Thus, with a BR = .5 & a SR = .5, 25% of the decisions made at random will be TPs.
  • The probability of the remaining outcomes can be obtained by subtraction:
  • P (FN) = BR - P (TP) à P (FN) = .5 - .25 = .25 or 25%.
  • P (FP) = SR - P (TP) à P(FP) = .5 - .25 = .25 or 25%.
  • P (TN) = 1- P (TP) - P (FP) - P (FN) à P (TN) = 1 - .25 - .25 - .25 = .25 or 25%.
  • In other words, with a BR = .5 and a SR = .5, 50% of all random decisions will be correct (TP and TN) and 50% will be incorrect (FP and FN).
  • These numbers can provide the comparison for which we can evaluate how much our psychological tests help to improve our decision making ability.
  • When a valid test is used to help us make a decision, the P (TP) increases to:

    P(TP) = BR x SR + rvxy SQRT {BR (1 - BR) SR (1 - SR)}

  • Thus, suppose we have the BR and SR as above, but we also use a test with a validity of .4 to help us with our decision. How much will our decision effectiveness change?
  • You do the math.
  • P (TP) = .35; P (FP) = .15; P (FN) = .15; P (TN) = .35.
  • Now 70% of our decision will be correct. Is this a worthwhile gain? Consider Utility Theory in your decision.