Cognitive ability vs personality tests for selection

CATs vs. personality tests practice essay

Definiton

The following post is an essay researched and written in anticipation of an exam question for the module Selection and Assessment in pursuit of the Organizational Psychology Master's qualification within the University of London's International Programmes (Birkbeck College). Excerpts may be used with the citation:

Aylsworth, J. (2010). CATS versus personality measures: New implications for fairness. (url) Accessed: (Month year).

Exam essay question:
"Using a broad range of appropriate criteria, discuss the use of psychometric measures as a method of selection"

CATs versus Personality Measures:
New Implications for Fairness

Introduction
Psychometric (psychological) tests formally and systematically measure certain psychological characteristics that are assumed to play an important role in predicting job performance (Dewberry, 2009). We can think of “job performance” as “the total expected value to the organization of the discrete behavioral episodes that an individual carries out over a standard period of time” (Motowidlo, 2003).

The question will be addressed in two parts. Part 1 will compare cognitive ability tests (CATs) and personality measures with regard to domain, tests accountability. Part 2 will examine validity and fairness. We will conclude that selection methods should continue to be evaluated by our familiar criteria of validity, reliability, acceptability, usability, fairness, generality and utility. However, we may need to expand our view of fairness with regard to personality measures, in particular.

Part 1: Domain, Tests and Accountability
CATs and personality measures are the psychometric tests most often used in selection (Dewberry, 2009). Murphy & Dzieweczynski (2005) help us understand CATs versus personality measures based on differences in domain, the tests themselves, and accountability.

Domain. CATs all relate to the idea of “g,” (Spearman, 1904) as a “general” factor representing intelligence. CATs are actually tests in the sense that the higher the score, the higher the expectations of job performance. Because CATs correlate with one another (have positive manifold), the implication for practice is that any well-constructed CAT should be about as good as the next at predicting job performance.

Personality tests are only tests in the sense that they “test” for a particular attribute that the organization may desire for a particular job. The Big Five (Fiske, 1979; in Goldberg, 1993) dominates personality research at present. The taxonomy consists of conscientiousness, extraversion, agreeableness, openness to experience, and emotional stability (formerly neuroticism). Correlations among these facets are weak at best (Digman, 1990). In relation to job performance, correlations with personality measures may actually be U-shaped, which means that either too little or too much of a characteristic could be dysfunctional. This is consistent with individuals high in agreeableness being judged less favorably than those with low levels of agreeableness (Bartram, 2005) on overall job performance.

Differences. CATs require a demonstration of ability and are tests of maximal performance. Personality measures, however, demonstrate only the individual’s self-reported perception of past behavior and are tests of typical performance.

Accountability. Based on validity and fairness, CATs have been held accountable by courts of law, while personality tests have rarely been challenged, except on the basis of privacy (e.g. Merenda, 1995).

Part 2: Validity and Fairness
Validity. The validity coefficient (VC) tells us how strongly a particular selection method correlates with performance. It is based on criterion-related validity, which is a characteristic of the relationship between the selection method and the particular performance criterion being measured. On average, CATs have a VC of 0.51, while no personality measure reaches higher than conscientiousness at 0.31 (Schmidt & Hunter, 1998). While this may seem to argue for routine use of CATs, it does not. We need to take a closer look.

Using VC-squared times 100 to calculate variance, CATs account for 26 percent of the variance in job performance, while conscientiousness accounts for 10 percent. Conversely, this means that variables other than CATs and conscientiousness predict 74 percent and 90 percent of the variance, respectively. So, while CATs are near the top of the criterion-related validity list, they are far from the final word in predicting job performance.

CATs’ criterion-related validity appears to diminish as skill-level of the job decreases. CATs have VCs of 0.40 and 0.23, respectively, for semi-skilled and unskilled jobs (Schmidt & Hunter, 1998). So, for an unskilled job, a valid measure of conscientiousness – with a VC of 0.31 – is a better choice than a CAT on the basis of criterion-related validity. At the other extreme, CATs have a VC of 0.58 for highest-level professional, managerial jobs. Thus, for C-level and other upper-level management positions requiring good decision-making under difficult conditions, CATs should be a more valid performance predictor than work sample tests, which have a VC of 0.54. Work sample tests are also typically more expense to administer.

Fairness. CATs meet Cleary’s (1968) standard of fairness, which requires only a comparison of final scores and does not consider group membership. However, CATs do not satisfy Thorndike (1971). He views fairness as each group’s probability of being selected and then performing the job at an acceptable level, although “acceptable level” is a bit murky. Thorndike’s view of fairness accommodates adverse impact, which means that selection is more likely among members of some groups versus others.

CATs are not “fair” by Thorndike’s definition because, for example, blacks as a group score about one standard deviation lower than whites (Arvey et al 1994). Statistical methods, such as banding and differential item functioning can be used to offset this impact, or other selection methods could be considered instead. For example, structured interviews have the same VC as CATS on average – 0.51.

Cognitive (learning) styles are not theoretically mature enough to be used for selection (Kozhevnikov, 2007), so they cannot be challenged on issues of fairness. However, we cannot assume that they would get a free ride. We believe the same to be the case with personality measures. All of the Big Five (Costa & McCrae, 1992) and locus of control (Miller & Rose, 1982) are believed to be partly inherited. Heritability suggests the existence of psychological groups, which would allow us to look at correlations of those groups with CATs and, subsequently, adverse impact.

In addition, distinctions between states and traits are not as clearcut as we might like to believe (Chaplin et al, 1988), and traits (e.g. neuroticism) do not appear to be entirely stable (Hampson, 2006). This calls the validity of fairness measures into question because it suggests that they are susceptible to the influence of context – e.g. situational and temporal factors.

Conclusion
We have examined CATs and personality measures with regard to differences in domain, tests and accountability, and we have also looked at issues of validity and fairness. We conclude that selection methods should continue to be evaluated by our familiar criteria of validity, reliability, acceptability, usability, fairness, generality and utility. However, our view of validity and fairness, in particular, may need to be expanded.

Exam performance: This essay was used – but adapted – under exam conditions because the question as written as an exam choice was slightly different than the version that informed the practice draft above. The actual exam answer was marked only as a "pass," possibly due to a combination of the following factors: 1) The writer experienced a cognitive lapse in correctly referencing an important definition, 2) Sources cited for partial heritability of personality constructs, lack of clarity between states and traits, and possible lack of trait stability were from readings outside of the module, 3) Examiners may simply have found the argument not to have followed a logical structure that supported the conclusion. Since the year-end exam marks for each module are each module's only – and ultimate – mark within this program of study, this example illustrates an important point: A module's mark does not always accurately reflect a student's exam preparation investment or the scope and breath of learning available for application in real-world environments.