A Critique of Psychological Tests Commonly Used with Chronic Pain Patientsby William W. Deardorff, Ph.D, ABPP.
Course content © Copyright 2011 - 2024 by William W. Deardorff, Ph.D, ABPP. All rights reserved. |
PLEASE LOG IN TO VIEW OR TAKE THIS TEST
This test is only active if you are successfully logged in.
Course Outline
Introduction Learning Objectives The Basics of Statistics in Psychological Testing Reliability Test-Retest Reliability Inter-rater Reliability Parallel Forms Reliability Internal Consistency Reliability Validity Content Validity Face Validity Criterion-related Validity Concurrent validity Predictive validity Convergent validity Discriminant validity Construct validity Generalizability Standardization and Normative Samples Overview of Psychological Testing of Chronic Pain Patients Types of Instruments Broadband-General Broadband-Health Narrow Focus Narrow Focus – Health Examples of Psychometric Tests Used with Chronic Pain Patients Psychological Tests Commonly Used in the Assessment of Chronic Pain Patients Attributes, Strengths and Weaknesses – Published by the Colorado Division of Workers’ Compensation
Introduction
The various Practice Guides relating to assessment of chronic pain (including ACOEM and ODG) discuss the use of psychological testing. One of the most objective components of assessing a chronic pain patient is psychological testing. However, to achieve valid results, the clinician must have an understanding of the use of psychological tests with this special population. As will be discussed in this course, there are many psychological tests that have been specifically designed to be used with chronic pain patients. However, the vast majority of tests used with chronic pain patients (e.g. the MMPI-2) were not originally designed for this purpose. Being aware of the strengths and weaknesses of various psychological tests when used with chronic pain patients is extremely important to be able to assess the validity of the results. Having an understanding of these issues is important for all practitioners who evaluate and treat chronic pain patients.
Please note that for this course, the individual Help-Feature for each question on the test is available only for the material that is presented on the web site. It is not available for the document to be reviewed which constitutes the second part of the course. However, you can take the test as many times as you like, until you pass. You will receive feedback each time you submit the test for scoring, until you pass. We hope you enjoy the course and reviewing this valuable information.
The Basics of Statistics in Psychological Testing
To evaluate a psychological test, either in general or for use with pain patients, one must have some familiarity with statistics related to psychometrics. For psychologists and others that are trained to do testing, this will be review material. However, anyone who works with chronic pain patients who have undergone psychological testing will be able to have an idea of the validity of the results presented in the report by having an understanding of these concepts.
Reliability
Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. For example, if a test is designed to measure a trait (such as introversion), then each time the test is administered to a subject, the results should be approximately the same. Unfortunately, it is impossible to calculate reliability exactly, but there several different ways to estimate reliability. Reliability is the extent to which a test is repeatable and yields consistent scores.
A perfectly reliable test is one that is completely accurate and free from error. In other words, the same test, given to the same individual, in the same way, should always yield the same value from moment to moment assuming the thing measured itself has not changed. It also assumes that if there is a change in the test results, it is only due to a change in the thing being measured, and not the imperfection of the test. This is the reliability that one strives for, but never achieves (this is true variance). All psychological tests have some degree of measurement error (error variance). This affects the value of the test that is not related to the thing being measured. It is the imperfection of the test. Every test tries to maximize true variance and minimize error variance.
Test-Retest Reliability
The test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice. Then the first set of scores is correlated with the second set of scores. Test-retest correlations range between 0 (low reliability) and 1 (high reliability). This kind of reliability is used to assess the consistency of a test across time. This type of reliability assumes that there will be no change in the quality or construct being measured. Test-retest reliability is best used for things that are stable over time, such as intelligence. Generally, reliability will be higher when little time has passed between tests and lower if a greater amount of time has passed. Reliability is negatively impacted by measurement error. One desires a test to have a low measurement error. Change due to measurement error is not related to actual changes in the variable being measured (e.g. if you use a tape measure to measure a room on two different days, any differences in the result is likely due to measurement error rather than a change in the room size). The test-retest reliability of tests used to assess variables expected to change over time (e.g. level of depression) are done with short test-retest intervals and using other methods. Inter-rater Reliability
This type of reliability is assessed by having two or more independent judges score the test. The scores are then compared to determine the consistency of the raters’ estimates. One way to test inter-rater reliability is to have each rater assign each test item a score. For example, each rater might score items on a scale from 1 to 10. Next, you would calculate the correlation between the two ratings to determine the level of inter-rater reliability. Another means of testing inter-rater reliability is to have raters determine which category each observation falls into and then calculate the percentage of agreement between the raters. So, if the raters agree 8 out of 10 times, the test has an 80% inter-rater reliability rate. Inter-rater reliability is certainly important for such measures as the GAF. For instance, if two practitioners independently assessed a patient and assigned a GAF, how closely would those results correlate. If the practitioners are well trained in the use of the GAF, the inter-rater reliability should be high.
Parallel-Forms Reliability
Parallel-forms (or alternate forms) reliability is gauged by comparing two different tests that were created using the same content. This is accomplished by creating a large pool of test items that measure the same quality and then randomly dividing the items into two separate tests. The two tests should then be administered to the same subjects at the same time.
Internal Consistency Reliability
This form of reliability is used to judge the consistency of results across items on the same test. Essentially, you are comparing test items that measure the same construct to determine the test’s internal consistency. When you see a question that seems very similar to another test question, it may indicate that the two questions are being used to gauge reliability. Because the two questions are similar and designed to measure the same thing, the test taker should answer both questions the same, which would indicate that the test has internal consistency.
Interpreting Reliability Correlations
So what do reliability figures indicate? Test manuals and independent reviews of tests will provide information about reliability. The reliability of a test is indicated by a reliability coefficient. It is denoted by the letter “r” and is expressed by a number ranging from 0 (no reliability or correlation) to 1 (perfect reliability or correlation). Generally reliability coefficients are expressed as a decimal (e.g. r = .75) and the larger the reliability coefficient, the more repeatable or reliable the test score. However this does not indicate the test’s validity, which will be discussed later. For a test to be valid it MUST have reasonable reliability but a highly reliable test is not necessarily valid. Some general guidelines for interpreting test reliability can be found in Table 2. Validity
Validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. In order to be valid, a test must be reliable; but reliability does not guarantee validity. Validity isn’t determined by a single statistic, but by a body of research that demonstrates the relationship between the test and the behavior it is intended to measure. There are three types of validity:
Content Validity
When a test has content validity, the items on the test represent the entire range of possible items the test should cover (or, the adequacy with which the test items adequately and representatively sample the content area to be measured). Individual test questions may be drawn from a large pool of items that cover a broad range of topics. In some instances where a test measures a trait that is difficult to define, an expert judge may rate each item’s relevance. Because each judge is basing their rating on opinion, two independent judges rate the test separately. Items that are rated as strongly relevant by both judges will be included in the final test. Content validity is primarily an issue for educational tests, certain industrial tests, and other tests of content knowledge like the Psychology Licensing Exam. Expert judgment (not statistics) is the primary method used to determine whether a test has content validity. Nevertheless, the test should have a high correlation w/other tests that purport to sample the same content domain. Face Validity
Face validity is the least important aspect of validity, because validity still needs to be directly checked through other methods. All that face validity means is: "Does the measure, on the face of it, seem to measure what is intended?" Some measures commonly used have very high face validity such as the Beck Depression Inventory. Anyone looking at the items on the test can tell exactly what is being measured (symptoms of depression). It may have excellent validity for measuring depression in a person who is committed to answering the questions in a truthful manner. However, the test is easily manipulated and a person who is not depressed can easily produce results on the BDI indicating high levels of depression (it is easily faked). For that reason, researchers will sometimes purposely try to obscure a measure’s face validity in an effort to attain improved validity elsewhere (the MMPI-2 test is probably the best example of low face validity which was purposely done as part of its construction). However, a test does not always strive for high content validity. The best example is the MMPI-2. The items on the MMPI-2 have been found to be associated with certain traits being measured, but one can often not tell what the question is measuring. In these cases, the other types of validity are more important (criterion and construct). Face validity is not a technical sense of test validity. Just because a test has face validity does not mean it will be valid in the technical sense of the word. Criterion-related Validity
A test is said to have criterion-related validity when the test has demonstrated its effectiveness in predicting criterion or indicators of a construct. There are two different types of criterion validity:
Concurrent Validity. Concurrent validity occurs when the criterion measures are obtained at the same time as the test scores. This indicates the extent to which the test scores accurately estimate an individual’s current state with regards to the criterion. For example, on a test that measures levels of depression, the test would be said to have concurrent validity if it measured the actual current levels of depression experienced by the test taker. Often, criterion validity is established by correlating the test findings (e.g. level of depression) with some “gold-standard” (e.g. a structured interview and assesses depression).
Convergent Validity
It is important to know whether the test being used (or developed) returns similar results to other tests which purport to measure the same or related constructs. Questions to be addressed include: Does the measure match with an external 'criterion' such as a behavior or another, well-established, test? Does it measure it concurrently and can it predict this behavior? Examples might include a self-report measure of pain level compared to trained observers ratings a patient’s pain behavior.
Discriminant Validity
Just as it is important to show that a test returns results similar to other test of the same trait, it is also important to show that a measure doesn't measure what it isn't meant to measure (i.e. it discriminates). For example, discriminant validity would be evidenced by a low correlation between a depression measure and one of self-efficacy or self-esteem (one would expect depression and self-esteem to be inversely related). Also, a test of depression that correlates highly with an anxiety test will not have good discriminant validity (it cannot discriminate between depression and anxiety).
Construct Validity
Construct Validity is the most important kind of validity. A test has construct validity if it demonstrates an association between the test scores and the prediction of a theoretical trait. If a measure has construct validity it measures what it purports to measure. Establishing construct validity is a long and complex process. The various qualities that contribute to construct validity include: criterion validity (includes predictive and concurrent), convergent validity, and discriminant validity. Generalizability
Reliability and validity are often discussed separately but sometimes you will see them both referred to as aspects of generalizability. Often we want to know whether the results of a measure or a test used with a particular group can be generalized to other tests or other groups. This is especially important relative to the topic being discussed. For instance, are the results of the MMPI (that was developed using non-pain patients) generalizable when given to a patient with chronic pain? A test may be reliable and it may be valid but its results may not be generalizable to other tests measuring the same construct nor to populations other than the one sampled. Standardization and the Normative Sample
To understand norms and statistical assessment one first needs to understand standardization. Standardization is the process of testing a group of people to see the scores that are typically attained on the test. With a standardized test (such as the MMPI), the patient’s raw data results are compared to where that score falls compared to the standardization group's performance. This results in the standardized scores. With standardization, the normative group upon which the test was developed must reflect the population on which the test is being used. Most commonly used major psychological measures, are norm-based (again, meaning that the score for an individual is interpreted by comparing his/her score with the scores of a group of people who define the norms for the test). Often the test manual, or subsequent publications, will provide data about different results with different “normative” or standardization groups. For instance, there may be community norms, medical patient norms, psychiatric patient norms, etc. Depending upon which set of normative data is used, will change the standardized score for the individual patient. Often, if multiple norms are available, they will be reported as part of a computerized scoring report (e.g. the largest of which is NCS-Pearson). The concept of normative and standardization is very important relative to test interpretation but often ignored when common psychological tests are used with chronic pain patients (not included in the normative sample).
In summary, standardized tests are:
Administered under uniform conditions (no matter where, when, by whom or to whom it is given, the test is administered in a similar way).
Scored objectively (the procedures for scoring the test are specified in detail so that any number of trained scorers will arrive at the same score for the same set of responses). Questions that need subjective evaluation (e.g. essay questions, responses to open-ended questions) are generally not considered standardized tests.
Designed to measure relative results on the test as compared with the normative sample (as discussed above). In order to measure relative results, standardized tests are interpreted with reference to a comparable group of people (the standardization or normative sample). One example is a test of depression. The cut-off scores for a depression test (e.g. not depressed, mildly depressed, severely depressed) are determined during the test development phase using the standardization group (or normative sample). The normative sample should be representative of the target population - however this is not always the case. In that case, the test needs to be interpreted with appropriate caution. This is one of the weaknesses of most tests that were not originally designed for use with pain patients but are commonly used with this population (e.g. MMPI, BDI, etc.). This “off-label” use of the test can still be done effectively if this is taken into account in the interpretation of results. In most cases, there is substantial research on the use of these tests with pain patients and, in some cases, special normative data is available (e.g. MMPI standardized on a chronic pain population).
Overview of Psychological Testing of Chronic Pain Patients
Psychometric instruments used in the assessment of chronic pain might be categorized as four general types as can be seen in Table 3.
Broadband-General
The broadband-general measures include those that were not originally designed to assess medical patients including pain. These measures often assess a number of personality, behavioral or other variables. These assessments were not originally designed to assess medical issues, but often normative data for specific populations has been developed to help with generalizability. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2; Butcher, Graham, Ben-Porath et al., 2001) is the most widely used and researched personality inventory. The MMPI-2 was designed to identify psychopathology and personality features; however, it is also one of the most commonly used measures in such things as chronic pain, pre-surgical screening, and other issues. When using broadband-general measures, the clinician must be well-versed in validity, standardization, and interpretation issues to avoid misuse of the test. Example excerpts of three different interpretations of a 1-3/3-1 codetype on the MMPI-2 illustrate this point and can be seen in Table 4:
Broadband-Health
The broadband-health measures are measures that have been specifically developed to assess a number of issues related to health and medical issues, without necessarily focusing on one particular health problem. Examples can be seen in Table 3. These tests will often assess psychological and behavioral issues that are intimately related to medical treatment. For instance, the Battery for Health Improvement-2 (BHI-2; Bruns and Disorbio, 2003) is designed “for the psychological assessment of medical patient” and includes scales organized into five domains: validity, physical symptoms, affective, character, and psychosocial variables. Similarly, the Millon Behavioral Medicine Diagnostic (MBMD; Millon, Antoni, Millon, Minor and Grossman, 2001) includes domains of response patterns, psychiatric indications, coping styles, stress moderators, treatment prognostics, management guides and negative health habits. The MBMD now has normative data for general medical patients, chronic pain and bariatric surgery candidates.
Narrow Focus
The narrow focus measures include measures that assess a particular psychological issue such as depression, anxiety, suicidality, stress and coping. Probably two of the most commonly used measures in this category are the Beck Depression Inventory (BDI-2; Beck, Steer, & Brown, 1996) and the Beck Anxiety Inventory (BAI; Beck and Steer, 1993). Similar to the MMPI-2, when these measure are used with medical patients one must be very cautious with interpretation. For instance, the BDI-2 is a measure of self-rated depression that contains a number of physical (e.g. weight, sleep, energy) and cognitive (concentration, memory) symptoms, all of which can be differentially affected by depression, pain or some other medical condition, or both. Therefore, the clinician should always be aware of the impact of the actual medical problem on the narrow focus psychological instrument. Narrow Focus-Health
The narrow focus-health test is designed to be a brief measure of a specific medical or health condition (See Table 5). These tests are valuable for assessing and treating a specific condition. Examples of these tests have been developed for the assessment of chronic pain (often used in conjunction with some of the broad based measures). For instance, the Multidimensional Pain Inventory (MPI; Kerns, Turk, & Rudy, 1985) includes 13 scales that yield assignment to one of three profiles based on cluster analysis: Dysfunctional, Interpersonally Distressed, and Adaptive Coper. Some examples of psychometric tests of each type that are used with chronic pain patients can be found in Table 5.
One of the primary pitfalls in psychometric assessment of the chronic pain patient is not paying attention to the validity of the test instrument relative to the problem being assessed along with concomitant interpretation issues. It is always important to keep in mind standardization and basic psychometric issues when using any test on a medical patient population including chronic pain.
Psychological Tests Commonly Used in the Assessment of Chronic Pain Patients
The following document entitled, “Psychological Tests Commonly Used in the Assessment of Chronic Pain Patients” was published by the Colorado Division of Workers’ Compensation. It is an excellent review of these tests including:
Test Characteristics Attributes of the Tests Strengths and Weaknesses of Each Test
The review includes categories that are listed in Table 6. These roughly correspond to the types of tests as discussed in Table 3 with slightly more specificity.
The Colorado review is referenced in many other Practice guidelines. The remainder of the course includes the material contained in the document. The document is available in pdf format and can either be reviewed from online, or printed. The questions on the post-course test refer to the previous material and the following document.
To review the document click here.
REFERENCES
Beck, A.T. & Steer, R.A. (1993). BAI, Beck Anxiety Inventory Manual. San Antonio, TX: Psychological Corporation.
Beck, A.T., Steer, R.A., & Brown, G.K. (1996). Manual for the Beck Depression Inventory-II. San Antonio, TX: Psychological Corporation.
Bergner, M., Bobbitt, R.A., Carter, W.B., & Gilson, B.S. (1981). The Sickness Impact Profile: Development and final revision of a health status measure. Medical Care, 19, 787-806.
Bruns, D. & Disorbio, J.M. (2003). Battery for Health Improvement – 2. Minneapolis, MN: Pearson.
Block, A.R., Gatchel, R.J., Deardorff,W.W., & Guyer, R.D. (2003). The Psychology of Spine Surgery. Washington, D.C: American Psychological Association.
Butcher, J.N., Graham, J.R., Ben-Porath, Y.S. et al. (2001). MMPI-2: Manual for administration, scoring, and interpretation. Minnesota, MN: University of Minnesota Press.
Derogatis, L.R. (1983). SCL-90-R: Administration, scoring and procedures manual-II. Towson, MD: Clinical Psychometric Research.
Foa, E.B. (1995). Posttraumatic Stress Diagnostic Scale Manual. National Computer Systems Inc.
Kerns, R.D., Turk, D.C., & Rudy, T.E. (1985). The West Haven-Yale Multidimensional Pain Inventory (WHYMPI). Pain, 23, 345-356.
Millon, T., Davis, R.D., & Millon, C. (1997). MCMI-III manual (2nd ed.). Minneapolis, MN: National Computer Systems.
Millon, T., Antoni, M., Millon, C., Minor, S., & Grossman, S. (2001). Millon Behavioral Medicine Diagnostic. Bloomington, MN: Pearson Assessments.
Morey, L.C. (1991). Personality Assessment Inventory: Professional Manual. Tampa, FL: Psychological Assessment Resources.
Wahler, H.J. (1983). Wahler Physical Symptoms Inventory Manual. Los Angeles, CA: Western Psychological Services.
Wallston, K.A., Wallston, B.S., Devellis, R. (1978). Development of Multidimensional Health Locus of Control (MHLC) Scales. Health Education Monographs, 6, 160–70.
|
PLEASE LOG IN TO VIEW OR TAKE THIS TEST
This test is only active if you are successfully logged in.