If your jurisdiction has ever been the target of a cheating scheme or scandal then you are very familiar with the costs of cheating. Writing new tests, conducting new recruitments and administering new selection procedures are time-consuming and costly, yet they represent only a portion of what could be considered the costs of cheating. Perhaps the biggest potential costs of cheating are those incidents where cheating goes undetected, resulting in incorrect selections for promotions or filling entry-level positions. Whatever the costs and whatever the long-term impacts may be, the sad truth is that cheating still occurs.
A cheating incident from 2009 underscores the impact of cheating scandals on agencies and should serve as a warning for individuals with the responsibility of preventing cheating, as well as any individual that may consider cheating. In this case, a member of the testing committee provided test questions to a lieutenant who subsequently provided them to an officer studying for the test. That officer went to internal affairs and an investigation was launched. In addition to scrapping the test, four police officials were temporarily removed from their duties pending the conclusion of the investigation. Two eventually left the department. The cloud of suspicion and distrust continues to hang over the department and the ultimate negative affects are incalculable.
As stated before, incidents like these should serve as lessons to those on both sides of the situation. Test users can always do more in terms of providing security for their tests and cheaters rarely succeed, ultimately paying the highest price of all involved.
Unfortunately, I had the opportunity to witness the impact of cheating first hand when one of the jurisdictions I worked for in the past was rocked by a scandal involving a wrongful promotion and subsequent misbehavior on the part of the individual that was promoted. Not only was the department liable for the actions of the individual that had been promoted to sergeant, but it also suffered incalculable consequences in regard to those who were not promoted and became disenchanted with serving this particular department.
Among the many lessons learned in this experience was one I see many practitioners failing to learn; that complacency is a cheater’s biggest ally. While it is critical to follow all the known test security and anti-cheating measures that are available in the literature, and through common sense as well as written test security agreements, it is also critical to learn from the mistakes and experiences of others. Therefore, I think it is valuable to relate the circumstances surrounding the cheating incident I was involved in uncovering so I can share the lessons that I learned and provide food for thought for individual readers in terms of how their agencies may be vulnerable to cheating.
In that regard, IPMA-HR has had the foresight to develop very strict item challenge procedures for promotional tests that include not allowing candidates to take any test materials from the testing site or any item challenges meetings held within the department.
In particular, the police agency that I worked for had experienced an incident of cheating before my tenure there and they had an elaborate test review and item challenge procedure. The regulations that they promulgated to prevent cheating and provide a “fair” challenge system created the perfect situation for the potential for cheating. Unfortunately, this wasn’t truly seen until the incident occurred and in that regard, we had overlooked the potential for a group of individuals working together to pull off an intricate scheme. So the lessons learned included being aware that cheaters do not always act alone and that plans for cheating can be elaborate. Hopefully the details of what occurred will provide insights regarding things you or your jurisdiction might have overlooked in your test security.
In this particular case, the jurisdiction review process involved providing each candidate with a list of the items he missed along with the correct or keyed answer immediately after completing the test. This was done to allow each candidate the opportunity to compare his answers with keyed answers and determine if he wanted to challenge an item. Challenges could take the form of asserting that their answer was as good as or better than the keyed answer or that the item itself was flawed and should be removed from the test. The challenge process was as competitive as the rest of the selection process, with test participants being aware that the outcome of appeals could impact their ultimate ranking on the certification list. In particular, candidates with the lowest scores had the most to gain by appealing items and winning their arguments so the system encouraged them to file numerous appeals regardless of their merit.
The test was administered following standard protocol with the test session being limited to test takers and proctors. Each candidate signed in with picture ID and was given one test booklet and a Scantron answer sheet, both of which were numbered. Candidates were allowed as much time as they wanted to complete the test and were escorted to the restroom as needed, but were not allowed to the leave the test room for any other reason. Upon completing the exam, each candidate was given a computer printout listing the number of the items he had missed along with his answer and the keyed answer.
There were two unusual occurrences during the test. Within the first twenty minutes of the exam, one officer, who had a history of discontent with the department, stood up and said something to the effect that he’d “had enough of this B.S.” and left the room. He proceeded to take his answer sheet to the scoring table, “just for the heck of it,” as he announced.
The second unusual occurrence was that a sergeant that was on duty stopped by in his police unit to “see how the test was going.” He was told that he was not permitted to be at the test site so after agreeing to leave he asked if he could use the rest room which he was permitted to do and then he left.
It was not until the end of the test and all candidates were finished that a third piece of information was provided that put the other two occurrences in a different light. One of the proctors informed HR Staff that he had accompanied one particular candidate to the restroom more than ten times during the two hours he spent working on the test. At that point, considering that the sergeant that had dropped by, the officer who finished early, and the officer who made excessive trips to the restroom were all on the same squad and known to be very close, we were able to put together a theory of what happened.
The candidate that finished the test first left the test site with an almost complete key to the test since he had done so poorly and had been provided with all the correct answers to the items he missed. Then he met up with the on duty sergeant to provide him with the key. The sergeant, per a prearranged agreement, went to the test site and deposited a crib sheet in the toilet paper dispenser in a predetermined stall in the rest room. This happened to be stall four — the same stall the candidate that was the focus of our investigation used every time he went in the restroom.
As part of our investigation, we reviewed the answer sheets provided by all candidates. We determined that the average number of erasures and item changes per answer sheet was approximately three, while the candidate in question had over eighty. In addition, about fifty percent of the time when answers were changed, the erasure marks indicated that the candidate had changed the answer from wrong to right and about fifty percent of the time the candidate changed from right to wrong. The answer sheet competed by the candidate under suspicion indicated that among his erasures, he had gone from wrong to right in every change he made except one. That meant that some how he gained an insight that allowed him to change his original wrong answer to a correct answer 98% of the time. So in addition to the other suspicious behavior, this candidate defied all odds and was able to correct nearly every mistake he had originally made.
While HR believed we had solid proof of what had happened and how, the Sheriff in office at the time chose not to take any action, believing that the evidence would not withstand a challenge by the union. Instead, the list was certified without challenge and the individual who had cheated was promoted, only to be terminated at a later date for use of excessive force that was caught on tape.
While the next article will delineate a number of precautions that should be taken regarding test security, along with several steps involved in creating an anti cheating environment, it is hoped that this particular story of cheating will encourage you to take a hard look at the procedures you have in place and identify areas that may make your jurisdiction vulnerable. HR needs to be as creative in prevention procedures as those that would cheat us are in developing their schemes. Knowing what has been attempted before is a critical way of recognizing and stopping schemes before they happen.
The reading list for the Police Lieutenant Test (PL 301) has been updated to reflect the release of a new edition of one of the books that appears on the list.
The updated reading list table is below:
TYPE TEST NAME LAST UPDATED Police PSUP 301/302/303 Mar 2012 PL 301 May 2012 PDET 201 Mar 2012 Fire FCO 101-EM/102-EM Mar 2011 FCO 103/104 Mar 2011 Corrections CF-FLS 102 Feb 2011 ECC ECC-FLS 102 Oct 2010The articles in this series when taken as a whole present a picture of the challenges and potential pitfalls presented in the development of effective selection instruments and test batteries. In addition to the need to make sure instruments are reliable and valid so that they support the selection of the best available work force, they must also withstand legal scrutiny. Unfortunately, experience has shown that local laws, statutes and/or civil service rules that provide the blue print for how HR work is to be done are many times in conflict with exam development and validation procedures. In particular, certification rules that dictate the number of candidates from a ranked list that can be certified for a hiring authority to consider for selection can be responsible for undoing the efforts made to conform to professional standards.
Many individuals tasked with writing civil service rules, particularly in the infancy of the development of merit systems, did not have the benefit of possessing a test development and statistical background. Many systems focused on fairness and avoiding abuses of differing forms of the spoils system or the good ol’ boy system, but they did not take into consideration statistical concepts related to test scores, and in particular whether or not meaningful differences existed between scores. Sometimes, certification rules narrowly defined the group eligible for certification and in other instances; rules were modified in an attempt to address equal employment issues. These modifications often took the form of certification of the whole list which meant the hiring agency could select anyone on the entire list to put through the final selection interviews.
Both approaches ignore what we know about test scores and test development. First of all, if certification rules are too narrow and allow only three to five names to be certified, they may be placing an emphasis on differences in test scores that are not truly meaningful. While limiting the number of candidates that can participate in the final interview or hiring interview process may be efficient, it may be excluding candidates that are essentially as well qualified for the job as those who are included for selection. Secondly and conversely, if all candidates can be certified and are eligible to participate in hiring interviews, we have ignored the concept of test utility and the fact that well constructed and valid selection instruments can be appropriate for ranking. This is particularly true if a criterion- related validity study has been conducted that demonstrates a correlation between test scores and job performance as is the case for IPMA-HR entry level tests.
To understand the first concern with certification rules that unduly limit the number going on for final consideration we need to look at whether or not differences in test scores translate into difference in expected job performance. In other words, can we expect someone with a 91 to perform better on the job than someone with 90? Most of us can agree that this would be a difficult thing to prove and probably should not be our focus in establishing certification rules. Rather, we should be looking at whether the top group certified for selection, when taken as a whole is better than those not certified.
Again, this may be difficult to prove, but statistics and logic can be utilized to make our certification rules more defensible and work better for us. First of all, we should be able to agree that we should use whole scores and avoid setting certify, don’t certify points at fractions. If whole point differences don’t translate into meaningful distinctions in scores and job performance than how can fractions? So we want to avoid certification rules that would certify someone with a 89.5, but not someone with an 89.2.
In addition, since we do want to make the differences meaningful, we should look at the distinction between the top score certified and the bottom score to be certified. In other words we should look at the group certified as a range, and rather than comparing the lowest score in the range to the next highest score below that score we should be looking at the distinction between the top score in the range and the lowest passing score and the score below it. That is, if your certification rule, as applied, would indicate the top score to be certified is a 95 and the lowest score to be certified is an 88, you want to have some confidence that the next lowest score, say an 85 is significantly different than a 95.
Candidates have attempted to challenge pass points and certification rules by comparing the 88 and the 85 in our little example and saying there is no difference in these scores. We want to be sure that we are not making the claim that there is. What we are trying to say is that there is a difference between a 95 and an 85. Since we want to select top performers, we don’t want to include anyone in our certification group that could be considered significantly below our top group.
If we are working with rules that allow for broad ranges of groups to be certified, we will need to accept that not all will be top performers, on the other hand when we are working with rules that allow for only the top three to five top scorers to be certified, we may be working with rules that are too narrow and that unnecessarily eliminate candidates with potential to be top performers. While probability statistics can be difficult to compute, understand and apply, a look at the normal curve helps illustrate some of the things that we may want to look at when examining our current certification rules.
This is the final part of four-part series on successive hurdles, test weighting and certification rules. If you’ve just joined us, we suggest you catch up with part 1, part 2 and part 3 first. In case you missed it, check out Robert Burd’s previous series, Item Analysis In Public Safety.
In the last article we focused on weighing the tests and subtests that comprise the total selection process. We identified some instruments that should only be used as pass fail and we identified others that can be used to rank candidates. Those tests and subtests suitable for ranking are those identified through the job analysis as assisting in differentiating potential job performance. We also identified an issue with weighting tests and sub tests if we rely on simply multiplying test results by the percentage we want them to weigh in our total. Tests with greater variance tend to impact ranking more than the desired weight. Simply put, tests tend to self weight based on their variance.
Given a simple illustration we can see that tests that spread test scores out (have greater variance) will have a greater impact on the final ranking of candidates than tests that tend to lump everyone together (have less variance). Taking this concept to its extreme, it can be seen that if a group of five people all got the same score on a multiple-choice exam but achieved widely divergent scores on a structured interview, the multiple-choice exam would weight zero in our final ranking and the interview would weigh one hundred percent.
Multiple-Choice Test Scores Structured Interview Scores 80 95 80 85 80 80 80 75 80 70This would be an undesirable result if you actually wanted each test to have a different weight or equal weights in the final ranking. Also note that simply multiplying each test score by the desired weight, for example multiplying by .5 if you wanted each component to weigh 50%, will not achieve the desired results in that it will not change the ranking established by the Structured Interview.
Even though this may be an extreme example it does help us picture what is happening within the tests we use for ranking in regard to their self weighting. Further, utilizing an extreme example also illustrates the decision making process a jurisdiction can go through in determining whether or not taking the time an effort to adjust scores for the desired weighting is warranted. Since in most cases, if the means and the standard deviations for the tests within the battery do not vary widely, it may not be necessary to apply corrections.
In addition to the challenges posed by the variance of the selection instruments within our test battery, we can also have other sources of weighting errors when we utilize multiple panels for conducting structured interviews or multiple panels of assessors for rating candidates in multiple assessment center exercises. Frequently, very large jurisdictions will have to employ both of these strategies in dealing with large numbers of candidates competing for positions at each rank. Although these rating differences should be avoided with sufficient training, agencies can use the corrections outlined below to address differences in the way panels score candidates. A review of the panel’s scores may indicate that panels are varying their application of ratings with some boards being very strict, some being quite lenient and others being conservative despite best efforts to train all raters.
If you do have differences in the way a panels are scoring candidates you may want to consider some type of correction so that the panel that is considered the “easiest” and tends to give the highest scores will not end up determining who gets ranked the highest. As suggested previously, the issues related to combining scores in proportion to their desired weights can be resolved by utilizing standardized scores.
All of us have been subjected to the use of standardized scores in some fashion or another since they are widely used in the educational system in our country. Their utility in the educational model is similar to that in HR selection; they provide a means of making comparisons of candidate performance on different tests. Essentially they allow us to use the test group to establish the norms for comparing candidates’ scores within that group. In the case of Z scores, the standardization process (i.e. the process used to convert a score) involves expressing an individual’s score in terms of its distance from the mean (arithmetic average) of the test utilizing the standard deviation (SD – distance scores vary from mean). Other types of standardized scores use statistical methods to set the mean and the standard deviation with T scores and Deviation IQ scores being examples of these. T scores utilize a mean of 50 and a SD of 10 and IQ scores use a mean of 100 and a SD of 15. Thus someone who earned a Z score of 1 a T score of 60 and a standard IQ score of 115 would have had the same level of performance on each instrument when compared to the norm groups. Such comparisons are illustrated in the chart of the normal curve below with the first line under the normal curve representing Z scores indicating they range from -4 to +4 as they divide the normal curve by standard deviations. That is a Z score of +1 would be equivalent to being one standard deviation above the mean and -1 would be equivalent to one standard deviation below the mean.
The most commonly used standardized scores are Z scores. To compute a Z score, we obtain the difference between an individual’s raw score and the mean of the normative group and then divide the difference by the Standard Deviation (SD) of the normative group. That is, if an individual earned 85 on a test with a Mean of 75 and a Standard Deviation of 10 their Z score would be 1.
(Raw Score – Mean) divided by the Standard Deviation = Z
85 – 75 = 10 divided by 10 = 1 so the Z score is 1
The utility of standardized scores, as indicated previously, is in the ease of comparisons they provide. Once we have calculated Z scores, and thankfully, there are computer programs that will do this for us, we can make comparisons with other candidates on other tests and in addition, we have a tool we can use to combine test scores to achieve their desired weights. The diagram below illustrates how Z scores and other standardized scores compare with each other as well as how they compare to the normal curve which represents the distribution of scores we hope to have from the administration of our tests. We can also see that if we calculate Z scores for our candidates on multiple exams, the point that each score falls can be illustrated linearly and will give us a graphic of how a candidate performed on a test compared to the rest of the group.
Then to combine these scores we take the portion or percentage of the Z score that we want a test to contribute to the final score and use it to multiply the score. So a Z score of 1.0 times a weighting of .5 becomes .5 and a Z score of .5 times a weighting of .5 becomes .25. Summing these two values gives us a Z score of .75.
Combining Z scores for final rankings should only be applied to passing scores and thus an important assumption is incorporated in this process and that is those passing exams are at least average or above and therefore combining scores will not be confounded with the computation of negative Z scores (remember Z scores can range from -4 to +4). Using only passing scores means that you have the option of using straight Z scores for establishing your ranked list or transforming them into a more recognizable metric. To accomplish that task you would establish a mean that represents the pass point (i.e.,you set the mean to the same number as your pass point) and a standard deviation that approximates the average of the standard deviations achieved on the instruments themselves. Then to compute the transformed scores you use the mean as the starting point for a candidate’s score and add the product of the SD multiplied by the Z score.
That is if an individual had a Z score of 1.0 his score would be 70 (mean) plus 10 (SD X1.0) which equals 80.
As can be seen by these transformations, addressing the issues related to self weighting can add several steps to determining scores and that, along with the complexity of the computations has served to deter many jurisdictions from correcting for self weighting. The importance of this information is to inform test developers and users that there is an issue with self weighting, there are methods for correcting self weighting errors and individuals responsible for ranking candidates can apply corrective methods should they choose.
This is part three of a four-part series on successive hurdles, test weighting and certification rules. If you’ve just joined us, we suggest you catch up on part 1 and part 2. Part 4 will discuss how certification rules fit into the process of successive hurdles and test weighting as we’ve discussed thus far. The conclusion to this series will be available to read on the Assessment Services Review on April 25, 2012. In case you missed it, check out Robert Burd’s previous series, Item Analysis In Public Safety.
Is smoking the new frontier for EEO actions? Hospitals and other health institutions (the National Institutes for Health in the Washington metro area comes to mind) proclaim that they are smoke-free. Some employers are going a step further and have become smoker-free. The reason is that these employers, regardless of their business, have a stake in keeping health benefit costs down. Having healthy workers would help achieve that goal, and smoking is unhealthy.
Under federal law smokers are not a protected class. But does smoking implicate protected classes? Smoking itself is an activity, not a disability. Nicotine addiction might be an impairment, but it may or may not be a disability under federal law. Another angle is whether smokers tend to be more prevalent in some race/ethnic groups. The Centers for Disease Control issued a report last year indicating that smoking rates varied by industry, but did not go into demographics.
If an employer wants to ban smokers, would testing for nicotine be an unlawful medical examination? It‘s one of many issues to consider. On the state level, according to Pfeifer (2012), there are 17 states that allow smokers to be banned. But there are also a number of states (such as Wisconsin) that have “lawful product” statutes that prohibit an employer from not refusing employment to those using a product lawfully available in the state.
Reference: Pfeifer, R. (2012, January 31). Employer smoking bans debated, but Wisconsin‘s protection of smoking employees remains. Lexology. Retrieved at www.lexology.com/library.
Reprinted with permission from the Personnel Testing Council of Metropolitan Washington.
In the previous article, I introduced the concept of weighting exams that comprise the battery of instruments in a selection process. This article will explore that process more in depth. To begin with, some instruments lend themselves to being weighted and thus providing an impact on the final ranking of candidates and others do not. Determination of which instruments are appropriate for ranking and the weight given to those that are considered appropriate for that purpose should be established through the use of a comprehensive job analysis designed to support the content validity model for test development.
There are numerous published methodologies for conducting job analyses that are designed to comply with the Uniform Guidelines for Employee Selection Procedures (UGESP) even though they may differ in how they combine subject matter experts’ ratings on KSAP’s which ultimately determine the weight given to selection components. Typically, these systems will collect ratings on KSAP’s and then review them to determine which ones have received ratings that indicate that they are required at time of hire, are important for job success and are linked to performing important job tasks effectively. Often to make the system more manageable, the next step will involve grouping KSAP’s into domains, which is what is recommended by several job analyses procedures designed to conform with the requirements of the UGESP.
Ultimately the surviving KSAP’s should be plugged into an Exam Plan Outline which is essentially a grid with types of selection instruments being listed across the top and surviving KSAP’s being listed along the left side of the grid. The KSAP’s along with their relative weights are distributed under the type of selection instrument that would be the most effective for measuring that particular KSAP.
For example, in applying this model to selection for police officers, we would commonly find that KSAP’s would include ability to communicate verbally and ability to read and comprehend training materials. We would also commonly find that we would have written multiple-choice exams, oral exams, background investigations, psychological exams, medical exams, and physical ability exams. Plugging our two abilities into this hypothetical grid, we can see that we would put verbal communication under our oral board and we would put the ability to read under our written multiple-choice exam. Totaling the weights from each column in our grid will give us the weighting to be applied to each instrument in the selection process.
The following hypothetical grid shows how the weights from the job analysis can be taken from the KSAP’s and assigned to the selection instruments in the battery.
KSAP’s Exam Type WR OB PF BG PS WR = Written, OB = Oral Board, PF = Physical Fitness, BG = Background, PS = Psychological Ability to read and comprehend written materials 5% Ability to make quick decisions 5% Ability to respond to questions verbally 5% Ability to apprehend and subdue suspects C 21 Years of Age R Stable thinking and mental processing C Total Weight 5% 10%In establishing the weights to be applied to exams, another fundamental factor must be considered in regard to whether or not a particular instrument is appropriate for ranking. As indicated previously, job analysis systems designed to support the content validity model for test development are critical for establishing the weights for the instruments in the test battery, however; development and validation of the instruments utilized for background investigations, medical exams and psychological exams are not typically established through this model. While the need for such exams can be demonstrated through the use of the job analysis process, establishing the validity for these exams themselves, typically requires utilization of the construct or criterion related validity models.
It is difficult to support using instruments for ranking that measure constructs without proof that possessing more of a construct increases job performance which is essentially the claim made by the content and criterion related validity models. So while we cannot use the content validity model to demonstrate, for example, that having a more suitable background beyond the minimal threshold level, makes one a better police officer we can demonstrate that having more of an ability directly tied to job performance such as greater ability to read and comprehend written materials can make one a better police officer.
Once the critical test has been applied to determine if a component is suitable for ranking, the focus can shift to the actual weighting of components. As suggested earlier, some job analysis systems apply weights to individual KSAP’s at the beginning of the process that remain with them through out the process while others allow for a reshuffling of the deck so to speak after some KSAP’s have been eliminated from the selection process. Regardless of the system utilized, it is important to follow the process through its cycle to be assured that it conforms to the UGESP. Ultimately, whichever system is used, it should result in percentages that reflect the portion a particular component will contribute to the final score.
Typically selection processes developed from appropriate job analysis procedures will result in weighting written multiple-choice tests and oral exams including structured interviews and assessment centers. It is most important that the tests are weighted in proportion to their job importance and that the job analysis documents the reason for the weighting scheme that is chosen. For example, if job analysis results indicated that these two components should be equally weighted with each portion contributing 50% of the final score used for ranking. The common practice is to multiply a candidate’s scores on the oral board and written exam by .5 and then sum the results. (Note: It is important to remember that the range of weights used will vary depending on job analysis results and the other testing components being utilized. One study (Lowery, 1996) indicated the median weight for the written multiple-choice test was 30%. There is not one correct weighting scheme. The job analysis will have to support the scheme that is chosen.) Similarly, written multiple-choice tests frequently incorporate multiple subtests, one variation being the inclusion of two components, one cognitive and one non-cognitive with non-cognitive components being similar to an interest questionnaire or a background data questionnaire similar to IPMA-HR’s BDQ. These two components could also be weighted based on information obtained in the job analysis.
It is important to remember that tests tend to self weight based upon their variance. This means that tests with more divergent scores have a greater impact on the ranking of candidates than tests with more homogenous scores.
To ensure that the instruments in the selection process are weighted according to the exam plan outline, it is important to apply a correction or adjustment to test scores so that they actually contribute their desired weight to the final score as opposed to the weight established by their variance. One common method for ensuring that scores contribute their desired weight is to transform the scores into T scores or Z scores. This process involves using the statistical means of the instruments and their standard deviations to establish standardized scores and then combining the standardized scores. The process for establishing T scores and Z scores is described in most texts on statistics including Psychological Testing by Anne Anastasi. In addition, there is a statistical correction that can be made that provides a formula for determining what percentage of an exam score needs to be utilized to achieve the desired weight.
The next article in this series will go more in-depth on the topic of standardizing scores to adjust weights. Stat Trek includes formulas for important statistics. Most of these can be run using a statistical program like SAS or SPSS.
The keys to combining scores ultimately involve using comprehensive job analyses procedures to establish the desired weights and then using appropriate statistical methods for combining these scores so that they reflect the desired weights.
Resources:Anne Anastasi and Susana Urbina (7th Edition, 2009) Psychological Testing, Prentice Hall (ISBN-13: 978-0205703890)
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, and Department of Justice. (1978). Uniform guidelines on employee selection procedures. 43 FR 38295. Available as a free download from www.uniformguidelines
Phillip E. Lowry, A Survey of the Assessment Center Process in the Public Sector, 25 Pub. Personnel Mgmt. 307, 309 (1996), IPMA-HR
This is part two of a four-part series on successive hurdles, test weighting and certification rules. If you’ve just joined us, read up on part 1, which introduces the concept of weighting exams that comprise the battery of instruments in a selection process. Part 3 will go more in depth on the topic of standardizing scores to adjust weights and will be available to read on the Assessment Services Review on April 18, 2012. The series will conclude with part 4 on April 25, 2012. In case you missed it, check out Robert Burd’s previous series, Item Analysis In Public Safety.
The Court will be asked to straighten out the aftermath of Ricci v. Destefano. It has not yet decided whether to grant certification in Briscoe v. New Haven, No. 11-1024, petition for cert 2/15/2012. The case that has revived the fight over tests administered in 2003.
Michael Briscoe, an African American firefighter, brought the suit that the city said it was trying to avoid when it canceled promotion lists that would have benefitted Ricci et al. District court, while somewhat sympathetic to Briscoe‘s suit, said that the Supreme Court had spoken regarding promulgation of the lists and Briscoe should have intervened in Ricci to ensure that his situation were taken into account. Among other things, Briscoe argued that the city could have weighted the written test results differently and had less adverse impact.
On appeal, the Second Circuit took the position that Ricci was one issue, but Briscoe‘s discrimination claim is another and he was entitled to sue. The city would seem to be in a quandary where it had to comply with the Supreme Court (and subsequent orders from the district court implementing the Supreme Court decision) and is being sued arguably for that compliance. The city also noted that other state and local jurisdictions could be whipsawed by litigation regardless of what they did to comply with the law until this matter is resolved.
Reprinted with permission from the Personnel Testing Council of Metropolitan Washington.
The reading list for the PSUP series of Police Supervisor Tests has been updated to reflect the release of a new edition of one of the books that appears on the list.
The updated reading list table is below:
TYPE TEST NAME LAST UPDATED Police PSUP 301/302/303 Mar 2012 PL 301 Jan 2012 PDET 101 Feb 2012 Fire FCO 101-EM/102-EM Mar 2011 FCO 103/104 Mar 2011 Corrections CF-FLS 102 Feb 2011 ECC ECC-FLS 102 Oct 2010Medical doctors and psychologists rarely rely on the results of one clinical test when making diagnoses. Similarly, selection experts recommend using a battery of selection instruments when making employee selections. The people in these professions realize that the accuracy and reliability of their conclusions are greatly enhanced when they have a broader range of information on those being evaluated.
In selection it is often critical to measure quite divergent knowledge, skills, and abilities, which necessitates the use of multiple selection instruments as part of a battery that comprises the selection process. Most jobs require cognitive abilities and some require a body of knowledge, which in many cases can be measured by a written exam. In addition, most jobs require some degree of ability to communicate verbally. Since written tests can not measure verbal communication, a second test, usually a structured interview is necessary to measure whether or not a candidate possess the verbal abilities required for the target job.
In addition to measuring these abilities in candidates, many positions require additional abilities which require the use of additional selection instruments. Many classes like police officer, fire fighter, corrections officer and park ranger require the measurement of candidates’ physical abilities, psychological stability, medical fitness and suitability of background. To utilize these instruments effectively and efficiently, they must be combined in a manner that provides the greatest support for administration of the selection process and maximization of each instrument’s validity. Combining the information from multiple instruments is where the employee selection model differs from the medical model.
Two critical concepts in the combining process involved in selection are what are termed “successive hurdles” and “weighting.” Both terms are pretty much what they sound like they should be. “Successive hurdles” refers to the sequencing of the administration of the instruments in the selection plan with candidates being required to “jump” each hurdle. “Weighting” refers to the portion or percentage an instrument contributes to the final score for candidates. Weighting is typically established from the job analysis done on a classification and the related content validity study for identifying the instruments to be used in the subsequent selection process.
These two concepts are intertwined in that the weight a selection instrument is given can impact the sequence of testing or the “Successive Hurdles” in the selection process. While the weighting process focuses on mathematical computations usually spelled out in the specific job analysis and test development system being utilized, sequencing of testing involves more administrative concerns.
Two considerations that should be utilized in establishing the order tests will be given are whether or not a test is considered pass/fail and the labor intensity of administering a particular test. The criticality of a particular knowledge, skill, ability, or personal trait will determine whether or not the test utilized to measure that particular KSAP, will be pass or fail. Pass/ fail as used in this sense indicates that success on the test demonstrates a sufficient amount of the required KSAP to will allow a candidate to continue in the selection process while a lack of success on the test will eliminate a candidate from the process. In that regard, as used here, pass/fail does not relate to whether or not a selection instrument is used for ranking, merely whether or not a threshold or pass point has been established that determines whether or not a candidate is permitted to continue to the next step. The usefulness of an instrument for ranking is determined by the validity of that instrument and whether or not it is capable of distinguishing levels of job performance and therefore supports ranking better test performers higher than those who perform poorer on the test.
Narrowly defined, “Successive Hurdles” can refer to the fact that any individual that is not successful in one portion of the selection process cannot proceed to the next portion. Since tests that compromise a test battery or selection process should all measure different KSAP’s and ideally correlate with job performance while not correlating with each other, all instruments in the battery should be used to determine whether or not a candidate is permitted to proceed to the next step. However; many systems, including those for police officer may necessitate that candidates continue to the next phase while aspects of previous testing are evaluated. This being the case, most jurisdictions functionally define “Successive Hurdles” as the process involved in leaping over each step in the process successfully whether or not success on previous instruments has been fully determined or not.
Combining the criticality of an instrument along with its ease of administration often results in a clear cut sequence of testing. Ideally, instruments that are relatively easy to administer in that they are not particularly labor intensive will also measure critical aspects of job performance or basic job requirements and therefore make clear choices for being administered first in the process. Perhaps the best example of this combining of the two concepts is found in the instruments designed to evaluate a candidate’s background. While conducting a thorough background investigation is a time consuming and labor intensive process, a quick inventory or “mini” background check that focuses only on immediate and legally defensible disqualifiers can be administered easily at the beginning of the selection process to narrow the applicant pool and reduce the costs of administering the next steps in the process. For example, in order to meet the basic Peace Officers Standards and Training in Nevada, which essentially means an individual can be admitted to a training academy for police officer, the individual must be twenty-one years old, a citizen of the United States without any felony convictions or convictions for domestic battery. This means that an instrument that determines eligibility to enter a training academy could be administered rather inexpensively while having a significant impact on the applicant pool.
Inserting the next steps in the process would involve evaluation of the same two concepts. Typically, a logical second step is a written exam in that it can be administered to a large number of applicants with the use of a relatively few proctors. Other instruments such as physical fitness tests, oral boards, psychological evaluations, polygraphs, background investigations, and medical exams require an almost one to one ratio between applicants and those administering the test and therefore are more labor intensive and more expensive. Therefore, it makes sense to administer these instruments in an order that reflects their costs per candidate while taking into consideration the number of candidates succeeding in or failing these tests. In that regard, it is also important to determine when a sufficient amount of information has been gained on a candidate to determine whether or not it is feasible to make a conditional offer of employment so that additional testing can be done without violating any provisions of the Americans with Disabilities Act (ADA).
The use of multiple selection instruments that necessitate the use of the “Successive Hurdles” model is a sound selection process that is often necessitated by the complexity of the KSAP’s that must be measured. Using this model wisely and effectively is a valuable administrative tool that can save agencies money while assisting in selecting the best candidates available.
This is part one of a four-part series on successive hurdles, test weighting and certification rules. Part 2 will focus on weighting selection instruments within the selection process and will be available to read on the Assessment Services Review on April 11, 2012. Part 3 will be available on April 18, 2012. The series will conclude with part 4 on April 25, 2012. In case you missed it, check out Robert Burd’s previous series, Item Analysis In Public Safety.Who is a supervisor? The Court isn‘t sure and is asking the U.S. Solicitor General (SG) for help. The SG provides amicus curiae briefs on issues in which the federal government has a stake, and sometimes the Court invites an analysis of a disputed legal issue. The Court invited the SG to file a brief expressing the government’s view on the definition of the term “supervisor” for the purpose of imposing vicarious liability on an employer for harassment in violation of Title VII (Vance v. Ball State Univ., U.S., No. 11-556, SG invited to file brief 2/21/12). The Seventh Circuit had ruled in this racial discrimination that the alleged harasser was a co-worker, not a supervisor. Vance is arguing that the circuits are split on the issue. Ball State acknowledges that there is a circuit split, but claims that Vance’s alleged harasser would not be a supervisor under any theory. The major alternatives are between someone with personnel action authority and someone who directs the daily work of others.
Below is a case where a panel in the Eighth Circuit split on the issue, only one of several in the case. The majority held that a “lead driver” who was in charge of a two-person long-haul trucking team did not have enough authority to be a supervisor; notably, the lead driver could not direct the subordinate driver beyond what was already established in the subordinate‘s duties and could only make recommendations to management regarding performance. The dissent pointed to considerable practical authority that the lead driver had while on the road.
Reprinted with permission from the Personnel Testing Council of Metropolitan Washington.
As a result of updates to the books that support our police detective tests and our commitment to making sure our tests are the absolute best they can be, we’ve released an update the PDET 101 Police Detective test. Because of these changes, the name of the test has been updated as follows:
PREVIOUS TEST ALSO KNOWN AS NEW TEST PDET 101 PDET 2.1 PDET 201Five questions were replaced from the PDET 101 to create the new PDET 201. These changes are a result of new editions to books that appear on the test’s reading list. The old questions were no longer supported by the new books, so we replaced them.
It is important to note that all replacement questions were written to assess the same content areas as the original questions and are supported by the books on the current reading list, dated March 2012.
Note that we strongly urge you to always post the most current reading list. That said, if you’ve already posted the reading list from January 2012 for the PDET 101, you can still order the PDET 201 with confidence; the January 2012 reading list fully supports the questions on the PDET 201.
What does this mean for you?Agencies who have already posted reading lists for their candidates to begin studying can use the new versions with absolute confidence that their candidates have studied for the correct test.
The PDET 201 is available to order immediately. We will honor existing orders for the PDET 101, but future orders will be fulfilled with the new version exclusively. If you are a customer who has already placed an order for the PDET 101 but have not received your order yet, please contact the Assessment Services Department and let know if you’d like to change your order to the PDET 201.
If you have any questions about this switchover or are concerned with how it may affect you, do not hesitate to contact us.
From this point forward, we’ll announce updates to our reading lists for promotional tests as they happen.
In the meantime, however, we’d like to bring to your attention the fact that the reading list for the PDET 101 Police Detective test was updated in February to reflect a new edition of one of the publications.
You can submit a request to receive a reading list on our website.
You’ll find that we’ve also added a table that will show you when the reading list for each test was last updated. As of today, that table looks like this:
Type Test Name Last Updated Police PSUP 301/302/303 Jan 2012 PL 301 Jan 2012 PDET 101 Feb 2012 Fire FCO 101-EM/102-EM Mar 2011 FCO 103/104 Mar 2011 Corrections CF-FLS 102 Feb 2011 ECC ECC-FLS 102 Oct 2010We’ll be announcing future updates here on the Assessment Services Review so be sure and subscribe to Instant Updates on the ASR in the sidebar to the right.
As a result of ongoing feedback and our commitment to making sure our tests are the absolute best they can be, we’ve made some minor updates to PSUP series tests. Because of these changes, the names of the tests have been updated as follows:
Previous Test Also Known As New Test PSUP 201 PSUP 1.2 PSUP 301 PSUP 202 PSUP 2.2 PSUP 302 PSUP 203 PSUP 3.2 PSUP 303All links to the 200 series tests will be updated to the 300 series once available.
Changes to each test are as follows:
It is important to note that all replacement questions were written to assess the same content areas as the original questions and are supported by the books on the current reading list.
What does this mean for you?Agencies who have already posted reading lists for their candidates to begin studying can use the new versions with absolute confidence that their candidates have studied for the correct test.
We will begin stocking the new PSUP 300 series on Monday, March 5th, 2012. As a result of low quantities on the older versions, it is very likely that we will not be able to fill orders for PSUP 200 series tests. Because of this, existing orders will be switched over to the 300 series tests.
If you have any questions about this switchover or are concerned with how it may affect you, do not hesitate to contact us.
It’s hard to believe that it has been two years since we last printed our test products and services catalog. We made a decision at that time to switch to a two-year catalog cycle and quite a lot has changed since then!
Besides updating you on the latest assessment product information, you’ll notice that it is no longer just one catalog. Instead, we’ve split it into a unique publication for each of the five different services we support:
By having separate catalogs, we can better align our products with your specific needs.
Inside each catalog we’ve reconfigured the product listings to give you a comparison view of the various tests we offer and the content areas they cover. The new layout will make it easier to see at a glance what makes one entry-level test different from another.
You’ll also find announcements about new products and services we’ll be offering in the coming year.
All of our current test security agreement signers will receive catalogs automatically, along with anyone who has expressed interest in our products in the past and provided us with their contact information. We’ve done our best to make sure your department receives only the catalogs most pertinent to your needs.
Catalogs should arrive in the mail within the next couple of weeks. If you do not receive a mailing or would like to request a catalog, please fill out the form below and we will add you to our mailing list.
[contact-form-7]
In the two previous articles, we looked at the statistical and technical aspects of item analysis. Individual test developers will view the statistical computations and their value differently based upon their knowledge of statistics and their understanding of their application. However, a test developer or test user with a rudimentary understanding of item analysis can still make accurate decisions regarding the effectiveness of test items and therefore, written exams. As we emphasized previously, IPMA-HR conducts item analyses on potential test items in their test development process and maintains item analysis data from successive administrations of all exams. These practices ensure that only items that perform well continue to be utilized and, in addition, this practice reflects a standard that all test developers should employ. Also note that for our discussion, we will be focusing on typical four response multiple choice items and true false items.
Effective utilization of item analysis information for item and test revision is where science meets art. This process of “cleaning” up test items and tests requires utilization of the basic information from an item analysis, effective analysis of the applicant response data and application of the information available for developing good test items. There is an extensive amount of scholarly information available on item writing as well as response theory and the effective practitioner should take the time to review some of this information prior to writing or “repairing” test items. It should also be noted that even though the information provided in this article focuses on actual test developers, it can also be extremely valuable for those who purchase or lease tests since it can assist them in evaluating the quality of tests they are considering.
As indicated in the previous articles, the first review of item analysis data should look at Item Difficulty and Item Discrimination. Item difficulty ideally should range from .4 to .6 with items below .3 and above .7 being flagged for review. Items with a negative Discrimination Index and items with an Index below .3 should also be flagged for review. Once this review has been completed, the test developer has the opportunity to apply analytical abilities in evaluating the item response patterns. Computer programs that conduct item analyses and provide the actual number of test takers that selected each possible response are particularly helpful in this process.
In reviewing items with Difficulty Indexes of .7 and above indicating that the item is too easy, with 70% of test takers getting the item correct, the test developer should be tipped that either the correct response is too transparent or obvious or the distractors (incorrect responses) do not represent viable answers. In looking at the distractors it is important to see if there is one that was selected more frequently than the others and use the wording of this distractor as a template for the other distractors. In that regard, all potential answers should be worded similarly, represent the same reading level, include common terms and be of similar length. Essentially, the key to writing items that are an appropriate difficulty level is to ensure that the item accurately reflects knowledge needed to perform the job and is written at the same level as the required knowledge with all distractors appearing as plausible as the keyed answer. Reading the stem and following it with each possible answer will give insights as to whether each distractor is plausible. In particular, this is valuable for double checking grammar, and subject verb agreement since any distractor that obviously doesn’t agree with the stem is a throwaway.
Experience has shown that item difficulty can be increased by revising them to incorporate five options instead of four with the fifth being “all of the above,” or “none of the above.” Similarly, items can be revised to incorporate multiple answers with the keyed response indicating for example that “both A and B are correct” or some similar variation. Conversely, items that are written in this format that proved to be too difficult can be revised to eliminate the options of “none of the above,” or “all of the above.” Note too, that in order for these options to truly be effective, there must be some items where the keyed response is “all of the above,” or “none of the above.” Writing and keying items in this fashion can increase the difficulty of a test so it is important to continue to evaluate their performance through ongoing item analysis procedures in the manner practiced by IPMA-HR. As indicated previously, it is often necessary to continue to gather item analysis information since the stability of the results and thus their reliability increases as the number of test takers increases. Most test developers endorse a sample of one hundred or more as a good number for establishing the reliability of item analysis results and recommend caution in applying item analysis with test takers below this number.
Also note that there may be times when some “easy” items are intentionally included in a written exam. Some test developers use such items to help build some confidence in test takers. If this is part of the test developer’s goal, easier items should be grouped together at the beginning of a test or test section. In addition, there may be an appropriate use of items that would be considered easy in a situation where a test is used to determine mastery rather ranking. That is, for example, if there is a position within your agency that needs to have knowledge of basic math to perform effectively and a threshold level can be established then everyone who meets that threshold would pass the test. Candidates would not be ranked since greater knowledge of math does not predict greater job performance, but the test would still be utilized even with relatively “easy” items because basic math is critical to the performance of the job.
Review of items with Difficulty that is below .3 indicating that an item is rather difficult with 30% or fewer candidates getting the item correct is akin to the review of items that are too easy. Again, the stem and possible answers must be reviewed to ensure the keyed answer is correct and also for the level of the knowledge being measured and the congruency of the stem with the possible answers. Response patterns can provide insight here as well particularly if there are distractors that garner an overwhelming percentage of the responses. These distractors need to be reviewed to develop theories as to what makes them so attractive and perhaps what can be done to reduce their level of plausibility.
Similar to the utilization of items that would normally be considered too easy; there may be times that using items considered difficult play a role in developing a test. Tests that are used for ranking need to provide a spread of candidates and difficult items can contribute to this spread by increasing the ceiling of the test. So difficult items can be valuable as long as they test knowledge where more is better and they are not written at a level beyond what is required for the job.
In reviewing items with a negative Discrimination Index, those items where those with the lowest test scores answered the item correctly more often than those with the highest test scores, one of the first things that should be checked is the keyed answer. Miskeyed items can frequently be the cause of a negative Discrimination Index.
Typically items with a low discrimination index are the result of items that encourage good test performers to over analyze a question and therefore see the possibility of one or more of the distractors being as plausible an answer as the keyed answer. This can also occur when items tend to be relatively easy and lead good performers to assume that there must be more to the item than seen at first glance. Again, correcting these items involves a review of answer patterns and developing hypotheses regarding what led to the observed results. Once the hypotheses are developed then the guidelines related to writing good test items can be applied to overcome the inadequacies of the items.
Essentially, improving test quality involves trial administrations of the test and then analysis of the items with hypotheses being developed in regard to what went wrong with an item and then applying the correct item development principles to overcome the issues with the item. Being a good test developer involves writing items applying what we know about what makes good items and then rewriting items based on information obtained from an item analysis. As stated previously, this is as much art as it is science and the ability to write good tests only comes through practice and research.
This is final part of our three-part series on item analysis in public safety departments. In case you missed it, check out Robert Burd’s previous series, Succession Planning in Public Safety.
Mr. Burd’s next series will begin in March, covering topics such as successive hurdles, test weighting and certification rules.
The Assessment Center Educational Materials are a comprehensive guide to the complicated process of administering an assessment center in your organization. Whether you’re using an in-house assessment center system or one developed for you by an outside organization, the ACEM is an invaluable tool to make sure your administrators, assessors and candidates are informed, prepared and know what to expect during the process.
Starting immediately, the ACEM is also $50 less! We’ve lowered the price to $249 and you can order it online today.
The Assessment Center Educational Materials include the following manuals:
Additionally, the ACEM includes a CD-ROM that contains a wide range of documents to assist you during an assessment center:
As with all of our publications, you must have a Test Security Agreement on file with us before you can order the Assessment Center Education Materials.
Upon request, we can also provide you with an inspection copy of the ACEM, which includes a table of contents for the various manuals provided, explicitly describing the topics covered within each. You can request your inspection copy by emailing us or calling us at (800) 381-8378.
In the previous article, we began our discussion of the valuable information available through conducting an item analysis and we focused on the two most readily available pieces of information. First, of course, is the Difficulty Index. Just as the name implies, this index is an indicator of the difficulty of the item. It is expressed as a percentage and reflects the number of candidates that got that item right out of the total number of candidates that responded to the item. That is if nine out of ten respondents answered an item correctly the index would be .9 or 90%. From this illustration, we can also see that the Index actually has an inverse relationship with the difficulty of the item. That is, the higher the index or the higher percentage the easier the item is.
The second Index we discussed was the Item Discrimination Index. Essentially, this index reflects how the candidates who performed best on the test responded to a specific item when compared to how the candidates who performed the worst on the test responded to that same item. The top 27% of test performers and the bottom 27% of test performers are used for calculating the Discrimination Index and it is expressed as a proportion or percentage of the number in the top group that answered the item correctly in relation to the number in the bottom group that answered the item correctly.
These are both valuable measures in evaluating item performance because of the information they provide and the fact that they tend to be gross measures in regard to their interpretation. In evaluating an exam by utilizing an item analysis it can be seen that any items that have Difficulty levels indicating that almost everyone is getting them right or wrong need to be flagged for scrutiny. Likewise any items that have Discrimination levels indicating that the top 27% did not perform significantly better than the bottom 27% need to be flagged for scrutiny. In particular, any items with a negative index indicating that poorer performers answered the item correctly more often than top performers need to be fixed or removed.
In that regard, that is truly the value of an item analysis for test developers in that it provides information on how items performed and clues to how poor performing items may be improved. For example, an item with a negative discrimination index may actually be miskeyed. This is where the additional information provided by computer programs that provide print outs detailing the responses from all test takers who actually responded to the item along with the numbers of those who did not respond are particularly valuable. Beyond utilizing the “gross measures” that serve to identify ineffective or bad items, utilizing this molecular view of each item provides information that can assist in developing theories as to what was actually going on with a test item when viewed by test takers. We will discuss that more in the final article on item analysis. At this point, we will continue our review of other statistics provided by item analysis and their value in evaluating item and exam performance.
The Difficulty Index of each individual item will reflect on the Mean of the test and tells the test developer how that item is contributing to the Mean, which is the arithmetic average score. Again, the beauty of the Difficulty Index is its simplicity. It is easy to use and easy to interpret. In that regard, items with high indexes indicating that almost everyone passed the item are going to serve to raise the Mean of the test. This also means that the overall test is easy and not providing the necessary discrimination in performance among test takers to truly be useful. On the other hand, if the Difficulty Index is very low, it means the item is difficult and it will tend to depress the Mean of the test. A test heavily populated with difficult items is just as problematic as a test that is too easy. It is possible that the test sample represents a group of weak candidates, but it is more likely that the problem is related to the items being written at a level beyond the level that is necessary for successful job performance. Discerning the difference is part of the art of test writing and will be covered in more detail in the next article.
Two common statistics associated with the Discrimination Index are the Biserial Correlation and the Point Biserial Index. Text books indicate that these measures are time consuming and difficult to compute by hand and that they are also somewhat difficult to interpret. The Biserial Index is intended to develop a measure of the validity of an item, and the Point Biserial Index is indicated to show how the performance on an item correlates to the performance of the test. In the past, computation and interpretation of these indexes made them less than ideal for use by the average practitioner developing tests. Now, commercially available items analysis programs routinely include calculations for them. Still, their formulas and interpretation of results are much more difficult and less intuitively apparent than the Difficulty and Discrimination Indexes.
If test practitioners focus on creating items based on a thorough job analysis and being sure that the items created are an accurate measure of the knowledge and ability identified in that job analysis, they can have a fair amount of confidence in the validity of their items. In that regard, that confidence can be strengthened by having items reviewed by subject matter experts (SME’s) to determine if their content, difficulty level, and reading levels are appropriate for inclusion in the written instrument. Once the exam has been developed and appropriate reviews have been conducted the Item Difficulty and the Discrimination Index can serve as very effective measures in evaluating performance as well as an effective guide for improving test performance which will be the focus of the next article.
This is part two of a three-part series on item analysis in public safety departments. Part 3 will be available on Feb. 29, 2012. In case you missed it, check out Robert Burd’s previous series, Succession Planning in Public Safety.
Any organization using written exams as part of their selection processes that doesn’t take the time to review the information provided by an item analysis is overlooking a treasure trove of information. Without performing an item analysis on a written exam and acting upon the information gained from such an analysis, a jurisdiction truly does not know how that exam is performing.
There are two fundamental concepts involved in test utility and they are test validity and reliability. Simply put, validity refers to whether or not a test measures what it is intended to measure and reliability refers to how consistently a test measures what it is intended to measure. Item performance directly impacts these two concepts. Test items are the basic element for gathering information about test participants. In order for a test to have the necessary validity and reliability to make it worth using, test items have to perform optimally and gather the best information possible as accurately as possible. If test developers do not utilize information available from an item analysis, that test developer has no idea how items are performing. If items are performing poorly then all other statistical analyses become less meaningful and the test is not suited for its intended purpose. Hiring decisions based on the use of the test are questionable since they become another layer of what is ultimately a house of cards that does not have an adequate foundation. In this case, the written exam has not played its role in terms of optimizing accurate selections and supporting the mission of the organization.
An item analysis gives test users and developers a sort of molecular view of their exams and provides critical information that can be utilized to maximize the effectiveness of their tests. In that regard, a word of caution needs to be inserted in regard to note that test developers should be careful in assigning too much importance to item analysis developed from small numbers of test takers. Typically, a sample of one hundred is considered as a minimum for establishing reliable measures. Also, it is important to note that IPMA-HR sets a good example in the use of item analysis in that it conducts analyses of test items in the development of new tests to ensure that only the best performing items are included in their tests. Further, they continue to evaluate item performance on data compiled from customer test administrations of each test which further assists in ensuring the best performing items are used and the effectiveness of tests are maintained at a high level.
Despite the vast amount of information on item analysis and the number of different ways of analyzing the data available from studying individual items, there are two basic pieces of information that specialists in the field agree upon and even non-statisticians can understand and apply. Specifically, these are item difficulty and the discrimination index.
Item difficulty is as straightforward as it sounds. It tells how difficult the item is in terms of those who took the test. It is expressed as a percentage and reflects the number of candidates who answered the item correctly. That is, if 30 candidates respond to a test item and 20 of them select the correct answer, the Item Difficulty Index is .66. The Difficulty Index is expressed as p and the formula is:
It is important to recognize that p represents the percentage of correct responses and it is inversely related to the actual difficulty of the item. In other words, the higher p is, the easier the item. Therefore, a p of .90 indicates an easy item that almost everyone got right and a p of .10 represents a difficult item that almost everyone got wrong. Neither of these items would be particularly desirable on a test since they are not providing the discrimination among candidate performance that is desirable on a written exam.
Item Difficulty levels of .5 yield the maximum discrimination among candidates’ performance since it indicates that half of the candidates got the item correct and half got it incorrect. Ideally, item difficulty should range between .4 and .6 with items exceeding .3 and .7 being marked for review and possible revision.
The Item Discrimination Index measures how well individual items contribute to the measurement of the knowledge or ability the test is intended to measure. Since this measure is affected by the homogeneity of test items, each segment of the test should be considered separately. That is, a portion of a test intended to measure ability to perform mathematical computations should not have a Discrimination Index computed along with the portion of a test intended to measure knowledge of grammar.
The Item Discrimination Index is intended to reflect the relationship between performance on an item and the total test. Based on the way this index is computed, it can also be thought of as telling the test developer how those who did well on the test performed on the item compared to how those who did poorly on the test performed on the item. This index is computed by dividing the individuals who took the test into three groups: the 27% on the bottom 27% on the top, and the middle 46%. Since calculations are not done on the middle group, it is often said that the individuals taking the test are divided into two groups, top and bottom, but since computations use the top and bottom 27%, a middle group is automatically created.
Fortunately, numerous computer programs are available for calculating the top and bottom 27% and determining the Discrimination Index such as the one available from the University of Maryland Scantron Scoring software (available from IPMA-HR by using our scoring service to score your tests). Since this information is so readily available, it is important to understand it and be able to apply it to improving the performance of written exams. Essentially, the calculations these programs are performing involves selecting the top and bottom 27% from the test and then looking at the number of correct responses in each group.
Based on a test sample of 100, there would be 27 in both the top and bottom groups. If 20 candidates in the top group answered the item correctly and 9 people in the bottom group answered the item correctly we would subtract 9 from 20 and then divide that number by the 27 in the top group. The result would be a Discrimination Index for that item of .40. That is:
Eleven divided by twenty-seven equals .40, which in this case indicates that 40% more people in the top group got the item correct than those in the bottom group. The Item Discrimination Index ranges from -1.00 to +1.00. That is, the index can range from everyone in the bottom group getting the item right and everyone in the top group getting it wrong to everyone in the top group getting the item right and everyone in the bottom group getting it wrong. When expressed in these terms it can be seen how this index takes into consideration candidates’ overall performance on the test and a comparison of how the best performers on the test did on the item in proportion to how the worst test performers did on the item.
Items with negative Discrimination Indices need to be eliminated if the problem with the item cannot be diagnosed and remedied. Item Discrimination Indices above .30 are generally considered acceptable with items below that being needing careful scrutiny for editing that could improve their performance. In the two articles that will follow, we will look at the additional information available from the typical job analysis and discuss how this information can be utilized to improve test performance.
This is part one of a three-part series on item analysis in public safety departments. Part 2 will be available to read on the Assessment Services Review on Feb. 22, 2012. Part 3 will be available on Feb. 29, 2012. In case you missed it, check out Robert Burd’s previous series, Succession Planning in Public Safety.