Testing and Assessment - Reliability and Validity
Though reliability and validity are different from each other, they are still related. In this lesson, we'll look at the differences of and. It is a common mistake to assume the terms “validity” and “reliability” Also, the extent to which that content is essential to job performance. Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity. Test reliability These are the two most important features of a test. . Validity evidence indicates that there is linkage between test performance and job performance.
Reliability and Validity
That is, you are consistently and systematically measuring the wrong value for all respondents. This measure is reliable, but no valid that is, it's consistent but wrong. The second, shows hits that are randomly spread across the target. You seldom hit the center of the target but, on average, you are getting the right answer for the group but not very well for individuals.
In this case, you get a valid group estimate, but you are inconsistent. Here, you can clearly see that reliability is directly related to the variability of your measure. The third scenario shows a case where your hits are spread across the target and you are consistently missing the center. Your measure in this case is neither reliable nor valid.
- Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity
- Difference Between Validity and Reliability
Finally, we see the "Robin Hood" scenario -- you consistently hit the center of the target. Your measure is both reliable and valid I bet you never thought of Robin Hood in those terms before. Another way we can think about the relationship between reliability and validity is shown in the figure below.
Here, we set up a 2x2 table. The columns of the table indicate whether you are trying to measure the same or different concepts. The rows show whether you are using the same or different methods of measurement. Imagine that we have two concepts we would like to measure, student verbal and math ability.
Furthermore, imagine that we can measure each of these in two ways.
Classroom Assessment | Basic Concepts
Second, we can ask the student's classroom teacher to give us a rating of the student's ability based on their own classroom observation. The first cell on the upper left shows the comparison of the verbal written test score with the verbal written test score. But how can we compare the same measure with itself? We could do this by estimating the reliability of the written test through a test-retest correlation, parallel forms, or an internal consistency measure See Types of Reliability.
What we are estimating in this cell is the reliability of the measure. This estimate also reflects the stability of the characteristic or construct being measured by the test.
Some constructs are more stable than others. For example, an individual's reading ability is more stable over a particular period of time than that individual's anxiety level.
Relation between Validity and Reliability of a Test
Therefore, you would expect a higher test-retest reliability coefficient on a reading test than you would on a test that measures anxiety. For constructs that are expected to vary over time, an acceptable test-retest reliability coefficient may be lower than is suggested in Table 1. Alternate or parallel form reliability indicates how consistent test scores are likely to be if a person takes two or more forms of a test.
A high parallel form reliability coefficient indicates that the different forms of the test are very similar which means that it makes virtually no difference which version of the test a person takes.
On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore cannot be used interchangeably. Inter-rater reliability indicates how consistent test scores are likely to be if the test is scored by two or more raters.
On some tests, raters evaluate responses to questions and determine the score. Differences in judgments among raters are likely to produce variations in test scores. A high inter-rater reliability coefficient indicates that the judgment process is stable and the resulting scores are reliable.
Inter-rater reliability coefficients are typically lower than other types of reliability estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if raters are appropriately trained. Internal consistency reliability indicates the extent to which items on a test measure the same thing.
A high internal consistency reliability coefficient for a test indicates that the items on the test are very similar to each other in content homogeneous. It is important to note that the length of a test can affect internal consistency reliability. For example, a very lengthy test can spuriously inflate the reliability coefficient. Tests that measure multiple characteristics are usually divided into distinct components.
Manuals for such tests typically report a separate internal consistency reliability coefficient for each component in addition to one for the whole test.
Test manuals and reviews report several kinds of internal consistency reliability estimates. Each type of estimate is appropriate under certain circumstances. The test manual should explain why a particular estimate is reported. Standard error of measurement Test manuals report a statistic called the standard error of measurement SEM. It gives the margin of error that you should expect in an individual test score because of imperfect reliability of the test. The SEM represents the degree of confidence that a person's "true" score lies within a particular range of scores.
For example, an SEM of "2" indicates that a test taker's "true" score probably lies within 2 points in either direction of the score he or she receives on the test. This means that if an individual receives a 91 on the test, there is a good chance that the person's "true" score lies somewhere between 89 and The SEM is a useful measure of the accuracy of individual test scores.
The smaller the SEM, the more accurate the measurements. When evaluating the reliability coefficients of a test, it is important to review the explanations provided in the manual for the following: Types of reliability used.
The manual should indicate why a certain type of reliability coefficient was reported. The manual should also discuss sources of random measurement error that are relevant for the test. How reliability studies were conducted. The manual should indicate the conditions under which the data were obtained, such as the length of time that passed between administrations of a test in a test-retest reliability study. In general, reliabilities tend to drop as the time between test administrations increases.
The characteristics of the sample group. The manual should indicate the important characteristics of the group used in gathering reliability information, such as education level, occupation, etc.
This will allow you to compare the characteristics of the people you want to test with the sample group.
If they are sufficiently similar, then the reported reliability estimates will probably hold true for your population as well. Appendix A lists some possible sources. Test validity Validity is the most important issue in selecting a test.
Validity refers to what characteristic the test measures and how well the test measures that characteristic.Introduction to Reliability and Validity
Validity tells you if the characteristic being measured by a test is related to job qualifications and requirements. Validity gives meaning to the test scores. Validity evidence indicates that there is linkage between test performance and job performance. It can tell you what you may conclude or predict about someone from his or her score on the test. If a test has been demonstrated to be a valid predictor of performance on a specific job, you can conclude that persons scoring high on the test are more likely to perform well on the job than persons who score low on the test, all else being equal.
Validity also describes the degree to which you can make specific conclusions or predictions about people based on their test scores. In other words, it indicates the usefulness of the test. Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used. It is important to understand the differences between reliability and validity.
Validity will tell you how good a test is for a particular situation; reliability will tell you how trustworthy a score on that test will be. You cannot draw valid conclusions from a test score unless you are sure that the test is reliable. Even when a test is reliable, it may not be valid. You should be careful that any test you select is both reliable and valid for your situation. A test's validity is established in reference to a specific purpose; the test may not be valid for different purposes.
For example, the test you use to make valid predictions about someone's technical proficiency on the job may not be valid for predicting his or her leadership skills or absenteeism rate.
This leads to the next principle of assessment. Similarly, a test's validity is established in reference to specific groups. These groups are called the reference groups. The test may not be valid for different groups.
For example, a test designed to predict the performance of managers in situations requiring problem solving may not allow you to make valid or meaningful predictions about the performance of clerical employees. If, for example, the kind of problem-solving ability required for the two positions is different, or the reading level of the test is not suitable for clerical applicants, the test results may be valid for managers, but not for clerical employees.
Reliability & Validity
Test developers have the responsibility of describing the reference groups used to develop the test. The manual should describe the groups for whom the test is valid, and the interpretation of scores for individuals belonging to each of these groups.
You must determine if the test can be used appropriately with the particular type of people you want to test. This group of people is called your target population or target group. Use assessment tools that are appropriate for the target population. Your target group and the reference group do not have to match on all factors; they must be sufficiently similar so that the test will yield meaningful scores for your group.
For example, a writing ability test developed for use with college seniors may be appropriate for measuring the writing ability of white-collar professionals or managers, even though these groups do not have identical characteristics.
In determining the appropriateness of a test for your target groups, consider factors such as occupation, reading level, cultural differences, and language barriers.