Notes from text
Boudett, City, Murnane
Chapter 2: Building Assessment Literacy
Assessments should be of middling difficulty; extremely easy or extremely hard tests give you little information about what students know.
Sample principle of testing- making inferences of students' knowledge of an entire domain from a smaller sample.
Discrimination- discriminating items are used to reveal differences in proficiency of students that already exist.
Measurement error- inconsistencies in scores; for example, when various forms of a test have different samples, in people's behavior, and between individual scores.
Reliability- degree of consistency of measurement; a reliable measure is one that gives you nearly the same answer time after time
Score inflation- increase in scores that do not reflect a true increase in students' proficiency
Sampling error refers to inconsitency that arises from choosing the particular people from whom to take measurements.
The margin of error is simply a way to quantify how much the results would vary from one sample to another.
While a well-designed test can provide valuable information, there are many questions I cannot answer. How well does a person persevere in solving problems that take a long time and involve many false starts? To what extent has a student developed the dispositions we want-for example, a willingness to try applying what she has learned in math class to problems outside of school? How well does the student write long and complex papers requiring repeated revision? People demonstrate growth and proficiency in many ways that would not show up on any single test.
Significant decisions about a student should not be made on the basis of a single score.
Raw scores- percentage of possible credit. They are difficult to interpret and compare because they depend on the difficulty of the test which is likely to vary.
Norm-referenced tests- designed to describe performance in terms of a distribution of performance. Individual scores are reported in comparison to others (a norm group).
Percentile rank- percentage of students in the norm group performing below a particular student's score. PR tells you where a student stands, nut only relative to a specific comparison group taking a specific test.
Criterion referenced tests- determines whether a students has mastered a defined set of skills or knowledge; measures whether a student has reached a preestablished passing level (cut score). It does not rank students and seves only to differentiate those who passed from those who failed.
Standards-referenced tests- developed by specifying content standards and performance standards; scored with various performance levels
Developmental (vertical) scales- trace a students development as he or she progresses through school
Grade equivalents- developmental scores that report the performance of a student by comparing the student to the median at a specific stage; easy to interpret and explain but have become popular and rarely used. Ex 3.7 would be a third grader in their seventh month of school
Developmental scale (standard) score- reports performance on an arbitrary numerical scale; students who score the same are believed to have the same proficiency even if they are in different grades.
When interpreting the results of a single test, it is often useful to obtain performance data from more than one scale.
For purposes of diagnosis and instructional improvement, most educators want more detail than less. Although finer-grained levels of detail are instructionally more useful, because fewer items are used in reporting performance the results will also be less reliable.
Cohort-to-cohort change model- when schools test a given grade every year and gauge improvement by comparing each years scores for students in that grade to the scores of the previous year's students in that grade (mandated by NCLB).
Longitudinal (value-added) assessment- measures the gains shown by a given cohort of students as it progresses through school.
It is risky and misleading to rely on a single item to draw conclusions about a single student because of measurement error and not being able to tell which skill caused the student to miss the question.
Three complementary strategies for interpreting scores on a particular assessment, all of which involved using additional information:
1. Look beyond one years assessment results by applying either the cohort-to-cohort change or value-added assessment approach
2. Compare your students' results with those of relevant students in the district or the state.
3. Compare your students' results on the most recent assessment with their performance on other assessments.
Three reasons why small differences should not be given credence:
1. Sampling error
2. Measurement error
3. Any given set of content standards could lead to a variety of different blueprints for a test.
Differences that are sizable or that persist for some time should be taken seriously.
To understand whether improved student scores are meaningful, educators need to determine whether teaching has been focused on increasing mastery rather than on changing scores.
If students are gaining mastery, then the improvement will show up in many different places- on other tests they take or in the quality if their later academic work- not just in their scores on their own state's test.
This book focuses on how to use assessment results to change practice in ways that make a long-term, meaningful difference for students.