I have used mathematical analysis and computer simulation techniques to explore the year-to-year fluctuations in CSAP test results and rankings for individual schools that can be expected as a natural consequence of performing statistics on small samples, i.e. small numbers of students (fewer than, say, 1000). My primary test case is Lewis-Palmer High School (LPHS) in District #38, which typically in 2001 had somewhat more than 300 students taking each CSAP test.
My conclusions, for which details can be found in Sections III and IV, can be summarized as follows:
(1) We can expect year-to-year fluctuations of 10% - 15% in the fraction of students in the Proficient and Advanced groups on the 9th and 10th grade CSAP reading and writing tests and 15% - 20% on the 10th grade math test even in the absence of any other influences. This means that any year-to-year change within this percentage range is not statistically significant. One cannot assign meaning to such changes. This is the nature of the statistics of small samples.
(2) We can expect random year-to-year fluctuations in the total score, which is used for ranking schools, for LPHS to vary over a 20% - 30% range. The consequence may be that in some years LPHS will be ranked in the “Excellent” category but in the other years, due to such random fluctuations, it will drop into the next lower or “High” category. No significance can be attributed to such changes in ranking.
(3) I have found there to be an extreme sensitivity to the values of the so-called “scale score” ranges that determine which students fall into the Unsatisfactory, Partially Proficient, Proficient, and Advanced groups. In my LPHS test case varying the scale score ranges by 1% led to a 30% range of possible total scores for Lewis-Palmer High School. I have proposed a solution to moderate this extreme sensitivity based on so-called fuzzy logic concepts.
My overall conclusion is that, while the CSAP tests may provide guidance to teachers and administrators on how well individual students are performing, due to the small sample sizes (all else remaining the same), they lack the statistical significance required to be an effective measure of class and school performance to nearly as fine a degree as one would hope them to be.