How to interpret grade data

Standards not curving

A standby of college education for years was the curve. This "tool" allowed an educator to forgo the tiresome process of aligning their expectations for student performance by magically  realigning student grades to fit a normal distribution. 

Thankfully we're moved on from this and educators work tirelessly to develop goals for student performance, and assess whether or not students have reached these.

-----------------------------------

Managing grader standards in a large course

Modern college education at large institutions includes the management of introductory courses with many sections and often multiple graders. At our university most of these graders are new graduate students who secure teaching assistant appointments for funding before they secure a research assistantship. 

The Problem

The plethora of new graders in these courses results in the application of drastically different grading standards within the same course. This is not fair for the students, and the failure to return informative feedback to students about their performance causes them to lose a chance for growth.

The Solution

I've found that spending ~1 hour per week is sufficient time to create and interpret analysis to TA grade distributions. We then discussed these analysis with our TAs in our weekly planning meetings, and individually mentored graders with divergent standards. We saw immediate gains through performing this intervention, and our graders felt this process was worthwhile and should be implemented in the future

-----------------------------------

Tools for Interpreting distributions of Grade Data

Histograms

Histograms will be the default method of analyzing the distribution in most statistics tools. See Figure 1 for an example of how to interpret one. Histograms display the frequency with which you might find data in a population that fits into a specific "bin". The size of this bin is set by default by the program that you are using to create a histogram but you should be able to edit the size of this. Be wary: too few bins could cause you to overlook trends in the distribution of data, while too many bins relative to your sample size could result in an inaccurate picture of how distributed your data is. For more technical details of interpreting histograms I highly recommend Dr. John McDonald's free Handbook of Biological Statistics, in his chapter on central tendency he reviews their interpretation.

Normality of Grade Data

You might notice from Figure 1 that the distribution of these data do not nicely fit into a bell curve, also known as a "gaussian distribution", or a "normal distribution". I find that this is typically true of grade data. I've dedicated a separate page on this separate topic, please click on this section heading to navigate there if you are interested. If you'd rather read the "cliff notes", then: it's ok if your data is somewhat non-normal. I find that it is most important for you to compare what one grader's data looks like to another, not one grader's scores versus a global standard.

Box Plots

Figure 1B  shows a quartile plot, commonly referred to as a box plot, side by side with a histogram. Both of these figures shows similar data, the advantage of the box plot is that you can stack plots showing multiple populations of data side by side and compare them. These plots have a few important parts, the "box" portion of the plot represents the interquartile range, or where the middle 50% of all grades. The line in the middle of the interquartile range represents the position of the median grade in most graphing software.

The "whiskers" of the plot can represent several different things depending on which program you are using. They generally represent the range that your data covers, and in simple graphing programs this is as far as the analysis goes. I generated these plots in JMP and JMP limits the possible length of the whiskers to 1.5 interquartile ranges. This is commonly used as a way to identify possible outliers in large populations of data by identifying points that lie outside the "normal" range. You can see such examples in our sample here represented by the points below the lower whisker. It is not abnormal to observe many points like this in grade data; some students just did not perform, or did not turn in an assignment and this contributes to the non-normality of these data. 

Summary Statistics

The summary statistics presented in Figure 1 are another useful tool for analyzing your data. Of particular interest, Standard Deviation, will help you address the amount of variation in your grader's scores. Put simply: if your grader's scores have a smaller standard deviation that others than they are more likely to give out similar grades to most students.

Skewness is a measure of whether the variation in a sample of grades is distributed symmetrically around the mean. Put simply: most grade data will be skewed negatively as we are more likely to give a score 20 points below the average than we are to give one 20 points above that average. Thus, in Figure 1B the average grade was a 75, but the median of all grades was an 80. Because low scores can be very low and high scores will not be quite as far from the mean then our data will be skewed. Due to natural skewness I try to compare graders to each other, unless I've transformed my data to compensate for this skewness. To limit this page to only practical considerations, this topic is dealt with along with normality on another page.

-----------------------------------

Putting it all together; analyzing the distribution of grade data

Figure 2 shows a grade distribution that I observed from some of our graders early in a semester. Expectedly these inexperienced graders displayed most of the characteristic misalignments. 

Notice at Figure 1A how abnormally high TA-5's scores are relative to the others. High or Low scores will not always be as obvious as this but they are still relatively easy to identify at a glance. A second common trait is seen here, with the lack of a whisker above the box. This indicates that at least 25% of students received maximum credit for this assignment. Although this is not impossible for some assignments, this was our students first attempt at a challenging new type of assignment and this group of students did not seem to perform noticeably higher than others.

Figure 1B shows a grader with a smaller range than the others. In this instance both their interquartile (IQ) range and their overall corrected range were narrower had a decreased span when compared to graders 2, 3, and 4. Although less obvious than the errors made by TA-5 this error would suggest that it is both more difficult for a student to excel or fail in TA-1's course. Issues can arise from either too large or too small a range, but specific to grade calibrations: I would set your expectations for this based on the most common range sizes displayed by your graders. 

Figure 1C shows a far easier to miss misalignment by TA-2. Here we see a TA who's median lies far lower in the IQ range than their peers. Note that it is as close to the bottom of the IQ range as TA-4's median is to the top of their IQ range. This indicates a skew in the opposite direction of what we would expect from grade data. I would generally ignore modest skews comparable to TA-4 or 1 because of the non-normality of grade data. However a skew in the opposite direction suggests that it is comparatively more difficult to earn an exemplar grade in this section than in others even if the means were identical. Perhaps this TA couldlook to replace some of their numerical demerits with formative feedback.

Some caveats of grade distribution analysis to consider

First, you should expect some variation in your sections. The same way that performing multiple statistical tests results in an increased probability of detecting a false positive, the more sections you have the more likely you are to have some sections that truly perform at a lower level than others.

Second, assessment is only one educational role performed by your graders. You may see variation in the levels of performance of some of your sections as your class progresses over a semester. Perhaps one grader is also more skilled at getting their students to meet their assessment goals. This is a good thing and is the reason that this analysis only detects possible problems that warrant further investigation.

As your semester progresses and learning occurs I would expect your sections to become progressively more skewed as learning occurs. If you've aligned your learning objectives well, you will have more of your students achieving these as the course progresses.

Next step: Performing grade distribution analysis