Do Student Ratings Really Matter?

By Heather Morton
October 30, 2019

By Heather Morton

Senior Editor, MindEdge Learning

Student course ratings inevitably raise the question: what counts as learning?

Student Evaluation of Teaching (SET) – in which college students rate their courses and professors – understandably generates controversy among faculty whose jobs and salaries often depend, at least partially, on this feedback from students. There are strong proponents on both sides: those who believe student ratings provide genuine insight into the quality of the course and those who think the ratings are based on superficial qualities, such as the professor’s likeability, that are at best irrelevant (and at worst detrimental) to learning.

The controversy is fueled by conflicting research on the validity of student ratings – research that is conflicting because student learning is so difficult to assess. The traditional “gold standard” for research on student ratings is a situation in which students in a multi-section course taught by different professors take a common exam. The common exam offers a way to validate ratings: a professor who earned higher student ratings than other professors should have students who perform better on the exam. If the two numbers correlate, then administrators evaluating teaching in courses without a common exam can be confident the ratings tell them something about that professor’s quality of teaching.

But it’s easy to raise objections to even this gold standard. In K-12 education, people worry whether “teaching to the test” undermines critical thinking, creativity, and other hard-to-measure learning outcomes. Even if student ratings of professors perfectly correlated with exam scores (which they don’t), would those ratings signal the kind of deep learning and engagement that faculty in post-secondary education aim to foster? Some recent studies suggest they don’t.

But first, a little background into the fascinating history of research on ratings. Over the last 50 years, studies have appeared regularly with various and contradictory findings. However, broadly speaking, there have been two notable shifts in public opinion on student ratings in recent history. These shifts suggest that the measures we use to quantify “learning” affect whether we can believe student ratings.

Back in the 1970’s, a series of experiments registered what came to be known as the “Dr. Fox effect.” An actor presented as “Dr. Myron Fox” spoke to healthcare professionals on a topic irrelevant to their work. The lively and engaging Dr. Fox earned high ratings from these well-educated learners, despite delivering a lecture specifically designed to teach them nothing. The experiment’s results suggested students respond to manner over matter, and that a teacher’s enthusiasm might distract from his incoherence.

But critics pointed to flaws in the research design – most significantly, the fact that students were not asked if they had learned anything from Dr. Fox’s lecture.[1] Subsequent studies cast doubt on the Dr. Fox effect by showing that students could rate a fake teacher positively on zeal while also recognizing that he didn’t increase their understanding of a subject.

The early 1980s saw a shift toward increased trust in student ratings. In 1981, Peter Cohen published a meta-analysis of 41 validity studies that considered 68 multi-section courses with common final exams. That study reported a “large” correlation between overall course rating and student achievement on the common exam, and a similar correlation between instructor effectiveness and achievement. Cohen’s study was influential for the next 30 years, showing up in articles, books, and blogs that argued for the validity of student ratings.

In 2017, however, a new meta-analysis appeared, the effects of which are just starting to ripple through the previous consensus. Bob Uttl, Carmela White, and Daniela Wong Gonzalez re-analyzed Cohen’s studies and created their own meta-analysis of data that has appeared since Cohen’s 1981 research. This new study comes to the conclusion that the strong correlations between ratings and learning outcomes were due to small sample-size bias.

Uttl reported, “The best evidence…indicates that the SET/learning correlation is zero. Contrary to a multitude of reviews, reports, as well as self-help books aimed at new professors…the simple scatterplots as well as more sophisticated meta-analyses methods indicate that students do not learn more from professors who receive higher SET ratings.”

Unfortunately, the small sample-size problem is inherent to Cohen’s research design. Few universities, after all, offer the opportunity to study more than 10 sections of a single course in one semester – and in Cohen’s research, the unit of measurement is the section itself. Student ratings of the course (or instructor) are averaged, and student grades on the common exam are averaged. The two averages are measured for their correlation. Studies of six to 10 sections of a multi-section course might – and often do – show a high correlation between student evaluations and performance on a common exam, but this only represents six to 10 data points. Meanwhile, studies that don’t show such a high correlation might not have been published. Getting larger sample sizes requires restricting studies to very large universities, or using a different measure of student learning.

Uttle did the latter. In an effort to increase the sample size, Uttl’s meta-analysis drew on studies that used a different measure of learning – grades in subsequent classes in the discipline – as well as studies that used a common final exam.

Given the authority of a “meta-analysis” that critiques small sample size, it’s worth noting that the three largest studies in Uttl’s study all came from the same research paper, an analysis of 190 sections of “Principles of Microeconomics,” 119 sections of “Principles of Macroeconomics” and 85 sections of “Intermediate Microeconomics” (each course studied over multiple years) at Ohio State University. That is, they were all in a single discipline, at one institution, and conducted by the same team of researchers.

In any case, the issue may be less sample size than what is being measured. Perhaps both Cohen and Uttl’s studies are insightful. Perhaps students have an excellent sense of what is in their short-term interest – what kind of teaching helps them prepare for the upcoming final exam – but not as good an instinct for what will lead to longer-term learning, the kind that influences their future grades in the discipline.

Using a common final exam may fail to capture long-term, deep learning. But as a measure of learning, grades in subsequent upper-level courses in the same discipline might also present problems. For one thing, grades in later courses seem fairly distant from the initial site of learning. For another, it’s easy to imagine confounding factors. For instance, students who do well on the initial microeconomics course and rate their professor highly might develop an unconscious belief that they’ll do equally well on later economics courses, and put in less effort. Meanwhile, students who on average did not do quite as well on the final exam might arrive in the subsequent course feeling a need to work harder.

Still, we only know as much as the latest research, and that confirms faculty suspicions that student ratings do not tell us much about student learning. The Ohio State University study discovered what others have found: that student ratings are strongly correlated with students’ grades in the current course. Grades on either the end-of-course exam or the course itself evidently don’t tell us much about grades in later, related courses. We don’t know whether the present and future grades don’t line up because the students who experienced deep learning don’t perform as well in the short-term, or because of other factors yet to be imagined.

There’s one final finding that should encourage everyone to keep searching for what student ratings tell us. The Ohio State University study (the one that studied 190, 118, and 85 sections of economics courses), found statistically significant differences in student outcomes from different professors. That is, certain professors’ students did better on later courses than did other professors’ students. So we still have evidence that some professors are better than others (at least at getting their students good grades later on). We just don’t yet know why that information is not showing up on student ratings, and what other information might be.