When you handle trash, do you still have to handle it with statistical care?

Neurocopiae takes a closer look at the carefully crafted pizza study survey by the Wansink lab.

UWhen it comes to reheating leftover pizza, opinions are typically divided. I like cold pizza better because when you reheat a slice of pizza, it gets soggy. This soggy slice of pizza is a fitting metaphor for the next chapter in the Wansink pizzagate saga. I was a bit reluctant to write another post on the sad downfall of ig-nobel laureate Brian Wansink, head of the Food & Brand lab at Cornell University [Mindless publishing garnished with social science apologies], but I had to take a look at the now infamous pizza buffet data myself. A couple of days ago, Wansink posted a statement re-emphasizing that “[he], of course, take[s] accuracy and replication of our research results very seriously.”  More importantly, Wansink finally granted access to the data that the four papers, which came under fire months ago, were based on: “My team has also worked to make the full anonymized data and scripts for each study available for review.” This is awesome because everything is settled now, right? Move on, methodological terrorists, nothing to see here. Well, almost.

A spreadsheet hit by a scattershot

After reading in the text file with Excel, I thought I must have messed up the import. There are many missing values and some of the entered values, for example in the variable calories, seem improbable. So I check the text file and find no problem with the import. Good, i guess? After years of practice in data science, I finally learn from his response letter that “[different sample sizes in the statistical tests] arise because different numbers of individuals answered each question on the survey—an issue common to all field studies.” This is so common that there is absolutely no need to mention it in the methods or the table of results, or in Wansink’s words: “Survey non-response varying by question is common to all field studies. It was understood by both reviewers and editors and was not mentioned for brevity.” I like brevity! One more thing that is not mentioned by Wansink is the assumption that comes along with this approach: data is assumed to missing completely at random (MCAR). However, this cannot be true because some questions refer to the progression of taste perception over multiple slices of pizza. Not quite random I suppose.

Whenever I am concerned with the quality of collected data, I do some basic checks first. I see the “pieces” variable and the “calories” variable and I hypothesize that the correlation should be very high if everything is in order. The correlation is ~.5 and not quite what I had expected. I re-read the methods (because the data does not include a proper description of the variables) and learn that the variable refers to an estimate of the number of calories of pizza the participants consumed. Okay, so perhaps the correlation should not be so high. By now, I already know that field studies are always very noisy and not well-controlled like lab studies. But still; some values just don’t make sense to me. For example, one participant had no slice of pizza according to the observations of the research assistant, but estimated that s/he had 400 kcal worth of pizza. For one slice of pizza, the range of estimated intake of calories is from 50 to 1200 kcal, but maybe the participants just don’t have a clue about caloric density. This is field research so everything seems possible. For two slices of pizza, it gets more interesting: one participant estimated an intake of 0 kcal and two more participants estimated that they had consumed 1 kcal (!) of pizza. Okay, now I am getting more skeptical. Skeptical enough that I would, at the very least, flag those participants and double-check the raw data for plausibility. Before running it through many ANOVAs and publishing four papers. Or you can close your eyes and go ahead treating these variables as if they reflect an accurate read-out of the messy truth of a field study. For now, let us suppose you do the latter.


Are the observations dependent or is the survey worthless?

In briefly browsing the data, I spoted that the data vectors are quite similar across participants. I recall that one of the research questions of the papers was if men eat more in the company of woman so I start to wonder if the observations are truly independent. In the description of the paper, it is stated:

Customers who entered the restaurant that day for lunch were recruited to participate in the study before being seated along with the people who joined them and were asked two questions related to restaurant choices

DOI 10.1007/s40806-015-0035-3

My interpretation is that one (random) customer from a group of customers was drawn for the survey. In this case, we can assume that the observations are independent.

However, if this is the case, I am really stunned by the high degree of similarity and the low pairwise distances between responses among the vectors of survey questions (vars 6:23). I have included the plots for you to take a closer look, but just some quick facts. Some survey vectors are virtually indistinguishable. I count 296 pairwise differences of 0 indicating that per participant more than two individuals are the same in terms of survey responses. I find that number surprisingly high when you ask 18 questions. Then again, participants with many missing responses seem particularly dubious in this regard because they only have few features. Is it certain that half-completed surveys come from different participants though? Perhaps this is another good reason to exclude them? After exluding all zeros (even off diagonal), the median absolute pairwise distance is still 13, so less than 1 per question on a 9-point Likert scale. Not much variance left to be explained by anything I would say.

Moreover, when I look at the correlation of the survey response vectors, I can’t help but see a strong pattern of similarity across participants. It is apparent how high the similarity and how low the distance is between the diagonal and many other participants contained in the spreadsheet. My take on the results is that either the answers are strongly interdependent, which would clearly violate the assumptions of the employed statistics and call for an adjustment of the degrees of freedom, or the survey was as carefully designed as the results were initially reported.


Garbage in, garbage out

Now we are getting back to the question that Andrew Gelman had posed earlier. Is this work merely trying to fit a significant statistical model to little more than noise? I think the answer is clearly yes. For example, the pizza slice count is not well approximated by a normal distribution, but hey; you can still run it through an ANOVA and live happily ever after. At least until you brag about the published output from a spreadsheet that contains very little valuable information to begin with. If this is an exact reflection of the data that was collected, I think we don’t have to worry about the p-hacking combined with creative storytelling too much anymore. The problems run much deeper. The observations are not very helpful to answer any question of interest. They don’t tell us much about unique individual characteristics of participants. They don’t tell us much about the psychological processes involved in the decision to have another slice of pizza or not. They just tell us a sad story about what field studies turn out to be when you don’t even try to approximate good scientific practice.


Jordan Anaya pointed me to a section in one of the pizza papers where the authors indicated that the observations are not independent.

I think this may partially account for a higher overall similarity of survey responses, but it seems unlikely to account for this unusually high degree of similarity. Nevertheless, this only stresses the need to account for the nested collection of data. In hierarchical linear modeling, the hypothesized evolutionary “eating heavily” effect would correspond to a cross-level interaction between the group (same vs. different gender) and the individual level. This would, in turn, reduce the degrees of freedom for the group observations, which could be enough to make the interaction non-significant. Unfortunately, we cannot evaluate it properly because there is no group id provided in the spreadsheet.

One thought on “When you handle trash, do you still have to handle it with statistical care?”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s