What to Do When People Draw Different Conclusions From the Same Data
“In God we trust; all others must bring data.”
That famous line from statistician William Edwards Deming has become a mantra for data-driven companies, because it points to the promise of finding objective answers. But in practice, as every analyst knows, interpreting data is a messy, subjective business. Ask two data scientists to look into the same question, and you’re liable to get two completely different answers, even if they’re both working with the same dataset.
So much for objectivity.
But several academics argue there is a better way. What if data analysis were crowdsourced, with multiple analysts working on the same problem and with the same data? Sure, the result might be a range of answers, rather than just one. But it would also mean more confidence that the results weren’t being influenced by any single analyst’s biases. Raphael Silberzahn of IESE Business School, Eric Luis Uhlmann of INSEAD, Dan Martin of the University of Virginia, and Brian Nosek of the University of Virginia and Center for Open Science are pursuing several research projects that explore this idea. And a paper released earlier this year gives an indication of how it might work.
The researchers recruited 61 analysts (mostly academics) and asked them to assess whether soccer referees were more likely to give red cards to players with darker skin tones. The analysts split up into 29 teams, and were given a dataset that included numerous variables about both players and referees.
Each team devised their method for answering the question, and then shared that approach – but not any results – with the group. The result was a heated debate over which methods were defensible, and which were not. If you’re looking for a correlation between skin tone and red cards received, does it make sense to control for the position the player plays? What about the country their team is located in, or how many yellow cards they’ve received?
After receiving feedback from the group on their proposed method, the teams were able to tweak their approach if they wanted to, and then proceeded to the actual analysis. Then all the analyses were shared with the group and a debate took place over the results, which ones might have been influenced by outliers, and whether things would look differently if teams had taken different variables into account.
The results clearly illustrate why different analysts come to different conclusions about the same data. From 29 teams came 21 different sets of variables. Different teams also used different statistical models.
Not surprisingly, then, they came to different conclusions. 20 of the teams found a statistically significant relationship between a player’s skin color and the likelihood of receiving a red card. Nine teams found no significant relationship.
Had it just been a single team using a single method, they would have stopped at their result, declared a relationship between skin color and red cards (or not), and been done with it.
But with 29 slightly different results, the group could see clearly that their analyses hinged on difficult, somewhat subjective decisions about the best model to use and which variables should be included. There was another round of debate, after which “the analysts converged toward agreement that there is a small, statistically significant relationship between player skin tone and receiving red cards, the cause of which is unknown.” And although this paper can’t prove it, the authors suggest that taking the median result from the range might provide a less biased answer to the question.
Though most companies don’t have 60 analysts to throw at every problem, the same general approach to analysis could be used in smaller teams. For instance, rather than working together from the beginning of a project, two analysts could each propose a method or multiple methods, then compare notes. Then each one could go off and do her own analysis, and compare her results with her partner’s. In some cases, this could lead to the decision to trust one method over the other; in others, it could lead to the decision to average the results together when reporting back to the rest of the company.
“What this may help [to do] is to identify blind spots from management,” said Raphael Silberzahn, one of the initiators of the research. “By engaging in crowdsourcing inside the company we may balance the influence of different groups.”
Silberzahn and his colleagues are currently working on a second effort in the same vein: to crowdsource the analysis of how gender and status impact scientific debate. You can trust the results will be interesting, and thanks to the crowd, more likely to be right.