6. Best practices and general discussion
We have shown thatthere is no consensus in how the unexpected contentsmeasure is scored, and thatthemostfrequently used scoring system based on our review did notfactor in any control measure into analyses ofthe task. Further, across several analyses, we found that the coding scheme researchers used to score the unexpected content tasks potentially contributes to the significance level of analyses reported by the researchers. As such, we suspect that the coding scheme used for scoring this measure (and potentially related measures of false belief) could be a “researcher degree of freedom” (Simmons, Nelson, & Simonsohn, 2011, p. 1559) present in studies of children’s theory of mind. We do not suspect that malice or deceptive intentions underlie any researcher when they choose how to score this measure. Indeed, we suspect that some authors have used a particular coding scheme for long periods, and this presumably reflects a genuine belief that there is an unambiguous way of scoring the measure. But this seems to reflect a minority; the majority of researchers using this measure appear to succumb to the genuine ambiguity in how to best score this measure. We want to point out a particular limitation with our review, which is that we only focused on the unexpected contents task. Other measures of false belief (e.g., unexpected transfer, appearance-reality, etc.) also use control questions or pretests to ensure that children have the requisite attentional or memory capacities necessary to demonstrate their theory of mind knowledge.We limited our analysis to the unexpected contents task for two reasons. Thefirst was practical – we had available to us a data set on children’s performance on this particular measure. The second was that in the unexpected contents procedure, there tends to be only one control question asked (about the actual contents of the box), which simplified the review of the literature and our categorization of the different coding schemes. To our knowledge, other measures of false belief have not been analyzed in the manner presented here. We suspect that this analysis is representative of the broader literature on explicit judgments about false belief.