Two months into lockdown, we’ve grown accustomed to working from home, but there are still things we miss about being in the office: the buzz of colleagues working, impromptu chats in the kitchen, and most of all, the coffee machine. From decaf macchiatos to caffeinated double espressos, we were spoilt for choice – choice being the critical word here. Putting a group of behavioural scientists in front of a machine that provides endless opportunities for choice led to one predictable outcome: a trial, of course.
From the beginning of January (a timely moment if ever there was one) to the beginning of lockdown, three of us decided to settle once and for all the age-old debate: could we tell the difference between caffeinated and decaffeinated coffee? (No, no one had ever set the default of the BIT coffee machine to decaf, as far as we were aware – we nudge for good, after all.)
It was a single-blind experiment: Andrew made coffees for Jessie based on the outcome of a coin that he flipped away from Jessie’s prying eyes, Jessie did so for Tania, and Tania for Andrew – a véritable expérimentation à trois.
So, 85 coffees later, could we tell the difference? While Andrew turned out to be a connoisseur, correctly identifying 80% of his coffees, Jessie and Tania seemed to be guessing at best. However pooling our results together, we were correct for just under two thirds of our coffees (63%).
What information did we use to make our guesses? Jessie was sure she would be able to tell by flavour alone. But because she was interested in how caffeine affected her alertness, she asked Andrew to mix things up by adding either dairy, oat, soy, almond, or coconut milk. Inspired by survey methods to record illicit behaviour, this technique was designed to introduce extra variation in the flavour of Jessie’s coffees. By making it harder to tell what she was drinking by taste alone, she could focus more on the effects on her alertness. In this case, Jessie’s inability to discriminate suggests her subjective perception could be influenced by a placebo effect, for example feeling more alert simply due to associating that feeling with drinking coffee (two-thirds of Jessie’s wrong guesses were decaf coffees she thought were caffeinated).
‘Caffeine has no effect on me,’ asserted Tania confidently while we first brewed this subject by the coffee machine. This turned out to be accurate, as her discrimination was indeed no better than chance. However, she notes the experiment’s social benefits: as a new starter at BIT when the experiment began, it was a way to get to know colleagues informally in a way that was adjacent to work, yet not actually work. These interactions are something that people may miss when working from home; their importance is a key part of BIT’s recommendation to employers to schedule virtual social meet-ups and even create spontaneity via ‘randomised’ coffees/teas.
Andrew wears a fitbit that tracks his sleep. We found, using regression analysis, that Andrew’s guesses of whether the coffee was caffeinated was predicted not only by whether it actually was caffeinated, but also by how many minutes he had slept the previous night. Less sleep made him more likely to guess ‘decaf’, when controlling for the coffee’s actual caffeination – presumably he sometimes mistakenly blamed the coffee for a lack of caffeine when he actually should have gone to bed earlier the night before. (For those interested: drinking a caffeinated coffee was also weakly associated with fewer minutes slept the following night.)
What can we learn from this trial (aside from the importance of having empirically-minded coffee-loving colleagues)? We think there are three key lessons:
1. If you’re not making accurate guesses, it’s better to be aware of that fact.
Partway through the experiment, we started writing down our confidence in our guesses (a percentage where 100% indicates certainty and 50% indicates chance, i.e. ‘no clue!’). While overconfidence may be advantageous in some cases, it’s dangerous to be systematically, confidently wrong. Rather, we strive to be ‘well-calibrated’: for example, someone who is 70% correct would have 70% confidence in their guesses (on average). Brier scores are a measure that gives credit for both good calibration and solid discrimination. Jessie’s Brier score (0.25) was almost as good as Andrew’s (0.23; lower is better), despite Andrew’s superior discrimination, as she was helped by being more confident when she got things right.
2. Second-guessing hurts your forecasting ability.
When we looked at our results in more detail, we found that our accuracy decreased after we started rating our confidence. Jessie correctly identified 9/10 coffees before we started rating our confidence, and Andrew 14/14, but both of them plummeted to approximately 50% accuracy thereafter. Amateur forecasters like us might do well to trust our initial instinct a bit more, rather than second-guessing ourselves into inaccurate oblivion.
3. Feedback is critical in making accurate forecasts.
In hindsight, Andrew realised that he may have had an advantage: ‘I had already started to wean myself off caffeine almost entirely before this trial. But if I felt I needed a kick, I’d make myself a regular caffeinated coffee.’ In other words, Andrew had some learning and feedback on the taste of the two coffee types that Jessie and Tania lacked. In this experiment, we didn’t give each other feedback on whether our previous guess had been right which makes forecasting much harder. Providing rapid feedback is one of the key recommendations that forecasting gurus like Philip Tetlock recommend to improve judgement.
Overall, we’re surprised by the results, wondering at our own inability to discriminate at much better than chance (as a threesome). Calibration and discrimination can be difficult but can be improved with pracitice, and BIT is actively working with the government and other organisations to improve calibration and discrimination in forecasts of policy outcomes and other issues of importance to organisations. If you’re interested in working together on improving forecasts, please get in touch at firstname.lastname@example.org.