Thanks to Everybody Lies by Seth Stephens-Davidowitz—who will be joining us at BX2019 in September—we at BIT have a new hobby: looking at how online data can reveal previously invisible insights into behaviour and decisions.
For example, if you’re a Brit, you’d have had to be living under a rock the last few weeks to miss prominent members of the Conservative party fighting it out to be the new PM. In the past, MP votes, media headlines, and opinion polls would have been the only ways to take the temperature on candidate performance. Now, though, aggregate online search data can give a real-time description of what is happening.
For example, quantifying the meteoric rise of outsider Rory Stewart’s star:
Figure 1: Rory Stewart may have lacked MP votes, but he gave Boris Johnson a run for his money in the public consciousness stakes
Looking into these tools has been fun, but it is also urgent. In the last few years the insidious effects of partially-hidden behaviours and beliefs, from sexual abuse to Islamophobia, have been brought to public attention in an unprecedented way. The signs were there; from 4Chan to Twitter, the trolls were telling us everything we needed to know. We hope that incorporating online data into our toolkit will help us understand the root-causes of some of the world’s most serious problems, design better interventions, and gain a better understanding of what is working as a result.
This post covers five themes on which I have been musing (musing is the correct word: these represent lunch time forrays on questions that interest me!) and the implications for our work and partners:
1. New measures of old problems: sexism on the internet
Seth Stephens-Davidowitz’s most famous work looks at what being black cost Obama in vote share (4 percentage points, both elections) by comparing Google searches for racist language to voter choices. This made me wonder whether we could use Google Trends, a public tool that allows you to compare the relative popularity of searches over time, to get a better read on sex-based discrimination. Specifically, are people more likely to search “how to conceive a boy” or “how to conceive a girl”?
Figure 2: Google Trends Global results for “how to conceive a boy” vs “how to conceive a girl” (past 5 years)
Firstly, yes: this is something people Google. Second, globally we see a strong preference for boys. But preference, and its strength, is not consistent around the world.
The UK., Ireland, Canada, the Philippines, and New Zealand are, albeit marginally, more interested in conceiving baby girls, with the US and Australia both at 50/50. Nigeria, Kenya, South Africa, India, and Malaysia, on the other hand, strongly prefer boys. Although other data, such as the gender wage gap, political representation, share of movie dialogue, or evidence of son-bias in family size, do not yet suggest those countries with more balanced searches have reached parity, perhaps a dwindling of son-preference at conception will foreshadow progress in these more routine data sets.
2. Measuring the clandestine: drugs on the internet
Google Correlate allows users to look at what searches typically occur together; its outputs give insights in the form of “people who searched X also searched Y”. I’ve found it to be a little noisy but interesting, nonetheless. For example, looking for things that correlate with the search “delete search history” turned up a 0.92 correlation with searches for “oxycodone acetaminophen”—an opioid (with an almost unspellable name, at that)—in the US:
Figure 3: The correlation between “delete search history” and “oxycodone acetaminophen”
It is easy—perhaps too easy, given my previous caution on the noisiness of Google Correlate—to see the story in the coefficient. Indeed, the secrecy of the opioid crisis has been one of its most terrifying features, with families being torn apart without ever having seen the signs of an addiction developing.
Other opioid-related behaviours can also be seen through Google. A state-by-state map of the US showing strength of correlation between “oxycodone” and purchase interest in the same drug (“buy oxycodone”) looks—to the naked eye—pretty similar to the geographic distribution of opioid overdoses. Of course, most opioid deaths are not the result of pills but Google appears to censor some searches in its reporting tools, such as “buy heroin”, limiting the options:
Figure 4: US web search activity for “oxycodone” and “buy oxycodone” (Source: Figure 3)
Figure 5: US overdose deaths involving opioids 2015 (Source: the Economist, using CDC data)
Lastly, for the first time, Fentanyl has surpassed Heroin in terms of search frequency in the US.
Figure 6: Fentanyl and Heroin search popularity over time
Of course, searching either of these terms does not imply that the user intends to take these drugs but—in this case—the public interest has trailed behind the impacts; in 2016, Fentayl (a synthetic opioid that is much more powerful and cheaper than heroin) and its analogues overtook Heroin as the leading cause of overdose death in the US.
Although not evidence in itself (news of deaths from across the pond could be driving interest, to give just one plausible alternative), the uptick in searches for “Fentanyl” in the UK could be taken as a cause for concern:
Figure 7: Increasing popularity of Fentanyl searches in the UK
3. Meeting people where they’re at: Fatbergs and super-gonorrhea
At a recent workshop I was running, an attendee responsible for parts of the UK’s drainage system lamented: “We need people to care about flushing wet wipes but all they’re interested in is fatbergs!”
We checked it out and he was correct; a nation obsessed!
Figure 8: Rise of the Fatbergs
Relatedly (I promise), it turns out we Brits Google super-gonorrhoea (a relatively rare example of a drug resistant infection that has become something of a media darling) far more zealously than we Google “drug resistant infection”:
Figure 9: Super-gonorrhea grabs the public interest most-est (even though the sound of it is something quite atrocious…)
Super-gonorrhea and fatbergs play right into the availability heuristic; they grab our attention because they are unusual…
Of course, even though we ought to care about wet wipes and microbial resistance they’re just not salient topics. Super-gonorrhea and fatbergs, on the other hand, play right into the availability heuristic; they grab our attention because they are unusual (that and they make us a bit giggly).
Rather than lamenting the irresponsibility of our online selves, we should celebrate the fact that we now know what people are interested in. Since online advertising allows us to drop ads right where (and when) people are most likely to see and attend to them, having this data makes it more likely we can get messages about responsible use of antibiotics or desisting from wet-wipe flushing to land.
4. Tracking the media: could Eastenders change a child’s life?
The spiky nature of the super-gonorrhoea graph looks to be due to news coverage. We can also look for how other types of media can change public interest in real-time. When Eastenders ran a storyline on fostering, for example, there was an all-time peak in people searching “foster a child”.
Figure 10: Fostering interest via Albert Square
Of course, it is hardly news that bizarre behaviour on Google can be explained by what’s happening on our TV screens, but perhaps we should pay more attention from a policy perspective. Will huge surges in searching “definately” (sic) foreshadow applications to police jobs? Will the sudden interest in Chernobyl lead to more nuclear physics applications at university? Will babies called Daenerys fare worse in tomorrow’s labour market than those called Arya? Time will tell.
5. Prevention and mental health
In Everybody Lies, Seth Stephens-Davidowitz talks about some impressive examples of online behaviour enabling diagnosis—Google search histories may be clinically useful in diagnosing pancreatic cancer, for example. But just as the internet can reveal illness, it can also cause it.
Heart disease—responsible for 30% of US deaths—accounts for less than 3% of internet search and news coverage
Researchers at Imperial and King’s Colleges, estimate that “cyberchondria”—a particular form of health anxiety, in which the use of Dr Google exacerbates mental ill-health—is costing the NHS £420m a year. There is also evidence that we might be worried about the wrong things. Our World in Data, for example, shows that heart disease – responsible for 30% of US deaths – accounts for less than 3% of internet search and news coverage relating to mortality causes. Cancer, on the other hand, accounts for 37% of mortality related google searches compared to a similar 30% of deaths.
As something of a cyberchondriac myself, this insight is thrilling: I would greatly appreciate a well-placed nudge in the midst of a Google-frenzy. I wondered if it might be possible to spot disproportionate health concerns, which could be a result of anxiety, in the wild.
I took data from Cancer Research UK on the incidence and deadliness of the twenty most common cancers (2015 data). I then used data from the same period from NewsWhip on media coverage plus Google searches, set relative to Breast Cancer, the most searched of the group, to observe public interest.
Figure 11: Concern about cancer vs actual risk
The results suggest that, although directionally consistent with incidence (orange shaded area), public interest (blue and purple lines) does not track cancer incidence or deadliness (grey bars) especially well. Assuming that interest in breast cancer (the bar to which all others are pegged) is not disproportionately high, I was surprised to see there may be a deficit of concern about some cancers (notably prostate, lung, and bowel) compared to their prevalence and deadliness.
Other cancers, however, seem to loom larger than they ought in the public consciousness; notably leukemia, pancreatic cancer, ovarian cancer, and lymphoma. Some of this might be a result of missing data. After all, I can’t claim to know what people would actually Google if concerned (people may, like me, be incapable of spelling “oesophageal cancer”, or search “brain cancer” rather than “brain tumour”, for instance). But the patterns suggest there may also be some interplay between what gets reported in the news (the blue line) and what people search for (the purple line).
Adding real data on presenting concerns at the GP might improve our ability to predict which health concerns are likely to drive cyberchondria. But even without this, it is absolutely possible to spot unhealthy patterns of search in users’ online behaviour and target interventions, such as cognitive behavioural therapy or crisis helplines, before the situation escalates.
I am looking forward to using these techniques more formally and would encourage you to come and see Seth Stephens-Davidowitz describe use of these methods with a much higher degree of sophistication at BX2019.