Skip to content
  • Blog
  • 2nd Nov 2023

AI & BI : Building Trustworthy Tools – Testing AI chatbots in Public Service

This week world leaders and tech titans met at Bletchley Park to discuss the safety of frontier AI models. Bletchley Park was of course where the greatest minds and the most advanced computers of their day cracked the enigma codes in World War 2. Very few people knew of the existence of those computers and only a select few actually used them. A far cry from today, with more than 100 million monthly users of ChatGPT. 

As AI technology is built into products and services, at BIT we’re particularly interested in how people are going to interact with it, particularly in public services. The promise and the opportunities are enormous. Conversational AI can help a busy single parent understand what childcare benefits they’re entitled to, provide career advice to a young person struggling to decide what to study, or give mental health support to someone who doesn’t have a friend to talk to. 

To realise this value we need powerful and safe frontier AI models. But we also need an interface that people trust and want to use. That’s why we’re running an experiment to understand how people might interact with AI in public services. We’ve started with the example of an AI chatbot that helps people find the advice they need on government websites. 

Why are we interested in this? 

  • Most of us have experienced a bad chatbot. They can be slow, unhelpful, and time wasting. Many have rigid scripts, canned responses, and frustrating interfaces. But good chatbots can also be an effective tool for changing beliefs and behaviour. At their best, even an ‘old school’ bot can provide people with essential information making advice accessible, personable, and timely.
  • Now Generative AI is changing the game for chatbots – it can converse in real time, understand context, and grasp nuance, meaning it can provide tailored advice on complex queries in an instant. 
  • But we don’t know much about how people will interact with them, and their use could have implications for trust in both AI and public services. AI is not infallible – it can make mistakes. And, just as a frontline public servant’s demeanour can affect the outcome of a face-to-face interaction, an AI’s ‘tone’ and conversation style can shape a digital conversation and create lasting impressions on a human user. Bad deployment of AI could not only undermine trust in technology but public services more broadly.

We’ll be testing various AI-powered chatbots on two mocked-up government websites, one providing rent advice and other childhood health information. These integrate the OpenAI API into our research platform – Predictiv. Participants will be given a problem scenario, and asked to identify the correct government advice and appropriate next steps. The chatbot treatments will differ in UX features – interface and tone. 

Below are screenshots of the different chatbots we’ll be testing out.

1: ‘Basic’ bot

2: Cartoon bot

3: Full screen bot

We’ll be measuring engagement with the bot, speed, accuracy, and confidence in answering the task, and beliefs about AI. In particular we want to understand:

  • How many people (and who) engage with a chatbot? 
  • Does using a chatbot actually save people time  and give them a better understanding of the information? 
  • Do UX features make a difference for engagement, speed and accuracy? 
  • Does engagement with a chatbot improve a user’s confidence in their answer? 
  • Does the sector of government, as well as the issue they are tackling, have an impact on measures of engagement, and trust? 

We’ll also  be collecting some secondary data on beliefs about AI more generally including trust in its use in broader public service domains.

The findings will give us an initial understanding of what drives trust, confidence, and uptake in these systems, all of which will be critical for their integration in public services. We’ll share these findings in the coming weeks. 

We’re excited to share those results but also look forward to doing more work using this approach. We see great potential combining our online experimentation platform with AI tools and functionality. For example, we could test out whether an AI tool could help teachers produce a high quality lesson plan in a fraction of the time. Equally, we can use it to understand risks. We’ve previously used a simulated social media feed to test the impact of media literacy interventions on belief in fake news. We could now test whether people are more or less susceptible to AI generated disinformation compared to human written content.

We’re keen to hear from anyone interested in building the evidence based on the potential and the risks around generative AI. Don’t hesitate to get in touch.