How should computer voices sound in public sector applications? This project aimed to improve understanding of voice design for trustworthy, effective communication in government and civic contexts.
Background & Challenge
Deep learning models like OpenAI's Whisper and Elevenlabs' Eleven make it possible for us to easily implement low-latency voice interfaces with a high degree of naturalness. Natural voice interfaces can be used to improve accessibility and increase user input accuracy, as I explored on a previous project, Talking with my Taxes.
Synthetic voices hold a lot of potential to be used in public services, but what makes a voice trustworthy and effective? I set out to design and test tools that allow designers to experiment with different voice characteristics and gather feedback from users.
Design Artifacts
Current research into synthetic voice suggests that there are four important factors that come into play when it comes to the perception of voice:
- How high or low does the voice sound (Pitch)
- How stable is the pitch over time, does it rise, fall, or stay the same (Pitch stability)
- What is the tonal balance of the voice, is it breathy or clear? (Breathiness)
- How fast does the voice speak (Tempo or Speed)
I developed two tools: a Google Colab-based voice design tool and a React-based tool. Both used the OpenAI whisper API as a back-end, and enabled me to conduct user testing of synthetic voices in different scenarios.
Experimenting in Public
I hit the road, and with a MIDI controller and some headphones, asked members of the public to design their most trustworthy voice in settings like hospitals, cars, and train stations.
I found that pitch had the strongest influence: higher voices were often seen as too cheerful or insincere in serious contexts like medical updates, while lower voices conveyed calm and authority.
Pitch stability was almost always preferred to be at a low value. Participants often requested that the pitch of public announcements should be stable and unchanging.
Breathiness added a sense of realism and certainty when used subtly, especially in emotionally loaded messages.
Speed needed to match the emotional tone, slower speech felt more reassuring at home but too sluggish in urgent public settings.
Results & Reflection
To improve trustworthiness, tune the voice to match the context: for a hospital call, prompt with "Speak with a calm, low pitch, and slight breathiness, slowly and gently, as if delivering important but sensitive news." For a train announcement, try "Use a firm, clear tone with moderate speed and minimal variation, like a confident guide ensuring people get to the right platform."
If you're a researcher interested in how voice should be designed across different contexts, you can explore the tool and contribute your own insights by visiting my GitHub.