Voice as an input

Voice as an input

Enabling users to interact with voice, have a conversation or take actions

User behaviors unlocked

✓ Using the product while moving, multitasking, or away from screens.
✓ Speaking naturally to brainstorm, narrate, or reason with the system.
✓ Building trust, with visible feedback, that makes people confident to actually use voice.

User intent

Work hands-free

Macro trend

Multi-modal

Voice as an input allows users to communicate with the system through, as you may have guessed, voice. It removes the need for manual navigation or even typing by enabling real-time, hands-free, verbal commands. These assistants respond to trigger phrases, hold up a full conversation, and complete tasks ranging from setting reminders to answering questions. The pattern is maturing, from isolated voice triggers to multimodal exchanges that bridge voice input with app functionalities.


## Why does "voice input" matter?

This is a type of screenless UX, which can be used for delegating low-risk or reversible tasks. It requires low cognition, and enables multi-tasking PRO Max.

Most recently, there's a hidden behaviour shift taking place...

People are relying on voice more and more. Taking walks with ChatGPT to think out loud, or talking to the system with [Whisprflow](https://wisprflow.ai/?gad_campaignid=22302416291&gbraid=0AAAAA-Jst43eZTmWLGxFfqfYsO53bEpE9&dub_id=fWyp9B6Z7lBbPEFM) instead of *click, click, type*.

We're seeing voice being added to our day-to-day applications, not just standalone devices like Echo dot.

But why now? Because the technology is finally here. Our fear of "Uhuh, I didn't quite that. Try again" is a thing of the past. Audio interpretation has gotten faster and much more advanced - take Airpods' live translation for example.

In fact, it's so much better that users have their expectations raised again (since 2011's Siri launch); they expect accuracy, context awareness, and a seamless handoff between speaking, writing and editing. Done right, voice input can make a product feel like a **real partner**.

Let's dive into some key takeaways.

## Trust is visible

Users only believe in voice if they see immediate feedback and an accurate translation. Break that once, especially during onboarding or initial test, and you've lost that user for a couple of years.

**Perplexity** has a voice input mode with a fancyyy visual feedback - reminiscent of the audio visualizer days.

## Correction is power / The new baseline / Smart defaults

A voice system is only as good as its editing flow. Punctuation, filler removal, and intent detection make it usable.

**Alexa** listens for a wake phrase, "Hey Alexa", and initiates actions like playing music, checking the weather, or controlling your home devices. Users speak their commands, and while Alexa interprets it, the device shows a vibrant blue ring as feedback. The interaction is hands-free and optimized for ambient environments like kitchens or living rooms.

**Whisprflow** corrects grammar and removes the ermss and umms. Everything that existed visibly in text, should now be present invisibly in voice.

## Voice in the flow

(context cleans the mess). We’re past the “dedicated device” era. Voice works best when it’s baked into daily tools (calls, docs, chats) instead of living on a smart speaker island. Embedding it reduces friction and raises adoption.

**Siri** is activated via long press or voice, "Hey Siri". Users can ask general questions like "What's the weather this week?" or request context-aware actions like "Add hiking to my calendar for 7 AM". The assistant integrates with apps, pulling relevant information and executing tasks while preserving continuity across follow-ups.

## Thinking out loud

Subtle guidance (“Say ‘next step’”) empowers without overwhelming. Voice isn’t just about commands anymore. Users are using it to reason, brainstorm, and multitask — design for messy, exploratory speech, not just short queries.

(multiple personalities concepts)

**Arc** (highlight elevator music) enables users to speak during a phone call and have the assistant surface web results live. The AI listens mid-call, processes the query, and presents a summary or options without leaving the interface. This is an embedded voice interaction that functions within a live context, augmenting search with real-time voice input.


Examples

Perplexity has a voice input mode with a fancyyy visual feedback - reminiscent of the audio visualizer days.


Alexa listens for a wake phrase, "Hey Alexa", and initiates actions like playing music, checking the weather, or controlling your home devices. Users speak their commands, and while Alexa interprets it, the device shows a vibrant blue ring as feedback. The interaction is hands-free and optimized for ambient environments like kitchens or living rooms.


Siri is activated via long press or voice, "Hey Siri". Users can ask general questions like "What’s the weather this week?" or request context-aware actions like "Add hiking to my calendar for 7 AM". The assistant integrates with apps, pulling relevant information and executing tasks while preserving continuity across follow-ups.


Arc enables users to speak during a phone call and have the assistant surface web results live. The AI listens mid-call, processes the query, and presents a summary or options without leaving the interface. This is an embedded voice interaction that functions within a live context, augmenting search with real-time voice input.


AI UX checklist

When designing a voice AI UX, the interface is minimal but needs to consider multiple states in terms of interaction and feedback.

  • Ensure reliable wake-word detection or manual trigger

  • Show a "connected state" to confirm it's ready to listen.

  • Show a listening state to keep the user in the loop with the system's

  • Create a response state to ensure the user knows the system understood

  • When the user stops talking, what's that moment like?

  • Maintain context across follow-up questions or commands

  • Show a state when the system is done talking

  • Enable fallback when speech is unclear or interrupted

  • Design voice responses to be brief and relevant

  • The whole interaction should be interruptible, with priority given to the user

The aim when working on a voice interaction is to get into code and a live experience as soon as possible to understand all the interaction states and iron out the nuances.

As a designer, there's not much to the interface, maybe a button and a visualizer, but a lot of designing is needed for the part of the interaction that's invisible.

Future play

Voice interfaces are soon going to evolve from simple, reactive tools into proactive, conversational systems. With advancements in memory and context modeling, future assistants will anticipate user needs and suggest helpful actions without being prompted.

Imagine this: your room is connected to a smart voice assistant that says,
"It’s getting quite warm; should I turn on the AC? Also, it’s late; would you like me to dim the lights?"

"I've gotten a ton of value out of aiverse over the last year!"

Dave Brown, Head of AI/ML at Amazon

Unlock this pattern
instantly with PRO

Access the entire Pattern Library

Access all upcoming Checklists

Access all upcoming Case studies

Get on-demand AI insights for your UX challenges

Curated by

Aiverse research team

Published on

Jul 15, 2025

Last edited on

Jul 19, 2025

Insights in your inbox, monthly

Stay ahead of the curve

Stay ahead
of the curve

for designers and product teams in the new AI-UX paradigm.

for designers and product teams
in the new AI-UX paradigm.