You’re aboard the Starship Enterprise. You’re lost. But no one else is lost. That’s because they know something you don’t: speak to any wall or panel, and a virtual assistant is ready to take you to the holodeck at a moment’s notice. She’ll answer questions, open doors, and engage in (almost) human banter.
This circa 1966 Star Trek interaction is the inspiration (seriously, look it up) for a device that currently occupies 20 million homes: Alexa. For decades, we’ve been imagining a future where voice is the dominant user interface. It’s finally happening. Voice interfaces are in our homes, cars, and phones. As the tech continues to advance, new trends emerge.
Beyond Apple’s Siri and Amazon’s Alexa, tech giants and startups are developing, testing, and implementing voice assistants. Nearly half of all Americans use them.
Despite this, voice assistants have a long way to go. They didn’t hit the mainstream until Amazon introduced Alexa in 2014. Voice assistants rely heavily on machine learning and natural language processing – capabilities that budded in the late ’80s and are still in relative infancy.
Developers see a future where voice assistants are all-knowing, self-taught, and seemingly human. 2018 will be a big step toward this future. Here are the trends to keep your eyes on.
There’s a correlation between the mass adoption of emerging tech and the development of “humanized” features.
Voice is difficult, however. Human beings use idioms and slang; we have accents; we evolve and change; we have nuances in our speech that make our “natural language” difficult to process. The challenge to create a machine that truly speaks like a human seems impossible.
To define Google Assistant’s personality, Google hired Emma Coats, ex-Pixar story artist. Amazon created the “Alexa Prize” – the largest chatbot showdown in history – where teams from universities around the globe compete for $1 million. The task: build “a socialbot that can converse coherently and engagingly with humans on popular topics for 20 minutes.”
This seems to be the future: robots that use neural networks, machine learning (ML), and natural language processing (NLP) internally to externally output what we perceive as human speech. If Star Trek nailed it in the ’60s, only time is keeping us from the Ex Machina reality, right?
Wrong, says Rand Hindi, CEO of Snips.ai (more on Snips to come). “Because it’s impossible to have all the context of a user, there will always be things the assistant cannot understand, which is why I believe more in specialized voice assistant that are great at doing specific things, rather than some sort of human AI like we see in movies.”
“A human voice interface means both a human level understanding of the sentence, and the ability to respond creatively, like a human would.” –Rand Hindi, CEO, Snips.ai
Speech is purely human, and until we can learn to chart human psychology like a logic tree, we can’t create a truly human voice interface.
Still, 2018 will be a big year for conversational voice assistants. Tech giants invest piles of cash and resources (Amazon has over 5000 people working on Alexa) to making voice more “human.” Out with “set a timer,” in with “how was your day?”
Chatbots are text-based assistants that use much of the same underlying technology as voice assistants (ML, NLP, etc), but have a visual interface.
DialogFlow (previously API.ai), Google’s “conversational experience” engine, enables developers to build systems that take a text request from a user. DialogFlow then pings an API or database, and responds to the user’s query with a text response. You type, “I want pizza.” The bot connects to the nearest Domino’s, orders a pizza, and types back, “It’s on the way!”
The only difference (in terms of underlying tech) comes from an added “voice” layer – you speak, the computer processes the language and converts it to text, then executes the chatbot processes above.
Google isn’t the only company building development tools that encourage the marriage of text-based chatbots and voice assistants. Amazon built their chatbot development framework, Amazon Lex, from Alexa’s underlying tech.
IBM recently unveiled Watson Assistant – a voice-activated assistant that was born from existing IBM chatbot products Watson Conversation and Watson Virtual Assistant. Watson, though, markets a hardware-agnostic approach (integrates into a number of environments, not just Echos or iPhones) as a value proposition (it should be noted that Amazon offers a similar SDK through AVS, or Alexa Voice Service). One thing Watson offers that Alexa doesn’t: customizable wake-words (e.g. “Hey Snickerdoodle” instead of “Hey Alexa”).
From a UX perspective, voice and text interactions are dramatically different. From a marketing perspective, companies building for both voice and text significantly increase their audiences. The convergence of chatbot and voice frameworks is hard to ignore. Look out for co-branded, connected voice and text-based assistants moving forward.
Fun fact: in 2009, the team that developed Microsoft’s Cortana interviewed human personal assistants to influence language and functionality.
As a customer, consider how many times you’ve smashed the 0 key on your phone to skip the “automated assistant” and go straight to a human. We’ve made serious advances in AI, but haven’t reached the moment where humans prefer AI assistants over human assistants.
Facebook’s now-defunct text assistant, “M,” attempted to bridge the best of both worlds. In a 2017 trial run, 10,000 users used M to order flowers, book hotels, and get quotes from local companies all within the Messenger app. It was smarter than the competition by leaps and bounds, but that’s because it cheated.
When M didn’t know the answer, users weren’t met with the dreaded “Hmm… I’m not quite sure I understand.” Instead, they got a human whose answers created a satisfied customer and a smarter bot. A perfect hybrid approach, except that it couldn’t scale to a billion users and the entire project flopped.
Enabling human intervention at a Facebook or Amazon scale is impractical; Facebook knew that. The moral of the story is humans will make voice assistants smarter, not machines.
Vying for the aforementioned Alexa Prize, one team from the Czech Republic knew that machine learning was a “superior method for tackling so-called classification problems, in which neural networks find unifying patterns in voluminous, noisy data.” But when it came to getting chatbots not just to translate speech into language but to say something back, good old-fashioned “handcrafting” is the way to go.
Expect to see voice assistants getting becoming “more human” – in part because of advanced ML, but mostly because of handcrafted response templates from human developers and voice UI professionals. Text-bots are inherently robotic (as they happen on screen). But voice assistants strive to capture humanity, and must integrate human training, intervention, interaction, and development.
We – as technology users – rely on companies like Amazon, Google, and Apple to predetermine every part of the voice experience. The “wake word” (Hey Siri, Okay Google) is decided for us. The data collection and storage (e.g. a cloud service connected to a service farm in Maiden, North Carolina) is managed for you. The ways in which your assistant gets smarter (e.g. understand contextual questions for a hotel room versus a home) is programmed for us. The drive to keep core features within an ecosystem explains why you can’t use Siri to control Spotify, which is why the streaming service is attempting to break out. These three pillars of voice assistant technology – wake words, data collection/storage, and training sets – are obvious pain points for developers.
Newcomers like Watson Assistant (launched March 2018) address the above pain points head-on. Watson Assistant is a white-label product that enables customization for wake word, training sets, and data storage. It’s designed to enable the creation of branded voice experiences, and IBM’s currently advertising to hospitality, automobile, and customer service industries.
“From an industry perspective, I see a lot of companies now embracing voice in their products. But rather than a general purpose platform like Alexa, companies are looking to build their own voice assistant, with their own branding, and dedicated to whatever functionality their product has.”–Rand Hindi, CEO, Snips.ai
Inherently, there is less control in an ecosystem where more components are customizable – Watson Assistant experiences will, at first, be harder to develop than an Alexa Skill or Google Action. But an assistant named “Vroom Vroom” trained on specific data for in-automobile requests (i.e. music and GPS control, traffic safety precautions, etc) paints a very cool picture for the future of integrated, environment-native voice assistants.
This is (and has been) a trend. Founded in 2013, Snips, like Watson, enables customized wake words, stores on-device, and is adamantly privacy-first (in fact, they tout GDPR compliance as a core value proposition). The organization claims to hit benchmarks that are on par with or beat Alexa, Siri, and DialogFlow, all without relying on 3rd party services or cloud-based interactions.
In today’s world, privacy and universal access are critical features for any emerging technology. Newcomers like Snips are holding giants like Amazon and Apple accountable, pushing them to innovate by introducing innovations of their own. The Snips community has created over 20,000 bots (and counting).
Google’s aggressive presence at CES 2018 is an indication that voice is not going anywhere, and Amazon’s reign may come to an end. Google is coming for the Echo Show, announcing future partnerships with a number of speaker companies, including several that are smart speakers with screens.
Google makes AI a core focus area. Google is integrating their vast network of knowledge and AI into core consumer-facing and developer-facing services like ARCore and – of course – Google Assistant. We’ve seen how powerful Google’s Vision API is – imagine what Assistant can understand if it’s crawling both text and visual assets.
As of April, Apple is finally addressing “the Siri problem,” hiring John Giannandrea, Google’s former head of search and artificial intelligence, to lead their AI efforts. Apple hopes this will lead to a dramatic turn-around, ending Siri’s run as the butt of voice jokes.
While Apple must rely on public data sets, Google and Facebook can leverage data from their billions of users around the globe. That said, the recent Cambridge Analytica scandal and further regulation on data collection may even the playing field.
Alexa will strive to keep their number one spot, while smaller-market services like Cortana slip into obscurity as Microsoft makes the slow pivot away from consumer technology and the Windows phone. Watson Assistant and Snips will continue to push boundaries that Apple, Google, and Amazon may respond to. Even Adobe is getting into the mix.
As advances are made, regulations are constructed, and innovations come to the fray, let’s see if Amazon’s fall from the throne comes by the end of 2018.
As always, if you have questions or comments drop me a line: firstname.lastname@example.org.
- Four Pro Tips for First-Time Alexa Skill Developers
- We Published Three Alexa Skills. Only 2 Poke Fun at Trump.
- Team-Based Alexa Development: Heroku + Flask-Ask
Special thanks to editing mastermind Ruth Aitken and Rand Hindi for sharing his voice insights.