Facebook’s Portal looked like a slick alternative to the Amazon Echo speaker when it launched earlier this month, but problems abounded behind the scenes. Facebook had already delayed the video-calling device due to privacy concerns around the Cambridge Analytica scandal. And when it finally did launch, there was a glaring omission: no voice assistant from Facebook. Instead it came with Alexa, meaning anyone who bought the 15.6-inch version for $350 got an awkward gateway to Amazon, whose competing Echo Show cost at least $100 less. It also meant Facebook was blocked from collecting any speech data to train its voice technology further.
Facebook started investing heavily in voice tech from 2013. Yet despite that early start, being one the world’s biggest technology companies with 30,275 employees and booking nearly $16 billion in 2017 profits, the company has yet to plant a stake in technology that lets you talk to computers, widely regarded as the next wave of human-to-machine interfaces.
The omission points to Facebook’s broader difficulties in turning innovative technology into products. Among its previous misfires: Android launcher Home, which shut down in April 2013; virtual currency Credits (closed in Sept. 2013); Snapchat competitors Poke and Slingshot (2014 and 2015) and mobile development platform Parse (2017). In the field of voice, Facebook bought multiple speech-based companies and hired experts specializing in voice technology over the last five years, but it has struggled to turn those investments into useful services, two senior sources who worked at the company told Forbes, largely due to chaotic product priorities and confusion over where researchers should focus their time. “After five years, to still not have a product is shameful,” said one.
Concerted voice efforts came too late by the time Portal became known within the company as a project, some two years ago. “Facebook wanted to use their own speech-to-text technology for Portal, but it was not ready,” said a senior engineering source who spoke to Forbes on the condition of anonymity due to concerns about reputational and legal repercussions. Using Alexa represented “a huge drawback. If you don’t have access to data, it’s hard to progress and learn, and improve things.”
A spokeswoman for Facebook pointed out that Portal customers can activate the device by saying “Hey Portal,” to initiate a call and access device controls, but admitted the company had to partner with Amazon to “provide the range of tools that people have come to expect from a home device.” Facebook did not answer questions from Forbes about its struggles to develop voice technology. In 2016, Facebook’s then head of Messenger, David Marcus, said the company was “not actively working on” voice.
Facebook has in fact worked on voice technology, but its efforts have suffered from confusing directions between product managers and voice engineers, as well as pressure to move more quickly than the development of voice-recognition technology allows. Product managers often wanted voice-based research to turn into products “within half a year,” said another senior engineering source, who asked not to be named due to nondisclosure agreements. Group-based product reviews held every six months would typically spur a change of direction, from voice-based search, to news transcription, to a voice-assistant for Messenger—all internal projects that never turned into products.
The problem was that building voice technology takes much longer than half a year due to its sheer complexity. Voice data is constantly changing. There are different types of microphones, varying accents and different processing hardware between phones. To build software that recognizes speech, you also need to train it on a database of voices first, then put it out in the wild and train it some more on real voices.
For Apple’s Siri, that process took well over two years. When the iPhone maker launched Siri in October 2011, it outsourced its speech recognition software to Nuance, a legacy player in voice-recognition. But Apple was loathe to rely on third parties for a strategic product and set out to build its own software. In 2013 Apple established a voice technology office in Boston (a few miles away from Nuance), and in 2015, the company quietly dropped Nuance as a partner.
Overall in voice, Google is out in front. Its new Pixel 3 smartphone, for instance, includes a digital assistant feature called Duplex that can answer phone calls and even transcribe them into text in real time. “In the past nine months Google has gone from robotic voices to natural-sounding voices,” adds Peter Cahill, founder of the Irish voice technology startup Voysis. Google was responsible for developing Wavenet, a method for building eerily human sounding artificial voices and is “years ahead of Amazon.” Cahill described the Silicon Valley hierarchy of voice expertise like so: “Google at the top, then Amazon and Apple, then Facebook.” The latter, he added, was “struggling to get anything out the door.”
To its credit, Facebook got in early on voice technology. In 2013 it bought Mobile Technologies, a startup spun out of Carnegie Mellon University that developed Jibbigo, an early translation app that could listen to speech in one language and then play it in another language. When Facebook bought the startup and its staff of several dozen researchers for an undisclosed sum, it sparked excited speculation that Facebook would start working on some sort of competitor to Apple’s Siri, or more.
“Voice technology has become an increasingly important way for people to navigate mobile devices and the web,” Facebook’s Tom Stocky, who led the deal, wrote at the time. “This technology will help us evolve our products to match that evolution.”
Yet even as Facebook went on to double the size of Jibbigo’s team, the company didn’t end up using its voice or speech-recognition expertise. It mainly used Jibbigo’s technology to translate text on users’ posts so it wouldn’t have to rely on Microsoft’s Bing, according to a person involved in the deal. Jibbigo’s voice-recognition technology was “shut down” after one year, the person added, and essentially went to waste. “It didn’t generate enough clicks… [People] don’t have that many friends who speak another language.”
Voice technology is made up of two key components. The first is speech recognition, which carries out transcription. The second is natural-language understanding, which structures the transcription. Together such software is also referred to as voice AI.
In 2014, Facebook bought Wit.ai, a company that specialized in that second component of voice tech, natural-language understanding. Wit.ai licensed software to developers that let them structure the messiness of text into data they could query with software. Rather than combine Wit.ai’s technology with speech-recognition tech, though, Facebook used it to help businesses build chatbots on Facebook Messenger, a monetization initiative that was launched in April 2016.
Facebook’s voice efforts eventually manifested in two areas between 2015 and 2017, according to a source close to those projects: One was transcribing the audio of Facebook videos to make subtitles in real time, and the other was on publishing cutting-edge research at Facebook’s AI division, known as FAIR (Facebook Artificial Intelligence Research). But the latter became part of the problem.
Facebook launched FAIR in December 2013, and the division is often compared to DeepMind, the AI research company that Google bought for an estimated $400 million in 2014. At FAIR, a team of 50 researchers work under the respected scientist Yann LeCun to solve long-term problems in AI. Facebook has a second, similar division called Applied Machine Learning (AML), with around 100 staff, responsible for commercializing AI research.
The challenges with these divisions were twofold, according to a senior source who worked on voice at Facebook. They turned into research enclaves that didn’t contribute much to product development, and they lured skilled engineers who would otherwise be working on products. “It created this parallel world of research,” the source said.
Among Facebook’s executive team, there were big ideas for using voice technology, such as building a digital assistant like Siri, but such projects required long-term commitments of time and staff. They were also thwarted by a lack of cooperation between researchers and product managers.
Many product managers who worked on Facebook’s voice ambitions didn’t have a clear understanding of the technology involved, the source added. The managers also tended to change every three to six months, just as core researchers were gravitating to the prestigious FAIR and AML divisions. The effect was like constantly repotting a tree and not giving it a chance to take root and grow. Facebook ultimately lacked “a cohesive team that stays with a problem.”
“Facebook never had a clear strategy for speech recognition or voice,” said the other senior engineering source. “It was never clear why they bought [Jibbigo]. It was a big question internally. We knew there was this team, but nobody knew why they were here.”
Ultimately, the reason Facebook didn’t make a stronger commitment to building voice technology was simple, they added: “There was no customer. Nobody said to Facebook, ‘I need this technology now.’” That was, of course, until the Portal project started developing internally. Such is the challenge for technology companies that want to keep ahead of the broader competition. Remaining innovative means making a decisive bet on a technology that has yet to be proven out, even when there’s no obvious customer. Facebook didn’t make that call with voice technology until it was too late.