Inside Amazon Echo's Voice Control

Overview

Amazon Echo is one of the best-selling IoT devices. Whether called a smart speaker, virtual digital assistant, voice-controlled device, or a home robot, voice-based products are growing rapidly.

The second-generation Echo Dot was reduced in price from $89.99 to $49.99 and was released in the U.S. this month.

Market Impact

The Echo and Dot opened a market that drives competition among device vendors in voice capture, microphone audio resolution, advanced background-noise filtering, improved sound-field detection, and reliable connectivity to deliver better audio quality.

Companies such as XMOS are targeting this new voice-interface market even if their chips are not used in Echo. Paul Neil, VP of marketing and business development at XMOS, said that IoT is moving quickly and that voice is the most natural user interface for controlling IoT devices.

Neil added that their technology is ideal for voice interfaces because it combines traditional MCU performance, embedded DSP capabilities, and flexible I/O configurations.

However, hardware competition is only part of the smart-speaker market. Paul Erickson, senior analyst for connected home at IHS Markit, emphasized that the real competitive variable is the cloud.

Cloud services are increasingly competitive as vendors pursue smarter assistants capable of handling complex and ad hoc queries. Google planned to enter the market with Google Home and Google Assistant, and there were reports that Apple might enter later with Siri.

Future-Proofing and Cloud Updates

Another reason the Echo is popular is its potential for future-proofing: devices can gain new capabilities over time. Skip Ashton, VP of software at Silicon Labs, explained that future-proofing means ensuring a device can receive additional features as time progresses. For example, Alexa started with around 70 voice services and has grown to more than 1,700.

Echo can answer questions, read news, provide sports scores, control lights, order from Amazon, and set alarms. Users can also use Echo to call an Uber or order pizza delivery.

Ashton said Echo receives cloud updates roughly every two weeks, and Amazon notifies users when new features are released, which keeps user expectations aligned with ongoing feature additions.

Local Intelligence

Tom Hackenberg, chief analyst for embedded processors at IHS Markit, explained that smart-microphone and smart-speaker applications are valuable to processor vendors because the devices provide local intelligence while also leveraging cloud services.

Voice interfaces are appearing across broad markets, not only as digital assistants but also as speaker forms embedded in TVs, set-top boxes, HVAC/environmental controllers, and in automotive infotainment to enable hands-free operation.

Teardown Findings: Echo and Echo Dot

After tearing down Echo and Echo Dot and comparing them, Hackenberg noted that apart from memory suppliers, the processing components were not significantly different between the units.

According to an iFixit teardown, Amazon Echo used:

Samsung K4X2G323PD-8GD8 256 MB LPDDR1 RAM
SanDisk SDIN7DP2-4G 4 GB iNAND Ultra Flash (non-volatile storage)

The newer Dot used:

Micron MT46H64M32LFBQ 256 MB LPDDR SDRAM
Samsung KLM4G1FEPD 4 GB high-performance eMMC NAND Flash

Both products use the same application processors: a Texas Instruments media processor DM3725 at the core, and a Qualcomm Atheros QCA6234 application-specific standard processor for connectivity.

Hackenberg noted that memory choices can affect performance but are subject to price fluctuations, so component changes throughout a product lifecycle are common. Connectivity modules, especially media processors, are more complex and typically remain stable unless a major product update occurs.

Atheros processors are designed as connectivity ASSPs built on a customized Tensilica XTensa core and focus on coordinating network communications. Connectivity is critical because it affects which data can be retrieved and the speed and reliability of cloud interactions. Improvements in Wi-Fi throughput, quality of service (QoS), and range help deliver more immediate interactions between the cloud and the speaker.

All local intelligence is handled by the TI DM3725. Hackenberg said this SoC is designed for multimedia applications such as set-top boxes, TVs, and game systems. The DM3725 integrates an ARM Cortex-A8, TI's C64x+ DSP, and a 3D graphics acceleration engine. The Cortex-A8 is a mature and economical application processor that is adequate for executing simple local tasks. If applications become more complex rather than just a speaker, changes may be required.

Integrated DSP

Hackenberg said the key to the SoC is the integration of a DSP and potentially a GPU. Typical designs include multiple input sensors, primarily microphones. The incoming audio is first heavily filtered by the DSP so the system can differentiate user speech from environmental noise quickly.

The system can also determine the relative position of the speaker or identify who is speaking; it builds patterns that can be processed or matched, often with parts of the processing sent to the cloud.

Hackenberg suggested that a GPU can be used locally for fast and efficient pattern matching. This allows the device to respond to stored control patterns—such as "lower volume" or "change channel"—without a network connection. The application core then provides the required response and controls inputs or displays as needed.

Microphone Array

One appealing feature of Echo and Dot is the 7-microphone array. Amazon states that multiple microphones and beamforming techniques enable Echo and Dot to hear across a room even when music is playing, and that Echo is tuned to provide 360-degree immersive audio.

Marwan Boustany, senior analyst for MEMS and sensors at IHS Markit, said Echo uses Knowles MEMS microphones. Improving the audio-band signal-to-noise ratio, matching, and performance helps far-field voice capture and improves speech recognition.

Ultimately, algorithm performance is the real key to better speech recognition. The "intelligence" often continues to rely on cloud processing, while local processing improves recognition for simple or predefined phrases such as wake words. Software vendors focused on voice recognition will be important to voice-enabled home systems like Alexa.

XMOS xCore voice interface example

Rising Competition

Several suppliers of microcontrollers and connectivity ASSPs could compete in this area, including Apple, Broadcom, Cypress, Microchip, NXP, Renesas, STMicroelectronics, and Silicon Labs. Designs combining 802.11n and Bluetooth 4.0 are less common; some lower-cost designs may use only Bluetooth.

Media processors present a tougher challenge. Mobile application processor vendors can supply suitable parts, but costs may be too high for simple applications. Vendors might choose solutions without equivalent DSP or pattern-matching capabilities.

Hackenberg listed candidates such as Apple Ax, Broadcom BCM7xxxx, Hisilicon Hi3xxx, NXP i.MX, MediaTek MT8xxx, ST STiHxxx, and Qualcomm Snapdragon. TI may have cost advantages for DSP support critical to voice recognition, but other vendors are narrowing the gap.

XMOS expects to gain momentum in this market. For voice-assistant products like Echo, key performance factors include far-field voice capture, beamforming, and processing speed. Neil said XMOS single-chip devices, with significant processing power and embedded DSP, provide scalable and differentiable solutions.