Can the ESP32-S3 do real-time keyword spotting?

Yes. The dual-core Xtensa LX7 at 240 MHz processes MFCC feature extraction on one core and runs DS-CNN inference on the other. A typical keyword spotting cycle (1-second window) completes well within the real-time deadline.

What microphone works with ESP32-S3 for voice recognition?

I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the ESP32-S3's I2S peripheral. Configure for 16 kHz sample rate. The ESP32-S3 does not have a built-in microphone.

How many keywords can the ESP32-S3 recognize simultaneously?

The standard DS-CNN model supports 10-12 keyword classes plus silence and unknown categories. Larger vocabularies require more model capacity — the 512 KB SRAM supports larger voice models and vocabularies than most MCU platforms with accuracy tradeoffs.

Hardware Guide

ESP32-S3 for Voice Recognition with TensorFlow Lite Micro

The ESP32-S3 handles on-device keyword spotting with TFLite Micro using DS-CNN models that classify 1-second audio windows into predefined commands. The 512 KB SRAM and 240 MHz dual-core processor handle feature extraction (MFCC) and inference within the real-time audio deadline.

Published 2026-04-01

Hardware Specs

Spec	ESP32-S3
Processor	Dual-core Xtensa LX7 @ 240 MHz
SRAM	512 KB
Flash	Up to 16 MB (external)
Key Features	Vector instructions (SIMD), USB OTG, LCD/Camera interface, Up to 8 MB PSRAM
Connectivity	Wi-Fi 802.11 b/g/n, Bluetooth 5.0 LE
Price Range	$3 - $8 (chip), $10 - $25 (dev board)

Compatibility: Good

Voice recognition on the ESP32-S3 requires audio preprocessing (MFCC feature extraction) followed by neural network inference. The standard DS-CNN model from the TFLite Micro micro_speech example uses ~80 KB for the model plus ~50 KB for audio buffers and MFCC computation — well within the 512 KB SRAM budget. The Xtensa LX7's vector instructions accelerate the feature extraction math. The ESP32-S3 lacks a built-in microphone, so you need an external I2S MEMS microphone (INMP441 or SPH0645). One core handles audio capture and preprocessing while the second core runs inference — the dual-core architecture is a clear advantage over single-core alternatives. Wi-Fi connectivity enables forwarding recognized commands to cloud services or local smart home systems.

Getting Started

1

Set up ESP-IDF with TFLite Micro

Install ESP-IDF v5.1+ and add the tflite-micro-esp-examples component. The micro_speech example is the reference implementation for keyword spotting on ESP32.
2

Connect an I2S MEMS microphone

Wire an INMP441 or SPH0645 I2S microphone to the ESP32-S3's I2S peripheral. Configure I2S in receive mode with 16 kHz sample rate and 16-bit depth.
3

Implement MFCC feature extraction

The micro_speech example includes audio_provider and feature_provider modules. These capture 1-second audio windows and compute 40-channel MFCC features as input to the classifier.
4

Train or customize the keyword model

The default micro_speech model recognizes 'yes', 'no', silence, and unknown speech. To add custom keywords, use the Speech Commands dataset with TensorFlow's training scripts, then quantize to int8 and convert to a C array for embedding.

Alternatives

Arduino Nano 33 BLE with Edge Impulse

Built-in microphone on the Sense variant eliminates external hardware. Easier setup via Edge Impulse, but half the RAM (256 KB) and quarter the clock speed (64 MHz).

STM32H7 with TFLite Micro

1 MB SRAM and 480 MHz Cortex-M7 allow larger vocabulary models and faster inference. CMSIS-NN accelerates DSP operations. No Wi-Fi, higher cost.

Explore More

More ESP32-S3 guides More Voice Recognition guides All resources Find the right MCU

FAQ

Can the ESP32-S3 do real-time keyword spotting?: Yes. The dual-core Xtensa LX7 at 240 MHz processes MFCC feature extraction on one core and runs DS-CNN inference on the other. A typical keyword spotting cycle (1-second window) completes well within the real-time deadline.
What microphone works with ESP32-S3 for voice recognition?: I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the ESP32-S3's I2S peripheral. Configure for 16 kHz sample rate. The ESP32-S3 does not have a built-in microphone.
How many keywords can the ESP32-S3 recognize simultaneously?: The standard DS-CNN model supports 10-12 keyword classes plus silence and unknown categories. Larger vocabularies require more model capacity — the 512 KB SRAM supports larger voice models and vocabularies than most MCU platforms with accuracy tradeoffs.

Orchestrate Voice AI Agents with ForestHub

Keyword spotting runs on-device; ForestHub on the Linux edge gateway routes events, adds LLM reasoning as one node, and acts — replayable and auditable end to end.

Get Started Free