Hardware Guide
The ESP32-S3 handles on-device keyword spotting with TFLite Micro using DS-CNN models that classify 1-second audio windows into predefined commands. The 512 KB SRAM and 240 MHz dual-core processor handle feature extraction (MFCC) and inference within the real-time audio deadline.
| Spec | ESP32-S3 |
|---|---|
| Processor | Dual-core Xtensa LX7 @ 240 MHz |
| SRAM | 512 KB |
| Flash | Up to 16 MB (external) |
| Key Features | Vector instructions (SIMD), USB OTG, LCD/Camera interface, Up to 8 MB PSRAM |
| Connectivity | Wi-Fi 802.11 b/g/n, Bluetooth 5.0 LE |
| Price Range | $3 - $8 (chip), $10 - $25 (dev board) |
Voice recognition on the ESP32-S3 requires audio preprocessing (MFCC feature extraction) followed by neural network inference. The standard DS-CNN model from the TFLite Micro micro_speech example uses ~80 KB for the model plus ~50 KB for audio buffers and MFCC computation — well within the 512 KB SRAM budget. The Xtensa LX7's vector instructions accelerate the feature extraction math. The ESP32-S3 lacks a built-in microphone, so you need an external I2S MEMS microphone (INMP441 or SPH0645). One core handles audio capture and preprocessing while the second core runs inference — the dual-core architecture is a clear advantage over single-core alternatives. Wi-Fi connectivity enables forwarding recognized commands to cloud services or local smart home systems.
Set up ESP-IDF with TFLite Micro
Install ESP-IDF v5.1+ and add the tflite-micro-esp-examples component. The micro_speech example is the reference implementation for keyword spotting on ESP32.
Connect an I2S MEMS microphone
Wire an INMP441 or SPH0645 I2S microphone to the ESP32-S3's I2S peripheral. Configure I2S in receive mode with 16 kHz sample rate and 16-bit depth.
Implement MFCC feature extraction
The micro_speech example includes audio_provider and feature_provider modules. These capture 1-second audio windows and compute 40-channel MFCC features as input to the classifier.
Train or customize the keyword model
The default micro_speech model recognizes 'yes', 'no', silence, and unknown speech. To add custom keywords, use the Speech Commands dataset with TensorFlow's training scripts, then quantize to int8 and convert to a C array for embedding.
Built-in microphone on the Sense variant eliminates external hardware. Easier setup via Edge Impulse, but half the RAM (256 KB) and quarter the clock speed (64 MHz).
1 MB SRAM and 480 MHz Cortex-M7 allow larger vocabulary models and faster inference. CMSIS-NN accelerates DSP operations. No Wi-Fi, higher cost.
Chain keyword detection with actions on the ESP32-S3 — compile the full pipeline to firmware from a visual workflow.
Get Started Free