Can the ESP32-S3 do real-time keyword spotting?

Yes. The dual-core Xtensa LX7 at 240 MHz processes MFCC feature extraction on one core and runs DS-CNN inference on the other. A typical keyword spotting cycle (1-second window) completes well within the real-time deadline.

What microphone works with ESP32-S3 für spracherkennung?

I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the ESP32-S3's I2S peripheral. Configure for 16 kHz sample rate. The ESP32-S3 does not have a built-in microphone.

How many keywords can the ESP32-S3 recognize simultaneously?

The standard DS-CNN model supports 10-12 keyword classes plus silence and unknown categories. Larger vocabularies require more model capacity — the 512 KB SRAM supports models up to ~150 KB, which can handle 20-30 keywords with accuracy tradeoffs.

Hardware-Leitfaden

ESP32-S3 für Voice Recognition mit TensorFlow Lite Micro

The ESP32-S3 handles on-device keyword spotting with TFLite Micro using DS-CNN models that classify 1-second audio windows into predefined commands. The 512 KB SRAM and 240 MHz dual-core processor handle feature extraction (MFCC) and inference within the real-time audio deadline.

Hardware-Spezifikationen

Spez.	ESP32-S3
Prozessor	Dual-core Xtensa LX7 @ 240 MHz
SRAM	512 KB
Flash	16 MB
Konnektivität	Wi-Fi 802.11 b/g/n, Bluetooth 5.0 LE
Preisbereich	$3-8 (Chip), $10-25 (Board)

Kompatibilität: Gut

Voice recognition on the ESP32-S3 requires audio preprocessing (MFCC feature extraction) followed by neural network inference. The standard DS-CNN model from the TFLite Micro micro_speech example uses ~80 KB for the model plus ~50 KB for audio buffers and MFCC computation — well within the 512 KB SRAM budget. The Xtensa LX7's vector instructions accelerate the feature extraction math. The ESP32-S3 lacks a built-in microphone, so you need an external I2S MEMS microphone (INMP441 or SPH0645). One core handles audio capture and preprocessing while the second core runs inference — the dual-core architecture is a clear advantage over single-core alternatives. Wi-Fi Konnektivität ermöglicht forwarding recognized commands to cloud services or local smart home systems.

Erste Schritte

1

Set up ESP-IDF with TFLite Micro

Installiere ESP-IDF v5.1+ and add the tflite-micro-esp-examples component. The micro_speech example is the reference implementation for keyword spotting on ESP32.
2

Connect an I2S MEMS microphone

Wire an INMP441 or SPH0645 I2S microphone to the ESP32-S3's I2S peripheral. Configure I2S in receive mode with 16 kHz sample rate and 16-bit depth.
3

Implement MFCC feature extraction

The micro_speech example includes audio_provider and feature_provider modules. These capture 1-second audio windows and compute 40-channel MFCC features as input to the classifier.
4

Train or customize the keyword model

The default model recognizes 'yes' and 'no'. To add custom keywords, use the Speech Commands dataset with TensorFlow's training scripts, then quantize to int8 and convert to a C array for embedding.

Alternativen

Arduino Nano 33 BLE with Edge Impulse

Built-in microphone on the Sense variant eliminates external hardware. Easier setup via Edge Impulse, but half the RAM (256 KB) and quarter the clock speed (64 MHz).

STM32H7 with TFLite Micro

1 MB SRAM and 480 MHz Cortex-M7 allow larger vocabulary models and faster inference. CMSIS-NN accelerates DSP operations. No Wi-Fi, higher cost.

Häufige Fragen

Can the ESP32-S3 do real-time keyword spotting?: Yes. The dual-core Xtensa LX7 at 240 MHz processes MFCC feature extraction on one core and runs DS-CNN inference on the other. A typical keyword spotting cycle (1-second window) completes well within the real-time deadline.
What microphone works with ESP32-S3 für spracherkennung?: I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the ESP32-S3's I2S peripheral. Configure for 16 kHz sample rate. The ESP32-S3 does not have a built-in microphone.
How many keywords can the ESP32-S3 recognize simultaneously?: The standard DS-CNN model supports 10-12 keyword classes plus silence and unknown categories. Larger vocabularies require more model capacity — the 512 KB SRAM supports models up to ~150 KB, which can handle 20-30 keywords with accuracy tradeoffs.

Voice-AI-Agents mit ForestHub orchestrieren

Die Schlüsselwort-Erkennung läuft on-device; ForestHub auf dem Linux-Edge-Gateway routet Events, ergänzt LLM-Reasoning als einen Knoten und handelt — durchgängig replayfähig und auditierbar.

Kostenlos starten