Hardware-Leitfaden
ESP32-S3 für Voice Recognition mit TensorFlow Lite Micro
The ESP32-S3 handles on-device keyword spotting with TFLite Micro using DS-CNN models that classify 1-second audio windows into predefined commands. The 512 KB SRAM and 240 MHz dual-core processor handle feature extraction (MFCC) and inference within the real-time audio deadline.
Hardware-Spezifikationen
| Spez. | ESP32-S3 |
|---|---|
| Prozessor | Dual-core Xtensa LX7 @ 240 MHz |
| SRAM | 512 KB |
| Flash | 16 MB |
| Konnektivität | Wi-Fi 802.11 b/g/n, Bluetooth 5.0 LE |
| Preisbereich | $3-8 (Chip), $10-25 (Board) |
Kompatibilität:
Voice recognition on the ESP32-S3 requires audio preprocessing (MFCC feature extraction) followed by neural network inference. The standard DS-CNN model from the TFLite Micro micro_speech example uses ~80 KB for the model plus ~50 KB for audio buffers and MFCC computation — well within the 512 KB SRAM budget. The Xtensa LX7's vector instructions accelerate the feature extraction math. The ESP32-S3 lacks a built-in microphone, so you need an external I2S MEMS microphone (INMP441 or SPH0645). One core handles audio capture and preprocessing while the second core runs inference — the dual-core architecture is a clear advantage over single-core alternatives. Wi-Fi Konnektivität ermöglicht forwarding recognized commands to cloud services or local smart home systems.
Erste Schritte
- 1
Set up ESP-IDF with TFLite Micro
Installiere ESP-IDF v5.1+ and add the tflite-micro-esp-examples component. The micro_speech example is the reference implementation for keyword spotting on ESP32.
- 2
Connect an I2S MEMS microphone
Wire an INMP441 or SPH0645 I2S microphone to the ESP32-S3's I2S peripheral. Configure I2S in receive mode with 16 kHz sample rate and 16-bit depth.
- 3
Implement MFCC feature extraction
The micro_speech example includes audio_provider and feature_provider modules. These capture 1-second audio windows and compute 40-channel MFCC features as input to the classifier.
- 4
Train or customize the keyword model
The default model recognizes 'yes' and 'no'. To add custom keywords, use the Speech Commands dataset with TensorFlow's training scripts, then quantize to int8 and convert to a C array for embedding.
Alternativen
Arduino Nano 33 BLE with Edge Impulse
Built-in microphone on the Sense variant eliminates external hardware. Easier setup via Edge Impulse, but half the RAM (256 KB) and quarter the clock speed (64 MHz).
STM32H7 with TFLite Micro
1 MB SRAM and 480 MHz Cortex-M7 allow larger vocabulary models and faster inference. CMSIS-NN accelerates DSP operations. No Wi-Fi, higher cost.
Häufige Fragen
- Can the ESP32-S3 do real-time keyword spotting?
- Yes. The dual-core Xtensa LX7 at 240 MHz processes MFCC feature extraction on one core and runs DS-CNN inference on the other. A typical keyword spotting cycle (1-second window) completes well within the real-time deadline.
- What microphone works with ESP32-S3 für spracherkennung?
- I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the ESP32-S3's I2S peripheral. Configure for 16 kHz sample rate. The ESP32-S3 does not have a built-in microphone.
- How many keywords can the ESP32-S3 recognize simultaneously?
- The standard DS-CNN model supports 10-12 keyword classes plus silence and unknown categories. Larger vocabularies require more model capacity — the 512 KB SRAM supports models up to ~150 KB, which can handle 20-30 keywords with accuracy tradeoffs.
Voice-AI-Agents mit ForestHub orchestrieren
Die Schlüsselwort-Erkennung läuft on-device; ForestHub auf dem Linux-Edge-Gateway routet Events, ergänzt LLM-Reasoning als einen Knoten und handelt — durchgängig replayfähig und auditierbar.
Kostenlos starten