Hardware Guide
STM32H7 for Voice Recognition with TensorFlow Lite Micro
The STM32H7 runs keyword spotting and voice command recognition with TFLite Micro using CMSIS-NN accelerated inference. The 1 MB SRAM and 480 MHz Cortex-M7 support larger vocabulary models (30+ keywords) and faster-than-realtime feature extraction.
Hardware Specs
| Spec | STM32H7 |
|---|---|
| Processor | ARM Cortex-M7 @ 480 MHz |
| SRAM | 1024 KB |
| Flash | 2 MB |
| Key Features | Double-precision FPU, L1 cache (16 KB I + 16 KB D), JPEG codec, Chrom-ART Accelerator (DMA2D) |
| Connectivity | Ethernet, USB OTG HS/FS |
| Price Range | $8 - $20 (chip), $30 - $80 (dev board) |
Compatibility:
The STM32H7's 1024 KB SRAM provides 8x the 128 KB minimum for voice recognition, enabling significantly larger models than most MCUs can handle. A DS-CNN model with 30+ keyword classes (~150-200 KB) runs comfortably with headroom for audio buffers and the application stack. The Cortex-M7 at 480 MHz processes MFCC feature extraction in a fraction of the real-time deadline — the CPU is idle most of the time waiting for the next audio window. CMSIS-NN kernels accelerate the convolutional layers by 2-4x through SIMD instructions. The STM32H7's SAI (Serial Audio Interface) peripheral handles I2S microphone input with DMA, requiring zero CPU intervention for audio capture. The limitation is connectivity: no built-in wireless — you need external modules for Wi-Fi/BLE if commands need to reach a network. For embedded voice control (local actions only), this is not a concern.
Getting Started
- 1
Configure STM32H7 audio input via SAI
Use STM32CubeMX to configure the SAI peripheral in I2S receive mode. Connect an IMP34DT05 or MP45DT02 MEMS microphone. Set up DMA circular buffer for continuous 16 kHz audio capture.
- 2
Add TFLite Micro with CMSIS-NN backend
Integrate TFLite Micro into your STM32CubeIDE project. Enable CMSIS-NN and CMSIS-DSP libraries for optimized convolution and DSP operations. The micro_speech example is the reference starting point.
- 3
Implement audio preprocessing pipeline
Process raw PCM audio into 40-channel MFCC features using the CMSIS-DSP FFT functions. The STM32H7's FPU handles floating-point FFT efficiently. Window size: 30ms with 10ms stride gives 98 frames per 1-second input.
- 4
Deploy the keyword spotting model
Use Google's Speech Commands dataset to train a DS-CNN model with your target keywords. Quantize to int8 and embed in firmware. On the STM32H7, you can afford a larger model (150-200 KB) for higher accuracy or more keyword classes than typical MCU deployments.
Alternatives
ESP32-S3 with TFLite Micro
Built-in Wi-Fi and BLE for connected voice control. 512 KB SRAM handles standard keyword models (10-12 classes). Lower cost, but half the RAM and clock speed of the STM32H7.
Arduino Nano 33 BLE with Edge Impulse
Built-in microphone (Sense variant) and Edge Impulse's end-to-end pipeline simplify development. 256 KB SRAM limits vocabulary size. Best for prototyping with minimal hardware.
Explore More
FAQ
- How many keywords can the STM32H7 recognize?
- The 1 MB SRAM supports DS-CNN models with 30-50 keyword classes plus silence and unknown categories. Model size scales roughly linearly — Model size increases with vocabulary size. For production deployments, Larger vocabularies are possible given the available RAM — accuracy depends on model architecture and training data.
- What is the inference latency for voice recognition on STM32H7?
- CMSIS-NN accelerated inference on a 30-class DS-CNN model completes well within the 1-second audio window on the STM32H7 at 480 MHz, leaving significant CPU headroom for application tasks. The L1 cache reduces latency further for models that fit in cache.
- Can the STM32H7 do continuous voice recognition?
- Yes. Configure the SAI peripheral with DMA circular buffer for uninterrupted audio capture. Process overlapping 1-second windows with 500ms stride for continuous monitoring. The 480 MHz Cortex-M7 handles both feature extraction and inference without dropping audio frames.
Orchestrate Voice AI Agents with ForestHub
Keyword spotting runs on-device; ForestHub on the Linux edge gateway routes events, adds LLM reasoning as one node, and acts — replayable and auditable end to end.
Get Started Free