Hardware Guide
The STM32H7 runs keyword spotting and voice command recognition with TFLite Micro using CMSIS-NN accelerated inference. The 1 MB SRAM and 480 MHz Cortex-M7 support larger vocabulary models (30+ keywords) and faster-than-realtime feature extraction.
| Spec | STM32H7 |
|---|---|
| Processor | ARM Cortex-M7 @ 480 MHz |
| SRAM | 1024 KB |
| Flash | 2 MB |
| Key Features | Double-precision FPU, L1 cache (16 KB I + 16 KB D), JPEG codec, Chrom-ART Accelerator (DMA2D) |
| Connectivity | Ethernet, USB OTG HS/FS |
| Price Range | $8 - $20 (chip), $30 - $80 (dev board) |
The STM32H7's 1024 KB SRAM provides 8x the 128 KB minimum for voice recognition, enabling significantly larger models than most MCUs can handle. A DS-CNN model with 30+ keyword classes (~150-200 KB) runs comfortably with headroom for audio buffers and the application stack. The Cortex-M7 at 480 MHz processes MFCC feature extraction in a fraction of the real-time deadline — the CPU is idle most of the time waiting for the next audio window. CMSIS-NN kernels accelerate the convolutional layers by 2-4x through SIMD instructions. The STM32H7's SAI (Serial Audio Interface) peripheral handles I2S microphone input with DMA, requiring zero CPU intervention for audio capture. The limitation is connectivity: no built-in wireless — you need external modules for Wi-Fi/BLE if commands need to reach a network. For embedded voice control (local actions only), this is not a concern.
Configure STM32H7 audio input via SAI
Use STM32CubeMX to configure the SAI peripheral in I2S receive mode. Connect an IMP34DT05 or MP45DT02 MEMS microphone. Set up DMA circular buffer for continuous 16 kHz audio capture.
Add TFLite Micro with CMSIS-NN backend
Integrate TFLite Micro into your STM32CubeIDE project. Enable CMSIS-NN and CMSIS-DSP libraries for optimized convolution and DSP operations. The micro_speech example is the reference starting point.
Implement audio preprocessing pipeline
Process raw PCM audio into 40-channel MFCC features using the CMSIS-DSP FFT functions. The STM32H7's FPU handles floating-point FFT efficiently. Window size: 30ms with 10ms stride gives 98 frames per 1-second input.
Deploy the keyword spotting model
Use Google's Speech Commands dataset to train a DS-CNN model with your target keywords. Quantize to int8 and embed in firmware. On the STM32H7, you can afford a larger model (150-200 KB) for higher accuracy or more keyword classes than typical MCU deployments.
Built-in Wi-Fi and BLE for connected voice control. 512 KB SRAM handles standard keyword models (10-12 classes). Lower cost, but half the RAM and clock speed of the STM32H7.
Built-in microphone (Sense variant) and Edge Impulse's end-to-end pipeline simplify development. 256 KB SRAM limits vocabulary size. Best for prototyping with minimal hardware.
Map voice commands to device actions on the STM32H7 — design the full audio processing chain visually.
Get Started Free