Hardware Guide

STM32H7 for Voice Recognition with TensorFlow Lite Micro

The STM32H7 runs keyword spotting and voice command recognition with TFLite Micro using CMSIS-NN accelerated inference. The 1 MB SRAM and 480 MHz Cortex-M7 support larger vocabulary models (30+ keywords) and faster-than-realtime feature extraction.

Hardware Specs

Spec STM32H7
Processor ARM Cortex-M7 @ 480 MHz
SRAM 1024 KB
Flash 2 MB
Key Features Double-precision FPU, L1 cache (16 KB I + 16 KB D), JPEG codec, Chrom-ART Accelerator (DMA2D)
Connectivity Ethernet, USB OTG HS/FS
Price Range $8 - $20 (chip), $30 - $80 (dev board)

Compatibility: Good

The STM32H7's 1024 KB SRAM provides 8x the 128 KB minimum for voice recognition, enabling significantly larger models than most MCUs can handle. A DS-CNN model with 30+ keyword classes (~150-200 KB) runs comfortably with headroom for audio buffers and the application stack. The Cortex-M7 at 480 MHz processes MFCC feature extraction in a fraction of the real-time deadline — the CPU is idle most of the time waiting for the next audio window. CMSIS-NN kernels accelerate the convolutional layers by 2-4x through SIMD instructions. The STM32H7's SAI (Serial Audio Interface) peripheral handles I2S microphone input with DMA, requiring zero CPU intervention for audio capture. The limitation is connectivity: no built-in wireless — you need external modules for Wi-Fi/BLE if commands need to reach a network. For embedded voice control (local actions only), this is not a concern.

Getting Started

  1. 1

    Configure STM32H7 audio input via SAI

    Use STM32CubeMX to configure the SAI peripheral in I2S receive mode. Connect an IMP34DT05 or MP45DT02 MEMS microphone. Set up DMA circular buffer for continuous 16 kHz audio capture.

  2. 2

    Add TFLite Micro with CMSIS-NN backend

    Integrate TFLite Micro into your STM32CubeIDE project. Enable CMSIS-NN and CMSIS-DSP libraries for optimized convolution and DSP operations. The micro_speech example is the reference starting point.

  3. 3

    Implement audio preprocessing pipeline

    Process raw PCM audio into 40-channel MFCC features using the CMSIS-DSP FFT functions. The STM32H7's FPU handles floating-point FFT efficiently. Window size: 30ms with 10ms stride gives 98 frames per 1-second input.

  4. 4

    Deploy the keyword spotting model

    Use Google's Speech Commands dataset to train a DS-CNN model with your target keywords. Quantize to int8 and embed in firmware. On the STM32H7, you can afford a larger model (150-200 KB) for higher accuracy or more keyword classes than typical MCU deployments.

Alternatives

Explore More

FAQ

How many keywords can the STM32H7 recognize?
The 1 MB SRAM supports DS-CNN models with 30-50 keyword classes plus silence and unknown categories. Model size scales roughly linearly — Model size increases with vocabulary size. For production deployments, Larger vocabularies are possible given the available RAM — accuracy depends on model architecture and training data.
What is the inference latency for voice recognition on STM32H7?
CMSIS-NN accelerated inference on a 30-class DS-CNN model completes well within the 1-second audio window on the STM32H7 at 480 MHz, leaving significant CPU headroom for application tasks. The L1 cache reduces latency further for models that fit in cache.
Can the STM32H7 do continuous voice recognition?
Yes. Configure the SAI peripheral with DMA circular buffer for uninterrupted audio capture. Process overlapping 1-second windows with 500ms stride for continuous monitoring. The 480 MHz Cortex-M7 handles both feature extraction and inference without dropping audio frames.

Build Voice Control Pipelines in ForestHub

Map voice commands to device actions on the STM32H7 — design the full audio processing chain visually.

Get Started Free