ForestHub Logo ForestHub Logo ForestHub

Hardware Guide

STM32H7 for Voice Recognition with TensorFlow Lite Micro

The STM32H7 runs keyword spotting and voice command recognition with TFLite Micro using CMSIS-NN accelerated inference. The 1 MB SRAM and 480 MHz Cortex-M7 support larger vocabulary models (30+ keywords) and faster-than-realtime feature extraction.

Hardware Specs

Spec STM32H7
Processor ARM Cortex-M7 @ 480 MHz
SRAM 1024 KB
Flash 2 MB
Key Features Double-precision FPU, L1 cache (16 KB I + 16 KB D), JPEG codec, Chrom-ART Accelerator (DMA2D)
Connectivity Ethernet, USB OTG HS/FS
Price Range $8 - $20 (chip), $30 - $80 (dev board)

Compatibility: Good

The STM32H7's 1024 KB SRAM provides 8x the 128 KB minimum for voice recognition, enabling significantly larger models than most MCUs can handle. A DS-CNN model with 30+ keyword classes (~150-200 KB) runs comfortably with headroom for audio buffers and the application stack. The Cortex-M7 at 480 MHz processes MFCC feature extraction in a fraction of the real-time deadline — the CPU is idle most of the time waiting for the next audio window. CMSIS-NN kernels accelerate the convolutional layers by 2-4x through SIMD instructions. The STM32H7's SAI (Serial Audio Interface) peripheral handles I2S microphone input with DMA, requiring zero CPU intervention for audio capture. The limitation is connectivity: no built-in wireless — you need external modules for Wi-Fi/BLE if commands need to reach a network. For embedded voice control (local actions only), this is not a concern.

Getting Started

  1. 1

    Configure STM32H7 audio input via SAI

    Use STM32CubeMX to configure the SAI peripheral in I2S receive mode. Connect an IMP34DT05 or MP45DT02 MEMS microphone. Set up DMA circular buffer for continuous 16 kHz audio capture.

  2. 2

    Add TFLite Micro with CMSIS-NN backend

    Integrate TFLite Micro into your STM32CubeIDE project. Enable CMSIS-NN and CMSIS-DSP libraries for optimized convolution and DSP operations. The micro_speech example is the reference starting point.

  3. 3

    Implement audio preprocessing pipeline

    Process raw PCM audio into 40-channel MFCC features using the CMSIS-DSP FFT functions. The STM32H7's FPU handles floating-point FFT efficiently. Window size: 30ms with 10ms stride gives 98 frames per 1-second input.

  4. 4

    Deploy the keyword spotting model

    Use Google's Speech Commands dataset to train a DS-CNN model with your target keywords. Quantize to int8 and embed in firmware. On the STM32H7, you can afford a larger model (150-200 KB) for higher accuracy or more keyword classes than typical MCU deployments.

Alternatives

ESP32-S3 with TFLite Micro

Built-in Wi-Fi and BLE for connected voice control. 512 KB SRAM handles standard keyword models (10-12 classes). Lower cost, but half the RAM and clock speed of the STM32H7.

Arduino Nano 33 BLE with Edge Impulse

Built-in microphone (Sense variant) and Edge Impulse's end-to-end pipeline simplify development. 256 KB SRAM limits vocabulary size. Best for prototyping with minimal hardware.

Explore More

More STM32H7 guides More Voice Recognition guides All resources Find the right MCU

FAQ

How many keywords can the STM32H7 recognize?
The 1 MB SRAM supports DS-CNN models with 30-50 keyword classes plus silence and unknown categories. Model size scales roughly linearly — Model size increases with vocabulary size. For production deployments, Larger vocabularies are possible given the available RAM — accuracy depends on model architecture and training data.
What is the inference latency for voice recognition on STM32H7?
CMSIS-NN accelerated inference on a 30-class DS-CNN model completes well within the 1-second audio window on the STM32H7 at 480 MHz, leaving significant CPU headroom for application tasks. The L1 cache reduces latency further for models that fit in cache.
Can the STM32H7 do continuous voice recognition?
Yes. Configure the SAI peripheral with DMA circular buffer for uninterrupted audio capture. Process overlapping 1-second windows with 500ms stride for continuous monitoring. The 480 MHz Cortex-M7 handles both feature extraction and inference without dropping audio frames.

Orchestrate Voice AI Agents with ForestHub

Keyword spotting runs on-device; ForestHub on the Linux edge gateway routes events, adds LLM reasoning as one node, and acts — replayable and auditable end to end.

Get Started Free