Guide

How to Deploy AI Models to Microcontrollers

To deploy AI models to microcontrollers, train a model in TensorFlow or Edge Impulse, quantize it to int8, convert it to TFLite format, and flash the resulting C array alongside the TFLite Micro runtime to your target MCU.

Published 2026-04-01

What You Need Before Starting

Deploying AI to an MCU requires three things: a trained model, a conversion pipeline, and a firmware project for your target hardware.

On the training side, you need a model that is small enough to fit in your MCU’s memory. For most microcontrollers, that means models under 500 KB. You will train on a desktop machine or cloud service — not on the MCU itself.

On the hardware side, you need:

  • A supported MCU (ESP32, STM32, Arduino Nano 33 BLE, or similar ARM Cortex-M / Xtensa chip)
  • The vendor’s toolchain installed (ESP-IDF for Espressif, STM32CubeIDE for ST, Arduino IDE for Arduino boards)
  • A USB cable and basic embedded C experience

Step 1: Train or Select a Model

Start with a pre-trained model or train your own. For first deployments, use one of these proven starting points:

  • Image classification: MobileNet V2 (quantized) — fits on ESP32-S3 with PSRAM
  • Keyword spotting: The TFLite Micro speech example — runs on virtually any supported MCU
  • Anomaly detection: A simple autoencoder or statistical model — under 20 KB on STM32F4

If you use Edge Impulse, the training and conversion happen in one pipeline. Upload your dataset, select a learning block, and Edge Impulse handles quantization and export.

If you use TensorFlow, train normally in Python, then convert to TFLite:

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

Step 2: Quantize to int8

Quantization converts 32-bit floating point weights to 8-bit integers. This is not optional for most MCUs — it reduces model size by 4x and speeds up inference significantly on chips without an FPU.

Full integer quantization (int8 weights and activations) is the standard for MCU deployment. You need a representative dataset — a small sample of real inputs — to calibrate the activation ranges:

def representative_dataset():
    for sample in calibration_data:
        yield [sample.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

Expect accuracy loss of 1-3% from quantization. If accuracy drops more than that, your model may be too complex for the target hardware.

Step 3: Convert to C Array

The MCU cannot read .tflite files from a filesystem. You convert the binary model into a C array that gets compiled into the firmware:

xxd -i model.tflite > model_data.cc

This produces a header with the model bytes:

unsigned char model_tflite[] = {
  0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, ...
};
unsigned int model_tflite_len = 152384;

Edge Impulse skips this step — it exports a complete C++ library with the model already embedded.

Step 4: Set Up the TFLite Micro Runtime

Your firmware needs the TFLite Micro interpreter. The setup follows the same pattern regardless of MCU:

  1. Include the runtime — add TFLite Micro source files to your build (as a static library or source inclusion)
  2. Register operators — only the ops your model actually uses, keeping binary size small
  3. Allocate a tensor arena — a static byte array that serves as the interpreter’s working memory
  4. Load the model — point the interpreter at your C array
  5. Run inference — copy input data into the input tensor, invoke, read the output tensor

The tensor arena size depends on your model. Start with 80-100 KB and reduce it until inference fails, then add 10% headroom.

Step 5: Flash and Test

Build the firmware and flash it to your board:

MCUBuild SystemFlash Command
ESP32 / ESP32-S3ESP-IDFidf.py flash monitor
STM32H7 / STM32F4STM32CubeIDEBuild + Run in IDE, or st-flash
Arduino Nano 33 BLEArduino CLIarduino-cli upload -b arduino:mbed_nano:nano33ble

After flashing, verify inference works:

  • Check serial output for prediction results
  • Measure inference time — compare against your latency requirement
  • Monitor RAM usage — the tensor arena plus stack must fit in available SRAM

Common Pitfalls

Model too large for flash. A 500 KB model on a 1 MB flash chip may not leave enough room for the firmware itself. Budget 40-60% of flash for application code and runtime.

Tensor arena too small. The interpreter will return kTfLiteError without a clear message. Increase the arena in 10 KB steps until inference succeeds.

Operator not supported. TFLite Micro supports a subset of TFLite operators. If your model uses an unsupported op (like FlexDelegate), you must restructure the model or implement the op manually.

Wrong input format. If your model expects int8 input but you feed float32 sensor data, results will be garbage. Match the input tensor type exactly.

Which MCU Should You Use?

The right choice depends on your use case:

RequirementRecommended MCUWhy
Vision (camera input)ESP32-S3Camera interface, SIMD, PSRAM
Ultra-low powerSTM32L4< 100 nA shutdown mode
Maximum computeSTM32H7480 MHz Cortex-M7, 1 MB SRAM
Budget / prototypingESP32-C3$1-3 per chip, Wi-Fi included
Arduino ecosystemNano 33 BLEBuilt-in sensors, simple IDE

Frequently Asked Questions

What size AI model can run on a microcontroller?
Most MCUs run models between 50 KB and 500 KB. An int8 quantized MobileNet V2 for image classification fits in roughly 250 KB. Simpler models like keyword spotting or anomaly detection can be under 20 KB.
Do I need to train the model on the microcontroller?
No. Training happens on a PC or cloud service. The microcontroller only runs inference — it executes the pre-trained model on live sensor data. On-device training on MCUs is experimental and not production-ready.
Can I use Python on a microcontroller for AI?
MicroPython exists but is too slow for real-time inference. Production deployments use C/C++ with TFLite Micro or Edge Impulse SDK. The model is compiled into a C array and linked directly into the firmware.
How long does inference take on a typical MCU?
Depends on the model and hardware. A keyword spotting model on ESP32 runs in roughly 20-50 ms. Object detection on ESP32-S3 with SIMD takes roughly 100-300 ms per frame. Anomaly detection on STM32L4 can run under 10 ms. These are estimated ranges — benchmark on target hardware for production, as performance varies with model architecture and optimization.

Related Hardware Guides

Explore More

Skip the Boilerplate

ForestHub is designed to generate deployment-ready C code from a visual workflow. Pick your MCU, pick your model, deploy.

Get Started Free