Machine Learning Inference on STM32/Cortex M7

Last update: 2026-01-29

tl;dr: Tensorflow is currently the way to go to run a QAT quantized model on Cortex Mx using optimized CMSIS-NN kernels.

This post is to hopefully save someone’s weekend: Quantizaton Aware Training in Tensorflow/Keras, converting to Tensorflow Lite (now LiteRT) with tf.lite.TFLiteConverter and running it with TFLite-Micro is the only thing that works as of January 2026.

The entire design of TFL-Micro seems a bit unfortunate (admittedly with no knowledge for potential reasons), as it doesn’t really do any AOT compilation and leaves a lot of work to be done at runtime. Operations parsed from bytecode are resolved dynamically when instantiating a model, an entire memory planner determines how buffers are most efficiently allocated, which all adds up in terms of code size and RAM use and also prevents any kind of kernel fusion passes or similar.

But it exists and works, which is a big plus.

Survey of Embedded ML

Executorch seems like a decent option option for many uses. The runtime is a bit heavier citing a 50kB base size, and not all int8 operations for CMSIS-NN are implemented. But [progress seems fast] and adding them doesn’t actually seem too complicated.

iree is an MLIR-based compiler/optimizer/runtime that does everything and more, but unfortunately doesn’t have special support for ARMv7 (yet?). Plugging in architecture-specific ukernels seems doable at first glance though and would enable some really cool stuff.

Here’s a selection of projects I’ve looked at and tried to make work without success:

Converting PyTorch models using Quantization Aware Training with AI Edge Torch is mostly broken, people have had issues with that since 2024 it seems. Post Training Quantization with pt2e works though, but full integer quantization this way can really mess up some models to the point where it’s basically useless.

The PyTorch -> ONNX -> TF -> TFLite route is also not easily possible: onnx-tensorflor is dead, and onnx2tf looks great but does not support this particular use case.

microTVM/tvmc as listed in search results doesn’t exist anymore, and honestly I still don’t understand what tvm can do and how to use it from a not so quick look through scattered docs.

mlonmcu looks very powerful potentially (though undocumented), but also starts with TFLite models as input.

nnom can be a replacement for the TFLite-Micro runtime and is TFLite-only.

tract is a bit heavy and running without the Rust standard library is not planned

tinygrad can spit out C code, with bare scalar for loops only.

There’s also STM’s Cube AI thingy, if proprietary compilers are your thing I guess.

TFLite-Micro Quick Start

The tflite-on-stm32-cortexm7 repo contains a start-to-finish example of creating the quantized model and running it on an ARM Cortex-M7 microcontroller. In my case it’s an STM32H7S3L8 Nucleo board running at 600MHz, with ~600kB of RAM, instruction and data caches, a double precision FPU and DSP extensions making for a pretty powerful platform. It’s similar to the IMXRT1062 used on the Teensy 4 in terms of specs.

I’m running this small LSTM for testing:

def small_lstm(dim_in=40, filters=2, dim_hidden=20, dim_lstm_in=40, kernel_size=10):
    input_data = layers.Input(shape=(1, dim_in), name="data") 
    input_state = layers.Input(shape=(dim_hidden * 4,), name="h")

    x = layers.Reshape((dim_in, 1))(input_data)
    x = layers.Conv1D(filters=filters, kernel_size=kernel_size, 
                      activation='relu', data_format="channels_last", use_bias=False)(x)
    x = layers.MaxPooling1D(pool_size=2)(x)

    x = layers.Flatten()(x)
    x = layers.Dense(dim_lstm_in)(x)

    h1 = input_state[:, 0 : dim_hidden]
    c1 = input_state[:, dim_hidden : dim_hidden*2]
    h2 = input_state[:, dim_hidden*2 : dim_hidden*3]
    c2 = input_state[:, dim_hidden*3 : dim_hidden*4]

    x = layers.Reshape((1, dim_lstm_in))(x)
    x, h1_out, c1_out = layers.LSTM(dim_hidden, return_sequences=True,
                                    return_state=True, unroll=True)(
        x, initial_state=[h1, c1]
    )
    x, h2_out, c2_out = layers.LSTM(dim_hidden, return_sequences=True,
                                    return_state=True, unroll=True)(
        x, initial_state=[h2, c2]
    )
    state_buffer_out = layers.Concatenate(axis=-1, name="state_buffer_out")(
        [h1_out, c1_out, h2_out, c2_out]
    )

    x = layers.Flatten()(x)
    logits = layers.Dense(3)(x)
    probs = layers.Softmax(axis=-1, name="prediction")(logits)

    model = Model(inputs=[input_data, input_state], outputs=[probs, state_buffer_out])
    return model

Quirks I’ve encountered using tf.lite.TFLiteConverter.from_keras_model:

Convolution layers are finicky with the TFLite converter. This specific shape with channels_last set is what ended up working. use_bias=False is not always required, but in other setups conversion failed without it.
Conv2D rarely works at all.
LSTM layers don’t seem to work as described in Google’s docs anymore, but using unroll=True and handling states manually in user code does the trick.
Most ready-made models don’t work unless it’s explicitly mentioned.

Building and running everything is mostly straightforward and documented in the repo.

Benchmarks

There doesn’t seem to be a single model anywhere on Kaggle or KerasHub that fits in half a megabyte of RAM, so the TFLite example models and my own are all I’ve got.

The first rows are the MNIST LSTM example that ships with TFLite, the others list parameters to the function above. Dynamic and full int8 quantization are explained in the docs (dynamic basically means integer weights, float activations).

As the chip I’m using is a boot flash variant, the all code and data is stored in external XSPI flash. Without any cache, every byte loaded goes over SPI, which is why there’s data and instruction caches that do what you’d expect.

Model	Quantization	No Cache (ms)	ICache Only (ms)	I+DCache (ms)
LSTM, \(28\times 28\) input, \(20\) units	dynamic	77	17.0	3.06
LSTM, \(28\times 28\) input, \(20\) units	`int8`	77.9	22.9	0.75
40, 2, 20, 40, 10	dynamic	45.95	8.94	1.82
40, 2, 20, 40, 10	`int8`	27.9	5.0	0.99
40, 2, 40, 40, 10	dynamic	45.9	8.52	1.89
40, 2, 40, 40, 10	`int8`	45.1	8.56	1.92
60, 2, 20, 60, 10	dynamic	32.9	5.4	1.01
60, 2, 20, 60, 10	`int8`	31.7	5.34	0.94
60, 2, 40, 60, 10	`int8`	48.66	9.04	1.97
60, 4, 40, 60, 10	`int8`	50.09	9.54	2.08
150, 8, 40, 150, 10	`int8`	100.81	25.04	5.24