Machine Learning Inference on STM32/Cortex M7
tl;dr: Tensorflow is currently the way to go to run a QAT quantized model on Cortex Mx using optimized CMSIS-NN kernels.
This post is to hopefully save someone’s weekend: Quantizaton Aware Training in Tensorflow/Keras, converting to Tensorflow Lite (now LiteRT) with tf.lite.TFLiteConverter and running it with TFLite-Micro is the only thing that works as of January 2026.
The entire design of TFL-Micro seems a bit unfortunate (admittedly with no knowledge for potential reasons), as it doesn’t really do any AOT compilation and leaves a lot of work to be done at runtime. Operations parsed from bytecode are resolved dynamically when instantiating a model, an entire memory planner determines how buffers are most efficiently allocated, which all adds up in terms of code size and RAM use and also prevents any kind of kernel fusion passes or similar.
But it exists and works, which is a big plus.
Survey of Embedded ML
Executorch seems
like a decent option option for many uses. The runtime is a bit heavier
citing a 50kB base size, and not all int8 operations for
CMSIS-NN are implemented. But [progress seems fast] and adding them
doesn’t actually seem too complicated.
iree is an MLIR-based compiler/optimizer/runtime that does everything and more, but unfortunately doesn’t have special support for ARMv7 (yet?). Plugging in architecture-specific ukernels seems doable at first glance though and would enable some really cool stuff.
Here’s a selection of projects I’ve looked at and tried to make work without success:
Converting PyTorch models using Quantization Aware Training with AI Edge
Torch is mostly broken, people have had issues with that since 2024
it seems. Post Training Quantization with pt2e works
though, but full integer quantization this way can really mess up some
models to the point where it’s basically useless.
The PyTorch -> ONNX -> TF -> TFLite route is also not easily possible: onnx-tensorflor is dead, and onnx2tf looks great but does not support this particular use case.
microTVM/tvmc as listed in search results doesn’t exist
anymore, and honestly I still don’t understand what tvm
can do and how to use it from a not so quick look through scattered
docs.
mlonmcu looks very powerful potentially (though undocumented), but also starts with TFLite models as input.
nnom can be a replacement for the TFLite-Micro runtime and is TFLite-only.
tract is a bit heavy and running without the Rust standard library is not planned
tinygrad can spit out C code, with bare scalar for loops only.
There’s also STM’s Cube AI thingy, if proprietary compilers are your thing I guess.
TFLite-Micro Quick Start
The tflite-on-stm32-cortexm7 repo contains a start-to-finish example of creating the quantized model and running it on an ARM Cortex-M7 microcontroller. In my case it’s an STM32H7S3L8 Nucleo board running at 600MHz, with ~600kB of RAM, instruction and data caches, a double precision FPU and DSP extensions making for a pretty powerful platform. It’s similar to the IMXRT1062 used on the Teensy 4 in terms of specs.
I’m running this small LSTM for testing:
def small_lstm(dim_in=40, filters=2, dim_hidden=20, dim_lstm_in=40, kernel_size=10):
input_data = layers.Input(shape=(1, dim_in), name="data")
input_state = layers.Input(shape=(dim_hidden * 4,), name="h")
x = layers.Reshape((dim_in, 1))(input_data)
x = layers.Conv1D(filters=filters, kernel_size=kernel_size,
activation='relu', data_format="channels_last", use_bias=False)(x)
x = layers.MaxPooling1D(pool_size=2)(x)
x = layers.Flatten()(x)
x = layers.Dense(dim_lstm_in)(x)
h1 = input_state[:, 0 : dim_hidden]
c1 = input_state[:, dim_hidden : dim_hidden*2]
h2 = input_state[:, dim_hidden*2 : dim_hidden*3]
c2 = input_state[:, dim_hidden*3 : dim_hidden*4]
x = layers.Reshape((1, dim_lstm_in))(x)
x, h1_out, c1_out = layers.LSTM(dim_hidden, return_sequences=True,
return_state=True, unroll=True)(
x, initial_state=[h1, c1]
)
x, h2_out, c2_out = layers.LSTM(dim_hidden, return_sequences=True,
return_state=True, unroll=True)(
x, initial_state=[h2, c2]
)
state_buffer_out = layers.Concatenate(axis=-1, name="state_buffer_out")(
[h1_out, c1_out, h2_out, c2_out]
)
x = layers.Flatten()(x)
logits = layers.Dense(3)(x)
probs = layers.Softmax(axis=-1, name="prediction")(logits)
model = Model(inputs=[input_data, input_state], outputs=[probs, state_buffer_out])
return model
Quirks I’ve encountered using
tf.lite.TFLiteConverter.from_keras_model:
-
Convolution layers are finicky with the TFLite converter. This
specific shape with
channels_lastset is what ended up working.use_bias=Falseis not always required, but in other setups conversion failed without it. - Conv2D rarely works at all.
-
LSTM layers don’t seem to work as described in Google’s
docs anymore, but using
unroll=Trueand handling states manually in user code does the trick. - Most ready-made models don’t work unless it’s explicitly mentioned.
Building and running everything is mostly straightforward and documented in the repo.
Benchmarks
There doesn’t seem to be a single model anywhere on Kaggle or KerasHub that fits in half a megabyte of RAM, so the TFLite example models and my own are all I’ve got.
The first rows are the MNIST LSTM example that ships with TFLite, the
others list parameters to the function above. Dynamic and full
int8 quantization are explained
in the docs (dynamic basically means integer weights, float
activations).
As the chip I’m using is a boot flash variant, the all code and data is stored in external XSPI flash. Without any cache, every byte loaded goes over SPI, which is why there’s data and instruction caches that do what you’d expect.
| Model | Quantization | No Cache (ms) | ICache Only (ms) | I+DCache (ms) |
|---|---|---|---|---|
| LSTM, \(28\times 28\) input, \(20\) units | dynamic | 77 | 17.0 | 3.06 |
| LSTM, \(28\times 28\) input, \(20\) units |
int8
|
77.9 | 22.9 | 0.75 |
| 40, 2, 20, 40, 10 | dynamic | 45.95 | 8.94 | 1.82 |
| 40, 2, 20, 40, 10 |
int8
|
27.9 | 5.0 | 0.99 |
| 40, 2, 40, 40, 10 | dynamic | 45.9 | 8.52 | 1.89 |
| 40, 2, 40, 40, 10 |
int8
|
45.1 | 8.56 | 1.92 |
| 60, 2, 20, 60, 10 | dynamic | 32.9 | 5.4 | 1.01 |
| 60, 2, 20, 60, 10 |
int8
|
31.7 | 5.34 | 0.94 |
| 60, 2, 40, 60, 10 |
int8
|
48.66 | 9.04 | 1.97 |
| 60, 4, 40, 60, 10 |
int8
|
50.09 | 9.54 | 2.08 |
| 150, 8, 40, 150, 10 |
int8
|
100.81 | 25.04 | 5.24 |