Boumedine Billal : ESP32-P4 Deep Learning Pipeline Update 2: Efficient SiLU Approximation for Quantization

ESP32-P4 Deep Learning Pipeline: Approximating SiLU for Efficient Quantization

Alright, let’s talk about quantization. I’ve started optimizing my deep learning pipeline, and since I’m using YOLOv5 as the base model (because it’s standard, widely used for benchmarking, and popular in AIoT applications), the first challenge I ran into was the SiLU activation function. Linear operations are straightforward to quantize, but SiLU? Not so much.

Why SiLU is a Problem for Quantization

The SiLU activation function is defined as:

\[ \text{SiLU}(x) = x \cdot \sigma(x) \]

where \( \sigma(x) \) is the sigmoid function:

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

The issue? The sigmoid function is non-linear, computationally expensive, and doesn’t map well to MCUs with SIMD instructions. Look-up tables (LUTs) are an option, but they don’t work well with SIMD operations on embedded hardware.

Finding an Efficient Approximation

To solve this, I researched efficient approximations for the sigmoid function and found a great approach in this paper (see page 20).

The sigmoid approximation is:

\[ g(x) = 0.5 \cdot (0.25x - 1)^2 \]

With this, the approximated sigmoid function is defined as:

\[ \sigma_{\text{approx}}(x) = \begin{cases} 0, & x < -4 \\ g(-x), & -4 \leq x \leq 0 \\ 1 - g(x), & 0 \leq x \leq 4 \\ 1, & x > 4 \end{cases} \]

Approximating SiLU

Since SiLU is simply \( x \cdot \sigma(x) \), the approximated SiLU function becomes:

\[ \text{SiLU}_{\text{approx}}(x) = \begin{cases} 0, & x < -4 \\ x \cdot g(-x), & -4 \leq x \leq 0 \\ x \cdot (1 - g(x)), & 0 \leq x \leq 4 \\ x, & x > 4 \end{cases} \]

Optimized Computation for MCUs

To avoid costly division operations, we rewrite \( g(x) \) using bit shifts:

\[ g(x) = \left( (x \gg 2) - 1 \right)^2 \gg 1 \]

Bit shifting is much more efficient for MCUs, ensuring compatibility with SIMD instructions.

Why This Works So Well

Our approximation is limited to the range \([-4,4]\), but this isn’t an issue because:

For \( x < -4 \), SiLU naturally approaches 0, and our approximation does the same.
For \( x > 4 \), SiLU behaves like \( x \), which our approximation also captures.
Within \([-4,4]\), the quadratic approximation closely follows the real function.

So we maintain high accuracy while optimizing for embedded hardware!

What’s Next?

In the next post, I’ll dive into making this approximation work with quantized inputs instead of floating-point numbers. Stay tuned!

GitHub Implementation

You can find the full implementation here: GitHub Repository

Boumedine Billal

Friday, February 21, 2025

ESP32-P4 Deep Learning Pipeline Update 2: Efficient SiLU Approximation for Quantization