Boumedine Billal : ESP32-P4 Deep Learning Pipeline Update 3: Quantizing the SiLU Approximation for Efficient Inference

Quantizing the SiLU Approximation for Efficient Inference

Why Quantize the SiLU Approximation?

Quantization is crucial for deploying deep learning models on embedded systems, especially when leveraging specialized hardware features like SIMD instructions. While the ESP32-P4 includes a floating-point unit, it operates in scalar mode, making it inefficient for large-scale matrix operations. In contrast, its SIMD engine can process up to 16 int8 values simultaneously, offering a major speed advantage for integer arithmetic.

Using a LUT-based activation function on the ESP32-P4 requires processing each element individually and accessing memory for each lookup, which introduces latency. By replacing LUT-based activations with a quantized polynomial approximation, we eliminate memory bottlenecks and fully utilize the SIMD capabilities.

Key benefits of integer-based quantization on the ESP32-P4:

Higher throughput: SIMD allows parallel processing of multiple int8 elements, significantly boosting performance.
Lower latency: Integer arithmetic (bit shifts, additions, multiplications) is computationally cheaper than floating-point operations.
Reduced power consumption: Integer math is more energy-efficient, a crucial factor for battery-powered embedded devices.
Minimized memory overhead: Storing precomputed activation values in a LUT consumes additional memory, whereas a polynomial approximation eliminates the need for external lookup tables.

By quantizing the SiLU function into an efficient integer form, we ensure that the activation function can be executed with minimal latency while maintaining high accuracy.

Understanding the SiLU Approximation

The SiLU activation function is defined as:

\[ \text{SiLU}(x) = x \cdot \sigma(x) \]

where the sigmoid function is:

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

For embedded efficiency, we approximate SiLU using a piecewise polynomial function optimized for integer arithmetic. You can read more about the detailed process and optimization in the previous update here.

Step 1: Input Quantization

The quantized input \( x_q \) is defined as:

\[ x_q = \frac{x}{s_x} \]

where \( s_x \) is a power-of-two scale factor:

\[ s_x = 2^{-n_x} \]

This allows efficient integer conversion using a right shift:

\[ x_q = x \gg n_x \]

Step 2: Integer-Based SiLU Computation

The SiLU activation function is defined as:

\[ \text{SiLU}(x) = x \cdot \sigma(x) \]

where \( \sigma(x) \) is the sigmoid function:

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

For efficient integer computation, we approximate SiLU using a piecewise polynomial function:

\[ y = \begin{cases} 0, & x < -4 \cdot 2^{n_x} \\ (2^{n_y - (3 n_x + 5)}) \cdot (x + 2^{2 + n_x})^2 \cdot x + z_y, & -4 \cdot 2^{n_x} \leq x \leq 0 \\ (2^{n_y - (3 n_x + 5)}) \cdot (2^{2 n_x + 5} - (x - 2^{2 + n_x})^2) \cdot x + z_y, & 0 \leq x \leq 4 \cdot 2^{n_x} \\ (x \cdot 2^{n_y - n_x}) + z_y, & x > 4 \cdot 2^{n_x} \end{cases} \]

Where:

\( x \) is the input value.
\( n_x \) and \( n_y \) are scaling factors chosen to optimize precision.
\( z_y \) is the zero-point for output quantization.

This formulation allows efficient computation using only multiplications and additions, making it ideal for embedded hardware with limited resources.

Step 3: Output Quantization

The output is quantized using:

\[ y_q = (y \cdot s_y) + z_y \]

where \( s_y \) is a power-of-two scale factor and \( z_y \) is the zero-point offset.

Efficient integer implementation:

\[ y_q = (y \gg n_y) + z_y \]

Optimized Integer-Only Implementation

The final integer-only implementation is:

We define the quantized SiLU approximation function as \( QSiLUApprox(x_q) \), which computes the quantized output \( y_q \) based on the input \( x_q \):

\[ y_q = \begin{cases} 0, & x_q < -4 \ll n_x \\ C_1 \cdot (x_q + C_2)^2 \cdot x_q + z_y, & -4 \leq x_q \leq 0 \\ C_1 \cdot (C_3 - (x_q - C_2)^2) \cdot x_q + z_y, & 0 \leq x_q \leq 4 \\ (x_q \ll (n_y - n_x)) + z_y, & x_q > 4 \ll n_x \end{cases} \]

where the constants are:

\[ C_1 = 1 \ll (n_y - (3 n_x + 5)), \quad C_2 = 1 \ll (2 + n_x), \quad C_3 = 1 \ll (2 n_x + 5) \]

Trade-offs and Considerations

Accuracy vs. Efficiency: Lower bit-widths reduce precision but can be mitigated with proper scaling.
Scale Factor Selection: Power-of-two scaling simplifies computation but may require fine-tuning.
ESP32-P4 Optimizations: SIMD operations can further accelerate execution.

Final Thoughts

By leveraging integer-only arithmetic and power-of-two scaling, we achieve an efficient SiLU approximation for embedded AI applications. The next step involves benchmarking performance on the ESP32-P4 to evaluate accuracy and execution time.

Boumedine Billal

Sunday, February 23, 2025

ESP32-P4 Deep Learning Pipeline Update 3: Quantizing the SiLU Approximation for Efficient Inference