Wednesday, March 19, 2025

ESP32-P4 Deep Learning Pipeline Update 4: Optimizing QSiLUApprox Activation Using SIMD on ESP32-S3

Quantizing the SiLU Approximation for Efficient Inference

In our previous blog post, we explored the mathematical approximation of the SiLU activation function for quantized neural networks. Today, I'm excited to share how I implemented this approximation using SIMD instructions on the ESP32-S3 microcontroller, focusing on the unique challenges I encountered and how I solved them.

SIMD Challenges on ESP32-S3

While the ESP32-S3's SIMD capabilities offer significant performance advantages for neural network inference, implementing our SiLU approximation revealed several hardware limitations that required creative solutions:

  • No direct type conversion between int8_t and int16_t
  • No arithmetic right shift for vectors
  • No direct absolute value instruction

SIMD Architecture: Efficient Parallel Data Processing

The ESP32-S3 can process 16 int8 elements simultaneously with its SIMD instructions. This parallel processing capability is what makes the SIMD implementation so efficient for neural network operations. When you're working with quantized 8-bit neural networks, this means you can process 16 activations at once, which provides a significant performance boost compared to sequential processing.

Let's dive into how I tackled each of these challenges.

Challenge 1: Type Conversion with Sign Extension

Our quantized implementation requires converting between 8-bit and 16-bit integers while preserving sign. The ESP32-S3 doesn't provide direct SIMD instructions for this conversion, so I created a custom macro:

/**
 * @brief Expand an int8_t vector to int16_t with proper sign extension.
 *
 * This macro takes an int8_t vector (`q_src`) and expands it into two int16_t vectors
 * (`q_dst_high` and `q_dst_low`) using interleaving and arithmetic right shift.
 */
#define EXPAND_INT8_TO_INT16(q_dst_high, q_src, q_scale) \
    asm volatile ( "EE.VZIP.8 " #q_dst_high ", " #q_src : : );  \
    asm volatile ( "EE.VMUL.S16 " #q_dst_high ", " #q_dst_high ", " #q_scale : : ); \
    asm volatile ( "EE.VMUL.S16 " #q_src ", " #q_src ", " #q_scale : : );

How Vector Expansion Works

The expansion process works in two key steps:

Step 1: Vector Interleaving

The EE.VZIP.8 instruction interleaves bytes with zeros, effectively transforming:

\[ q\_src = [x_0, x_1, x_2, x_3, ..., x_{14}, x_{15}] \text{ (int8\_t)} \]

into:

\[ q\_dst\_high = [0, x_0, 0, x_1, ..., 0, x_7] \text{ (int16\_t)} \]

\[ q\_src = [0, x_8, 0, x_9, ..., 0, x_{15}] \text{ (int16\_t)} \]

Step 2: Sign Extension Fix

The interleaving operation creates 16-bit values but doesn't handle sign extension correctly. For example, a value of -1 (0xFF in int8) becomes 0xFF00 in int16, not 0xFFFF as required for proper sign extension.

To solve this, I use a clever multiplication trick:

  • Set the SAR (Shift Amount Register) to 8
  • Multiply by 1 using EE.VMUL.S16
  • The hardware automatically performs an arithmetic right shift after multiplication

This effectively restores the correct sign extension, transforming 0xFF00 into 0xFFFF for negative values while keeping positive values unchanged.

Challenge 2: Compressing Back to 8-bit

After performing our calculations in 16-bit precision, we need to compress the results back to 8-bit format:

/**
 * @brief Compress two int16_t vectors back to int8_t using EE.VUNZIP.8.
 *
 * This macro reverses the `EXPAND_INT8_TO_INT16` operation by using `EE.VUNZIP.8`
 * to merge `q_src_high` and `q_src_low` back into an `int8_t` vector.
 */
#define COMPRESS_INT16_TO_INT8(q_src_low, q_src_high) \
    asm volatile ( "EE.VUNZIP.8 " #q_src_low ", " #q_src_high : : );

The EE.VUNZIP.8 instruction elegantly performs the reverse operation of EE.VZIP.8, efficiently packing our 16-bit values back into 8-bit format.

Challenge 3: Implementing Absolute Value

The SiLU approximation requires calculating absolute values, but the ESP32-S3 lacks a direct SIMD instruction for this operation. I implemented it using the mathematical equivalence:

\[ |x| = \max(x, -x) \]

This allows us to leverage existing SIMD instructions for efficient computation of absolute values.

Results and Performance

By implementing these core operations as building blocks, I was able to create an efficient SIMD-based implementation of our SiLU approximation that:

  • Matches the accuracy of our Python reference implementation
  • Processes multiple elements in parallel using SIMD instructions
  • Avoids expensive memory operations by eliminating lookup tables
  • Maintains precision through careful management of data types

The resulting implementation can be directly integrated into inference engines running on the ESP32-S3, allowing for efficient computation of neural networks that use the SiLU activation function.

Why This Approach Matters

This implementation demonstrates how to overcome hardware limitations through creative use of available instructions. The techniques shown here can be applied to other activation functions and neural network operations to optimize performance on embedded systems with SIMD capabilities.

The key advantages of our approach are:

  • Parallelism: Processing up to 16 int8 values simultaneously
  • Memory efficiency: No lookup tables required
  • Computational efficiency: Using only basic arithmetic operations
  • Accuracy: Maintaining precision through proper type handling

What's Next?

In future posts, I'll explore additional optimizations for neural network inference on the ESP32-S3 and benchmark the performance gains of our SIMD implementation compared to traditional approaches.

GitHub Implementation

You can find the full implementation here: GitHub Repository