BlazeFace is a lightweight and well-performing face detector tailored for mobile GPU inference. It was developed in 2019 by Google researchers to achieve super real-time inferences. BlazeFace stands apart from its predecessors due to a series of ingenious optimizations that drastically enhance its performance. In this article, we delve into the intricacies of BlazeFace, shedding light on the main optimizations that have propelled it to the forefront of facial detection algorithms.

The paper presents several contributions, starting from a lightweight feature extractor inspired by MobileNet V1/V2, an anchor scheme modified from Single Shot MultiBox Detector (SSD) and an improved tie resolution strategy alternative to non-maximum suppression.

**Single/Double Blaze blocks:**

The BlazeFace feature extractor, as mentioned before, is based on MobileNet architecture, this well known neural network exploits the depthwise separable convolution, this type of convolution basically split the convolution in two parts, the first one where the depth of the input is preserved and a second one where the depth can be shrinked or expanded while the spatial dimensions remain the same of the input.

A depthwise convolution followed by a pointwise convolution make up a depthwise separable convolution, this techniques allow to reduce the computational complexity, improving the speed performances, let’s see an example.

Considering an input image 10x10x3 and convolving it with 256 3x3x3 filters (stride = 1, padding = 1) a tensor 10x10x256 is obtained, let’s see the number of operations needed:

*Normal Convolution*

Nₜₒₜ = (number of filters) * (number of kernel movements) * (kernel dimentions)

Nₜₒₜ = 256 * 8*8 * 3*3*3 = 442368 multiplications.

*Separable Convolution*

1. Depthwise:

DW= 8*8 * 3*3*3 = 1728 multiplications

2. Pointwise:

PW = 256 * 8*8 * 1*1*3 = 49152 multiplications

Nₜₒₜ = DW + PW= 50880 multiplications

As you may have observed, the reduced number of multiplications required, results in a significant improvement in the speed of the convolution process. In simpler terms, fewer operations lead to a faster convolution.

Furthermore, the researchers made an important observation regarding the depthwise separable convolution used in the BlazeFace model architecture. They found that the pointwise part of this convolution dominates its overall computation. Consequently, increasing the kernel size proves to be a relatively inexpensive process. When repeating the aforementioned operation with a 5×5 kernel size, the total number of multiplications amounts to 53952. This figure is not significantly higher compared to using a 3×3 kernel size.

Based on this insight, the authors decided to utilize 5×5 kernels in the BlazeFace model’s bottlenecks (Single BlazeBlock). This choice allowed them to trade the increase in kernel size for a decrease in the total number of bottlenecks required to achieve a specific receptive field size in the model.

In particular the MobileNetV2 implies two particular operations, based on depthwise separable convolutions, called depthwise expansion and pointwise projection. The researchers modified the MobileNetV2 block creating the Double BlazeBlock:

First of all the order of expansions and projections are swapped so that the residuals connections in the bottlenecks operate in the expanded channel resolution. Then, due to its low overhead, another depthwise convolution layer is added between the two pointwise convolutions, accelerating the receptive field size even further. Please note that after each pointwise convolution, an activation layer is included, even if it is not shown in the illustrations. This is done to introduce non-linearity into the process.

The combinations and the repetition of the Single and the Double BlazeBlocks compose the feature extractor architecture:

**Anchors Scheme**

SSD-like object detection models rely on predefined fixed-size base bounding boxes known as *priors* or *anchors*. These initial boxes serve as a starting point for the model to predict adjustments in center position and dimensions, allowing for a better fit around the detected object.

A typical SSD model uses predictions from feature maps with sizes of 1×1, 2×2, 4×4, 8×8, and 16×16. This practice of defining anchors at multiple resolution levels is commonly used to align with the range of object scales.

However, the BlazeFace architecture is optimized for mobile GPU inference, and the chosen anchor scheme takes this into account. It has been designed with consideration for the fact that GPU computation differs from CPU computation, particularly in the fixed cost of dispatching a particular layer computation. This cost becomes relatively significant for deep low-resolution layers that are inherent to popular CPU-tailored architectures. Therefore the researchers have chosen an anchors scheme that stops at the 8×8 feature map dimensions without further downsampling. The scheme is composed by 6 anchors per pixel at 8×8 and 2 anchors per pixel at 16×16 feature map dimensions (due to the common faces aspect ratio the anchors have been limited to 1:1 aspect ratio), resulting in:

Anchorsₜₒₜ = 6*8*8 + 2*16*16 = 896 anchors

**Output Layer**

It worth to be noticed that this model also predict some facial keypoints.

The paper does not describe deeply the output layers, however looking at some open source implementations, we can describe the output of the models with 2 tensor for both 8×8 and 16×16 feature maps.

Considering the last double BlazeBlock with dimensions 8x8x96, the two ouput tensors (one for the label and one for the point coordinates) are obtained with a 3×3 convolution using the follwoing number of filters:

– 6 filters for the labels (1 label per anchor)

– 96 filters for the coordinates ( 4 box and 12 keypoints coordinates per anchor)

**Post-processing**

The last optimization (in terms of accuracy) of this algorithm is based on the substitution of the typical non-maximum suppression operation. When employing NMS (Non-Maximum Suppression), only one box is selected as the winning prediction, leading to fluctuations in subsequent video frames. This variation between different anchors causes the predictions to exhibit temporal jitter, which may be perceived by humans as noise.

To minimize this effect, the authors replace the NMS algorithm with a *blending *strategy that estimates the regression parameters of a bounding box as a weighted mean between the overlapping predictions, leading to in a 10% increase of accuracy.

The BlazeFace model incorporates various optimizations, resulting in sub-millisecond inference time on mobile GPUs. Although originally designed for GPU inference, the algorithm achieves exceptional performance even on CPU.

I have developed a C++ library for CPU BlazeFace inference, here’s the GitHub link.