Blur the background in Google Meet with ML in the browser

Video conferencing is becoming more and more important in people's work and personal lives. The better the process is organized in terms of video processing and privacy, the less a person's attention is distracted from the conference itself. As part of this challenge, we recently developed blur and background replacement filters in Google Meet.... They use machine learning (ML) to better distinguish participants regardless of their environment. While other solutions require additional software to be installed, Meet features are based on advanced Web ML technologies built with MediaPipe and run directly in your browser - no additional steps are required. One of the key goals in developing these features was to provide real-time browser performance on almost all modern devices, which we achieved by combining efficient ML models on the device, WebGL- based rendering, and ML rendering on the web stack using XNNPACK and TFLite .





Blur and replace background based on MediaPipe



An overview of our Web ML solution



Meet's new features are powered by MediaPipe , Google's open source streaming platform. Other ML solutions are based on this framework, such as tracking the hand , iris, and body position in real time.



The main requirement of any mobile technology is to achieve high performance. To do this, the MediaPipe web pipeline uses the low-level binary format WebAssembly.designed specifically for web browsers to speed up complex computing tasks. At run time, the browser translates WebAssembly instructions into machine code that runs much faster than traditional JavaScript. Additionally, Chrome 84 recently implemented WebAssembly SIMD support , where each instruction processes multiple data points, which more than doubles performance.



Our solution first processes each video frame, separating the user from the background (see below for more on the segmentation model), using the ML output to compute a low-res mask. If necessary, we further refine the mask to align it with the image borders. The mask is then used to render the video output throughWebGL2 and the background is blurred or replaced.





WebML pipeline: all computational operations are implemented in C ++ / OpenGL and performed in the browser via WebAssembly



In the current version, the computation is performed on the client's CPU with the lowest power consumption and the widest device coverage. To achieve high performance in real time, we have developed efficient ML models using the XNNPACK library to speed up the calculation. It is the first video output engine specifically designed for the new WebAssembly SIMD specification. By accelerating XNNPACK and SIMD, the segmentation model works in real time on the Internet.



Thanks to the flexible configuration of MediaPipe, the background blur / replace solution adapts its processing depending on the capabilities of the device. On high-end devices it runs a full pipeline to ensure the highest visual quality, while on low-end devices it switches to lightweight ML computational models without specifying the mask.



Segmentation model



ML models on the device should be ultra-lightweight for fast computing, low power consumption and small footprint. For models in the browser, the input resolution strongly affects the number of floating point operations (FLOPS) required to process each frame, and therefore should be small as well. Before feeding the image to the model, we reduce it. Recovering the most accurate mask from a low-resolution image complicates model design.



The general segmentation network has a symmetric encoding / decoding structure. Decoder layers (light green) are symmetric to encoder layers (light blue). In particular, channel attention(channel-wise attention) with global average pooling is used in both encoder and decoder blocks, which reduces the load on the CPU.





The architecture of the model with the MobileNetV3 encoder (light blue) and the symmetric decoder (light green)



For the encoder, we modified the MobileNetV3-small neural network , the design of which was automatically designed by searching for the network architecture to achieve the best performance on weak hardware. To halve the size of the model, we exported it to TFLite with float16 quantization, which resulted in a small loss of precision, but without noticeable impact on quality. The resulting model has 193 thousand parameters and the size is only 400 KB.



Rendering effects



After segmentation, we use OpenGL shaders for video processing and rendering effects. The challenge is to render efficiently without artifacts. In the refinement phase, the bilateral joint filter smooths the low-resolution mask.





Suppression of rendering artifacts. Left: A joint two-sided filter smooths the segmentation mask. Middle: Separate filters remove ghosting artifacts. Right: change the background by a light wrapper



shader blur simulates the effect of bokeh , blur adjusting power in each pixel is proportional to the values of the segmentation mask, like the spot scatteringin optics. Pixels are weighted by the spot radius so that foreground pixels do not blend into the background. We have implemented weighted blur split filters instead of the popular Gaussian pyramid , as they remove halo artifacts around humans. For performance, the blur is done at low resolution - and blended with the input frame at its original resolution.





Background Blur Examples



When replacing the background, a technique known as light wrap is used.(light wrapping) for overlaying segmented faces on a custom background. Light wrap helps soften the edges of the segmentation, allowing background light to shine on foreground elements, making the composition more realistic. It also helps to minimize halo artifacts when there is a lot of contrast between foreground and background.





Background replacement examples



Performance



To optimize performance on different devices, we provide model options with multiple input image sizes (for example, 256x144 and 160x96 in the current version), automatically choosing the best option according to the available hardware resources.



We evaluated model inference speed and through pipeline on two popular devices: the MacBook Pro 2018 with a 2.2GHz 6-core Intel Core i7 processor and the Acer Chromebook 11 with an Intel Celeron N3060 processor. For 720p input, the MacBook Pro can run a higher quality model at 120 fps and a pass-through pipeline at 70 fps, while on a Chromebook the model runs at 62 fps with a lower quality model and a pass-through pipeline produces 33 FPS.



 Model 

 FLOPS 

 Device 

 Model output 

 Conveyor 

 256x144 

 64 million 

 MacBook Pro 18 

 8.3 ms (120 FPS) 

 14.3 ms (70 FPS) 

 160x96 

 27 million 

 Acer Chromebook 11 

 16.1 ms (62 FPS) 

 30 ms (33 FPS) 

Output rate model and a through conveyor on high laptops (MacBook Pro) and low grade (Chromebook)



To quantify the accuracy of the model used popular metric: coefficient Zhakar (intersection-over-union, IOU ) and a boundary F-measure (boundary F-score) ... Both models perform well, especially on such a lightweight network:



  Model  

  IOU  

  Boundary  

  F-measure  


  256x144  

  93.58%  

  0.9024  

  160x96  

  90.79%  

  0.8542  



We've also made a segmentation map of the model publicly available , detailing the specifications for the evaluation. Estimates include images from 17 geographic sub-regions of the world, annotated for skin tone and gender. The analysis showed that the model demonstrates consistently high results across different regions, skin tones and genders, with slight deviations in IOU scores.



Conclusion



Thus, we introduced a new ML browser-based solution for blurring and replacing backgrounds in Google Meet. Thanks to this solution, ML models and OpenGL shaders show efficient performance on the Internet. The developed features provide real-time performance with low power consumption, even on low-power devices.



All Articles