Our case is a streaming site for gamers, e-sportsmen and fans of live broadcasts of the Twitch.tv format . Some users stream entertainment content, while others watch it. The content can be very different: games, live music, interactive, mukbang, ASMR, cooking, podcasts, etc. - and, in principle, is not limited by anything except the streamer's imagination.
And the platform rules, which are monitored by the moderators.
Why moderate unsafe content? There are two reasons for this. The first is the current Russian legislation , according to which the distribution of pornography is illegal. The second reason is user experience . The platform is aimed at people of all ages and we cannot afford adult content on the homepage.
When we were faced with the task of tracking unsafe content, it immediately became obvious that it was not so easy to distinguish between safe and unsafe content. The first thing that was important to understand was that porn and nudity are not identical concepts.
Pornography is not necessarily about nudity: clothed sex is unsafe content, and it is often only over time that it can be distinguished from “safe” content.
Nudity is not necessarily about NSFW: sumo, wrestling, people dressed in latex - all this is safe content, in which open solutions are often worked out incorrectly.
Based on these considerations, we began to look at how this problem can be solved. Of the interesting open source solutions, the Open NSFW model from Yahoo, trained on closed data, has existed for several years ( implementation on TF ). There is also a cool open repository of Alexander Kim nsfw data scraper , from which you can get several hundred thousand images from reddit, imgur and seemingly some other sites. The images are broken down into five classes: porn, hentai, erotic, neutral, and drawings . On the basis of these data, there are many models, such as time, two
Open source solutions suffer from several problems - in general, the low quality of some models, incorrect operation on the aforementioned complex cases and safe images like twerking girls and memes with Ricardo Milos , as well as the problematic improvement, because either the models are outdated and trained on closed data, or the data is very noisy and has an unpredictable distribution.
We concluded that the temporal context is important for a good model , that is, the temporal context, with the help of it we will be able to catch more complex cases in dynamics. The statement of the problem becomes obvious.
Recognizing actions
In our case, this is still the same binary classification, but instead of one image, we feed a sequence of frames as input.
How is this problem solved in general? In the eighteenth year, an excellent review from qure.ai came out , and it seems that since then there has been no radical progress in the field, so I recommend it. A more interesting research on the topic of video turned into a more difficult task of understanding and retelling the video. There are graph grids and self-supervised learning - the second day at the last Machines Can See was even completely devoted to this .
So, the classification of actions. The history of progress in neural network models is approximately the following: first, they trained three-dimensional convolutional networks from scratch (C3D), then they began to try convolutions with some kind of recurrent architecture or attention mechanism; at some point, Andrey Karpaty proposed to merge views from different frames in different ways, even later it became a standard to make two-headed models, where a sequence of frames in BGR / RGB is fed to one input, and a dense optical stream counted on them to the other . There were also some jokes with the use of additional features and special layers like NetVLAD. As a result, we looked at the models that performed best on the UCF101 benchmark .where the videos are categorized into 101 class actions. This model turned out to be the I3D architecture from DeepMind, it went best with us, so I'll tell you more about it.
DeepMind I3D
As baselines, we tried to train C3D and CNN-LSTM - both models take a long time to learn and slowly converge. Then we took on I3D and life got better. These are two three-dimensional convolutional networks for BGR and optical flow, but there is a peculiarity - unlike previous models, this one is pre-trained on ImageNet and its own dataset from Deepmind Kinetics-700 , in which 650 thousand clips and 700 classes. This provides an extremely fast convergence of the model in a few hours to good quality.
In production, we use only an RGB head, since it is twice as fast, and the optical flow does not really drop in quality, and somewhere it can be even worse due to the fact that we mainly stream a computer screen and webcams, where the content sometimes pretty static.
We supply 16 frames to the model, not 64. Previously, we had a square entrance, but taking into account the specifics of the platform, we changed the aspect ratio of the entrance to 16: 9. The task is a binary classification, where the zero class is not porn, but the single one is porn. Trained with SGD with momentum, he performed slightly better than Adam. Minimal augmentation - horizontal flips and JPEG compression. Nothing special here.
Completing the topic of models - after I3D, there were still models EVANet - Neural Architecture Search for a sequence of frames, SlowFast Networks - a network with two channels with different frame rates, and an article by Google AI - Temporal Cycle-Consistency Learning , but we did not investigate them.
What was it taught on?
As I wrote above, data is tight. Nobody wants to publish them, it is difficult from a legal and ethical point of view - from licenses to the consent of each person involved in the content. Datasets, their licenses and publishing are generally fun. If anyone wants to write an article about this, I would love to read it. Of the significant academic datasets, there is only the Brazilian NPDI , and, unfortunately, it is small in volume, its data distribution is not diverse enough, it consists of keyframes, and the procedure for obtaining it is not the easiest. And we also want a dataset from the video! I had to assemble it myself.
A dataset consists of a video, which means you need to take a video from somewhere. There are two options for how to get them: scrapingfrom porn sites and YouTube and collecting videos manually . Each approach has its own pros and cons.
Scraping will potentially provide us with much more variety in the data, and we can get markup quite cheaply by saying that all frames of all videos from a conditional pornhub are unsafe, and all frames of all videos from YouTube are safe. There are disadvantages - all this must be stored somewhere, a dataset must be collected somehow from this, and the most important thing is that there is noise in the naive markup of porn videos. These are both direct mistakes: intro, scenes where everyone is dressed, close-ups without gender characteristics, menus of hentai games - as well as elements for which the model can retrain: logos, black screens, editing cuts. This noise is a few percent, and in the case of terabytes of video, getting rid of it is expensive. We will talk about this later.
The second approach is manual assembly. Its advantages are that we can model any desired distribution of the data, the data is more predictable, and it is easier to label them simply because there are fewer of them. But there are also disadvantages. Obviously, this approach yields less data, and in addition, it can suffer from collector bias, since it models the distribution and may miss something.
We took the second approach. We made a list of what could potentially end up on a streaming platform: all kinds of games, animation, anime, playing musical instruments, reactions, memes, stream highlights - and tried to cover all sorts of possible types of unsafe content - from anything ordinary to thrash in the spirit of porn with pterodactyls. We also mentioned computer games, which are often used for 3D hentai - Overwatch, for example. And they began to collect. As a result, I can highlight two insights.
Fetishists are good data collectors
There are a lot of compilations on porn sites for every taste, and each video can contain excerpts from hundreds or two of completely different videos, which allows you to get a dataset similar to scraping in terms of variety, and at the same time it is quite cheap to mark it up.
And youtubers too
Example times: there are compilations of streamer highlights on YouTube, sometimes they cover a separate year, last for hours and contain under a thousand edits, i.e. scenes. Example two: tops of games / anime / series. Let's say you need to clearly explain to neural networks what anime is. At the same time, there is a huge number of studios in Japan, the style of which is progressing every year. The solution is to download a video with anime tops for certain years from a famous youtuber. Or you need to cover a variety of scenes from a popular game. Go and download a video for example videogamedunkey for this game.
Data iteration
We had several iterations of the data. At first, it was about a hundred videos, about 70 hours long with the naive markup “all frames from porn sites - porn, everything from YouTube - blame”, from which we more or less evenly sampled sequences of frames for the dataset.
The model trained in this way worked well, but due to the noise in the data, the first models gave errors on various kinds of logos, black screens and dressed girls on a black leather sofa (͡ ° ͜ʖ ͡ °) . The black screens with the upcoming 0.817 were especially confusing, but it turned out that there was an error in the data - in one of the porn compilations the author accidentally rendered the video ten minutes longer than necessary, as a result, the train had a lot of “dangerous” black screens.
As a result, we honestly marked up the data, and these errors disappeared. In the context of scraping, the thought arises that if during manual selection of video such an error crept in, as with black screens, then when scraping thousands of videos it would be even more difficult to track down.
As noted - for almost all videos, we used the tool from OpenCV CVAT.
Five cents about CVAT
Computer Vision Annotation Tool. . , -. — , . XML. .
Then, in the course of our work, we collected more videos, updated the catalog of games, and as a result, now we have several hundred hours of videos in several dozen different categories, and we know that they consist of ~ 30,000 unique scenes, plus data with an asterisk , about which we'll talk below.
Great, we have raw tagged data! How do we get a good dataset out of them? Videos of different lengths, for each category, videos of different timing and degree of variety are collected - how to tie it all together? How many samples can we take from the dataset? Its diversity is somehow fundamentally limited (like the maximum number of video frames), how can we understand that we are taking too much?
At the beginning of work, we did not really bother with these questions and just took so many samples from each video of a separate class so that porn and non-spot in the dataset was approximately equal, and the number of samples was determined intuitively (“well, it seems that several times a minute in almost all something radically different happens, we will take 10,000 samples ”), and then empirically using the metrics of the trained models.
As a result, we addressed these questions, and we ended up with a rather complex tool for assembling datasets from video.
First of all, we wanted to know how much we can squeeze out of our video compilations. It is logical that we will have slightly more clips used for cutting different samples in the cut.
Let's look for editing glues
It was possible to use just the peaks of the norm of the difference between adjacent frames, but we used an open network specifically to find cut- outs - TransNet . This gave us two results: the first is that we learned how many scenes we have in the data in principle, and the second is that we learned which categories of data have lower diversity. Completed hentai, minecraft and other things.
Now our atomic unit for slicing is not a whole video, but one scene... This allows us to collect the most diverse dataset, balanced by categories and classes, taking into account safe scenes from porn videos. Videos are grouped into category folders, and scenes are sampled equally from them for each class. If we add new videos to the dataset, then additional cutting / deletion of unnecessary samples will occur to a minimum, the dataset will not be re-sliced from scratch. Very comfortably.
We collected a dataset of 20,000 samples in the train, 2000 in the validation and 2000 in the test, trained the model, we liked the metrics on the test, and sent it to production.
Let's talk a little about production - every day we check tens of thousands of clips, so even one percent of false positives can spam moderators, so for some time we collected a variety of false positives on a model with a slightly lower response threshold, and as a result, we had a lot of real data. which we used for additional training.
This is the data with an asterisk . They allowed us to focus on the diverse content of the platform and reduce the burden on moderators. Now, mostly false positives occur on new games - for example, at one time we were more likely to catch Death Stranding and Valorant.
The current dataset consists of 30000/5000/3000 train / val / test samples.
Evolution of our metrics on our test, broken down by categories, and comparison with open solutions (clickable)
f1- . , precision , f1- .
f1- . , precision , f1- .
Thanks to our detectors, the time for checking the entire platform by moderators is reduced several times. In addition to pornography, we catch nudity, TV logos and sports broadcasts, but these are stories for another time.
Fin.
A video version of the material can be seen here