Neural networks guarding traffic rules



Violations of traffic rules (SDA) by drivers carry operational, reputational and legal risks for organizations.



Previously, videos from official vehicles were analyzed to identify violations. This is a routine and time-consuming process, since very large volumes of video were processed manually. It was decided to automate this process and create a model for detecting traffic violations to form a risk-oriented video selection.



First of all, it was decided to look for such gross traffic violations as crossing a double solid line and driving at a red traffic light.



For image segmentation and detection of road markings, a convolutional neural network of the U-Net architecture was used. This architecture is a sequence of convolution and pooling layers, which first reduce the spatial resolution of the image, and then increase it, having previously combined the images with the data and passed it through other convolution layers.



To train the model, a training dataset was needed. Unfortunately, all found datasets from open access consisted of photographs of roads not from Russia. The results of training the model on foreign roads were disappointing: the model often simply refused to perceive our domestic road markings as markings. Therefore, it was decided to start creating a training sample independently. About 1,500 screenshots were cut from the video from the recorders, and the roadbed was marked on them using the Supervise.ly service (Fig. 1).







The model trained on such a dataset became capable of recognizing road markings on our videos from recorders. The neural network finds solid lines on the video and, if they contain at least a predetermined number of pixels (so that random lines, discontinuous or not solid ones are not taken into account), approximates them into a straight line, which our car should no longer cross.







Figure 2 shows how U-Net works: above is the original recording from the windshield, below is an example of the neural network, where the green areas are the road marking mask, and the thin red lines are the approximation of the line markings.



The model showed itself very well in processing most videos from recorders, but it should be noted that difficulties arose when analyzing a snow-covered road or video filmed in the dark - in some cases, the markings are simply not visible.



To determine the presence of traffic lights and cars, a pretrained neural network Darknet + Yolo v3 was used. This neural network is an improved version of the YOLO architecture, which stands for You Only Look Once. The main feature of YOLO v3 is that it has three output layers, each of which is designed to detect objects of different sizes.



The main feature of this architecture in comparison with others is that most of the systems apply the neural network several times to different parts of the image, and in YOLO, the neural network is applied to the entire image at once and once. The network divides the image into a kind of grid and predicts bounding boxes (parallelepipeds, bounding the found objects) and the probability that there are these desired objects for each area.



The advantages of this approach are that when viewing the entire image, YOLO takes into account the context of the image when detecting and recognizing an object. Also YOLO has clear advantages in performance: it is a thousand times faster than R-CNN and several hundred times faster than Fast R-CNN.







An example of YOLO operation is shown in Figure 3. The image analysis is performed frame by frame, all red traffic lights found are correctly detected by the neural network.



Training two whole neural networks requires a sufficiently powerful computer, especially in terms of a video card, since GPU calculations are used. We used an eighth generation Core i7 processor, nvidia gtx1080 graphics card and 32GB of RAM. Such system characteristics were quite enough for the project implementation.



Based on the results of using models for detecting traffic violations, we can say that it was a successful project. The input to the script was a video from the auto-recorder for one month with a total duration of 7 hours 11 minutes, the time of model inference (processing incoming videos) was 25 minutes. At the end of the processing of all video files, 112 fragments of 8 seconds were cut (15 minutes in total), of which almost 7 hours were saved, violations were easily identified.

You can send your questions to the email address .



All Articles