AI that does not ask for bread

An article on how we built our AI step by step. Reading time 10+ minutes.







Introduction . A computer vision startup using low-cost development as a core concept. The team is quite consistent with the spirit: 3 - 5 students of developers of different levels and directions, depending on the day of the week and time of day (from 0.25 to 1.25 rates). My experience of playing tag is very useful here.



Briefly about the product (pc + software) - an intelligent video surveillance system connected to a local network and producing video processing on its own . Two prerequisites: the presence of a user interface with different rights and the maximum autonomy of the algorithms.



On the technical side, there were no restrictions on hardware, the main thing was that it worked well; but with the financial were. For everything about everything ~ $ 500. Of course, only new and modern components. Their choice is not great, but there is!



We decided on the hardware, then the software. The choice fell on a microservice architecture using docker for some reason enough.



The development of features went from simple and necessary (work with streams and video files) to complex, with periodic review. We assembled an MVP, several optimization sprints brought us noticeably closer to our cherished goal - to complete all 4 points at the same time, and not separately:



  1. 16+ IP Cameras (FHD / 25fps) Live, Event or Time Playback & Recording
  2. Parallel operation of all available CV algorithms
  3. The user intensively uses the interface without delays - watching streams
  4. CPU load is less than 90% and everything works (!)


A little about the stack, the choice fell on: C / C +, Python + TensorFlow, PHP, NodeJS, TypeScript, VueJS, PostgreSQL + Socket.io and other little things.



The implemented features were deliberately hidden in order to dwell in more detail on, perhaps, the most interesting and delightful feature from the field of CV and, to some extent, ML.



"Unique User"



An example of use is to collect the history of visits of each specific visitor, and employees should be taken into account separately, even if we do not know that this is an employee (Example - a shopping center).

And it would seem that this problem has been solved 100,500+ times and phones and anything else can already recognize faces and remember them, send them somewhere, save. But 95% of solutions are used in ACS, where the user himself, trying to be recognized, stands in front of a 5MP camera at a distance of 30-50 cm for several seconds, until his face checks with one or several faces from the database.



In our case, such conditions were a luxury. Users moved erratically, looking at their smartphone at a sufficient distance from the camera mounted on the ceiling. Additionally, the cameras themselves introduced difficulties, most often these are budget cameras with 1.3-2MP and some kind of incomprehensible color rendition, always different.



In part, this problem was solved by the formation of technical specifications for the conditions for installing cameras, but in general the system should have been able to recognize in such conditions (of course, worse).



Approach to the solution: the task was decomposed into 2 tasks + database structure.



Short term memory



A separate service, where the real-time process mainly takes place, at the input is a frame from the camera (in fact, another service), at the output - an http request with a normalized 512-dimensional X-vector (face-id) and some meta-data, for example time stamp.

There are many interesting solutions in the field of logic and optimization inside it, but that's all; for now everything ...



Long term memory



A separate service where real-time requirements are not critical, but in some cases it is important (for example, a person from a stop list). In general, we limited ourselves to 3 seconds for processing.

At the entrance to the service - http from short-term memory with a 512-dimensional vector inside; at the exit - the Id of the visitor.



The first thoughts are obvious, the solution to the problem is quite simple: got http → went to the database, took what is → compared with http filling, if there is such, then it is; if not, then new.

The advantages of such a solution are countless, and only one minus - it does not work.



The problem was solved, although we followed the path of the samurai, trying various approaches, periodically looking at the Internet. In general, the decision turned out to be moderately laconic. The concept is quite simple and is based on clustering:



  1. Each vector (a-vector) will belong to some User; each cluster (no more than M vectors, out of the box M = 30) belongs to some User. Whether the a-vector belongs to the cluster A is not a fact. The vectors in the cluster define the interaction of the cluster, the vectors in the User define only the history of the User.
  2. Each cluster will have a centroid (in fact, an A-vector) and its own radius (hereinafter referred to as a range) of interaction with other vectors or clusters.
  3. The centroid and range will be a cluster function, not static.
  4. The proximity of vectors is determined by the squared Euclidean distance (in special cases - otherwise). While there are a few other decent methods here, we just stopped there.


Note: since we used normalized vectors, the distance between them was guaranteed from 0 to 2. Next, about the algorithm for implementing the concept.



# 1 The circle of suspects. Centroid as hash function



The X-vector obtained from the short-term memory is compared with the cluster centroids (A-vector) available in the database for proximity, distant ones, where range [X, A]> 1 - were discarded. If there is no one left, a new cluster is created.



Next, the minimum is sought between the X-vector and all the remaining a-vectors (min_range [X, a])



# 2 Unique properties of the cluster. Self-regulating entity



The cluster's own range_A is computed, whose vector is closest to the X-vector. Here the inverse linear function of the number of vectors (N) already in this cluster is used (const * (1 - N / 2M)); out of the box const = 0.67).



# 3 Validation and misunderstanding. If not someone - then who !?



If range_A> min_range [X, a], then the X-vector is marked as belonging to the A-cluster. If not, then ... Oh ... This is somewhat similar to the description of the mathematical model of misunderstanding.

We decided that in this case we will create a new cluster, thereby deliberately making a mistake of the 1st kind “Missing target”.



# 4 Additional training. How numbers form signs



Subjective experience is when data becomes a tool. We recognized earlier, but possibly with an error. Should I trust the X-vector to use it in the next match !? Checking! The X vector must:



  • be close enough to the centroid A (range_A> range [X, A])
  • be useful and diverse, because on the one hand, we minimize the risk of errors, on the other, we do not need copies either (Config_Max [0.35]> range [X, a]> Config_Max [0.125]). Thus, the configs determine the speed and correctness of "learning".


Fulfilling these conditions, the X-vector is included in cluster A (before that it simply belonged to User). If there are more vectors in the cluster, then we remove the most central one (min_range [A, a]) - it introduces the least variety and is only a function of the others; moreover, the centroid is already involved in the matching.



# 5 Work on bugs. We turn disadvantages into advantages



In each difficult choice, we took a step towards the “Missing target” error - we created a new cluster and User. It's time to revisit them ... all. After # 4 we have a modified cluster A. Next, we recalculate its centroid (A-vector) and look for the minimum distance to all available centroids in our 512-dimensional space. In this case, the distance is considered more difficult, but this is not so important now. When the min_range [A, B] distance is less than a certain value (out of the box range_unity = 0.25) we combine two sets, calculate a new centroid and get rid of less "useful" vectors if there are too many of them.

In other words: if there are 2+ clusters, in fact, belonging to the same User, then, after a series of detections, they will become close and merge into one together with their stories.



# 6 Combinatorial features. When the car ... thinks !?



It's worth defining a new term here in this article. A phantom vector is a vector that was obtained not as a result of short-term memory activity, but as a result of a function over N vectors of the cluster (a1, a2, a3, a4 ...). Of course, the vectors obtained in this way are stored and accounted for separately and do not represent any value until, as a result of matching, they are determined as the nearest ones (see # 3). The main benefit of phantom vectors is to speed up the early learning of a cluster .



The system is already in production. The result was obtained on real data outside the test environment for 5000+ User; a bunch of "weaknesses" were also noticed there, which were strengthened and taken into account in this text.



Interestingly, this approach has no user settings and all work is not controlled in any way, everything is completely autonomous . Additionally, time series analysis allows you to classify User into different categories in a similar way, thereby building associations. This is how the question was resolved - who are employees and who are visitors.



The role of the system user is simple - you need to periodically check your mail or the system interface for new activity reports.



Result



The proximity value for recognition based on long-term memory is ~ 0.12-0.25 for a moderately trained cluster (contains 6-15 a-vectors). Further, learning slows down due to an increase in the probability of “vector copies”, but in the long term, the proximity tends to values ​​of ~ 0.04-0.12, when the cluster already contains 20+ a-vectors. Note that inside the short-term memory, from frame to frame, the same parameter has a value of ~ 0.5-1.2, which sounds something like: "A person looks more like himself with glasses 2 years ago than 100ms ago." Such opportunities are opened by the use of clustering in long-term memory .



Riddle



One of the tests resulted in an interesting observation.



Initial conditions:



  • An absolutely identical video surveillance system with absolutely identical settings is deployed on two absolutely identical PCs. They are connected to one single ip-camera, located correctly, according to the TOR.


Act:



  • The systems are started at the same time and left alone for a week with all the algorithms working. Traffic matches normal traffic without change.


Result:



  • The number of created User, clusters and a-vectors is the same, but the centroids are different, not significantly - but different. The question is why? Who knows - write in the comments or here )


It's a pity that I cannot write about many things here, but perhaps I can describe something in the same detail in another article. Perhaps all this has already been described in some wonderful manual, but to my regret, I never found one.



In conclusion, I will say that it is very interesting to observe from the inside how an autonomous AI system classifies the surrounding space, implementing various features inherent in it along the way. People don't notice a lot of things due to the accumulated experience of perception (step # 4).




I really hope that this article will be useful to anyone in his project.



All Articles