RPi nanny

From time to time I am tempted to do something strange. Obviously a useless thing that does not justify itself in terms of the amount of investment, and six months after its creation, it gathers dust on the shelf. But on the other hand, it fully justifies itself in terms of the amount of emotions, the experience gained and new stories. There are even two of my articles on Habré about such experiments: Alcoorgan and a smart bird feeder .



Well. It's time to talk about a new experiment. How he collected it, what came of it and how to repeat it.







I was prompted to a new project by an event, in a sense, banal - a son was born. I arranged a month's vacation in advance. But the child turned out to be quiet - there was free time. And put the sleeping next to him.



Many different housesembedded hardware for computer vision. As a result, I decided to make a video nanny. But not as dull as all the shops are filled with. And something smarter and more interesting.



The article will be written in a narrative vein in order to understand how the development of the toy went, where it came to and where it is going next.



The article has several additions:



  1. Video where I show and tell how everything works.
  2. A small article on VC, where I tell you why such things most likely will not come to normal production, and about the limitations of ML systems of this kind.
  3. Sorts of everything on github + ready-made image for RPi. At the end of the article, a description of how to use it.


Choosing an idea



The most commonplace functionality of a baby monitor is to see what is happening to the child at any time. Unfortunately, this does not always work. You will not watch the broadcast all the time, it is not convenient. The baby can generally be put to sleep nearby in a cocoon, why video all the time? As a result, the following collection was put together to begin with:



  1. The system should make it possible to watch a video or photo at any time from the phone
  2. The system should respond to the child's waking up, and notify about it
  3. The system should detect missing face to prevent SIDS


Platform selection



I had a long article on Habré about comparing different platforms. Globally, for a prototype like what I'm doing there are several options:



  1. Jetson Nano. + ( Nano), , . . — TensorRT. . , , , TensorRT .
  2. VIM3. , . — .
  3. Raspberry PI + Movidius. . , , .
    1. , .
    2. . .
  4. Raspberry PI 4 - when working through OpenCV it will be good to dispose of open networks, which should be enough. But, there was a suspicion that there will not be enough performance.
  5. Coral - I have it in my hands, it would pass in terms of performance, but my other article says why I don't like it :)


So, I chose Rpi + movidius. I have it in my hands, I can work with it.



Iron



The computer is Raspberry Pi 3B, the neuroprocessor is Movidius Myriad X. This is clear.

The rest - scraped along the bottom of the barrel, bought in addition.







Camera



I checked three different ones that I had:



  • Camera from RaspberryPI. Noisy, inconvenient cable, no convenient attachment. Scored.
  • Some kind of IP camera. Very handy because it doesn't need to be included in the RPI. The camera is separated from the computer. My cell even had two modes, day and night. But the one that I had did not give a sufficient quality of the face.
  • Webcam from Genius. I have been using it for about 5 years already. But something has become unstable lately. But for RPI, just right. Moreover, it turned out that it can be trivially disassembled and the IR filter removed from there. Plus, as it turned out later, this was the only option with a microphone.






And the filter changes like this:







In general, it is clear that this is not a product solution. But it works.



If anything, then in the code you will see the remaining pieces for switching to the other two types of cameras. Perhaps even something will work outright if you change 1-2 parameters.



Lighting



I had an illuminator lying around with one of the old problems.



I soldered some kind of power supply to it. It shines well.







Point it at the ceiling - the room is lit.







Screen



For some modes of operation, I needed a monitor. Stopped at this . Although I'm not sure if this is the right decision. Maybe I should have taken the full-length one. But more on that later.







Nutrition



The child sleeps in arbitrary places. So it's easier when the system is powered by a power bank. I chose this, simply because it is at home for hiking:







OpenVino



Let's walk a little through OpenVino. As I said above, a big advantage of OpenVino is the large amount of pre-trained networks. What can be useful to us.



Face detection. There are many such networks in OpenVino:



  1. 1
  2. 2
  3. 3


Recognition of key points on the face . We need this in order to launch the following

face orientation networks . Child's activity and where he is looking.

Direction of sight recognition - if you try to interact

Depth analysis ? Maybe it will turn out

Skeleton analysis

Well, there are many other interesting ones ...



The main disadvantage of these networks will be their main advantage - their pre-training ...



This can be corrected, but now we are doing a quick prototype, our goal is not work in 100% of cases, but fundamental work that will bring at least some benefit.



Go. General logic version 1



Since we are developing an embedded device, we need to somehow interact with it. Receive photo / alarm signals. So I decided to do the same as when I did the trough , via telegrams. But bring to mind.



For the first version, I decided:



  • Run the designated networks on the RPi (I would like everything at once, suddenly the performance will allow). This will allow you to see more options for solving the problem / probable ways of development
  • Write a general program template.
  • Come up with an algorithm that recognizes waking up.
  • Make an algorithm that sends a notification on loss of face


Everything went more or less well, apart from a bunch of bugs all around. This is inherent in ComputerVision ... I'm used to it.



Here is a quick summary of what I came across:



  1. OpenVino RPi ( 2020) - from openvino.inference_engine import IECore. OpenVino ( OpenCV ), , .
  2. OpenVino , -generate_deprecated_IR_V7
  3. OpenVino ( , ) Movidius int 8 . int32 . RPi int8 . , .
  4. OpenVino . , OpenVino . , — .
  5. OpenVino , Intel ( , ).
  6. PyTorch 1.5 onnx, 1.4…


But, here's how ... I am sure that if I went through TensorRT, then there would be more problems as always.



So. Everything is brought together, the networks are running, we get something like this (by running the stack over the head, orientation, key points):







It can be seen that the face will often get lost when the child covers it with his hands / turns his head. and not all indicators are stable.



What's next? How to analyze falling asleep?



I look at those grids that are, and the first thing that comes to mind is to recognize emotions. When the child is asleep and quiet, there is a neutral expression on his face. But it's not that simple. Here is a dark blue graph, this is a neutral expression of a sleeping child for an hour:







The rest of the graphs are sad / angry / joy / surprise. Not even really the essence of what is where in colors. Unfortunately, the network data is unstable, which is what we see. Instability occurs when:



  • Excessive shadow on the face (which is not uncommon at night)
  • The child's faces were not in the OpenVino training set => arbitrary switching to other emotions
  • The child actually makes faces, including in a dream


Overall, I was not surprised. I have encountered networks that recognize emotions before, and they are always unstable, including due to the instability of the transition between emotions - there is no clear boundary.



Ok, waking up cannot be recognized with the help of emotions. So far, I didn't want to teach something myself, so I decided to try on the basis of the same networks, but on the other hand. One of the nets gives the head rotation angle. This is already better (total deviation from looking at the camera in time in degrees). Last 5-10 minutes before waking up:







Better. But ... The son may start waving his head in his sleep. Or vice versa, if you set a large threshold - wake up and not wave your head after that. To receive a notification every time ... Sadly:





(there is about an hour of sleep time)



So we still need to do normal recognition.



Issues encountered in version 1



Let's summarize everything that I didn't like in the first version.



  1. Autostart. It is not convenient to start this toy again every time, connect via SSH, run the monitoring script. In this case, the script should:

    • Check the status of the camera. It happens that the camera is turned off / not plugged in. The system must wait for the user to turn on the camera.
    • Checking the status of the accelerator. The same as with the camera.
    • Checking the network. I want to use the thing both at home and in the country. Or maybe somewhere else. And again, I don't want to log in via ssh => I need to make an algorithm for connecting to wiFi if there is no Internet.
  2. Waking up, network training. Simple approaches have not come in, which means it is necessary to train the neuron to recognize open eyes.


Autostart



In general, the autorun scheme is as follows:



  • I launch my program at the start. How I do it - I wrote a separate article, not to say that it is trivial to do it on RPi. In short:

    • OpenVino
    • , —
  • Movidius-
    • — QR- wifi
  • telegram . — QR-




There is no ready-made eye recognition network in OpenVino.

Hahaha. The network has already appeared. But, as it turned out, it was launched only after I started developing. And in the release and documentation, it appeared already when I more or less did everything. Now I was writing an article and found an update .

But, I will not redo it, so I write as I did.



It is very easy to train such a network. Above, I said that I used the selection of the eyes by frame. There is nothing left: add saving of all eyes met in the frame. It turns out such a dataset:







It remains to mark and train it. I described the marking process in more detail here (and a video of the process for 10 minutes here). Toloka was used for marking. It took ~ 2 hours to set up the task, and 5 minutes to markup + 300 rubles of the budget.



When learning, I didn't want to think too much, so I took a deliberately fast network, which is of sufficient quality to solve the problem - mobilenetv2. The entire code, including loading the dataset, initializing and saving, took less than 100 lines (mostly taken from open sources, rewrote a couple of dozen lines):



Hidden text
import numpy as np
import torch
from torch import nn
from torch import optim
from torchvision import datasets, transforms, models



data_dir = 'F:/Senya/Dataset'
def load_split_train_test(datadir, valid_size = .1):
    train_transforms = transforms.Compose([transforms.Resize(64),
                                           transforms.RandomHorizontalFlip(),
                                           transforms.ToTensor(),
                                       ])
    test_transforms = transforms.Compose([transforms.Resize(64),
                                      transforms.ToTensor(),
                                      ])
    train_data = datasets.ImageFolder(datadir,
                    transform=train_transforms)
    test_data = datasets.ImageFolder(datadir,
                    transform=test_transforms)
    num_train = len(train_data)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))
    np.random.shuffle(indices)
    from torch.utils.data.sampler import SubsetRandomSampler
    train_idx, test_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    test_sampler = SubsetRandomSampler(test_idx)
    trainloader = torch.utils.data.DataLoader(train_data,
                   sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test_data,
                   sampler=test_sampler, batch_size=64)
    return trainloader, testloader

trainloader, testloader = load_split_train_test(data_dir, .1)
print(trainloader.dataset.classes)

device = torch.device("cuda" if torch.cuda.is_available()
                                  else "cpu")
model = models.mobilenet_v2(pretrained=True)
model.classifier = nn.Sequential(nn.Linear(1280, 3),
                                 nn.LogSoftmax(dim=1))
print(model)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
model.to(device)
epochs = 5
steps = 0
running_loss = 0
print_every = 10
train_losses, test_losses = [], []
for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        logps = model.forward(inputs)
        loss = criterion(logps, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    test_loss += batch_loss.item()

                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses.append(running_loss / len(trainloader))
            test_losses.append(test_loss / len(testloader))
            print(f"Epoch {epoch + 1}/{epochs}.. "
                  f"Train loss: {running_loss / print_every:.3f}.. "
                  f"Test loss: {test_loss / len(testloader):.3f}.. "
                  f"Test accuracy: {accuracy / len(testloader):.3f}")
            running_loss = 0
            model.train()
torch.save(model, 'EyeDetector.pth')




And a couple more lines to save the model in ONNX:



Hidden text
from torchvision import transforms
import torch
from PIL import Image

use_cuda=1
mobilenet = torch.load("EyeDetector.pth")
mobilenet.classifier = mobilenet.classifier[:-1]
mobilenet.cuda()
img = Image.open('E:/OpenProject/OpenVinoTest/face_detect/EyeDataset/krnwapzu_left.jpg')
mobilenet.eval()
transform = transforms.Compose([transforms.Resize(64),
                                      transforms.ToTensor(),
                                      ])

img = transform(img)
img = torch.unsqueeze(img, 0)
if use_cuda:
    img = img.cuda()
img = torch.autograd.Variable(img)
list_features = mobilenet(img)

ps = torch.exp(list_features.data.cpu())
top_p, top_class = ps.topk(1, dim=1)

list_features_numpy = []
for feature in list_features:
    list_features_numpy.append(feature.data.cpu().numpy())
mobilenet.cpu()
x = torch.randn(1, 3, 64, 64, requires_grad=True)
torch_out = mobilenet(x)

torch.onnx.export(mobilenet, x,"mobilnet.onnx", export_params=True, opset_version=10, do_constant_folding=True,
input_names = ['input'],output_names = ['output'])
print(list_features_numpy)




Saving the model in ONNX is necessary for further calling the model in Open Vino. I didn't bother with converting to int8, I left the model as it was in 32-bit format.



Analysis of accuracy, quality metrics? .. Why is it in an amateur project. Such things are priced differently. No metric will tell you “the system works”. Whether the system works or not, you will only understand in practice. Even 1% of errors can make the system unpleasant to use. I happen to be the opposite. Like 20% errors, but the system is configured so that they are not visible.



These things are easier to look at in practice, “will it work or not”. And having already understood the criterion of work - to enter metrics, if they are needed.



Version 2 issues



The current implementation is qualitatively different, but it still has a number of problems:



  • . , :

    • - ⅓ .


  • . . , , . , .
  • . ?


?



I didn't retrain face detection. Unlike eye recognition, this is a lot more work. And with the collection of a dataset, and with quality training.



Of course, you can do it on the face of your son, probably even a little better will work than the current network. But for the rest of the people, no. And, perhaps, for my son in 2 months - also not.

Collecting a normal dataset takes a long time.



Sound



It would be possible to follow the classical path of sound recognition and train the neuron. In general, it would not be very long, at most several times longer than eye recognition. But I didn't want to mess around with collecting the dataset, so I used an easier way. You can use ready-made WebRTC tools . Everything turns out to be elegant and simple, in a couple of lines.



The disadvantage that I found is that the quality of the solution is different on different microphones. Somewhere triggered with a squeak, and somewhere only with a loud cry.



Go ahead, what else



At some point, I conducted a test by running a looped 5-second video of myself with my wife: It







was clear that the son was sticking to the faces of people in the field of view (the monitor hung him for 30 minutes). And the idea was born: to make facial expression control. This is not just a static video, but an interaction option. It turned out something like this (when the son's emotion changes, the video sequence switches):





“Dad, are you fucking fucking ?!”



Probably should try with a large monitor. But I’m not ready yet.



Maybe you need to replace the video being played. Fortunately, it's simple - the video is played from separate pictures, where the frame change is adjusted to the FPS.



Maybe you need to wait (at the current level, the child might simply not understand the connection between his emotions and the screen)



And then?



One of the most promising directions, it seems to me, is to try to control some physical objects / lights / motors through the direction of view / pose.



But so far I have not thought deeply about this issue. Rather, for now, I'll be testing emotion management.



How it looks in the end, description, thoughts



How everything works now (there is a larger video at the beginning of the article):



  • All control goes through Telegramm + through the camera.
  • If you do not need to control the video with emotions, then the whole device looks like this:





  • It is started by turning on the power on the power bank.
  • If there is a connected network, then the device is already ready for use
  • If there is no network, then you need to show the QR code with the network, the system will automatically start
  • Through Telegramm, you can select a set of events to monitor:





  • Every time an interesting event occurs, a notification is sent:





  • At any time, you can request a photo from the device to see what is happening



In general, reviews from a loved one:



  1. The face detector doesn't work very well. This is really a feature of any detectors that are not tuned for children. Usually, this does not interfere with waking up detection (at least a couple of normal photos with open eyes will come). There are no plans to retrain now.
  2. Without a screen, a slightly opaque launch (whether the QR code was read or not). And there are a lot of wires with the screen. I think the most correct option would be to put diodes on GPIOs. And light them depending on the status (there is a connection, the camera does not work, Movidius does not work, there is no connection to the telegram, etc.). But not yet done
  3. It is sometimes difficult to secure the camera. Since I have a pair of tripods, I can manage somehow. And without them, perhaps, nothing would have worked.
  4. Really allows you to free up some time and give freedom of movement. More than a normal streaming baby monitor? I do not know. Maybe a little easier.
  5. Cool stuff to experiment.


How to launch



As I said above - I tried to lay out all the sources. The project is large and ramified, so maybe I forgot something or didn't give detailed tools. Feel free to ask and clarify.



There are several ways to expand everything:



  1. Sors from github. This is a more complicated method, it will take a long time to configure the RPi, maybe I forgot something. But you have complete control over the process (including RPi settings).
  2. Use a ready-made image. Here we can say that it is graceless and unsecure. But it is much easier.


Github



The main repository is located here - github.com/ZlodeiBaal/BabyFaceAnalizer

It consists of two files that you need to run:



  1. The script for initializing / checking the status / setting of the network is QRCode.py (for this script, remember, there is a more detailed description ). He connects WiFi and checks that there are settings for the bot in Telegram.
  2. The main working script is face.py


Besides. there are two things missing in Git:



  1. WiFi credentials file - wpa_supplicant_auto.conf
  2. File with Telegram-bot credentials - tg_creedential.txt


You can let the system create them automatically on next startup. You can use the following by filling in the blank fields:



tg_creedential.txt
token to access the HTTP API — , @BotFather telegram "/newbot"

socks5://… — ,

socks5 — ,

socks5 — ,



wpa_supplicant_auto.conf
network={

ssid="******"

psk="*******"

proto=RSN

key_mgmt=WPA-PSK

pairwise=CCMP

auth_alg=OPEN

}



RPi tuning whistles and fakes



Unfortunately, you can't just put and run scripts on the RPi. Here's what else you need for stable work:



  1. Install l_openvino_toolkit_runtime_raspbian_p_2020.1.023.tgz according to the instructions - docs.openvinotoolkit.org/latest/openvino_docs_install_guides_installing_openvino_raspbian.html
  2. Install autorun
  3. Delete the message about the default password (maybe not necessary, but it bothered me) - sudo apt purge libpam-chksshpwd
  4. turn off screensaver - www.raspberrypi.org/forums/viewtopic.php?t=260355
  5. For audio detection:



    • pip3 install webrtcvad
    • sudo apt-get install python-dev
    • sudo apt-get install portaudio19-dev
    • sudo pip3 install pyaudio
  6. Download models from the OpenVino repository using the “Get_models.py” script in the “Models” folder


Form



The image is posted here (5 gigs).



A couple of points:



  1. The standard login-password is used (pi, raspberry)
  2. SSH access enabled
  3. By default, WiFi is not connected and the address of the bot in the cart which the system will use for monitoring is not configured.


How to set up WiFi in an image



The first option is to show the QR code with the text after launch:



WIFI:T:WPA;P:qwerty123456;S:TestNet;;


Where after P is the network password, after S is the network identifier.



  1. If you have a phone with Android 10, then there such a QR code is generated automatically when you click "share network"
  2. If not, then you can generate it at www.the-qrcode-generator.com


The second option is to SSH into the RPi (by connecting over the wire). Or turn on the monitor and keyboard. And put the file



wpa_supplicant_auto.conf
network={

ssid="*********"

psk="*******"

proto=RSN

key_mgmt=WPA-PSK

pairwise=CCMP

auth_alg=OPEN

}



with your wi-fi settings to the "/ home / pi / face_detect" folder.



How to set up a telegram bot in an image



The first option is to show the QR code with the text after launch:



tg_creedential.txt
token to access the HTTP API — , @BotFather telegram "/newbot"

socks5://… — ,

socks5 — ,

socks5 — ,



by generating it via www.the-qrcode-generator.com The

second option is to SSH into the RPi (connected by wire). Or turn on the monitor and keyboard. And put the tg_creedential.txt file described above in the "/ home / pi / face_detect" folder.



Remark about childhood



Already when I collected the first version and showed it to my mother, I received a sudden answer:

“Oh, and we did almost the same in your childhood.”

"?!"

"Well, they put the stroller with you on the balcony, threw a microphone through the window, which was included in the amplifier in the apartment."


In general, it suddenly turned out that it is hereditary.



Remark about spouse



"How did your wife react?"

“How did she let you experiment on your son ?!”

They asked more than once.

But, I ruined my wife well. Here, she even writes articles on Habré sometimes.



PS1



I am not an information security specialist. Of course, I tried to make sure that no passwords were shown anywhere, etc., and everyone could configure for themselves, indicating all the security information after the start.



But I do not exclude that I missed something somewhere. If you see obvious errors, I'll try to fix it.



PS2



Most likely, I will talk about updates for this project in my telegram channel , or in the VKontakte group . If I accumulate a lot of interesting things, then I will make another publication here.



All Articles