Investigation: How Anonymized Data Becomes Personal and Outsourced

A week ago, they called me again and offered to buy some new car in the salon, where I have never been for sure. To a simple question about where the caller got my phone number and my name and patronymic, there was a direct answer - we selected your number at random from the number capacity. I did not believe in this explanation, and decided to ask how the data market works and understand who can merge information about users and how easily and masterly Internet monopolists bypass the Law "On Personal Data" (No. 152-FZ).



Read under the cut about who is monetizing my data and how it ends up in the hands of companies whose services I have never used - banks, insurance companies, medical centers, real estate developers and other organizations with annoying advertising calls. And yes, this is longread, as you like.



Our beautiful country spent the spring and early summer of 2020 in self-isolation. In addition to the obvious increase in the financial burden on business, the need for people to wear masks everywhere and have to work from home, this time period clearly showed how easy and simple some market participants are with the personal data of Russians.



Background



I was prompted to write this article by an interview with Tigran Oganesovich Khudaveryan in the media ( TheBell , Roem ) about the work of the Yandex service for assessing the self-isolation index.



Let me remind you briefly what the point is: almost simultaneously with the announcement of the regime "like non-working days throughout the country", the Internet giant Yandex began to regularly report on compliance with self-isolation measures by citizens. Officials and the media consulted this data on a daily basis. And although now this topic is gradually fading into the background, the questions to the primary source of such data have not gone anywhere.



Since Yandex was previously involved in, let's say, an easygoing attitude towards users - let's remember at least the history of surveillance through applications- it is reasonable to assume that data on the current location of citizens during self-isolation was collected using mobile applications with geolocation. And in itself, the method of surveillance through smart gadgets is obvious. In the capital, for example, there was a blatant story in general - despite the abundance of violations of the current legislation, the DIT of Moscow forced people to sign an onerous agreement with another similar "comrade major."



And although in his interview, the Managing Director of Yandex states:



“We are not involved in any of this. I confess that this is a sore spot for us, because we are constantly suspected of being involved in this surveillance. But we have our own principle within the company: in no case, even in a difficult situation, should we violate the principles that Yandex has been guided by since its inception. "



- there is no belief in it. The journalists did not ask the most important question - on the basis of what data did Yandex form its "confidential" rating ? This is important, because there is no answer to free access - the Internet giant simply does not disclose its methodology:







It is reasonable to assume that the term "data on the use of various Yandex applications and services" means monitoring the movements of citizens. But it is unlikely that any of you and I gave direct consent to such surveillance.



How the data market works



In the 90s, they were selling databases on the market bombs with CDs. Nowadays, you can get a list of the contacts you need even faster - you don't even need to go anywhere.



Obvious but illegal ways



You can search for someone else's data in social networks, or in special telegram channels, I will not give the names of publics, I'm sure you will find them yourself if you wish.







Some more advanced citizens act a little differently - they post an offer agreement on their websites, from which it follows that data is collected from public sources and even cites references to articles of the law that, as it were, allow them to do this: The







only nuance is that in The documents on the Avito website say that it is expressly prohibited by the rules to parse the contact database of the avito.ru Internet site on its own.



Likewise, online database sellers collect information from all possible sources.... All these methods, let's say frankly, are illegal, since they violate the provisions of the Law "On Personal Data" (No. 152-FZ). I am 100% sure that not a single sane person from such databases has given his consent to the public dissemination of information about himself by such companies via the Internet.



Man-in-the-middle attack



The way of leaking information through employees of enterprises with access to the customer base is also obvious. Let's not pay too much attention to this aspect.



The only way to deal with such people is access control, competent design of the contact base and the use of anti-fraud mechanisms that are developed by information security officers. The latter, by the way, regularly catch "sellers" and hand them over to law enforcement officers.



Subtle ways of collecting data



Internet companies, let's face it, have become completely insolent and have come up with a new method of free handling of user data. Today, all the largest players in this market collect such a dossier about us, poor users, that James Bond, Richard Sorge, Mata Hari and Austin Powers combined will envy them. Moreover, none of the users authorized the Internet company to collect such an invoice.



Everyone has heard the story of the American elections, in which the Republican victory was ensured by the targeting of advertisements to users of Google and Facebook. Moreover, these companies shared data with a third-party organization Cambridge Analytics, which formed the "target audience" of advertisements. Data collection is also used in China - the now popular social network has also recently become famous.using illegal surveillance methods that are prohibited even by Google's rules.



I must say that Russian Yandex closely monitors the actions of foreign colleagues, and uses similar methods - the company hides behind a screen of "impersonal data", which, as my personal experience of a non-programmer has shown, can be deciphered even while sitting at home on the couch with proper skill.



In December last year, an interesting article appeared on RBC , which told about the joint project of Yandex and the Bureau of Credit Histories (BCH) to transfer data about user behavior on the Internet. As conceived by the authors of this solution, banks will be able to receive additional information on the persons they need from Yandex, while having only the client's e-mail address and mobile phone number.



A source unnamed in the article said that Yandex receives data in a hashed form, after which internal algorithms determine a certain assessment for a specific person, and it is this assessment that is returned to the BKI. All this looks pretty neat, but there is a nuance - the article contains the opinion of Alexander Pakhomov, Managing Partner of Law and Business Management Company, who, like me, believes that when this procedure is performed, anonymized data again becomes personal:







How anonymized data becomes personal



Let's try to figure out what is happening “under the hood” of this service. I must say right away that it is difficult for me to do this, since I often enjoy the grace of great and beautiful Russia, and do not spend my working days at meetings in the meeting rooms of Yandex's modern Moscow office. Therefore, I urge you to share information and correct me if I am mistaken or in something wrong.



Step 1. hashing the data



Let's start by examining what Yandex itself means in the concept of "encrypted", "hashed" or "impersonal" data. And the public service Yandex.Audience will help us with this .



From its description it follows that the service allows advertisers to reach out to their customers. Moreover, to achieve this goal, you just need to tell Yandex some customer identifiers - phone numbers or email addresses. This data can be downloaded explicitly, for example, as a text or table file. And you can - also in an impersonal form. For this, the MD5 hashing algorithm is used.



Then the service works as follows: Yandex calculates a specific user, knowing his personal data, and shows him targeted advertising messages on various Yandex services and portals.



What do we know about MD5?
MD5 128- . , 128- , . 

. , — , .



MD5 1991 , 1993 . , . , «» MD5. 2008 .



Step 2. Decrypting MD5 hashes



Technically, MD5 cracking can be done in one of four ways:



  • Dictionary search
  • Brute-force
  • Rainbow-crack
  • Hash function collision


Obviously the fastest and easiest option is to use rainbow tables. In fact, to implement this method, you just need to know the hash and make your table according to certain criteria.



How rainbow tables work
, , . , , , — 9. , 11.



. :







, - . 83910123456. MD5 — fba55dd11f758ab4f03fad3c5f19ba75.



, … , — Plaintext!







, , . , — , , .



«» — , . .



Step 3. Comparison of data



There is not the slightest doubt that Yandex stores data in encrypted form. Relatively speaking, the search engine has a profile of each registered user, where, among other things, his email addresses and phone number are indicated. Such data can be easily hashed and, if necessary (as we have already seen above), de-hashed.



Further, having received a list of contacts from advertisers in any form, it is not difficult for Yandex to compare them with its internal database, which contains the same identifiers. Simply put, Yandex cross-matches the identifier from its user profile to match the requested advertiser data. This allows targeted display of ads to a specific user when entering the page of a particular Yandex service.



Unique identification of users



There can be no question of any impersonal data exchange when working according to such a scheme. All parties uniquely identify a specific user in the process of providing services. With credit bureaus, judging by the comments and descriptions, exactly the same scheme is applied. And apparently, Yandex uses a solution that is suspiciously similar to the Crypt platform .



However, Yandex has never publicly announced the possibility of matching such profiles with mobile phone numbers or e-mails of its users. But, as we learned from media materials, Yandex does exactly this, at least when working with the United Credit Bureau.



Why not honestly tell your customers about this, because everything is already on the surface? Instead, Yandex speakers shyly talk about the lack of "personal information" and cite other fictitious terms that are absent in the legislation of the Russian Federation and allow to get around some issues of circulation and protection of citizens' data.



A little practice: Yandex, I found your violation of 152-FZ!



Does Yandex solit hashes? I cannot unequivocally answer this question, after all, I do not work in this company and do not know the inner workings. However, I can make two assumptions:



  • Yandex's server capabilities allow you to quickly dehash unsalted MD5 hashes;
  • to work with salted hashes, both parties need to know the salt.


Obviously, in the case of the advertiser service, unsalted hashes are used. Otherwise, the interface for advertisers would have to have a salt field. And he's not there! Let's take a close look at the screenshot in the description of Yandex.Audience :







Pay attention to the question mark next to the "Hashed data" checkbox. Let's go to the service itself and hover over this question.







We see three hashes: a31259d185ad013e0a663437c60b5d0 , 78ee6d68f49d2c90397d9fbffc3814d1 and 702e8494aeb560dff987e623e71bccf8 . Moreover, the first is clearly missing something: there are only 31 characters, but there should be 32! Therefore, we will discard this hash immediately.



I also could not decrypt the second two hashes through the previously created rainbow table. But I decided to try to brute force them. To do this, I needed to reconfigure a mining farm of 6 video cards of the GeForce GTX1060 class from ether mining to work with the hashcat program .







I told the program to search by a mask of 11 digits (see the top arrow in the screenshot). As a result, my normal farm de-hashed the phone number in one of the hashes in just 22 seconds. Just imagine how fast you can brute force hashes on Yandex facilities!



Now let's determine who owns this number, just punch it through the Numbuster mobile app :







Now we go to the search engine, and in a matter of moments we get all the information we need:







Check and checkmate, Yandex, thanks to open information from your own site, I just found out in a couple of clicks who exactly made your service! Needless to say, the same action can be easily repeated by any of those who are now reading this article? Why did you do this to Yaroslav?



What data can be in the profile of each user



To use Yandex services, you must provide your mobile phone and email number. Yandex knows almost everything about me through its applications and services: from the sites I visit (where Yandex.Metrica is located, and there are more than 54% of them on the Runet ) to the phone number that I indicate in applications. He knows my routes from the Yandex.Go superappa, my diseases, my preferences in music. Yandex knows which theaters I go to, which films I watch, which goods I buy in the store, and which food I order.



This  information, according to the company, "is used mainly for their own needs and the placement of targeted advertising based on knowledge of customer preferences." The key here is “mostly”. Previously, it was believed that Yandex is an innovative company that provides users with free services and makes money from advertising on the Internet. But as we know from the media, now Yandex at least sells data through the Bureau of Credit Histories - I will show the work of the data transfer mechanism itself just below. It is reasonable to assume that there will be quite a lot of people who want to buy information about users from the Internet giant in relation to phone numbers and email addresses.



In other words, now banks, insurance and legal companies, medical centers, developers can get the number of a person who visited a certain site or searched for a certain product, and call him for their advertising purposes. Or refuse to issue insurance or a bank loan.



Who does the Credit Bureau sell data to?



You don't need to be a special analyst to understand that the CRI consolidates data about specific people not only for banks. On the website of the structure with which Yandex works, you can see that, in addition to bank scoring, other services are also available to customers:



Service "Triggers Bureau"



Information about your actions in trigger mode is transmitted to Banks and Insurance Companies:







Pay attention to the logic of this service - you put on monitoring the phone numbers of your customers, and as soon as they do any action that interests you, you receive a notification about it ... In this case, data on the specific actions of the client are not transmitted. Just the fact of the targeted action - submission or registration of an auto insurance policy, ordering a taxi, and so on.



Convenient, right? Especially from the point of view of explaining the position "customer data is not transmitted and processed in Yandex"? After all, information about an action in the form of a visit to a specific website can be reported by simply transferring a hashed mobile number, without any data about visiting the site. And the hash, which I mentioned above, can be easily compared with the hashes of the user base. You can even, for simplicity, take a database of all possible combinations of mobile numbers in Russia - it is available on the website of the Federal Communications Agency .



Again, it turns out that "encrypted", "hashed", "depersonalized" data in terms of Yandex is not really that. And certainly the scheme described by Yandex does not interfere with selling this data within the framework of the considered services of credit bureaus, which can be the very source of spam calls to my phone.







Insurance companies, having gained access to data from Yandex mapping services and its masterpiece Yandex.Go superapp, can determine:



  • where I live and work;
  • how often I travel by car;
  • what routes do I take;
  • how fast I am driving;
  • what is my driving style - I brake sharply, recklessly or drive smoothly.


And this is not speculation, the fact of collecting this data by Yandex became known in 2019, thanks to the introduction of the European legislation on the protection of citizens' data, the so-called GDPR. According to it, any company is obliged to provide EU citizens with information about what data it collects and analyzes about it.



The journalists of the Meduza edition took advantage of the GDPR law , who from Lithuania requested data on one of their employees.



The Meduza article says that the journalist received an archive from Yandex employees, which, among other things, contained a file with the entire history of movements. The information was tracked at the moment when the application was launched on the smartphone, including in the background. The journalist calls this "the history of launching the Maps application on an iPhone with the exact coordinates of where it happened" (file traffic_sessions.csv ).



It is interesting that Yandex does not provide such information to citizens of the Russian Federation. Moreover, until now Yandex has not even provided a service that would make it possible to understand who and when requested the accumulated data about the user. Even Facebook has such a service - and the user himself can request and view all information about himself.



What personal information does Yandex accurately collect?



Let's refer to the legal documents on the Yandex website . From point 4, we learn that the Internet giant can collect the following categories of personal information of users while using Yandex sites and services:



  • Personal information: name, phone number, address and age;
  • Electronic data (HTTP headers, IP address, cookies, web beacons / pixel tags, browser ID data, hardware and software information);
  • date and time of access to sites and / or services;
  • information about user activity while using sites and / or services: history of search queries; e-mail addresses of those with whom the user is in correspondence; e-mail content and attachments , as well as files stored in Yandex systems;
  • ;
  • , , ;
  • , — .


?



The answer to this question can be found in the same document, we are attentively looking at point 5. In addition to clear goals such as:



providing users with search results for search queries;

compliance with the obligations established by law;

in order to better understand how users interact with sites and services,



Yandex separately notes that the collection of personal data is necessary in order to offer you other products and services of Yandex or other companies that, in our opinion, may be of interest to you (sub-clause “ c "paragraph 5).



However, the law "On Personal Data" (No. 152-FZ) is categorical: Article 15 states that "the processing of personal data in order to promote goods, works, services on the market by making direct contacts with a potential consumer is allowed only with the prior consent of the subject of personal data." On the users' side, the regulatory authorities are FAS, Rospotrebnadzor and Roskomnadzor.



At the same time, the Internet giant freely transfers to other companies databases with allegedly impersonal personal identifiers, which, according to the Internet giant, have ceased to be personal data. And Yandex has secured this right to “share” through an inconspicuous line in the impressive text of its own privacy policy.



Instead of a conclusion



Is it all legal? After all, I did not give Yandex the right to disclose information about me to anyone. Lawyers I know say that Internet data and Internet identifiers are a “gray” field in our legislation and it is impossible to hold Yandex accountable for the sale of such data about you.



And how fair is it that Yandex makes money on my data, without explaining to me exactly how this happens and due to what this earnings are formed, because this has long been not only the notorious advertising of irons, which, after searching for an iron, catches up with you for 2 more weeks on all sites ... This also has a direct impact on the quality of my life and the availability of social services and services, such as loans, insurance, medical care.



Agree, the assessment of me as a borrower or policyholder based on information about my behavior on the Internet, which also happens “in the dark” and relies only on veiled terms and offers hidden in basements - looks absolutely unethical and opaque. This is very annoying.



Despite the GDPR and the tightening of laws on the use of personal data of citizens in Russia, the Internet giant continues to monetize information about us and absolutely openly monitors all our actions through its services. Even hiding behind the socially important topic of informing the population and the authorities about the observance of the isolation regime, as in the case of the coronavirus. A reasonable question arises - who else uses our data besides Yandex and its commercial clients?



All Articles