🥤 ❣️ 🧘 Three whales of linguistic analysis, without which InfoWatch Traffic Monitor cannot work 🤺 🌨️ 💥

Hello! Today we will talk about how linguistics is integrated into the work of a DLP system and how it helps us to protect important data from malicious attacks.

Recently, the need for companies to protect data from leaks of confidential information has grown significantly. The shift of employees to a remote mode of work led to a significant increase in cyberattacks and crimes in the field of information security: according to analysts' reports, in the first third of 2020, the number of confidential information leaks from Russian companies increased by 38%, and this trend continues to develop.

As a rule, legal documents, financial papers, personal data of employees and clients, etc. are under attack. To protect confidential data from intruders, companies install DLP (Data Loss Prevention) systems to prevent information leaks.

The linguistic analysis technology is deeply integrated into the work of many DLP systems, since content analysis is an invariable basis for filtering traffic in order to detect violations, and the quality of the technology largely determines the quality of the product itself.

Linguistic Analysis: How It Works

The linguistic analysis technology allows to automatically determine the topic and whether the analyzed piece of information is confidential, based on the terms and their combinations encountered in it .

To begin with, we conduct an initial analysis of documents: after the customer company determines the volume and content of documents that are confidential and that need to be protected (it is desirable that there are at least ten samples of documents for each category of protected information). In the case when the customer does not understand what documents he must provide, you can focus on the list of restricted information adopted in the customer's organization), the linguist highlights the terms- words or phrases characteristic of a particular industry and specifying the specifics of the text. It is extremely important here that the terms are found as often as possible in the texts of documents in one industry and extremely rarely in another (for example, for the banking sector, typical terms can be "cash balance", "settlement and cash services" or "deposit").

- Further, terms are categorized . The number of categories is not deterministic, however, the more categories are selected, the more heterogeneous the classification will be. Categories group terms into general conceptual groups to help organize information more clearly.

When categorizing a term, a linguist may define it as "characteristic." Characteristic are terms that are found only in the category in which they are entered, and do not occur in any of the other categories. If even one such term is found in the intercepted text, this text automatically belongs to the category in which this term is located.

In general, there can be from three terms in a category (the minimum number of non-characteristic terms, upon detection of which the system detects the text as confidential) to several thousand, depending on the specifics of the category. If this is a category consisting only of characteristic terms (for example, "Drugs", "Terrorism", etc.), then there can be several thousand terms in the category. If a category consists of non-characteristic terms (as a rule, these are categories based on the company's documentation - personnel, accounting, legal information), then it is advisable to limit the number of terms to a few dozen (from three to fifty).

- Then the linguist enters the categories into the content filtering database (BCF), on the basis of which linguistic analysis takes place. The content filtering base is a hierarchically structured dictionary that includes a list of categories and terms.

BKF works as a classifier on the basis of which the thematic distribution of the analyzed information occurs.

When adding non-characteristic terms to the BCF, they are assigned a weight- a number from 1 to 10 (by default, when creating a category, the weight is set to 5). The values of the weights for the terms in the category should be proportional to the ratio of the frequencies of the use of terms in the text, and it is the frequencies of the use of terms relative to each other - their frequency relative to those words in the text that are not included in the BCF does not matter.For example, if in one of the BCF categories we will introduce the terms “glokaya”, “kuzdra” and “shtekto” and set them the same weights (it does not matter if they have a weight of 10 or 1), then the text “Glokaya kuzdra shteko bumbled the sides and curls bokrenka” will be detected with relevance 1. If in the forwarded text, the words “glokaya” and “kuzdra” will appear 10 times, and “shteko” - 100 times, the relevance of the category text with equal weights for all terms will decrease and will be approximately 0.69.In this case, it is reasonable to set the weight of the terms “gloka” and “kuzdra” to 1, and to the term “shteko” - 10. Then the relevance of the sent text will become 1. It is clear that it is not always possible to observe such a strict proportion, but one should strive for it.

To determine the relevance of a text to a particular category, one of the classic search models is used - a vector model. This is a fairly popular way of working with various linguistic objects.

The main idea can be described as follows: there is a certain space defined by various terms (in our case, this is a document intercepted by the system containing textual information). A vector is built for the intercepted document, the value of each coordinate of the vector will be the number of times the corresponding term is used in this document. A similar vector is constructed for each BKF category. The dimension of vectors is the same for all analyzed texts and is equal to the number of words in the BKF.

Then the relevance value of the vectors can be calculated as the cosine of the angle between them using the dot product and the norm: The

cosine similarity of the intercepted document and terms from the BKF varies in the range from 0 to 1: the larger this value, the more similar the document is to a particular category.

The linguistic analysis technology based on content filtering databases has a number of advantages over other text classification technologies (which are also used by InfoWatch linguists to analyze documents, but more about them later).

The main distinguishing feature of BKF is its "flexibility" and the ability to customize the bases for the needs of a particular company. Linguists manually replenish and adjust the content of the BKF, thereby fine-tuning the technology for each customer.

The technology of linguistic analysis based on the BKF allows you to find the necessary terms and phrases, taking into account transliteration, the presence of typos and morphology: for example, with a given term "transport lease", the system will react to both "transport lease" and "transport lease", i.e. e. to all possible combinations of inflection of this term with misprints. The search is performed on the basis of morphological dictionaries (for Russian it is the dictionary of A.A.Zaliznyak, for foreign languages - separately created dictionaries). The typo detector does not correct the terms that are in the morphological dictionary, which helps to avoid reacting to words, the Domerau - Levenshtein distance (1) between which is equal to one.

InfoWatch has a large database of industry dictionaries. We have developed BKF for a variety of business areas - from space to energy, we also have narrow-profile bases (for example, in Islam or containing the source code of C ++, Java, etc.), designed for the specific purposes of individual companies. It is also worth adding that, in addition to Russian, we have 95 BKFs in 33 foreign languages, taking into account the support of morphology for many of them.

Autolinguist: quick protection of standard documents

As a rule, the workflow of an individual company does not differ in strong variability; in each of the departments, standard documents are used that are similar in subject matter and lexical content.

To protect and classify such documents in the "arsenal" of InfoWatch there is another tool for analyzing text data - "Autolinguist".

As the name suggests, the technology allows you to automatically classify typical documents into predefined categories without resorting to manual analysis.

The analysis of documents within the framework of the creation of the BKF is usually a long and energy-intensive work (on average, it takes a linguist 2-5 days to highlight terms, create categories and further work with the elimination of false positive and false negative responses), an autolinguist can significantly speed up the process of setting up the categorization of texts.

The classifier uses the Liblinear machine learning library, in particular, the logistic regression algorithm (2) , which makes it possible to obtain the probability of a text document belonging to a certain category.

The user has the opportunity to customize the work of "Autolinguist" himself: having previously loaded the training collection of documents and trained the classifier, the user can later add new categories, as well as adjust the content of the document base.

Text objects: when regex is not a problem, but a solution

Another powerful tool for analyzing and detecting the necessary information is text objects - a technology based on the use of regular expressions (which, as you know, are extremely flexible and convenient tool that allows you to specify almost any search criteria) and is used to protect data with a fixed external presenting, for example, credit card numbers, bank account details, email addresses, etc.

A text object can include one or more patterns of regular expressions or strings (words or phrases; in this case, the search will be performed for an exact match of the word to the string, without taking into account the peculiarities of spelling and morphology).

To verify the found text or a combination of numbers and settings, taking into account the needs of the customer, without changing the source code of the technology, verification functions are written in Lua.

I will give an example of a verification function for detecting international bank codes in the SWIFT system:

The function removes the “SWIFT” prefix, verifies and returns the rest of the text without separators.

In addition to a set of pre-installed text objects (Russian, Belarusian, Kazakh, Vietnamese, Malay, Arab, as well as a number of international ones that cover data from almost all business areas), users have the opportunity to create their own text objects that are unique for a particular business. For example, it will be important for a transport organization to control the VIN numbers of cars, and for a military structure - the number of a serviceman's ID.

Friends, from this article you learned about the main intricacies of linguistic analysis within the InfoWatch Traffic Monitor system: content filtering databases and its basics - terms and categories; "Autolinguist" technology, capable of independently classifying typical texts, and text objects used to detect template data.

Despite the proven effectiveness of the technologies and developments we already have, we continue to actively develop in semantic analysis, regularly replenishing the existing and creating new BKF and text objects, as well as expanding the scope of linguistic technologies. I will definitely write about all the innovations and interesting "chips" in the future.

Colleagues linguists, comment, ask difficult questions, throw useful links and share your experience! Let's make the world a better place together!

Author: Volobrinskaya Valeriavaleria_volob

1. , , , , .

2. , .

Three whales of linguistic analysis, without which InfoWatch Traffic Monitor cannot work

Linguistic Analysis: How It Works

Autolinguist: quick protection of standard documents

Text objects: when regex is not a problem, but a solution

More articles: