Scalable data classification for security and privacy





Classifying data based on content is an open task. Traditional data loss prevention (DLP) systems solve this problem by fingerprinting the relevant data and monitoring the fingerprint endpoints. Given the large number of ever-changing data assets on Facebook, this approach not only does not scale, but is ineffective in determining where the data is. This article is about an end-to-end system built to discover sensitive semantic types on Facebook at scale and automatically enforce storage and access control.



The approach described here is our first end-to-end privacy system that attempts to address this issue by incorporating data signals, machine learning, and traditional fingerprinting techniques to display and classify all data on Facebook. The described system is operated in a production environment, reaching an average F2 score of 0.9+ for various privacy classes while processing large amounts of data resources in dozens of storage facilities. Introducing the translation of Facebook's ArXiv post on scalable data classification for security and privacy based on machine learning.



Introduction



Organizations today collect and store large amounts of data in a variety of formats and locations [1], then the data is consumed in many places, sometimes copied or cached multiple times, with the result that valuable and sensitive business information is scattered across many corporate data warehouses. When an organization is required to comply with certain legal or regulatory requirements, for example, to comply with regulations in civil proceedings, it becomes necessary to collect the location of the relevant data. When the privacy statement states that an organization must mask all Social Security Numbers (SSNs) when sharing personal information with unauthorized entities, the natural first step is to search for all SSNs in the organization-wide data stores.In such circumstances, data classification becomes critical [1]. The classification system will enable organizations to automatically enforce privacy and security policies such as enabling access control policies, retention of data. Facebook is introducing a system we built at Facebook that uses a variety of data signals, a scalable system architecture, and machine learning to discover sensitive semantic data types.scalable system architecture and machine learning to discover sensitive semantic data types.scalable system architecture and machine learning to discover sensitive semantic data types.



Finding and classifying data is finding and labeling it in such a way that you can quickly and efficiently retrieve the relevant information when needed. The current process is more manual in nature and consists of examining the relevant laws or regulations, determining what types of information should be considered sensitive and what the different levels of sensitivity are, and then in the appropriate class building and classification policy [1]. After data loss prevention (DLP) systems, data is fingerprinted and endpoints are monitored downstream for fingerprints. When dealing with a warehouse with a lot of assets and petabytes of data, this approach simply does not scale.



Our goal is to build a data classification system that scales for both persistent and volatile user data, without any additional constraints on data type or format. This is an ambitious goal, and naturally fraught with difficulties. Any data record can be thousands of characters long.





Figure 1. Streams of online and offline forecasting



Therefore, we must effectively represent it using a common set of features, which can then be combined and easily moved. These features should not only provide an accurate classification, but also provide flexibility and extensibility to easily add and discover new data types in the future. Second, you need to deal with large, stand-alone tables. Persistent data can be stored in tables that are many petabytes in size. This can lead to a decrease in scan speed. Third, we must adhere to a strict SLA classification for volatile data. This makes the system highly efficient, fast and accurate. Finally, we need to provide low latency data classification for volatile data in order to perform classification in real time, as well as for internet use cases.



This article describes how we dealt with the problems above and presents a fast and scalable classification system that classifies data items of all types, formats, and sources based on a common set of features. We have extended the system architecture and built a custom machine learning model to quickly classify offline and online data. This article is organized as follows: Section 2 introduces the general design of the system. Section 3 discusses the parts of a machine learning system. Sections 4 and 5 describe related work and outline the future direction of work.



Architecture



To deal with Facebook-scale persistent data and data issues online, the classification system has two separate streams, which we'll discuss in detail.



Persistent data



Initially, the system needs to learn about a variety of Facebook information assets. For each warehouse, some basic information is collected, such as the datacenter containing that data, the system with that data, and the assets located in a particular data warehouse. This forms a catalog of metadata, allowing the system to efficiently retrieve data without overloading clients and resources used by other engineers.



This catalog of metadata provides a reliable source for all assets scanned and allows you to track the health of various assets. This information is used to prioritize scheduling based on collected data and internal information from the system, such as the time of the last successful scan of the asset and the time it was created, as well as the past memory and processor requirements for that asset if it was previously scanned. Then, for each data resource (as resources become available), the actual scan job for the resource is called.



Each job is a compiled binary that runs a Bernoulli sample over the latest data available for each asset. The asset is split into separate columns, where the classification result for each column is processed independently. In addition, the system scans any rich data within the columns. JSON, arrays, encoded structures, URLs, base 64 serialized data and more are all scanned. This can dramatically increase the scan execution time, as a single table can contain thousands of nested columns in a blob json.



For each row that is selected in the data asset, the classification system extracts floats and text objects from the content and associates each object back with the column from which it was taken. The result of the feature extraction step is a map of all features for each column found in the data asset.



What are signs for?



The concept of traits is key. Instead of float and text traits, we can pass raw string patterns that are directly fetched from each data resource. In addition, machine learning models can be trained directly on each sample, rather than hundreds of feature calculations that only try to approximate the sample. There are several reasons for this:



  1. : , , . , . , , .
  2. : . . , , .
  3. : , . .



The features are then sent to a prediction service where we use rule-based classification and machine learning to predict the data labels of each column. The service relies on both rule classifiers and machine learning and selects the best forecast given from each forecast object.



Rule classifiers are manual heuristics that use calculations and coefficients to normalize an object in the range of 0 to 100. Once such an initial score is generated for each data type and column name associated with that data, it does not fall into any "deny lists" , the rule classifier selects the highest normalized score among all data types.



Because of the complexity of the classification, using purely manual heuristics results in low classification accuracy, especially for unstructured data. For this reason, we developed a machine learning system to work with the classification of unstructured data such as user-generated content and address. Machine learning allowed us to start moving away from manual heuristics and apply additional data signals (eg, column names, data origins), significantly improving detection accuracy. We'll dive deep into our machine learning architecture later.



The Prediction Service stores the results for each column along with metadata about the time and state of the scan. Any consumers and downstream processes that depend on this data can read it from the daily published dataset. This set aggregates the results of all these scan jobs, or the real-time API of the data catalog. Published forecasts are the foundation for the automatic enforcement of privacy and security policies.



Finally, after the prediction service writes all data and all predictions are saved, our Data Catalog API can return all data type predictions for a resource in real time. Each day, the system publishes a dataset containing all the latest forecasts for each asset.



Volatile data



Although the above process is designed for persistent assets, non-persistent traffic is also considered part of the organization's data and can be important. For this reason, the system provides an online API for generating real-time classification predictions for any volatile traffic. Real-time prediction is widely used to classify outbound traffic, inbound traffic in machine learning models, and advertiser data.



The API takes two main arguments here: the grouping key and the raw data to be predicted. The service performs the same object retrieval as described above and groups objects together for the same key. These symptoms are also supported in the persistent cache for failover. For each grouping key, the service ensures that it has seen enough samples before calling the prediction service, following the process described above.



Optimization



We use libraries and hot storage read optimization techniques [2] to scan some repositories and ensure that there is no disruption from other users accessing the same repository.



For extremely large tables (50+ petabytes), despite all the optimizations and memory efficiency, the system works to scan and compute everything before it runs out of memory. After all, the scan is fully computed in memory and is not stored during the scan. If large tables contain thousands of columns with unstructured chunks of data, the job might fail due to insufficient memory resources when making predictions for the entire table. This will reduce the coverage. To combat this, we optimized the system to use scan speed as a mediator on how well the system is handling the current load. We use speed as a predictive mechanism to see memory problems and proactively calculate feature maps.However, we use less data than usual.



Data signals



The classification system is only as good as the signals from the data. Here we will look at all the signals used by the classification system.



  • Content Based: Of course, the first and most important signal is content. A Bernoulli sample is taken for each data asset that we scan and feature extraction by data content. Many signs come from content. Any number of floats are possible, which represent calculations of how many times a particular type of pattern has been seen. For example, we might have mouth signs for the number of emails seen in a sample, or indicators for how many emoticons were seen in a sample. These feature calculations can be normalized and aggregated across different scans.
  • : , , . — . , , . , .
  • : , . . , .
  • — , , . , , , , . , . , .




An important component is a rigorous methodology for measuring metrics. The main metrics of the classification improvement iteration are the accuracy and recall of each label, with the F2 score being the most important.



Calculating these metrics requires an independent methodology for labeling data assets that is independent of the system itself, but can be used to directly compare against it. Below, we describe how we collect basic truth from Facebook and use it to train our classification system.



Collecting reliable data



We accumulate reliable data from each source listed below in its own table. Each table is responsible for aggregating the most recent observed values ​​from that particular source. Each source has a data quality check to ensure that the observed values ​​for each source are of high quality and contain the latest data type labels.



  • Logging platform configurations: certain fields in the beehive tables are filled with data that belongs to a certain type. The use and dissemination of this data serves as a reliable source of reliable data.
  • : , , . , , .
  • , .
  • : Facebook . , , , . .
  • : , , , , . , .
  • : , . , , GPS.
  • : , , . .


We combine every major source of valid data into one corpus with all of this data. The biggest problem with validity is making sure that it is representative of the data store. Otherwise, the classification engines can overfit. In the fight against this, all of the above sources are utilized to provide balance when training models or calculating metrics. In addition, human marketers evenly select different columns in the store and label the data appropriately to keep the collection of valid values ​​unbiased.



Continuous integration



To enable rapid iteration and improvement, it is important to always measure system performance in real time. We can measure every improvement in classification compared to the system today, so that tactically we can target the data for further improvements. Here we look at how the system completes a feedback loop that is driven by valid data.



When the scheduling system encounters an asset that is tagged from a trusted source, we schedule two tasks. The first uses our manufacturing scanner and thus our manufacturing capabilities. The second task uses the latest build scanner with the latest features. Each task writes its output to its own table, flagging versions along with the classification results.



This is how we compare the results of the classification of the release candidate and the production model in real time.



While the datasets compare RC and PROD features, many variations of the prediction service's ML classification engine are logged. Most recent built machine learning model, current model in production, and any experimental models. The same approach allows us to "slice" different versions of the model (agnostic of our rule classifiers) and compare metrics in real time. It's so easy to tell when an ML experiment is ready to go into production.



Every night the RCs calculated for that day are sent to the ML training pipeline, where the model is trained on the latest RCs and evaluates its performance against a valid dataset.



Every morning the model completes training and is automatically published as experimental. It is automatically included in the experimental list.



Some results



More than 100 different types of data are marked with high precision. Well-structured types, such as emails and phone numbers, are classified with an f2 score of over 0.95. Free data types like custom content and name also perform very well, with F2 scores over 0.85.



A large number of separate columns of persistent and volatile data are classified daily across all stores. More than 500 terabytes are scanned daily in over 10 data stores. Most of these repositories have over 98% coverage.



Classification has become very efficient over time, as classification jobs in a persistent offline stream take an average of 35 seconds from scanning an asset to calculating predictions for each column.





Figure: 2. A diagram describing the continuous flow of integration to understand how RC objects are generated and sent to the model.





Figure 3. High-level diagram of a machine learning component.



Machine learning system component



In the previous section, we dived deeply into the architecture of the entire system, highlighting scale, optimization, and data flows offline and online. In this section, we will look at the forecasting service and describe the machine learning system that powers the forecasting service.



With over 100 data types and some unstructured content such as post data and user-generated content, using purely manual heuristics results in subparametric classification accuracy, especially for unstructured data. For this reason, we also developed a machine learning system to deal with the complexities of unstructured data. Using machine learning allows you to start moving away from manual heuristics and work with features and additional data signals (for example, column names, data origins) to improve accuracy.



The implemented model studies vector representations [3] over dense and sparse objects separately. They are then combined to form a vector that goes through a series of batch normalization [4] and non-linearity steps to produce the final result. The end result is a floating point number between [0-1] for each label, indicating the likelihood that the example is of the given sensitivity type. Using PyTorch for the model allowed us to move faster, enabling developers outside of the team to quickly make and test changes.



When designing the architecture, it was important to model sparse (e.g. text) and dense (e.g. numeric) objects separately due to their intrinsic difference. It was also important for the final architecture to perform a parameter sweep to find the optimal value for learning rate, packet size, and other hyperparameters. Optimizer selection was also an important hyperparameter. We found that the popular Adam optimizer often leads to overfitting, while the SGD modelmore stable. There were additional nuances that we had to include directly into the model. For example, static rules that ensured that the model makes a deterministic prediction when a feature has a certain value. These static rules are defined by our clients. We found that including them directly in the model resulted in a more self-contained and robust architecture, as opposed to implementing a post-processing step to handle these special edge cases. Also note that these rules are disabled during training so as not to interfere with the gradient descent training process.



Problems



Collecting high quality, reliable data was one of the challenges. The model needs validity for each class so that it can learn associations between objects and labels. In the previous section, we discussed data collection methods for both measuring the system and training models. Analysis has shown that data classes such as credit card and bank account numbers are not very common in our storage. This makes it difficult to collect large amounts of reliable data for training models. To address this issue, we have developed processes for obtaining synthetic valid data for these classes. We generate such data for sensitive types, including SSN , credit card numbers and IBAN-numbers for which the model could not predict earlier. This approach allows sensitive data types to be handled without the privacy risk associated with hiding real sensitive data.



Aside from the data validity issues, there are open architecture issues that we are working on, such as isolating changes and stopping early.... Isolation of changes is important so that when various changes are made to different parts of the network, the impact is isolated from specific classes and does not have a broad impact on the overall forecasting performance. Improving early stopping criteria is also critical so that we can stop training at a stable point for all classes, rather than at a point where some classes get retrained and others don't.



Sign importance



When a new feature is introduced into the model, we want to know its overall impact on the model. We also want to make sure the predictions are human interpretable so that we can understand exactly which features are used for each type of data. For this, we have developed and introduced class-by-classthe importance of features for the PyTorch model. Note that this differs from the general importance of a trait, which is usually maintained because it does not tell us which traits are important for a particular class. We measure the importance of an object by calculating the increase in forecast error after rearranging the object. A trait is "important" when a permutation of values ​​increases the model's error because in this case the model relied on the trait in forecasting. A feature is "unimportant" when the shuffling of its values ​​leaves the model error unchanged, since in this case the model ignored it [5].



The importance of the feature for each class allows us to make the model interpretable so that we can see what the model is paying attention to when predicting the label. For example, when we analyze ADDR, then we ensure that an address-related trait such as AddressLinesCount ranks high in the trait importance table for each class, so that our human intuition matches well with what the model has learned.



Assessment



It is important to define a common metric for success. We chose F2 - balance between recall and accuracy (recall bias is slightly larger). Feedback is more important in a privacy use case than accuracy, because it is imperative for the team not to leak any sensitive data (while ensuring reasonable accuracy). Actual estimates of the F2 performance of our model are beyond the scope of this article. However, with careful tuning, we can achieve a high (0.9+) F2 score for the most important sensitive classes.



Related work



There are many algorithms for automatic classification of unstructured documents using various methods such as pattern matching, document similarity search, and various machine learning methods (Bayesian, decision trees, k-nearest neighbors, and many others) [6]. Any of these can be used as part of the classification. However, the problem is scalability. The classification approach in this article is biased towards flexibility and performance. This allows us to support new classes in the future and keep latency low.



There is also a ton of data-fingerprinting work. For example, the authors in [7] described a solution that focuses on the problem of catching confidential data leaks. The main assumption is that it is possible for a data fingerprint to match against a set of known sensitive data. The authors in [8] describe a similar privacy leak problem, but their solution is based on a specific Android architecture and is classified only when user actions have resulted in personal information being sent, or when user data is leaked in the underlying application. The situation here is slightly different, as user data can also be highly unstructured. Therefore, we need a more sophisticated technique than taking prints.



Finally, to deal with the lack of data for some types of sensitive data, we introduced synthetic data. There is a large body of literature on data augmentation, for example, the authors in [9] investigated the role of noise injection during learning and observed positive results in supervised learning. Our approach to privacy is different because introducing noisy data can be counterproductive, and instead we focus on high quality synthetic data.



Conclusion



In this article, we have presented a system that can classify a piece of data. This allows us to create systems to ensure compliance with privacy and security policies. We've shown that scalable infrastructure, continuous integration, machine learning, and high-quality data fidelity data are key to the success of many of our privacy initiatives.



There are many areas for future work. It can include providing support for non-schematic data (files), classifying not only the data type, but also the sensitivity level, and using self-supervised learning directly during training by generating accurate synthetic examples. Which, in turn, will help the model reduce losses by the greatest amount. Future work may also focus on the investigation workflow, where we go beyond detection and provide root cause analysis for various privacy breaches. This will help in cases such as sensitivity analysis (i.e. whether the sensitivity of the privacy of the data type is high (for example, the user's IP) or low (for example, Facebook's internal IP)).



Bibliography
  1. David Ben-David, Tamar Domany, and Abigail Tarem. Enterprise data classification using semantic web technolo- gies. In Peter F.Ï Patel-Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, and Birte Glimm, editors, The Semantic Web – ISWC 2010, pages 66–81, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  2. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. f4: Facebook’s warm BLOB storage system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 383–398, Broomfield, CO, October 2014. USENIX Association.
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  4. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  5. Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.
  6. Thair Nu Phyu. Survey of classification techniques in data mining.
  7. X. Shu, D. Yao, and E. Bertino. Privacy-preserving detection of sensitive data exposure. IEEE Transactions on Information Forensics and Security, 10(5):1092–1103, 2015.
  8. Zhemin Yang, Min Yang, Yuan Zhang, Guofei Gu, Peng Ning, and Xiaoyang Wang. Appintent: Analyzing sensitive data transmission in android for privacy leakage detection. pages 1043–1054, 11 2013.
  9. Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation.


image


, Level Up , - SkillFactory:





E







All Articles