Data driven approach to harden Android security





We do everything to keep the Android platform safe for all users on all devices. Security updates are released every month   with fixes for vulnerabilities found by members of the  Vulnerability Rewards Program (VRP) . However, we also try to protect the platform from other potential vulnerabilities, for example by  using a compiler  and improving the test environment. The Android ecosystem includes devices with a wide variety of capabilities, so all decisions must be balanced and must take into account the available data.



This article explains how we select security controls for specific circumstances and how they are implemented.



Keeping Android secure takes a holistic approach. To make it harder for potential vulnerabilities to be exploited, we make data-driven decisions using multiple principles and techniques. When it comes to hardening the platform, the following questions need to be answered:



  • What data do we have and how can they help us make decisions?
  • What attack prevention tools are available? How can they be improved? In what situations should they be applied?
  • What problems can arise when using certain security tools? What possible layouts should be taken into account?


The principles we use in our security choices reflect our overall approach to protecting users of the Android platform.



Make data-driven security decisions



To find out for which platform components certain solutions will be effective, we turn to various sources. The  Android Vulnerability Rewards Program  (VRP) is perhaps the most informative of them all. Our security engineers analyze all vulnerabilities discovered by program participants, determining their root cause and severity level (based on  these recommendations ). In addition, there are internal and external bug reports. They help identify vulnerable components as well as code snippets that often cause failures. Knowing what such fragments look like and understanding the severity and frequency of errors that result from them, we can make informed decisions about which security measures will be most effective.





High and Critical Severity Vulnerabilities Fixed in Android Security Bulletins 2019.



However, do not rely solely on vulnerability reports. They initially give a distorted picture, since security professionals often pay attention to "hot" zones, that is, to those areas where vulnerabilities have already been found (for example,  Stagefright ). Or, they can look for vulnerabilities where they are easier to detect using out-of-the-box solutions. For example, if a security analytics tool is published on the GitHub platform, many professionals will use it.



We try to distribute our efforts to improve security evenly. Our teams pay attention to the less explored and more complex components of the platform. In addition, automated fuzzing testing is continuously performed on virtual machines and physical Android devices, which allows you to find and fix bugs at the earliest stages of development. When deciding which tools to use, we also analyze the root causes and severity of the problems we find.



As part of the Android VRP program, we encourage developers to add  full vulnerability chainsallowing you to trace the entire attack process from start to finish. As a rule, cybercriminals exploit several vulnerabilities at once, and in such chains these "bundles" are clearly visible, so they are very informative. Our security engineers analyze both entire chains and their individual links and try to discover new attack strategies in them. This analysis helps determine strategies to help prevent sequential exploitation of vulnerabilities (for example,  random address space allocation  and Control Flow Integrity methods  ), and also understands whether the attack can be mitigated if the process gains unwanted access to resources.



Obviously, some vulnerabilities can be included in several chains at once and are located in a different order. Therefore, it is better to use "deep protection", reducing the effectiveness of individual vulnerabilities and lengthening the chains of exploits. In this case, it will be more difficult for an attacker to build an effective chain and conduct an attack.



To understand current security threats and predict future trends, you need to constantly keep your finger on the pulse of the security community, in particular:



  • work closely with third-party security experts;
  • read thematic publications and attend conferences;
  • study the technologies used by malware;
  • track the latest developments in the field of security;
  • participate in side projects like  KSPP , syzbot, LLVM, Rust, etc.


As a result, you will have a better understanding of your overall security strategy, the effectiveness of existing solutions, and opportunities for improvement.



Why stronger protection is necessary



Strengthening protection and preventing attacks



Analyzing the data helps identify areas where effective attack mitigation can address entire classes of vulnerabilities. For example, if some components of the platform develop many vulnerabilities due to integer overflow errors, you should use an Undefined Behavior Sanitizer ( UBSan ), such as Integer Overflow Sanitizer. If memory access vulnerabilities are common, you should use  hardened memory allocators  ( enabled by default in  Android 11 ) and attack prevention tools (such as  Control Flow Integrity ) that are resistant to memory overflows and Use-After vulnerabilities. Free.



Before we talk about data usage, we propose a classification of tools to harden the platform's security. Here are the main segments into which all of these tools can be broken down (although some tools and methods may apply to several of them at once):



  • Exploit elimination tools
    • Deterministic runtime remediation tools  detect undefined or unwanted behavior and interrupt program execution. This eliminates data corruption in memory, while maintaining the likelihood of only minor failures. Often such tools can be applied pointwise, and they will still be effective, since they are designed for individual errors. Examples:  integer overflow sanitizer  and  BoundsSanitizer .
    •   . . . . , . : , Control Flow Integrity (CFI), , .
    •  , . ,   . : .
    • . , . , .
    • C C++, Java, Kotlin Rust, . ,    Android  , : C/C++ .




Depending on the specific problem, we decide which of the described tools should be used and how. For example, each of them is suitable if we are dealing with a large process involving the processing of unreliable data and complex parsing. Multimedia platforms have been a great demonstration of how architecture decomposition can more effectively mitigate exploits and prevent privilege escalation.





Decomposition of architecture and isolation of media frameworks in a historical context



The targets of remote attacks (NFC, Bluetooth, Wi-Fi and media content) are traditionally associated with the most serious vulnerabilities, so strengthening their security should be a priority. Typically, these vulnerabilities are caused by the most common root causes found in the VRP program, and we recently added sanitizers for all of them.



Attack prevention tools are useful for libraries and processes that set or reside within security boundaries (for example,  libbinder , as well as the standard libraries  libuilibcore, and  libcutils), since they are not tied to specific processes. However, these libraries are responsible for the efficient and stable operation of systems, therefore, before using a particular method, serious assurance is needed that it will enhance security.



Finally, it is important to protect the kernel, given its high level of privileges. All codebases have different characteristics and functionality, so the likelihood of vulnerabilities in them is different. The main criteria here are stability and performance. Use only effective security measures that will not interfere with users' work. Therefore, before choosing the optimal hardening strategy, we carefully analyze all available data related to the kernel.

The data-driven approach has yielded tangible results. After the Stagefright vulnerability was discovered in 2015, we began to receive reports of a large number of other  critical  vulnerabilities in the Android media platform. To complicate matters, many of them were remotely accessible. We have performed a  massive decomposition of the Android Nougat system and  accelerated the fixing of vulnerabilities in multimedia components . Thanks to these changes, in 2020 there were no reports of critical vulnerabilities in multimedia platforms that can be accessed over the Internet.



How the deployment decision is made



Naturally, it makes sense to focus on the attack prevention tools that work best. To identify them, we look at how each tool affects performance, how much work is required to deploy and support it, and whether it will negatively affect system stability.



Performance



When choosing an attack prevention tool, you need to understand how it affects device performance. If some components or the overall system cannot handle the load, battery life and overall performance may be reduced. This is especially true for entry-level devices that also need increased security. Thus, we give preference to the most effective solutions that do not affect the performance of the devices.



When evaluating performance, we pay attention not only to processor time, but also to memory use, code length, battery life and  cases of interface freezing.... To ensure that a tool is performing well across the Android ecosystem, it is especially important to test the listed parameters on entry-level devices.



It is very important for which component the protection measures are applied. For example, binding is most commonly used for interprocess communication. Therefore, any excessive load will instantly affect the operation of the device. In the case of a media player that only processes frames at the original rate, the situation is different. If the video speed is much higher than the display speed, the additional load will not be so critical.



We use benchmarks to determine the performance impact of a particular solution. If there are no benchmark results for a component, you need to get them, for example, by calling the affected codec to decode the media file. If testing indicates an unacceptable load, there are several options:



  • Selectively disable attack prevention for features that have a significant impact on performance. Typically, only a few functions consume resources at runtime. By not applying attack mitigation to them, you can maintain performance and maximize your security impact. Here is an example of  this approach for one of the media codecs. To eliminate the risks, the mentioned functions should be checked for errors beforehand.
  • Optimize the use of attack prevention. Often this requires changes to the compiler. For example, our team switched to using   Integer  Overflow  Sanitizer and  Bounds  Sanitizer.
  • Some attack mitigation options, such as Scudo's built-in heap resiliency,  can be tuned  to improve performance.


Many of these improvements require changes to the LLVM design. As a result, not only the Android platform wins, but also other members of the LLVM community.



Deployment and support



When choosing an attack prevention tool, you need to consider not only security and performance considerations, but also deployment and long-term support costs.



Impact of security measures on the stable operation of the system



It is important to understand whether it is possible to falsely trigger a particular attack prevention tool. For example, if the Bounds sanitizer throws an error, it is definitely a denied access (although it might not have been used). In the case of the Integer Overflow sanitizer, false positives are possible, since an integer overflow is often an absolutely normal and harmless process.



Therefore, it is important to consider the impact of attack prevention tools on system stability. It doesn't matter if there was a false positive or there was a real security threat - in any case, the user experiences inconvenience. Here again, we note that it is necessary to clearly understand for which components one or another security measures should be used. Because failures in some components have a greater impact on the stability of the system. If Attack Prevention crashes the media codec, the video will simply stop playing. However, in the event of an error in the process  netd



 when installing the update, the device may no longer turn on. Even though false positives are not a problem for some attack prevention tools (such as with the Bounds sanitizer), we still do extensive testing to make sure the device is stable. For example, bias errors by one may not crash normally, and the Bounds sanitizer interrupts the process and disrupts system stability.



It is also important to understand whether it is possible to identify in advance all the components that the attack prevention tool can disable. For example, in the case of the Integer Overflow sanitizer, it is very difficult to predict risks without extensive testing, because it is difficult to determine which integer overflows are intentional (allowed) and which ones might cause vulnerabilities.



Support



It is necessary to consider not only the possible problems in the deployment of attack prevention tools, but also the specifics of their support in the long term. We estimate the time it takes to integrate the tool with existing systems, activate and debug it, deploy to devices, and then service after launch. SELinux technology is a good example. It takes a lot of time and effort to create a set of rules. And this set needs to be maintained for years, regardless of code changes, as well as the addition or removal of individual features.



We strive to ensure that attack prevention tools have a minimal impact on stability and that developers have all the information they need. To achieve these goals, we are improving our current algorithms to reduce the number of false positives, and we publish the documentation at  source.android.com . By making it easier to debug in the event of failures, you can reduce the maintenance burden on developers. For example, to make it easier to spot bugs in the UBSan sanitizer, we have added  UBSan minimum runtime support to the Android build system  by default. Initially the minimum execution time was  added other Google developers specifically for this purpose. If the program crashes due to the Integer Overflow sanitizer, the following snippet is added to the SIGABRT error message:



Abort message: 'ubsan: sub-overflow' 
      
      





After seeing this message, developers will understand that it is necessary  to enable diagnostic mode in order to print out information about the failure:



frameworks/native/services/surfaceflinger/SurfaceFlinger.cpp:2188:32: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'size_t' (aka 'unsigned long')
      
      





At the same time, SELinux has an audit2allow tool that allows you to propose rules that allow certain blocked operations:



adb logcat -d | audit2allow -p policy
 #============= rmt ==============
 allow rmt kmem_device:chr_file { read write };
      
      





While audit2allow may not always suggest the right options, it is of great help to developers new to SELinux.



Conclusion



With every release of Android, we bring you new tools that protect the entire ecosystem, ensuring the performance and stability you need. Data analysis plays an important role in this. We hope this article helped you better understand the challenges of implementing new attack prevention tools and how we deal with them.






Thanks to our colleagues and authors: Kevin Deus, Joel Galenson, Billy Lau, Ivan Lozano - Android security and privacy experts. Special thanks to Zviad Kardava and Jeff Van Der Stup for their help in preparing this article.



All Articles