⚓️ 🐪 📎 Pysa: How to Avoid Security Issues in Python Code 🌥️ ⏺️ 👏🏿

On August 7, Facebook introduced Pysa, an open source security-focused static analyzer that helps you work with millions of Instagram strings. The limitations are disclosed, design decisions are touched upon and, of course, the means to help avoid false positives. The situation is shown when Pysa is most useful, and the code in which the analyzer is not applicable. Details from the Facebook Engineering blog under the cut.

Last year, we wrote about how we built Zoncolan , a static analysis tool that analyzes over 100 million lines of Hack code and helps engineers prevent thousands of potential security issues. Success inspired Pysa - Python Static Analyzer. The parser is built on top of Pyre, Facebook's Python type checking tool. Pysa works with data flow in code. Data flow analysis is useful because often security and privacy issues are modeled as data going where it shouldn't be.

Pysa helps identify many types of problems. The analyzer checks if the code correctly uses certain internal structures to prevent access or disclosure of user data based on technical privacy policies. In addition, the analyzer detects common web application security issues such as XSS and SQL injection. Like Zoncolan, the new tool has helped scale up Python application security efforts. This is especially true for Instagram.

Pysa on Instagram

The largest Python repository on Facebook is millions of lines on Instagram servers. When Pysa is run on a developer suggested code change, it provides results in about an hour, rather than the weeks or months it might take to manually check. This helps you find and prevent a problem quickly enough that it doesn't get into your codebase. The results of the checks are sent directly to the developer or safety engineers, depending on the type of problem and the signal-to-noise ratio in the particular situation.

Pysa and Open Source

Pysa source code and many problem definitions are open for other developers to analyze the code of their projects. We work with open source server-side frameworks such as Django and Tornado , so from the first launch inside Facebook, Pysa finds security issues in projects using these frameworks. Using Pysa for frameworks that don't yet have coverage is usually as easy as adding a few lines of configuration. You just need to tell the analyzer where the data comes from to the server.

Pysa has been used to detect issues such as CVE-2019-19775 in open source Python projects. We also worked with the Zulip project and included Pysa in its codebase.

How it works?

Pysa is designed with lessons learned from Zoncolan. It uses the same algorithms to perform static analysis and even shares code with Zoncolan. Like Zoncolan, Pysa monitors the flow of data in a program. The user defines the sources of important data and the destinations where the data comes. In security applications, the most common kinds of sources are the points where user-controlled data enters the application, such as the HttpRequest.GET dictionary in Django. The receivers are usually much more varied and can include executing APIs. For example, evaloros.open... Pysa iteratively performs rounds of analysis to build summaries to determine which functions return data from the source and which have parameters reaching the destination. When the analyzer detects that the source is eventually connecting to the receiver, it reports the problem. Visualization of this process is a tree with a problem at the top and sources and flows in the leaves:

To perform cross-procedural parsing — to follow the flow of data between function calls — you need to be able to map function calls to their implementations. To do this, you need to use all the available information in the code, including optional static types, if present. We worked with Pyre to figure out this information. While Pysa relies heavily on Pyre and both tools share the same repository, it is important to note that these are separate products with separate applications.

False positives

Security Engineers are the main users of Pysa on Facebook. Like any engineer working with automated error detection tools, we had to figure out how to deal with false positives (no problem, no signal) and negatives (no problem, no signal).

Pysa's design aims to avoid overlooking problems and detect as many real problems as possible. However, reducing the number of false alarms can require trade-offs that increase the number of unnecessary alarms. Too many false positives causes anxiety fatigue and the risk of real problems being overlooked in the noise. Pysa has two tools for removing unwanted signals: sanitizers and signs.

SanitizerIs a simple tool. It tells the parser not to follow the data stream after the stream has passed through the function or attribute. Sanitizers allow you to encode domain transformation knowledge that always presents data in a secure and confidential manner.

Signs are subtler: they are small chunks of metadata that Pysa attaches to data streams as it tracks. Unlike sanitizers, signs do not remove problems from analysis results. Attributes and other metadata can be used to filter results after analysis. Filters are usually written for a specific source-destination pair to ignore problems when data has already been processed for a specific type (but not all types) of a destination.

To understand in which situations Pysa is most useful, imagine that the following code runs to load a user profile:

# views/user.py
async def get_profile(request: HttpRequest) -> HttpResponse:
   profile = load_profile(request.GET['user_id'])
   ...
 
# controller/user.py
async def load_profile(user_id: str):
   user = load_user(user_id) # Loads a user safely; no SQL injection
   pictures = load_pictures(user.id)
   ...
 
# model/media.py
async def load_pictures(user_id: str):
   query = f"""
      SELECT *
      FROM pictures
      WHERE user_id = {user_id}
   """
   result = run_query(query)
   ...
 
# model/shared.py
async def run_query(query: str):
   connection = create_sql_connection()
   result = await connection.execute(query)
   ...

This is where potential SQL injection in load_pictures cannot be exploited: this function always gets valid user_idfrom function load_userin load_profile. When configured correctly, Pysa probably won't report an issue. Now imagine that an enterprising engineer writing controller-level code realizes that fetching user data and an image at the same time returns results faster:

# controller/user.py
async def load_profile(user_id: str):
   user, pictures = await asyncio.gather(
       load_user(user_id),
       load_pictures(user_id) # no longer 'user.id'!
   )
   ...

The change might look harmless, but it actually ends up merging the user-controlled string user_idwith the SQL injection problem in load_pictures. In an application with many layers between the entry point and the database queries, the engineer may not realize that the data is completely controlled by the user, or that the injection problem is hidden in the called function. This is exactly the situation for which the analyzer was written. When an engineer proposes a similar change on Instagram, Pysa discovers that data is going from user-driven input to an SQL query and reports the problem.

Analyzer limitations

It is impossible to write a perfect static analyzer . Pysa has limitations in scope, data flow and design decisions, compromising performance for accuracy and accuracy. Python as a dynamic language has unique characteristics that underlie some of these design decisions.

Problem space

Pysa is designed to detect only security issues related to data streams. Not all security or privacy concerns are modeled as data streams. Check out an example:

def admin_operation(request: HttpRequest):
  if not user_is_admin():
      return Http404
 
  delete_user(request.GET["user_to_delete"])

Pysa is not the right tool to ensure that an authorization check is user_is_adminrun before a privileged operation delete_user. The analyzer can detect data from request.GETdirected to delete_user, but that data never goes through validation user_is_admin. You can rewrite the code to make the problem Pysa-modeled, or you can build permission checking into an administrative operation delete_user. But this code first of all shows what problems Pysa does not solve.

Resource limits

We made a design decision on the constraints so that Pysa can complete the analysis before the proposed changes make it into the codebase. When the analyzer monitors data streams in too many attributes of an object, it is sometimes necessary to simplify and treat the entire object as containing that data. This can lead to false positives.

Another limitation is development time. It forced a compromise on what Python features are supported. Pysa does not yet include decorators in the call graph when calling functions and therefore skips problems inside decorators.

Python as a dynamic language

Python's flexibility makes static analysis difficult. It is difficult to keep track of data streams through method calls without type information. In the code below, it is impossible to determine which of the implementations flyis called:

class Bird:
  def fly(self): ...
 
class Airplane:
  def fly(self): ...
 
def take_off(x):
  x.fly()  # Which function does this call?

The analyzer works in completely untyped projects. But it takes little effort to cover the important types.

The dynamic nature of Python imposes another limitation. See below:

def secret_eval(request: HttpRequest):
  os = importlib.import_module("os")
 
  # Pysa won't know what 'os' is, and thus won't
  # catch this remote code execution issue
  os.system(request.GET["command"])

The execution vulnerability is clearly visible here, but the analyzer will skip it. The module is osimported dynamically. Pysa does not understand that the local variable os represents exactly the module os. Python allows you to dynamically import almost any code at any time. In addition, the language can change the behavior of a function call for almost any object. Pysa can learn to analyze the os and detect the problem. But Python's dynamism means there are endless examples of pathological data streams that the analyzer will not see.

results

In the first half of 2020, Pysa accounted for 44 percent of all problems detected on Instagram. Among all vulnerability types, 330 unique issues were found in the proposed code changes. 49 (15%) problems turned out to be significant, 131 of the problems (40%) were real, but had mitigating circumstances. False negatives were recorded in 150 (45%) cases.

We regularly review issues reported in other ways. For example, through the Bug Bounty program. This is how we make sure we correct all false negative signals. The detection of each type of vulnerability is configurable. Through constant refinement, safety engineers have moved to more sophisticated types to report actual problems 100 percent of the time.

Overall, we are happy with the tradeoffs we made to help security engineers scale. But there is always room for development. We created Pysa to continually improve code quality through close collaboration between security engineers and programmers. This allowed us to quickly iterate and create a tool that meets our needs better than any out-of-the-box solution. The collaboration of the engineers led to additions and refinements to the Pysa movements. For example, the way you view the issue trace has changed. It's easier to see false negatives now.

Pysa analyzer documentation and tutorial .

Find out the details of how to get a high-profile profession from scratch or Level Up in skills and salary by taking online SkillFactory courses:

«Python -» (9 )

«Python » (2 )

Python (10 )

E

Machine Learning (12 )

«Machine Learning Pro + Deep Learning» (20 )

« Machine Learning Data Science» (20 )

Data Science (12 )

- (8 )

(9 )

DevOps (12 )

Java- (18 )

JavaScript (12 )

Pysa: How to Avoid Security Issues in Python Code