Checklist for Code Review in Distributed Systems

points of view by sanja



Microservice architecture is widespread in software development. But organizations that use it, in addition to the difficulties in implementing business logic, also face distributed failures.



Distributed computing errors are well documented but difficult to detect. As a result, building a large-scale and reliable distributed system architecture becomes a complex challenge. Code that looks great on a monolithic system can become a problem once you move to networking. Mail.ru Cloud Solutions



teamtranslated an article, the author of which has been engaged in detecting typical failures in production code for several years and studied the reasons that led to this result. This article provides code review guidelines that the author uses as a basic checklist.



The remote system fails



No matter how carefully the system is designed, it will crash at some point - this is a fact when the software is launched into production.



Failures happen for a variety of reasons: bugs, infrastructure problems, sudden traffic spikes, decay of neglect, but they happen almost always. The robustness and reliability of the entire architecture depends on how the calling modules handle errors:



  1. . . , , . β€” . , .
  2. . , . ? ? , ? ? .




This situation is worse than a complete failure, since it is not known if the remote system is up and running. Therefore, to handle this scenario, you should always check for the issues described below.



Some of the problems can be resolved transparently to the application code by using Service Mesh technologies such as Istio. However, you need to make sure that such problems are handled regardless of the method:



  1. Set timeouts for remote system calls . This also applies to timeouts for remote API and database calls, event publishing. Check if trailing and reasonable timeouts are set for all remote systems in calls. This avoids wasting resources on waiting if the remote system becomes unresponsive.
  2. -. , β€” . , .



    , - (, ). , , . β€” , .
  3. (Circuit Breaker). , , Hystrix. . , Circuit Breaker . β€” .
  4. - . - β€” , . , . , -. , .
  5. . , . , , .
  6. . , ( API, ), β€” . : , , . .


,



  1. , API . - API. , API . API API, β€” .
  2. SLA β€” . SLA , . , .



    SLA : β€” . , SLA. β€” , , .
  3. API-. SLA β€” SLA.
  4. . β€” , . , , , . .



    β€” «» , «» . , id = 123, id =123. , «» , Β« Β». .




  1. . , , . , Redis, . , .
  2. . API (), ? , , ? API ?
  3. . , , , . . . , , . . .




  1. Check input at every entry point. In a distributed environment, any part of the system can be compromised from a security point of view or have bugs. Therefore, each module must check what it receives as input. And do not assume that he will receive clean, that is, safe input.
  2. Never store credentials in a code repository. This is a very common mistake that is difficult to get rid of. However, credentials should always be loaded into the system runtime from external, preferably secure, storage.


I hope you find these guidelines useful for reducing common bugs in distributed systems code.



Good luck!



What else to read:



  1. , : .
  2. -Agile .
  3. .



All Articles