☁️ 🛎️ 🙋 Uber's Domain-Driven Microservice Architecture 👄 😤 🥅

Approx. transl. : A recent article from Uber Engineering talks about the journey of this large company towards its improved version of microservices architecture. While some Internet users saw the new approach as “just applying DDD principles to microservices” for good reason, the article garnered great interest from the developer community and other engineers. And therefore, we are glad to present its Russian-language version, prepared specially for habr.

Introduction

Recently, the drawbacks of service-oriented architectures and, in particular, microservice architectures (MA) have been actively discussed. Just a few years ago, many were willing to migrate to MA because of its many benefits: flexibility in the form of independent deployments, transparent ownership, increased system stability, and better separation of concerns. However, the situation has recently changed: the microservice approach has begun to be criticized for its tendency to seriously increase complexity, which sometimes makes it difficult to implement even trivial functions . (We talked about this in the talk " Microservices: size matters, even if you have Kubernetes " - approx. Transl.)

Uber currently has about 2,200 critical microservices, and we have experienced all the pros and cons of this approach ourselves. Over the past two years, Uber has tried to reduce the complexity of the microservice landscape while maintaining the architecture's advantages along the way. With this post, we plan to present our generic approach to microservice architectures called the Domain-Oriented Microservice Architecture (DOMA).

While it has been popular in recent years to criticize microservice architectures for their shortcomings, few have dared to proclaim that they should be abandoned entirely. Their operational benefits are all too important; furthermore, there seem to be no (or extremely limited) alternatives to this approach. The goal of our generalized approach is to help organizations that want to reduce overall system complexity while maintaining the flexibility inherent in MA.

This article will explore DOMA, the challenges that led to this approach at Uber, its benefits for platform and product teams, and finally some tips for those looking to migrate to this architecture.

What is a microservice?

Microservices are an extension of service-oriented architectures. Unlike the rather large "services" of the 2000s, microservices perform a certain narrow task. These applications are hosted and accessible over the network and provide a well-defined interface. Other applications access this interface using remote procedure call (RPC).

A key characteristic of MA is the way in which code is posted, invoked, and deployed. Large, monolithic applications are usually divided into encapsulated components with well-defined interfaces. These interfaces are then called directly from within the process rather than over the network. In this sense, a microservice can be considered as a kind of library with lower performance (due to the effect of network delays and time on serialization / deserialization) when calling any of its functions.

Thinking of microservices in this way, we might wonder why we need a microservice architecture at all? The classic answer to this question is because of the ability to independently deploy individual components and easily scale them.... In the case of a large, monolithic application, the organization is forced to deploy or release all the code at the same time. As a result, each new version carries a lot of changes. Deployments become risky and time-consuming. Any mistake can bring down the entire system.

Thus, companies are moving to microservices for ease of use while sacrificing performance . They also have to bear the additional costs of maintaining the infrastructure required for microservices. Experience shows that in many situations such a compromise makes sense. At the same time, it is a powerful argument against a premature transition to MA.

Motivation

At the time of the transition to microservices (around 2012-2013), we at Uber had two main monolithic services, and we faced a lot of operational problems that microservices successfully solve:

Availability risks. Any mistake in the monolith's codebase can drop the entire system (in this case, the entire Uber).
Risky and costly deployments. They were very difficult to carry out, and often had to roll back to the previous version.
Poor separation of areas of responsibility. It was very difficult to keep track of who was responsible for what in the colossal codebase. With exponential growth, haste sometimes blurred the lines between logic and components.
Inefficient work. The above problems together made it difficult for teams to work independently or independently of each other.

In other words, against the background of the increase in the number of engineers at Uber from tens to hundreds of people and the emergence of a large number of teams owning their own parts of the technology stack, the monolithic architecture increasingly tied the fate of these teams and did not allow them to work independently.

Therefore, we decided to switch to MA. As a result, our systems have become more flexible and have allowed teams to become more autonomous .

System reliability. The overall system reliability increases with the transition to MA. An individual service can crash (and can be rolled back to a previous version) without risking crashing the entire system.
. - : « ?», — .
. , . , , , , .
. .
. .

It is no exaggeration to say that Uber could not have reached its current scale and quality level without MA.

However, as the company continued to grow and the number of engineers increased from hundreds to thousands, we began to notice a number of problems associated with the significantly increased system complexity . In the case of MA, we sacrifice a single monolithic codebase in exchange for a number of "black boxes" whose functionality can change at any time and lead to unexpected behavior.

For example, engineers had to analyze ~ 50 services across 12 different teams to get to the root of a problem.

Understanding the dependencies between services can become quite difficult as they can interact with each other at many levels. A spike in delays in the n-th dependency can cause an avalanche of problems in upstream services. Moreover, without the proper tools, it will be impossible to understand what happened. All of this makes debugging very difficult.

Uber's microservice architecture as of mid-2018 by Jaeger

To implement the simplest function, an engineer often has to work with many services, while completely different teams and people are responsible for them. As a result, a lot of time is spent organizing teamwork, meetings, design consultations and code review (core review). The initial benefit of ownership transparency is gradually blurring as teams continually invade each other's services, change data models, and even deploy on behalf of service owners. This can create network monoliths in which services only appear to be independent, but in fact they have to be deployed together in order to safely make any change.

An example of such a complex system in Uber (~ 2018) with ten touchpoints for easy integration (even before DOMA).

As a result, we have a slowdown in the development process, instability hitting service owners, more time-consuming migrations, etc. Alas, there is no turning back for organizations that have already switched to MA. The situation is perfectly illustrated by the well-known phrase: " It is impossible to live with them, and you cannot shoot them ."

Domain-specific microservice architecture

Think of microservices as I / O-linked libraries, and microservices architecture as a huge, distributed application. In this case, we can use well-known architectural solutions to think about how best to organize our code.

Thus, a Domain-Oriented Microservice Architecture (DOMA) can rely on well-established ways of organizing code such as Domain-Oriented Design , Clean Architecture , Service-Oriented Architecture , and Object-Oriented and Interface-Oriented Development Patterns.We see DOMA as innovative in the sense that it is a relatively new way to leverage existing design principles in the globally distributed systems of large organizations .

Here are some basic DOMA concepts and related terminology:

Instead of looking at individual microservices, we are looking at groups of them. And we call them domains (domains) .
Next, we combine the domains of the so-called layers (the layers) . The layer a domain belongs to determines which dependencies are available to microservices in that domain. We call the resulting architecture of the multi-layer (of layer design) .
, . (gateways).
, , , 'hardcode' , . (, - ), (extension architecture) .

In other words, structured architecture, domain gateways, and pre-built DOMA extensibility points transform microservice architectures from something complex to something tangible and tangible: a structured set of flexible, reusable, and tiery components.

The rest of this article will focus on Uber's implementation of DOMA and its benefits. Practical advice will also be given to companies wishing to adopt this approach.

Implementation in Uber

Domains

Uber domains are collections of one or more microservices that are linked together based on a logical combination of functionality. The question naturally arises, how big should the domain be. In this case, we are not giving any instructions. Some domains can include dozens of services, others just one. It is important here to think carefully about the logical role of each association. For example, we have grouped search services on the map, fare services, selection services (comparing drivers and passengers) into separate domains. In addition, they do not always repeat the organizational structure of the company. Uber Maps is split into three domains with 80 microservices hidden behind three different gateways.

Layer-based architecture

The multilayer architecture answers the question of which service and which one can communicate within the boundaries of MA Uber. That is, it can be viewed as a global distribution of areas of responsibility or as a mechanism for global dependency management.

The layered architecture helps to understand the radius of damage after failures and reflect the specificity of the product in terms of the number of dependent services Uber. As you move from the bottom to the top, the number of services affected in the event of a failure is reduced and the product's scope of application narrows . And vice versa, a larger number of services depend on the functionality at the lower levels, therefore, the radius of damage as a result of a failure is, as a rule, greater, and the range of business tasks being solved is wider. The figure below illustrates this concept.

It can be imagined that the upper levels are focused on functions responsible for a specific (narrow) user experience (for example, mobile functions), while the lower ones are inhabited by more global business functions (for example, account management or travel through the ridesharing marketplace) ... Each layer only depends on the underlying layers, which brings clarity to concepts such as blast radius and domain integration.

It's worth noting that functionality often moves downward in this graph, from narrow to wider. You can imagine some simple function that becomes more important ("platform") over time as requirements evolve. In fact, this kind of downward migration is expected, and many of Uber's core business platforms started out as a feature for drivers or passengers, and over time it has grown and become more generalized as new lines of business (such as Uber Eats or Uber Freight ) and connect more dependencies to them.

Within Uber, we distinguish the following five levels.

. , . — Uber , .
-. , Uber , , Rides (), Eats ( ) Freight ( ).
. , , . , «request a ride» ( ) , Rides: Rider, Rider «Lite», m.uber.com, ..
. , (/), .
. Uber . .

As you can see, each subsequent level represents an ever narrower combination of functions and has a smaller hit radius (in other words, fewer components depend on functionality within this layer).

Gateways

The term API gateway is already well established in microservice architectures. Our definition is not much different from the well-established one - except that we tend to think of gateways as a single entry point into the corresponding group of services (which we call a domain ). The success of a gateway depends on a well-designed API architecture:

This diagram illustrates the high-level design of a gateway. It abstracts from the details of the internal structure of domains: a set of services, tables with data, ETL pipelines, etc. Other domains have access only to interfaces: API for remote procedure calls, events and requests in the messaging system.

Since upstream consumers only run on one service, gateways provide numerous benefits in terms of future migrations , discoverability , and an overall reduction in system complexity when upstream services have only one dependency (instead of depending on multiple downstream services that may exist in the domain). From an OO design perspective, gateways are interface definitions and allow us to do whatever we want with an internal "implementation" (that is, a group of microservices).

Extensions

Extensions (extensions) , as the name implies, is a mechanism for expanding domains. The basic definition of such an add-on is that it provides a mechanism for extending the functionality of a service without changing the internals of that service or affecting its overall reliability. In our Uber has two expansion models: the logic (logic extensions) and on the basis of data (data the extensions) . The extension concept allowed us to scale the architecture so that multiple teams can work independently of each other.

Logical extensions

Logical extensions provide a mechanism for extending the underlying logic of a service. For them, we use a kind of provider or plugin pattern with an interface that is defined separately for each service. This allows teams to implement their logic using only the interface and without interfering with the main platform code.

Suppose, for example, that the driver is online. We usually do various checks to make sure that it is allowed to have an online status (for security, compliance, etc.). Each of them has its own team. One possible way to do this is to force each command to write logic at the same endpoint, but this can add complexity. Each check will require a different - and completely unrelated - logic.

In the case of logical endpoint extensions called go onlinewill define the interface that each extension is expected to conform to with a predefined request and response type. Each team will register an extension that will be responsible for implementing this logic. In this case, they can simply take some information about the driver and return a logical value (bool) , which will determine whether the driver is "worthy" of online status or not. And the endpoint itself (go online) will simply iterate over these answers and establish if any of them are false .

This approach separates the core code from the extensions and provides isolation between them. In this case, the extensions do not know what other logic is being executed. This makes it easy to create additional functionality, for example for observability or feature flagging .

Data-driven extensions

This type of extension provides a mechanism for attaching arbitrary data to the interface to avoid unnecessarily bloating the underlying platform's data models. In data extensions, we actively use features like Any from Protobuf, which allow us to add arbitrary data to requests. Services often store this data or pass it on to a logical extension, so that the main platform never deserializes (and therefore doesn't "know" anything) about this arbitrary context. Any implementation incurs some infrastructure overhead in exchange for stronger typing. A simpler alternative is the JSON format to represent any data:

Arbitrary complements

In addition to boolean and data extensions, many teams at Uber have developed custom extension templates to match their domains. For example, most of the integrations related to presentation architecture use DAG-based task execution logic.

Benefits

DOMA has influenced nearly every major Uber business to one degree or another. Over the past year, we have mainly focused on the business layer. It provides generalized logic for the various lines of business in a company.

DOMA is relatively new to Uber, and in the future we will definitely share more information and examples of our architecture. The first results were encouraging: they greatly simplified the work of developers and reduced the overall complexity of the system.

Products and platforms

DOMA is the result of a collaborative effort between the various product and platform teams at Uber. In many cases, platform support costs have dropped by an order of magnitude. Product teams have benefited from specificity and accelerated development.

For example, an early platform consumer of our extension architecture was able to reduce the time to prioritize and integrate a new feature from three days to three hours by reducing code review times, scheduling, and accelerating consumer education.

Reduced complexity

Previously, product teams had to work with many downstream services within a domain, but now they only need to call one. By reducing the number of touchpoints when introducing a new feature, implementation time has been reduced by 25-30%. In addition, we were able to distribute 2,200 services across 70 domains. About half of them have been implemented, and for the majority there is a plan for implementation in one form or another.

Future migrations

At Uber, we have calculated that the microservice has a half-life of 1.5 years. In other words, every year and a half, 50% of our services lose their relevance. Without gateways, a microservice architecture can become a migration hell. Ever-changing microservices require constant upstream migrations. Gateways allow teams to avoid dependencies on downstream domain services, which means these services can change without having to migrate to upstream.

Two of Uber's biggest platform upgrades over the past year have happened behind gateways. These platforms have hundreds of dependent services, and without gateways, all existing consumers would have to be migrated. It would be incredibly expensive, making a complete redesign of the platform unrealistic.

New lines of business and products

DOMA-based frameworks have proven to be much more extensible and easier to maintain. Most of the teams at Uber that switched to DOMA did so because it became too expensive to maintain new lines of business.

Practical advice

In this section, I've compiled some practical tips for companies that might be interested in DOMA. The guiding principle here is that, in our experience, a mature and thoughtful microservice architecture is based on incremental shifts in the right direction at the right time. In reality, it is almost impossible to completely “rewrite” MA.

Therefore, we view the evolution of MA as a kind of process of "cutting a hedge", thanks to which it grows in the right direction, and not as a one-time, volitional effort. It is a dynamic and gradual process.

Startups

The key questions here are: "When should we move to MA?" and "Does this make sense for our organization?" As we saw above, while microservices provide an operational advantage in organizations with a large number of engineers, they also increase overall complexity, which can make it difficult to implement new features.

In small organizations, the operational advantage is unlikely to compensate for the increased architectural complexity. Moreover, MAs usually require dedicated engineering resources to support them, which may be too expensive for an early stage company or simply suboptimal in terms of prioritization.

With that said, it might be wise to postpone the transition to microservices for a while. If the organization decides to switch to microservices, we recommend that it use the analogy of a large distributed application and think in advance about dividing problem areas between services. Also keep in mind that the earliest microservices are likely to be the most important and long-lived, as they describe a key part of the business.

Medium business

MA's usefulness increases in mid-sized companies with many teams, where the lines of responsibility are gradually blurring between different functions and platforms.

This is where you can start thinking about the hierarchy of microservices. Dependency management may come to the fore as some services can become much more critical to running a business and more teams will rely on them.

Early investments in platforming can pay dividends later on. Creation of business platforms that do not depend on other products allows avoiding the accumulation of technical debt and the penetration of arbitrary product logic into the main services of the platform. Perhaps an extension mechanism should be introduced at this stage to achieve this goal.

Given that the number of microservices is still small, it may not make sense to bundle them together yet. However, it is worth noting here that a domain in the context of the DOMA implementation in Uber may well include a single service, so a "domain-oriented" train of thought still does not hurt.

Big business

Large engineering organizations can have hundreds of specialists, microservices, and many dependencies. It is in these conditions that DOMA reaches its full potential. Surely such companies will have obvious clusters of microservices that can be easily combined into domains with gateways in front of them. Legacy services often need refactoring / rewriting and subsequent migration. This means that gateways will soon begin to bring real benefits in terms of ease of migration (if, of course, they are already deployed).

The importance of a transparent and understandable hierarchy will also increase: some services will be “product” for certain functions or groups of functions, while others will support multiple products and act as “platforms”. At this stage, it is critical to keep arbitrary product logic separate from platforms to avoid massive operational stress on platform teams, and to minimize the risk of global system instability.

Final thoughts

At Uber, we continue to actively develop DOMA as more teams migrate to it. The main idea behind DOMA is that a microservice architecture is just one large distributed program. And the same principles can be applied to its evolution as to any other software. DOMA is just an approach for practical thinking about these principles. We hope you find it helpful and look forward to your feedback!

DOMA itself is the result of a cross-functional effort by nearly 60 engineers from across Uber. I would like to express special gratitude to the following people for their contributions to this work over the past 2 years:

Alex Zylman, Alexandre Wilhelm, Allen Lu, Ankit Srivastava, Anthony Tran, Anupam Dikshit, Anurag Biyani, Daniel Wolf, Deepti Chedda, Dmitriy Bryndin, Gaurav Tungatkar, Jacob Greenleaf, Jaikumar Ganesh, Jennie Ngyabuae, Joeoshier , Kusha Kapoor, Linda Fu, Madan Thangavelu, Nimish Sheth, Parth Shah, Shawn Burke, Simon Newton, Steve Sherwood, Uday Kiran Medisetty and Waleed Kadous.

Acknowledgments: This work has combined many existing design patterns in the industry to solve problems in Uber, and also suggested some new patterns (like extensions). We are grateful to the industry for working on them. We are also grateful to the Linkedin engineers who worked on Superblocks for sharing their experiences with us.

Uber's Domain-Driven Microservice Architecture