I work as a technical lead in the System team, which is responsible for the performance and stability of the service. From March to November 2020, Miro grew sevenfold - up to 600+ thousand unique users per day. Now our monolith works on 350 servers, we use about 150 instances to store user data.
The more users interact with the service, the more attention will be required to find and eliminate bottlenecks on the servers. Let me tell you how we solved this problem.
Part one: problem statement and introductory
In my understanding, any application can be represented as a model: it consists of tasks and handlers. Tasks are queued and executed sequentially, as in the figure below:
Not everyone agrees with this statement of the problem: someone will say that there are no queues on RESTful servers - only handlers, request processing methods.
I see it differently: not all requests are processed simultaneously, some are waiting for their turn in the memory of the web engine, in sockets or somewhere else. There is a queue anyway.
Miro WebSocket , . , , . , .
, β .
, . , , .
: . , β , β . : . : , . , β .
: . , . , . , .
, , , . , , .
. : (1%) , (99%).
: . , (userβs action).
: 2% , , β , . UX β , . .
:
. .
, , , : - input/output (IO).
, . β . , , SQL- .
(data access layer, DAL) , . , (observable).
: Miro jOOQ SQL. : SQL-, . Redis , . DAL . , , . .
RESTful , - .
, .
. , , SQL Redis. time-counters: , , Redis .
Prometheus Jaeger. ? , β . .
: Miro, . , , β . Prometheus , .
β , . Jaeger . β .
Stack trace
, , , . β .
data access layers stack trace. , end point Redis.
stack trace , . β , , .
Miro stack traces Grafana, dump third-party . : projects.pt.server.RepositoryImpl.findUser (RepositoryImpl.java:171) RepositoryImpl.findUser:171.
WatchDog
stack trace β , . , , .
β WatchDog. , . , .
β 100 5 . WatchDog thread, stack trace.
: 5 , stack trace. , alert, β , - deadlock .
2020 , , Miro 20% . , .
, β , . . β , . .