Developer path

Hello! My name is Alexey Skorobogaty. In 2015, I joined Lamoda as a developer. Now I am a system architect of an e-commerce platform and also a Technical Lead of the CORE team. In this article, I want to share the insights I received over these 5 years - in the takeaways format, with stories, memes and links to literature.



image



I would be glad to have any discussion in the comments under the article: questions, opinions, refutations!



There are knowns



At Lamoda, I joined the team working on the support and development of the order processing system. Nothing is clear, but terribly interesting.



After a small but ambitious web studio where I worked before, I was impressed by the sense of seriousness within a large company. Arranged development processes seemed like a perfectly polished mechanism. Rigorous but coaching code reviews from the leader and team members are essential for such a complex and key system. For me, point tasks flew, affecting literally one or two files, no more. Most of the codebase and system behavior was hidden from me by the fog of war.



After about a month, I completed one of the first tasks related to making real changes to a working system. Its essence boiled down to adding one field to the report on the return of funds to the client. Code review, unit tests, QA engineer testing the release - everything looked okay. Since the system was large and complex, we were released twice a week, according to the regulations - and on Thursday my task went to production. Most of the day, the release engineer was busy building and rolling out the code, followed by compulsive switching between tabs with monitoring graphs, errors, queues, logs - everything that could indicate a problem. But everything looked great. The code was merged into the master branch and dispersed to deal with other tasks.



The silence in the logs and monitoring hides a terrible bug: the database query returned an incorrect number of rows. The total amount to be returned was several times higher than the real one ... But we found out about this only on Monday. I still remember how tired and reproachful the Tech Lead looked at me as we rode the office elevator the next morning. He caught the bug until three in the morning and prepared a fix for release. And the company suffered some impact from my mistake. This was my first critical bug, but far from the last. People make mistakes, and they do it all the time.



Takeaway # 1:Business processes and data come first. It is important to pay close attention to the data with which the system works. Determine what you are dealing with before making changes. Understand the context in which adjustments are made. Always consider the problem being solved from the perspective of the context above the level. In other words, clearly understand what is happening in terms of the business process and who is the consumer of the affected models. The structure of an application can have as many layers of abstraction as you want and varying degrees of quality of the abstractions themselves, but this does not mean anything at all if the model or the business process as a whole is broken.



I continued to work in the same team, gained experience, and six months later, at the team stand-up, I threw a phrase that in general I understood how our order processing system works.



Of course I was wrong.



The complexity of large systems should never be underestimated. The American politician Donald Rumsfeld said very well about this:

image... as we know, there are famous known; There are things that we know, that we know them. We also know that there are known unknowns; that is, we know that there are some things that we do not know. But there are also unknown unknowns - those that we do not know, that we do not know them. And if you look at the history of our country and other free countries, the last category is usually difficult.



Takeaway # 2: When working with complex systems, it is important to understand what we know about them, what we don’t know, and what their behavior is not even guessing. And it's not just about the toolkit and following the “Monitoring towards Observability” trend , but also about dependency management and risk assessment in design. For example, before deciding to use a cool trend database for a critical system, I strongly advise you to stick to this site boringtechnology.club



Everything is broken



After two years of working with the order processing system, I could say that I know about 80% of the application with confidence. That is, about each module of the system I understand how it works, and I can make changes. I know which business processes are reflected in a particular model, how they are interconnected and affect each other. I performed integration with the payment processing system, which was designed by the neighboring team. In addition to integration, it was necessary to get rid of the legacy of the old code, since payments were previously part of our system - this task was my last and largest refactoring of a large module. Everything went so smoothly that it was not even interesting.



At the same time, a conflict was brewing inside me, as a developer. I honestly didn’t understand why our order processing system, which is so critical to the operation of our entire business, was so fragile. The neighboring large systems were just as fragile. From all the experience I gained in two years of work, it seemed that some kind of reliability from complex systems can be expected only when performing standard tested cases. And when you try to make the changes your business demands, things fall apart at the first drastic maneuver of an unlucky developer.



Reflecting on all this, I came across the article Everything is broken , in which the author writes about the same problem, but on an even larger scale (and also about the same, but from a different angle - Software disenchantment). Every time I am excited when I find from the outside confirmation of my inner feelings - so that time, after reading the article, I finally felt how my vague discontent turned into a vivid and obvious insight:

Software is so bad because it's so complex.



We didn't have to go far for an example in our work: just at that moment, adding just a couple of poles, we completely broke the creation of an order for a while.



Our big and important systems are so bad because they don't fit into our heads! And all the business processes that are closed within the systems do not fit into the heads of managers and analysts - and in general there is no such person who would understand how it all works together.



Takeaway # 3: When designing systems, it's important to consider their cognitive load. It consists of the complexity of technical solutions, as well as models and processes of the subject area. Well-designed systems have a high cognitive load on the subject area and low on technical solutions.Ideally, a single system should have a cognitive load that one person can handle.



Okay, the problem is clear. But suppose we have the opportunity to rewrite an overly complex and therefore bad system, simplifying it. What else should you pay attention to? In cybernetics, there is the Conant-Ashby theorem:



A good regulator of a system must have a model of that system. Good regulator



The meaning of this theorem is that if we want to control some object, we need a good (accurate and understandable) model of this object. And the more complex the object or the less information about it, the more difficult it is to get a good model of it - and this negatively affects management.



I think very few people would disagree that all our services are models. But what are we modeling? It is very important to pay attention to business processes, to model not only state, but also behavior.



At the end of 2017, I moved to the new CORE team. This team was formed then specifically to carry out the tasks of the IT strategy for the decomposition of monolithic systems. Our main goal was to cut that very large, but fragile order processing system (voice-over: then the samurai did not know that this path had a beginning, but no end!).

It was a new experience for me. A team with completely different principles and way of thinking. Decisions were made quickly, there were experiments and the right to make mistakes. The balance came out perfect: we tried and rolled back where the impact was minimal, but we prescribed each step in detail for critical moments.



We wrote a new service for creating orders from scratch in another language (being php developers, we switched to golang). We evaluated the first result and rewrote it again. The emphasis was on simplicity and reliability. They put the data model at the center, and built the whole architecture around. The result is a reliable and resilient service. We managed to put it into operation without failures, using the experimental mechanism. Over time, the constructed system has shown its worth more than once.



image



Takeaway # 4:All models are wrong but some are useful. Modeling states is not enough to build correct and stable systems. It is necessary to look at behavior: communication patterns, streams of events, who is responsible for this or that data. You should look for relationships between data and pay attention to the reasons for these relationships.



It's all about the dum dum da da dum dum



At my university there was a course in mathematical analysis, which was taught by an associate professor and Ph.D. Elena Nikolaevna. She was very strict, but fair. On tests every now and then came across problems, for the solution of which it was necessary to "twist" the lectures a little - to take an independent step towards understanding the material. And on the final exam, which, by the way, I passed the second time, I had to show flexibility of thinking and use my intuition to solve the problem as “good”. Here is the rule that E.N. she told us the whole course, and which I use ten years later:

When you don't know what to do, do what you know.



That's why I was proud to know matan good. Because according to the standards of E.N. it is not enough to know the material, but it is also important to understand it, to be able to synthesize something new.



Takeaway # 5: The further you go, the more responsibility you have to take and the more decisions you have to make. At a certain moment, absolute confidence disappears as a category, but instead comes the art of balance following the courage to take a step.



There comes a time when there is no right person around you who could remove the existing uncertainty. You have to assess the risks yourself and take responsibility for yourself. Make decisions in the face of uncertainty.



In the second half of 2018, our team led the Gift Certificates project. Initially, I was responsible for development in and around processing. Later, by the end of the year, the technical leadership of the entire project took over to me along with the task of restoring the balance of power after part of the team left.



The rules existing in the head and the world order of the developer were bursting at the seams, and then finally collapsed. Responsibility for a large and complex project knocked out of me the idealistic ideas about the development world with concepts and a rainbow. The cruel world of restrictions and just enough solutions required an understanding and revision of all the approaches and rules that I followed.



image



Takeaway # 6:Impostor Syndrome. What if I get exposed? Of course they will expose if nothing is done. If you do something important, then after a while you notice that there is no one to expose you.



Divergence and Convergence



In accordance with the chronology of my "Developer's Path" there should be an interesting story from the technical point of view about the project of personal policies. In this project, we implemented real-time data processing, and "on the fly" changed the very principles of the system architecture, moving to Events Driven Architecture. But about this, I already have a separate report from the Highload '19 conference (and an article here on Habré). Therefore, I'd rather tell you about the “high matters” of technical and not very management.



When a developer grows to the position of senior, which should be read as "ready to take responsibility and know how to make decisions autonomously," then the next classic step is the team lead. A team lead is a person who is primarily responsible for the team, i.e. for people and development processes. The customer does not come to the developer, he comes to the team lead, and asks for obligations from the team lead too.



It turns out a paradoxical thing - as soon as a developer has grown to work independently as an engineer, he is thrown into a storm called management.

No, perhaps for someone this path seems quite comfortable, and the transition from extremely unambiguous algorithms and protocols for the interaction of computer systems to the coordination of a group of people looks logical. But it seems to me that it is not for nothing that most of the conversations in profile chats and at conferences for team leaders revolve around the concept of “pain”.



What is the pain of a team lead? Isn't it because an engineer is in charge of management ?! No, why this happens is understandable - we do not have a school of technical management as such, and it is assumed that an IT engineer is a superman who can figure out everything, including such a “simple” thing as management.



But I decided to go the other way and chose the position of a tech lead as the next career step. As an architect, I work with development teams, and now I hear from the guys what I myself said to managers a year ago:

Why are the requirements so poorly developed? Crutch solutions! What two weeks ?! Here work for a month.



But ehehei, now it is my task to solve such problems. But as soon as you translate your thinking into the cost & benefit paradigm, you realize that all these problems cannot be solved - you bout dat life!



Takeaway # 7: Opening! Managers don't deal with problem solving; they manage clutter.



As a technical leader, my job is to remove uncertainty for the development team. Requirements not worked out? Crutch solutions? Doesn't the architecture provide? These are all signals of system fragility and divergence.



Let's say that the task setting in the order creation service looks like this: You

need to add the X field and the Y field. It is required that the value in the Y 'field at the output equals the Z value if X is 1.



The problem lies in the very statement of requirements. The mistake here is that it is completely unclear what state of the system you want to achieve. Fully defined steps in the statement lead to uncertainty during implementation and operation.

After several such tasks, the order creation service will be in a rather fragile state - and such cases as the one when we added a couple of fields and everything broke will begin to happen.



Objective: to ensure the convergence of the states of systems, the consistency of the statement of tasks and the reduction of uncertainty to achieve stability



image



The people working on the line of representation are constantly building and updating their models of what lies beyond the line. These activities are critical to the resilience of Internet systems and are a major source of adaptive capacity. Above the Line, Below the Line



The architect must understand the unity of the socio-technical system. Be able to coordinate processes above the line of presentation so that systems below the line of presentation meet the constraints of correctness, stability and adaptability.



image



Takeaway # 8: If the rules stop working, congratulations, you've reached the boundary conditions under which the current model stops working. It's time to revise your ideas and choose a new model that will meet current constraints and allow you to build adequate processes and rules.



Soft is simple, people are hard!



No really. This is what was written in one book on architecture. And it feels like the further I go, the more often I repeat over this book.



Technical concepts, algorithms and standards are clear - you just need to take it and consistently figure it out. I am not trying to discount the work of engineers - algorithms for distributed systems are extremely complex if you do not build such systems on a daily basis. But more often than not, the main difficulty that we face in the process of work arises when we need to understand why this or that service requires such a level of abstraction for the domain. And the problem is often compounded by the fact that the person who wrote the service is not around.



Algorithms that are easy to implement are more successful than mathematically accurate ones. Paxos is mathematically accurate, but only with the description of the Raft protocol, which is easier to implement, the practical application of consensus algorithms has been developed.

The Golang language is criticized for being too limited. But it is on it that Docker, Kubernetes and many other distributed systems are written. A well-designed system of constraints serves as the foundation for successful systems.



Everything would be much easier if technology did not have to reckon with the human factor. But they have to. Any system in IT, in the construction and maintenance of which more than one person is involved, must take into account the human factor.



And here technologies emerge at the intersection of software & people, designed to structure chaos and describe complex interactions. Domain Driven Design, Microservices, Agile - all of them create constraints that describe the principles and rules of interaction. Structures appear with which it is clear how to work. But it does not always get better with the advent of such technologies. Quite often it turns out the other way around - What Money Cannot Buy .



Takeaway # 9: Programs can and should be simple. To do this, you need to apply strength to the formation of an engineering culture. It is she who ultimately determines the performance of services.



image



Reading list



Books



The Manager's Path: A Guide for Tech Leaders Navigating Growth and Change, Camille Fournier - link



The Elegant Puzzle: Systems of Engineering Management, Will Larson - Link



Team Topologies: Organizing Business and Technology Teams for Fast Flow, Manuel Pais and Matthew Skelton - link



righting Software, Juval Lowy - link



Thinking in Systems: A Primer, Donella Meadows - reference



Articles



Mental Models, Street General Fernam - link



Complexity Bias: Why We Prefer Complicated to Simple. Fernam Street - link



What Money Can not Buy, LessWrong - link



Becoming Unusually Truth-Oriented, LessWrong -link



Programming Laws and Reality: Do We Know What We Think We Know? - link



No Silver Bullet Essence and Accidents of Software Engineering - link



The Art Of Troubleshooting - link



From Fragile to Antifragile Software - link



Computers can be understood - link



Tuckman Was Wrong! (About teams) - reference



How to fail at almost everything and still win big - link



Simplicity Before Generality, Use Before Reuse - link

Simplicity, Please - A Manifesto for Software Development - Link



Software Design is Human Relationships - Link



All Articles