Kafka, Lamoda and an irresistible desire to learn





Nikita Galushko, a developer of the Online Shop Lamoda division, while visiting the Slurm training center, shared his impressions of the Kafka course, told how this technology is used and what problems are solved in the R&D department (Research and Development).



"How quickly will we hit the network channel that Kafka uses - in two years or less?"


Lamoda is one of the largest online stores in Russia and the CIS. In order for buyers to use the site without problems and enjoy fast shipment and delivery, 340 employees are engaged in IT systems at Lamoda: developers, QA engineers, analysts, DevOps specialists, product managers, designers. Let's find out how the cogs turn in this system.



Tell me something about yourself. What do you do at Lamoda and how long have you worked there?



How I like to say: "I push the buttons, and they pay me money for it." I've been pressing buttons for about six years now. All this time I have been writing in Go. I have been at Lamoda not so long ago, since October 2020, but I have already managed to release my tentacles in many service projects.



I work in the RnD (Research and Development) department - this is a department where one week you write a service on Kafka and decide how to make a distributed transaction between Kafka and PostgreSQL, and the next week you write a code generator for Aerospike. This is a very interesting job.



Do I understand correctly that R&D is strongly tied to analytics?



Yes, we work closely with data scientists. They conduct analytics, train and build models, and then come to us with a request to embed these models into the operation of an application or website. For example, a model that shows users a personalized list of recommendations or suggests which size of clothing or footwear is most likely to suit a customer.



What tasks do you like about your work? What is interesting for you in the R&D department?



This is probably freedom. You are not limited to any one knowledge domain, service, or a small set of services. In R&D, you can make changes to the service in Go, and the next day, change in Java. For some, this may be a disadvantage, since with this approach it is difficult to focus on one service. But for me this is an opportunity to try my hand at different areas, see what approaches the developers used to solve different problems, learn something new for myself.



We recently started making a code generator for parsing Aerospike results. Since he gives raw data, you constantly have to parse them by hand. And these are potential mistakes and a waste of precious time: you forgot something or did not check something, and the result is not what you expect. Therefore, we thought about it and decided to write our own code generator. So far, it works in test mode, but I hope that soon we will open up it.



Does Lamoda often do open source projects?



Lamoda has a relatively well-known open source project called Gonkey . You can read more about it in our article .



Gonkey is a standardized set of solutions that makes it easy to write autotests in the Yaml markup language. This is convenient because such tests can be written by both developers and testers, thereby increasing the percentage of test coverage.



Now this tool is not developing as rapidly within the company as we would like, but we plan to contribute more time to it in the future: close packages that open on github, answer questions, improve.



Most likely, for this you need to have a wide stack of knowledge, to study twice as much as usual. Is it so?



I would say that you just need to have a broad outlook.



Superficial?



Not certainly in that way. There is such a thing as a T-shape developer. He understands very well, even excellently, in one area, but at the same time he is more or less versed in a number of other areas. I already said that now I write in Java, but still 99% of the code I write in Go. Therefore, when you know the Go stack very well and at the same time know how to look around, this is good not only when working in R&D, but in general. You can learn some ideas and approaches from other technologies and languages.



I know people who wrote in Python for a while and then switched to Go - they liked how this language approaches error handling. Now they are trying to bring the same approach to projects that write in Python.



Probably, every developer who wants to develop has no other choice - you need to upgrade in different areas. You will not be able to sit forever to understand only one highly specialized area. If you work anywhere, then you need to develop.



As I understand it, the tasks in R&D are diverse. Did you have to learn something in the process?



Learning and development with me, since I got acquainted with programming in the 10th grade of the lyceum. I get pleasure when I learn something new or when I tell others about something new. Before Lamoda, I worked on VKontakte and developed in the same way, read articles, took courses, watched speeches from conferences.



I do not urge to run to read books and selflessly develop - everyone decides for himself. Even if you take an example of one of my past works: we wrote everything in Go, and in parallel I looked towards Rust. It was not yet so popular then, and articles on the topic "Go versus Rust" from those distant times were very interesting to me. At the same time, I didn't need it for work.



Speaking about working at Lamoda, what have you needed to upgrade over the last year, besides Kafka?



Working with Kubernetes and writing helm charts. By the way, I took a course on Kubernetes with you, because I had not worked with him before. Usually these were either virtual machines or physical hardware, and everything went either through the admins, or you yourself had access to roll out the deb package. Therefore, I had to master Kubernetes: not at the level "through kubectl to see the state of the pod-a", but at the level of correctly writing the helm chart and understanding how it works inside.



While we're on the subject of courses, let's talk about Kafka. Why did you take the Kafka course?



I saw a banner on the website: " Kafka course coming soon ." And I thought: "We must go!"



As I mentioned, I just love to learn new things. You do not need to be seven inches in the forehead to send or read a message in Kafka at the minimum level. But in general, Lamoda Kafka has been used for a long time and firmly. Therefore, a deep dive into this technology was inevitable for me.



What is Kafka for you?



For me, this is a distributed, fault-tolerant log with a simple interaction interface that can pump an incredibly large amount of data through itself.



How is Kafka used in Lamoda?



It seems to me that finding a service that somehow does not use Kafka in Lamoda is very difficult. We have implemented event-bus on itIs such a bus of events for the whole Lamoda. Someone writes some events, and any other participant who is connected to this bus can read them and somehow react to them.



If we talk about new projects, we recently launched a service for collecting analytical data from xlog backends (this is its internal name). Also through Kafka, since it requires high throughput of the entire system.



Kafka is also needed to work with ClickHouse, which has a Kafka-engine. That is, we just write to Kafka, and ClickHouse reads and writes data to itself. This is very convenient, because we are working on one of the projects where you need to do a lot of entries in ClickHouse and often. And as we know, ClickHouse does not know how to do this well out of the box - it needs a suitable proxy. Now the market has a solution from Yandex and VKontakte, but since Lamoda already has good expertise in Kafka, we decided to use it to communicate with ClickHouse.



We also actively use it for all sorts of analytics.



How does the R&D team use Kafka? If you say that Kafka is logs for you, do I understand correctly that you are developing services using it, that is, you are working with Kafka Streams?



We have our own wrapper over the library for working with Kafka, which provides some kind of abstraction. But in fact, Go has channels: developers read from the channel and write there. They may not even think about whether it is Kafka or not.



What problems have you and your team faced with Kafka? How did you try to solve them?



Now we have a question, how quickly will we hit the network channel that Kafka uses - in two years or earlier? And behind it, another question arises: what compression should be enabled and for which topics in Kafka in order to postpone this story?



Conventionally, the same analytical data collection service is the first candidate for compression. But we can't just take and enable compression, because this is some kind of trade-off between CPU usage on producers and consumers.



Now I am trying to prepare a document with tests and analysis. By the way, your course helped me a lot with this, because there is a separate lesson on how to benchmark Kafka. In this document, I want to reflect whether there is now a need to enable compression on this service. If so, which one to include, because there are different compression algorithms. It seems to me that this is the most obvious topic for improvement.



Are there any current burning problems with Kafka?



When we set up ClickHouse to work with Kafka, there was a problem that the describe rights in the group that we used were not set quite correctly.



It looks less scary than a plan to run into bandwidth.



I will also ask you about what I recently learned. KIP500 was released from Kafka 2.8 to abandon ZooKeeper. And as I understand it, Kafka rested on the presence of ZooKeeper, its limitations. They promise that if ZooKeeper is abandoned, the number of partitions will increase to two million. Does this somehow solve your problem?



If you answer directly, then no, it does not solve, because we do not run into the work of Kafka, but the network channel that we use before it. She easily copes with the amount of data that we send to her - the channel does not change from this.



If we talk about the KIP500, then they took the first step towards abandoning ZooKeeper, but so far it does not look like a reliable solution: it is probably not worth abruptly abandoning ZooKeeper and switching to 2.8 for some more or less loaded production systems.



The point is that ops usually deal with Kafka, and they need to understand how to solve emerging problems. Now they know: if something happens, they need to go do something specific in Kafka, and in ZooKeeper something else. And when he is not there, the plan for solving the problems does not work, and you will have to develop expertise in this matter.



I understand correctly: you run into the network and horizontal scaling of Kafka does not help, that is, is it a network problem only?



In general, Kafka is made in such a way that you will run into the network or something else, but not its performance. I remember exactly what is said about this in the course, and the teachers explain in detail why this is so. But we will not talk about this so that those who are interested can go and watch the course.



On the big task, everything is clear - throughput. And to solve this problem, you went to the course to cheer her up and bring something to the team. Is this a coincidence or deliberately happened?



A coincidence, because I originally went to watch the course, not just to learn how to write in Kafka. Indeed, for good and correct replication, you need to set the correct acks. The course dives you into the insides of the system and how it works.



If we are talking about a course, then there is no division into developers and admins. Did you go through all the topics or scrolled through the admin?



I went through all the topics because everything is interesting to me. I get a kick out of learning new things. Usually I take notes, and after a while I return to these notes, re-read, throw out something. If something is unclear, I go to revise part of the course and rewrite part of the synopsis.



Have you done your internship? Was it hard for you to implement them, especially those that are admin?



I got through, but not all yet. No, there was everything well enough planned, how and what needs to be done and what result to expect. It was interesting to me.



First, a Java application is used for some practical work. I was interested in not just doing this job, but spent some time learning the code in Java that works with Kafka. You need to look a little deeper and wider when you go through such practical tasks.



That is, you are looking at technology.



Yes. When I was doing work on partitions, it was about replication. And I thought, what would happen if I did it a little differently. I took the time to play around and check what happens if I turn off one node? And if two? What if I do something else? It's good that there is a stand for practice, you don't have to lift anything from yourself. There is no need to waste time on this.



Tell us what was the most interesting for you? What did you find out about that, why were you very surprised?



For example, that in fact Kafka is an in-memory queue. We are used to the fact that databases do not just write to disk, but also call fsync so that the operating system flushes data to disk. Because just calling write does not guarantee that the data will be written.



Kafka doesn't do this: it just calls write, a system call, and that's it. He simply says: "You are the operating system, you decide when to reset it." And in fact, the reliability of Kafka is ensured through replication. I did not know that. I thought Kafka calls fsync and honestly persists all data to disk. That's how cunning she is.



It was also interesting to hear about the problems of several data centers.



Your bandwidth tasks. What did you manage to take out of the course to meet future challenges?



How to measure the performance of Kafka and how it works with compressed data. Kafka does not uncompress them, but writes them to disk and gives them to consumers as they are. This made it clear that it is necessary to look at the CPU costs not only on the producer who writes in Kafka, but also on the consumer who reads from it. Well, and how to benchmark it correctly.



Or maybe there was something difficult in the course for you? Or was the practice particularly difficult?



It was exactly like that: I watched some video several times. Exactly revised about the rebalancing of the consumer group. From the first viewing it was not entirely clear how this happens. I'm not talking about simple rebalancing, but about incremental. This had to be revised and re-read.



The topic is complex in itself. You watched the video and it seems to be remotely clear, but you want to clearly understand all the processes, so you need to revise. Just sitting with a pencil to draw - then you realize that everything seems to have become clear.



And in conclusion I will ask: what do you have in your plans for training and work?



I started blogging. I bought a domain and raised it to DigitalOcean for free - they distribute static content for free. Blogging gives me an incentive to learn something, write it down and share it with others. You understand the topic when you can tell it to someone so that he also understands.



And the blog just gives the skill of retelling. Now I am dealing with the efficiency of GIN indexes. There will be an article soon on this topic, which is based on last year's GolangConf talk.



You always need to look for yourself: if you have the strength, then why not read the article on how Go works.



All Articles