🛍️ 🖕🏽 ☠️ The future of Prometheus and the project ecosystem (2020) 👹 🍬 👩‍🚀

Approx. transl. : This is a translation of an article based on a recent talk by Richard Hartmann, a prominent member of the Prometheus development team, community director at Grafana Labs, founder of the OpenMetrics project and chairman of the SIG Observability group at CNCF. The author sums up the results of the last year in the life of the Open Source project (and community) Prometheus, as well as talks about the main difficulties and near-term prospects.

During PromCon Online 2020, I gave a talk titled "The Future of Prometheus and Its Ecosystem". And I want to share with you its key points.

Summary

Since the last PromCon - 2019, Prometheus has undergone many changes.

2.14

Version 2.14 introduces a new interface in React. Although in terms of functionality it still lags behind the old one, we are working on it and continue to improve it.

2.15

Version 2.15 came under the banner of the Metadata API. The Prometheus Exposure Format has long supported special expressions HELP, TYPEetc., however Prometheus itself used to simply discard this data. Now that they remain, you can open external access to them through the API. For example, Grafana already takes advantage of this opportunity and provides users with additional information about the time series with which they are working:

2.16

Version 2.16 focuses on various improvements and stability. For example, since 2016, users have been asking about the ability to select local time in the UI, as well as the implementation of the query log - it was nice to close these problems.

2.17

Speaking of lingering feature requests, version 2.17 finally brought the long-awaited "I" in ACID for the database : isolation .

2.18

In 2.18 , tracing and support for exposure format instances have been improved. Instances are the first user-noticeable impact of implementing OpenMetrics in Prometheus. By combining them with Grafana, you can achieve a convenient granularity that allows you to go from:

... to:

2.19

In 2.19, we reduced memory consumption by as much as 50%! Even though Prometheus is already quite effective, there is significant potential for optimization - it is both exciting and daunting.

This graph is a great illustration of this:

2.20

Version 2.20 boasts the longest changelog since v2.6 (!). The main one is probably the native service discovery support for Docker Swarm and DigitalOcean.

But there is a more important change that goes beyond the implementation of two independent service discovery mechanisms: we take Prometheus apart and take a fresh look at many of the old solutions and established approaches. The world has changed (perhaps we also had a hand in this) - this must be taken into account both in the project itself and in others.

node_exporter

To summarize, I'd like to point out that node_exporter has reached version 1.0 and now includes EXPERIMENTAL TLS support. The Cloud Native Computing Foundation sponsored the audit of node_exporter by Cure53 (it covered both the exporter in general and our TLS implementation in particular). And it was doubly worth it: we not only checked TLS before copying it to other exporters, but also used node_exporter as a guinea pig from which other patterns are copied.

Future

Sometimes I get the feeling that we, as a project, are resting on our laurels. A while ago I ran a brainstorming session inside Grafana under the motto "Prometheus is missing features" and encouraged Red Hat to do the same. Along the way, we created a document about EVERYTHING available to the entire Prometheus team. It serves as a framework for addressing specific topics, broken down into points for discussion during dev summits (as soon as these points are ready).

Developer Summits

Last year we had two dev summits: one after the KubeCon EU, the second after the PromCon. It was planned to do the same in 2020, but COVID prevented. There have been no summits this year, but I believe we have found a way out: shorter, more frequent and virtual meetings. We spend blocks of 4 hours instead of collecting for 1-2 days at once. The first such dev summit took place on July 10, and the next one will probably take place on August 7. We will continue to conduct them until we have analyzed all the accumulated questions (although their number is constantly growing as more and more new items from the above document are added).

Right now I want to do two things:

, . , , . , . — , , .
, , . , , .

Metadata is starting to bring real value to Prometheus (see 2.15 above). We need to implement more options for working with metadata (for example, distribute it via remote read / write). The consensus below does not cover interesting questions such as "What if the metadata changed / disappeared?" or "What if they become an attack vector?"

CONSENSUS: We want to better maintain metadata. The work will be carried out in a special document .

CONSENSUS: PR 6815 will go as an EXPERIMENTAL workaround. Most likely, it will be different in versions 3.x.

Workflow changes and s / master / main /

The topic of raking away garbage accumulated in work processes does not require special explanation, but a few words should be said about eliminating verbiage (unity of terminology). We are serious about cleaning up terminology: this is not the most important thing, but something that we can do now. While we are waiting for the corresponding toolkit from GitHub. As soon as it appears, we will try to attract a paid intern to this work through the Community Bridge.

I am in talks with the Linux Foundation and CNCF to potentially implement this in all projects. A great opportunity for anyone interested in this topic: the opportunity to explore many projects, work with various tools, meet many people. Contact me on Twitter ( @TwitchiH ) or by mail ( richih on grafana.com) if you're interested.

CONSENSUS: Set "Require status checks before merge" in all prometheus repositories / ... Do not allow direct pushes in the main branch? Do not allow force pushes in the main branch?

CONSENSUS: Disable force push to all main branches.

CONSENSUS: The default behavior allows pushing in the main branch, however it should be disabled for some "important" repos, for example, prometheus / prometheus (at the maintainer's discretion).

Filling with data (backfilling)

This is one of the oldest feature requests and a good example of how to approach consensus. There are many different opinions circulating in the Prometheus team on this issue, and it is difficult to come to a common denominator. Therefore, I wrote a limited and very specific consensus proposal with three criteria: "We want to support backfilling over the network at least in blocks that do not overlap with the head block ."

After lengthy discussions and attempts to reach a consensus, it became clear that this would not be easy to do. Therefore, I reformulated the proposal as follows: “We want to support backfilling over the network at least by streams that do not intersect with the head block".

Only by forcing everyone to express their own opinion on this matter, we were able to come to the final version: "We would like to support backfilling over the network in blocks that do not intersect with the head block, provided it is properly implemented ." Each word here has been chosen to reflect the exact extent and boundaries of consensus.

: Prometheus OpenMetrics, CSV- .

: backfilling , .

: backfilling , .

: backfilling , .

Another of the tasks associated with putting things in order. Here I want to criticize Go: it was developed in a world where single mono is the norm. Google stores all (or most) of its common code in a single repository. This approach has many advantages, but does not translate well into real world conditions. Go is slowly but surely moving away from this legacy.

Fun fact: I wrote the consensus proposal almost at the very beginning of the discussion. It was clear that we would at least try it. It was clear that Ben Kochie would volunteer to do this. And it was clear that node_exporter would become the "victim". As a rule, we strive to improve the workflow, and Ben is always a volunteer, and node_exporter is the test bench from which we then copy the results to other exporters. And yes, it was important that the discussion went on for a while and that people came to this on their own, instead of confronting them with a fact.

CONSENSUS: Delete it in node_exporter and see if we are happy with the result.

Mailing lists and IRC

Google is blocked in China, but our mailing lists work on it. We decided to try to make it possible to subscribe by email. I checked: prometheus-users+subscribe@googlegroups.com works. You can also read the archives ( https://www.mail-archive.com/prometheus-users@googlegroups.com/maillist.html ) if you wish.

In addition, we made sure that everyone can use IRC through the Matrix, since for some xkcd 1782 is very relevant.

:

, Google ; - .

docs/community , Google.

IRC, Matrix; , .

Let me repeat what I said at the summit: “The very first thing I didn't like about Prometheus in 2015 was the documentation; in 2020, I just hate her. It is difficult to use and almost useless, only suitable for those who already know what they are doing, or at least have a good idea of the concepts. " In short, we will work in this direction:

:

:

* (user manual) .

* (reference) , PromQL /.

* (guides), .

Diana Payton , .. .

.

If you are interested, we are currently looking for a technical writer on Grafana Cloud who will work on the official upstream documentation for Prometheus. At the end of the day, we take our commitment to the community seriously.

As usual, the notes from the dev-summit will be published. You can also read the results of the 2020-1 summit and summits of past years .

The future of Prometheus and the project ecosystem (2020)