Analysis of vacancies using clustering

The article provides a method for analyzing vacancies for the search query "Python" using the Gaussian Mixture Model (GMM) clustering model. For each selected cluster, the most requested skills and salary range will be shown.



What was the purpose of this study? I wanted to know:



  • In what applications is Python used
  • What knowledge is required: databases, libraries, frameworks
  • How much specialists in each area are in demand
  • What salaries are offered






Loading data



Jobs downloaded from the site hh.ru , using the API: dev.hh.ru . At the request of "Python", 1994 vacancies were uploaded (Moscow region), which were divided into training and test suites, in the proportion of 80% and 20% . The size of the training set is 1595 , the size of the test set is 399 . The test set will only be used in the Top / Antitop skills and Job Classification sections.



Signs



Based on the text of uploaded vacancies, two groups of the most common n-grams of words were formed :



  • 2-grams in Cyrillic and Latin
  • 1-grams in Latin


In IT vacancies, key skills and technologies are usually written in English, so the second group included words only in Latin.



After selection of the first n-gram group contained 81 grams of 2, 1 and the second 98-gram:

No. n n-gram Weight Vacancies
1 2 in python eight 258
2 2 ci cd eight 230
3 2 understanding of principles eight 221
4 2 knowledge of sql eight 178
five 2 development and nine 174
... ... ... ... ...
82 1 sql five 490
83 1 linux 6 462
84 1 postgresql five 362
85 1 docker 7 358
86 1 java nine 297
... ... ... ... ...


It was decided to divide vacancies into clusters according to the following criteria, in order of priority:

A priority Criterion Weight
1 Field (applied direction), position,

n-gram experience : "machine learning", "linux administration", "excellent knowledge"
7-9
2 Tools, technologies, software.

n-grams: "sql", "linux os", "pytest"
4-6
3 Other

n-gram skills : "technical education", "English", "interesting tasks"
1-3


Determination of which group of criteria the n-gram belongs to, and what weight to assign to it, occurred on an intuitive level. Here are a couple of examples:

  1. At first glance, "Docker" can be attributed to the second group of criteria with a weight of 4 to 6. But the mention of "Docker" in the vacancy most likely means that the vacancy will be for the position of "DevOps engineer". Therefore, "Docker" fell into the first group and received a weight of 7.

  2. «Java» , .. «Java» Java- « Python». «-». , , , «Java» 9.



n- — .





For calculations, each vacancy was transformed into a vector with a dimension of 179 (the number of selected features) of integers from 0 to 9, where 0 means that the i-th n-gram is absent in the vacancy, and the numbers from 1 to 9 mean the presence of the i-th n - grams and its weight. Further in the text, a dot means a vacancy represented by such a vector.

Example:

Let's say the list of n-grams contains only three values:

No. n n-gram Weight Vacancies
1 2 in python eight 258
2 2 understanding of principles eight 221
3 1 sql five 490


Then for a vacancy with text.



Requirements:



  • 3+ years of experience in python development .
  • Good knowledge of sql


the vector is equal to [8, 0, 5].



Metrics



To work with data, you need to have an understanding of it. In our case, I would like to see if there are any clusters of points, which we will consider as clusters. To do this, I used the t-SNE algorithm to translate all vectors into 2D space. 



The essence of the method is to reduce the dimension of the data, while keeping the proportions of the distances between the points of the set as much as possible. It is quite difficult to understand how t-SNE works from the formulas. But I liked one example found somewhere on the Internet: let's say we have balls in three-dimensional space. We connect each ball with all other balls by invisible springs, which do not intersect in any way and do not interfere with each other when crossing. The springs act in two directions, i.e. they resist both the distance and the approach of the balls to each other. The system is in a stable state, the balls are stationary. If we take one of the balls and pull it back, and then release it, it will return to its original state due to the force of the springs. Next, we take two large plates, and squeeze the balls into a thin layer,while not interfering with the balls to move in the plane between the two plates. The forces of the springs begin to act, the balls move and eventually stop when the forces of all the springs become balanced. The springs will act so that the balls that were close to each other remain relatively close and flat. Also with removed balls - they will be removed from each other. With the help of springs and plates, we converted the three-dimensional space into two-dimensional, preserving the distance between the points in some form!Also with removed balls - they will be removed from each other. With the help of springs and plates, we converted the three-dimensional space into two-dimensional, preserving the distance between the points in some form!Also with removed balls - they will be removed from each other. With the help of springs and plates, we converted the three-dimensional space into two-dimensional, preserving the distance between the points in some form!



T-SNE algorithm used by me only for visualization of the set points. He helped choose the metric, as well as select the weights for the features.



If we use the Euclidean metric that we use in our daily life, then the location of vacancies will look like this:





The figure shows that most of the points are concentrated in the center, and there are small branches to the sides. With this approach, clustering algorithms that use the distances between points will not produce anything good.



There are many metrics (ways to determine the distance between two points) that will work well on the data you are exploring. I have chosen as a measure of the distance Jaccard , taking into account the weights of n-grams. Jaccard's measure is easy to understand, but it works well for solving the problem under consideration.

:

1 n-: « python», «sql», «docker»

2 n-: « python», «sql», «php»

:

« python» — 8

«sql» — 5

«docker» — 7

«php» — 9

(n- 1- 2- ):  « python», «sql» = 8 + 5 = 13

( n- 1- 2- ):  « python», «sql», «docker», «php» = 8 + 5 + 7 + 9 = 29

=1 — ( / ) = 1 — (13 / 29) = 0.55


The matrix of distances between all pairs of points was calculated, the size of the matrix is ​​1595 x 1595. In total, 1,271,215 distances between unique pairs. The average distance turned out to be 0.96, between 619 659 the distance is 1 (i.e. there is no similarity at all). The following chart shows that overall vacancies are few similarities:





When using the metric Jaccard our space now looks like this:





Four distinct areas of density appeared, and two small clusters of low density. At least that's how my eyes see!



Clustering



The model was chosen as a clustering algorithm Gaussian Mixture Model (GMM) . The algorithm receives data in the form of vectors as input, and the n_components parameter is the number of clusters into which the set must be split. About how the algorithm works, you can see here (in English). I used a ready-made GMM implementation from the scikit-learn library: sklearn.mixture.GaussianMixture .



Note that GMM does not use a metric, but separates data only by a set of features and their weights. In the article, Jaccard distance is used to visualize data, calculate the compactness of clusters (I took the average distance between cluster points for compactness), and determinethe central point of the cluster (typical vacancy) - the point with the smallest average distance to other points of the cluster. Many clustering algorithms use exactly the distance between points. The Other Methods section will talk about other types of clustering that are metric-based and also give good results.



In the previous section, it was determined by eye that there will most likely be six clusters. This is how the clustering results look with n_components = 6:







In the figure with the output of clusters separately, the clusters are arranged in descending order of the number of points from left to right, from top to bottom: cluster 4 is the largest, cluster 5 is the smallest. The compactness of each cluster is indicated in brackets.



In appearance, the clustering turned out to be not very good, even considering that the t-SNE algorithm is not perfect. When analyzing clusters, the result was also not encouraging.



To find the optimal number of clusters n_components, we will use the AIC and BIC criteria , which you can read about here . The calculation of these criteria is built into the sklearn.mixture.GaussianMixture method . This is how the criteria graph looks like:





When n_components = 12, the BIC criterion has the lowest (best) value, the AIC criterion also has a value close to the minimum (minimum when n_components = 23). Let's divide vacancies into 12 clusters:







Clusters are now more compact, both in appearance and in numerical terms. During manual analysis, vacancies were divided into characteristic groups for understanding a person. The figure shows the names of the clusters. Clusters numbered 11 and 4 are marked as <Trash 2>:



  1. In cluster 11, all features have approximately the same total weights.
  2. Cluster 4 is dedicated to Java. Nevertheless, there are few vacancies for the position of Java Developer in the cluster, knowledge of Java is often required as "will be an additional plus."


Clusters



After deleting two uninformative clusters numbered 11 and 4, the result is 10 clusters:





For each cluster, there is a table of features and 2-grams that are most often found in the vacancies of the cluster.



Legend:



S - the proportion of vacancies in which the trait is found, multiplied by the weight of the trait

% - the percentage of vacancies in which the trait / 2-gram is found

Typical cluster vacancy - vacancy, with the smallest average distance to other points of the cluster



Data Analyst



Number of Jobs: 299



Typical Job : 35,805,914

No. Sign with weight S Sign % 2-grams %
1 excel 3.13 sql 64.55 knowledge of sql 18.39
2 r 2.59 excel 34.78 in developing 14.05
3 sql 2.44 r 28.76 python r 14.05
4 knowledge of sql 1.47 bi 19.40 with big 13.38
five data analysis 1.17 tableau 15.38 development and 13.38
6 tableau 1.08 google 14.38 data analysis 13.04
7 with big 1.07 vba 13.04 knowledge of python 12.71
eight development and 1.07 science 9.70 analytical warehouse 11.71
nine vba 1.04 dwh 6.35 development experience 11.71
ten knowledge of python 1.02 oracle 6.35 databases 11.37


C ++ Developer



Number of Jobs: 139

Typical Job : 39,955,360

No. Sign with weight S Sign % 2-grams %
1 c ++ 9.00 c ++ 100.00 development experience 44.60
2 java 3.30 linux 44.60 c c ++ 27.34
3 linux 2.55 java 36.69 c ++ python 17.99
4 c # 1.88 sql 23.02 in c ++ 16.55
five go 1.75 c # 20.86 development on 15.83
6 development on 1.27 go 19.42 data structures 15.11
7 good knowledge 1.15 unix 12.23 writing experience 14.39
eight data structures 1.06 tensorflow 11.51 programming on 13.67
nine tensorflow 1.04 bash 10.07 in developing 13.67
ten programming experience 0.98 postgresql 9.35 programming languages 12.95


Linux / DevOps Engineer



Number of Job Openings: 126

Typical Job : 39,533,926

No. Sign with weight S Sign % 2-grams %
1 ansible 5.33 linux 84.92 ci cd 58.73
2 docker 4.78 ansible 76.19 administration experience 42.06
3 bash 4.78 docker 74.60 bash python 33.33
4 ci cd 4.70 bash 68.25 tcp ip 39.37
five linux 4.43 prometheus 58.73 customization experience 28.57
6 prometheus 4.11 zabbix 54.76 monitoring and 26.98
7 nginx 3.67 nginx 52.38 prometheus grafana 23.81
eight administration experience 3.37 grafana 52.38 monitoring systems 22.22
nine zabbix 3.29 postgresql 51.59 with docker 16.67
ten elk 3.22 kubernetes 51.59 configuration management 16.67


Python Developer



Number of Job Openings: 104

Typical Job : 39,705,484

No. Sign with weight S Sign % 2-grams %
1 in python 6.00 docker 65.38 in python 75.00
2 django 5.62 django 62.50 development on 51.92
3 flask 4.59 postgresql 58.65 development experience 43.27
4 docker 4.24 flask 50.96 django flask 04.24
five development on 4.15 redis 38.46 rest api 23.08
6 postgresql 2.93 linux 35.58 python from 21.15
7 aiohttp 1.99 rabbitmq 33.65 databases 18.27
eight redis 1.92 sql 30.77 writing experience 18.27
nine linux 1.73 mongodb 25.00 with docker 17.31
ten rabbitmq 1.68 aiohttp 22.12 with postgresql 16.35


Data scientist



Number of vacancies: 98

Typical vacancy: 38071218

No. Sign with weight S Sign % 2-grams %
1 pandas 7.35 pandas 81.63 machine learning 63.27
2 numpy 6.04 numpy 75.51 pandas numpy 43.88
3 machine learning 5.69 sql 62.24 data analysis 29.59
4 pytorch 3.77 pytorch 41.84 data science 26.53
five ml 3.49 ml 38.78 knowledge of python 25.51
6 tensorflow 3.31 tensorflow 36.73 numpy scipy 24.49
7 data analysis 2.66 spark 32.65 python pandas 23.47
eight scikitlearn 2.57 scikitlearn 28.57 in python 21.43
nine data science 2.39 docker 27.55 mathematical statistics 20.41
ten spark 2.29 hadoop 27.55 algorithms of machine 20.41


Frontend Developer



Number of Jobs: 97

Typical Job : 39,681,044

No. Sign with weight S Sign % 2-grams %
1 javascript 9.00 javascript 100 html css 27.84
2 django 2.60 html 42.27 development experience 25.77
3 react 2.32 postgresql 38.14 in developing 17.53
4 nodejs 2.13 docker 37.11 knowledge of javascript 15.46
five frontend 2.13 css 37.11 and support 15.46
6 docker 2.09 linux 32.99 python and 14.43
7 postgresql 1.91 sql 31.96 css javascript 13.40
eight linux 1.79 django 28.87 databases 12.37
nine html css 1.67 react 25.77 in python 12.37
ten php 1.58 nodejs 23.71 design and 11.34


Backend Developer



Number of Jobs: 93

Typical Job : 40,226,808

No. Sign with weight S Sign % 2-grams %
1 django 5.90 django 65.59 python django 26.88
2 js 4.74 js 52.69 development experience 25.81
3 react 2.52 postgresql 40.86 knowledge of python 20.43
4 docker 2.26 docker 35.48 in developing 18.28
five postgresql 2.04 react 27.96 ci cd 17.20
6 understanding of principles 1.89 linux 27.96 confident knowledge 16.13
7 knowledge of python 1.63 backend 22.58 rest api 15.05
eight backend 1.58 redis 22.58 html css 13.98
nine ci cd 1.38 sql 20.43 ability to understand 10.75
ten frontend 1.35 mysql 19.35 in a stranger 10.75


DevOps Engineer



Number of Jobs: 78

Typical Job : 39634258

No. Sign with weight S Sign % 2-grams %
1 devops 8.54 devops 94.87 ci cd 51.28
2 ansible 5.38 ansible 76.92 bash python 30.77
3 bash 4.76 linux 74.36 administration experience 24.36
4 jenkins 4.49 bash 67.95 and support 23.08
five ci cd 4.10 jenkins 64.10 docker kubernetes 20.51
6 linux 3.54 docker 50.00 development and 17.95
7 docker 2.60 kubernetes 41.03 writing experience 17.95
eight java 2.08 sql 29.49 and customization 17.95
nine administration experience 1.95 oracle 25.64 development and 16.67
ten and support 1.85 openshift 24.36 scripting 14.10


Data Engineer



Number of Jobs: 77

Typical Job : 40,008,757

No. Sign with weight S Sign % 2-grams %
1 spark 6.00 hadoop 89.61 data processing 38.96
2 hadoop 5.38 spark 85.71 big data 37.66
3 java 4.68 sql 68.83 development experience 23.38
4 hive 4.27 hive 61.04 knowledge of sql 22.08
five scala 3.64 java 51.95 development and 19.48
6 big data 3.39 scala 51.95 hadoop spark 19.48
7 etl 3.36 etl 48.05 java scala 19.48
eight sql 2.79 airflow 44.16 data quality 18.18
nine data processing 2.73 kafka 42.86 and processing 18.18
ten kafka 2.57 oracle 35.06 hadoop hive 18.18


QA Engineer



Number of Jobs: 56

Typical Job : 39630489

No. Sign with weight S Sign % 2-grams %
1 test automation 5.46 sql 46.43 test automation 60.71
2 testing experience 4.29 qa 42.86 testing experience 53.57
3 qa 3.86 linux 35.71 in python 41.07
4 in python 3.29 selenium 32.14 automation experience 35.71
five development and 2.57 web 32.14 development and 32.14
6 sql 2.05 docker 30.36 testing experience 30.36
7 linux 2.04 jenkins 26.79 writing experience 28.57
eight selenium 1.93 backend 26.79 testing on 23.21
nine web 1.93 bash 21.43 automated testing 21.43
ten backend 1.88 ui 19.64 ci cd 21.43




Salaries



Salaries are indicated only in 261 (22%) vacancies out of 1,167 in the clusters.



When calculating salaries:



  1. If the range "from ... to ..." was specified, then the average value was used
  2. If only "from ..." or only "to ..." was indicated, then this value was taken
  3. The calculations used (or were given) salary after taxes (NET)


On the chart:



  1. Clusters rank in descending order of median salary
  2. Vertical bar in box - median
  3. Box - range [Q1, Q3], where Q1 (25%) and Q3 (75%) are percentiles. Those. 50% of salaries fall into the box
  4. The "mustache" includes salaries from the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR], where IQR = Q3 - Q1 - interquartile range
  5. Individual points - anomalies that did not fall into the mustache. (There are anomalies not included in the diagram)




Top / Antitop skills



The charts were built for all 1994 uploaded vacancies. Salaries are indicated in 443 (22%) vacancies. For the calculation for each feature, vacancies were selected where this feature is present, and on their basis the median salary was calculated.







Job classification



Clustering could be made much easier without resorting to complex mathematical models: to compile the top names of vacancies, and divide them into groups. Next, analyze each group for top n-grams and average salaries. There is no need to highlight features and assign weights to them.



This approach would work well (to some extent) for a "Python" query. But for the request "1C Programmer" this approach will not work, because for 1C programmers in the names of vacancies, 1C configurations or applied areas are rarely indicated. And there are many areas where 1C is used: accounting, salary calculation, tax calculation, cost calculation at manufacturing enterprises, warehouse accounting, budgeting, ERP systems, retail, management accounting, etc.



For myself, I see two tasks for analyzing vacancies:



  1. Understand where a programming language that I know little about is used (as in this article).
  2. Filter new posted jobs.


Clustering is suitable for solving the first problem, for solving the second one - various classifiers, random forests, decision trees, neural networks. Nevertheless, I wanted to evaluate the suitability of the chosen model for the job classification problem.



If you use the predict () method built into sklearn.mixture.GaussianMixture , then nothing good happens. He attributes most of the vacancies to large clusters, and two of the first three clusters are uninformative. I used a different approach: 



  1. We take the vacancy that we want to classify. We vectorize it and get a point in our space.
  2. We calculate the distance from this point to all clusters. Under the distance between a point and a cluster, I took the average distance from this point to all points in the cluster.
  3. The cluster with the smallest distance is the predicted class for the selected vacancy. The distance to the cluster indicates the reliability of such a prediction.
  4. To increase the accuracy of the model, I chose 0.87 as the threshold distance, i.e. if the distance to the nearest cluster is greater than 0.87, then the model does not classify the vacancy.


To evaluate the model, 30 vacancies were randomly selected from the test set. In the verdict column:



N / a: the model did not classify the job (distance> 0.87)

+: correct classification

-: incorrect classification  

Job vacancy Nearest cluster Distance Verdict
37637989 Linux / DevOps Engineer 0.9464 N / a
37833719 C ++ Developer 0.8772 N / a
38324558 Data Engineer 0.8056 +
38517047 C ++ Developer 0.8652 +
39053305 Trash 0.9914 N / a
39210270 Data Engineer 0.8530 +
39349530 Frontend Developer 0.8593 +
39402677 Data Engineer 0.8396 +
39415267 C ++ Developer 0.8701 N / a
39734664 Data Engineer 0.8492 +
39770444 Backend Developer 0.8960 N / a
39770752 Data scientist 0.7826 +
39795880 Data Analyst 0.9202 N / a
39947735 Python Developer 0.8657 +
39954279 Linux / DevOps Engineer 0.8398 -
40008770 DevOps Engineer 0.8634 -
40015219 C ++ Developer 0.8405 +
40031023 Python Developer 0.7794 +
40072052 Data Analyst 0.9302 N / a
40112637 Linux / DevOps Engineer 0.8285 +
40164815 Data Engineer 0.8019 +
40186145 Python Developer 0.7865 +
40201231 Data scientist 0.7589 +
40211477 DevOps Engineer 0.8680 +
40224552 Data scientist 0.9473 N / a
40230011 Linux / DevOps Engineer 0.9298 N / a
40241704 Trash 2 0.9093 N / a
40245997 Data Analyst 0.9800 N / a
40246898 Data scientist 0.9584 N / a
40267920 Frontend Developer 0.8664 +


Total: 12 vacancies have no result, 2 vacancies - erroneous classification, 16 vacancies - correct classification. Model completeness - 60%, model accuracy - 89%.



Weak sides



The first problem - let's take two vacancies:

Vacancy 1 - "Lead C ++ Programmer"

"Requirements:



  • 5+ years of C ++ development experience.
  • Knowledge of Python will be an additional plus. "


Vacancy 2 - "Lead Python Programmer"

"Requirements:

  • 5+ years of Python development experience.
  • Knowledge of C ++ will be an additional plus "
From the point of view of the model, these vacancies are identical. I tried to adjust the weights of the features by the order of their occurrence in the text. This did not lead to anything good.



The second problem is that GMM clusters all the points in a set, like many clustering algorithms. Non-informative clusters are not a problem by themselves. But informative clusters also contain outliers. However, this can be easily solved by clearing the clusters, for example, by removing the most atypical points that have the greatest average distance to the rest of the cluster points.



Other methods



The cluster comparison page demonstrates the various clustering algorithms well. GMM is the only one that gave good results.

The rest of the algorithms either did not work or gave very modest results.



Of those implemented by me, good results were in two cases:



  1. Points with a high density were selected in some neighborhood, located at a distant distance from each other. The points became the centers of the clusters. Then, on the basis of the centers, the process of cluster formation began - the joining of neighboring points.
  2. Agglomerative clustering is an iterative merging of points and clusters. The scikit-learn library presents this kind of clustering, but it does not work well. In my implementation, I changed the join matrix after each iteration of the merge. The process stopped when some boundary parameters were reached - in fact, dendrograms do not help to understand the merging process if 1500 elements are clustered.


Conclusion



The research I did gave me the answers to all the questions at the beginning of the article. I got hands-on experience with clustering while implementing variations of known algorithms. I really hope that the article will motivate the reader to carry out his analytical research, and will somehow help in this exciting lesson.



All Articles