Analyzing networks using graphs



Social network analysis is the process of exploring various systems using network theory. It began to be widely used just when it became clear that a huge number of existing networks (social, economic, biological) have universal properties: having studied one type, you can understand the structure of any other networks and learn how to make predictions from them.



Any network consists of individual participants (people or things in the network) and the relationships between them. Networks are very often visualized using graphs - structures made up of many points and lines that represent the connections between these points. Participants are represented as nodes of the network, and their relationships are represented as lines connecting them. Such visualization helps to get a qualitative and quantitative assessment of networks:





Figure: 1. Directional graph depicting money circulation between banks that form the foreign exchange market (1). EU banks are marked in red, North America - in blue, and other countries - in green.





Figure: 2. Graph showing the cooperation of audit partners in Denmark in 2010-2014 (2)



Using the analysis of social networks, it is possible to analyze a variety of interactions and exchange processes of resources, both material and informational. For example, by analyzing the network of transactions between bank clients (where the nodes are the bank's clients, and the edges are transfers between them), it is possible to determine the circle of persons involved in fraudulent transactions, or to identify violations of internal regulations by bank employees.



It is also possible to build a network of working relationships (for example, various types of communication between employees), which can help in understanding the social structure of the organization and the position of each employee in this structure. Given the data on the type of communication for each employee, one can even analyze how such characteristics as leadership, mentoring and cooperation affect his career, and, based on the knowledge gained, determine career goals and propose training programs to achieve them.



In addition, events can be predicted using the example of networks. For example, there are models that estimate the likelihood of a software failure, some of them consider people as a source of predictions - after all, it is people who develop and test products before release. Their interactions form a network: you can think of developers as nodes, and whether they worked together on the same file in the same release, as edges of a network. Understanding the interactions and information about previous failures will tell a lot about the reliability of the final product and point out the files that are most likely to fail.



Now let's try to apply the analysis of social networks in practice. To do this, we will use the Python programming language, or rather the networkx library for working with graphs, the matplotlib library for visualization, and the community library for highlighting communities within the network. Let's import them:



1.	import community
2.	import networkx as nx
3.	import matplotlib.cm as cm
4.	import matplotlib.pyplot as plt


1. Importing data and transforming it into a graph



As a dataset, let's take an email correspondence from a large European university, which contains anonymous information about all incoming and outgoing email messages between members of a research institution (link). The dataset contains a txt file where each line lists pairs of nodes that are related to each other.



1.	G = nx.read_edgelist('email-Eu-core.txt', create_using=nx.DiGraph())
2.	print(' : {}'.format(G.number_of_nodes()))
3.	print(' : {}'.format(G. number_of_edges()))
4.	print('       : {}'.format(round(G.number_of_edges() / float(G.number_of_nodes()), 4)))

 : 1005
 : 25571
      : 25.4438


In the above code, the dataset has been imported and converted into a graph. Then we sequentially deduced the main parameters of the graph: the number of nodes, edges and the average number of neighbors of the nodes in the graph. The last parameter reflects how closely the nodes are connected to each other.



2. Main characteristics of graphs



In order to understand how you can work with each specific graph, you first need to understand how it works. Let's take a quick look at the characteristics by which you can understand the structure of a graph.



First of all, consider the concepts of connectivity and directionality. A graph is called connected if each pair of nodes in the graph is interconnected, i.e. from any node you can come to any other node. If the graph is disconnected, then it can be divided into maximally connected subgraphs (called components). Also, graphs can be directed and undirected. This is determined by the presence of the orientation of the relationship between the two participants. One example of a directed network is transactions between bank customers, where each customer can have both incoming and outgoing payments.



In the general case, in directed graphs, the connections are not reciprocal; therefore, for directed graphs, instead of the concept of connectivity, the concepts of components of weak and strong connectivity are distinguished. A component is considered weakly connected if ignoring the direction results in a connected graph. A component of strong connectivity can be such if all its vertices are mutually reachable. Let's take a look at the structure of the graph from our email dataset:



1.	if nx.is_directed(G):
2.	        if nx.is_weakly_connected(G):
3.	                print('         .')
4.	        else:
5.	                print('       .')
6.	else:
7.	        if nx.is_connected(G):
8.	                print('    .')
9.	        else:
10.	                print('       .')


The graph is directional and consists of several components.



Here we checked the directionality and connectivity of the graph and found out that the graph from the dataset is directed and contains several disconnected components. Let's try to take a closer look at the largest components of strong and weak connectivity:



1.	G_weak = G.subgraph(max(nx.weakly_connected_components(G), key=len))
2.	print(' : {}'.format(G_weak.number_of_nodes()))
3.	print(' : {}'.format(G_weak.number_of_edges()))
4.	print('      : {}'.format(round(G_weak.number_of_edges() / float(G_weak.number_of_nodes()), 4)))
5.	
6.	G_strong = G.subgraph(max(nx.strongly_connected_components(G), key=len))
7.	print(' : {}'.format(G_strong.number_of_nodes()))
8.	print(' : {}'.format(G_strong.number_of_edges()))
9.	print('      : {}'.format(round(G_strong.number_of_edges() / float(G_strong.number_of_nodes()), 4)))


 : 986
 : 25552
      : 25.9148

 : 803
 : 24729
      : 30.7958


So, we have obtained the main characteristics for the weakly connected component and for the strong connected component included in it. Let's see what conclusions we can draw at this stage. We see that out of 1005 participants, 986 people communicate with each other, while 183 people of them sent emails to other people unilaterally, and only 803 people maintained two-way communication. In 823 cases, the attempt to establish communication via e-mail failed. We also see that the most active participants (included in the strong connectedness component) communicate with an average of 30 people.



Let's look at other key characteristics of graphs that define their structure. Graphs are considered weighted if the relations between the nodes reflect not only the very existence of the connection, but also a certain weight, reflecting the strength of this connection. Our dataset with emails is not weighted, since it only takes into account the fact of correspondence, but not the number of emails sent.



In addition, nodes and links can create different types of networks: monocotyledonous, dicotyledonous, or multi-level. Monocotyledonous networks consist of one type of participants and the connections between them. Bipartite networks consist of two different types of participants, where one type of node is associated only with the other type. Multilevel networks also include two types of participants, but links can connect both different types of participants and the same type of participants (for example, relationships between managers and relationships between projects). The dataset we have taken for research is a monocotyledonous network, since it consists of only one type of participants and the connections between them.



3. Graph visualization



Now let's try to visualize the dataset we have taken. For this we need the matplotlib library, which we already imported above:



1.	plt.figure(figsize=(15, 15))
2.	plt.title('E-mails')
3.	nx.draw(G, node_size=5)
4.	plt.show()


The first line sets the size of the future image, which is then assigned a specific name. The third line of the draw function passes a graph and specifies the size of the nodes, after which the graph is displayed on the screen. Let's take a look at it:





Fig. 3. Graph of user interactions with information about all incoming and outgoing emails between members of the research institution.



On the resulting graph, we see that there are a number of points that are not connected with the rest of the communication participants. These people, not connected with the rest of the participants, got on the graph, since they sent letters exclusively to themselves. You can also notice that at the periphery there are a number of points that are connected with the rest of the graph by a few incoming connections - these are participants that fell into the weakly connected component for our graph, but did not fall into the strong connected component.



Let's also look at a graph that illustrates the strong connectedness component - people who have two-way communication with other members of the research institution:





Fig. 4. Component of strong connectivity with information about all incoming and outgoing emails between members of the research institution...



4. Degree of knot and distribution of degrees of knots



Now that we know the structure of our graph and are able to visualize it, let's move on to a more detailed analysis. Each node in the graph has a degree - the number of nearest neighbors of that node. In directed networks, one can distinguish both the degree of entry (the number of incoming connections to the node) and the degree of exit (the number of outgoing connections from the node). If we calculate the degree for each node of the graph, we can determine the distribution of the degrees of the nodes. Let's take a look at it for the email graph:



1.	degree = dict(G.degree())
2.	degree_values = sorted(set(degree.values()))
3.	hist = [list(degree.values()).count(x) for x in degree_values]
4.	plt.figure(figsize=(10, 10))
5.	plt.plot(degree_values, hist, 'ro-')
6.	plt.legend(['Degree'])
7.	plt.xlabel('Degree')
8.	plt.ylabel('Number of nodes')
9.	plt.show()




Figure: 5. Distribution of degrees in a graph with information about all incoming and outgoing e-mails between members of the research institution



On the resulting graph we see the distribution of degrees of nodes: a large number of nodes have few connections with neighbors, but there are a small number of large nodes that have a number of connections with others participants is huge. This trend is called a power law of distribution, and it is very typical for large networks. This law can describe the distribution of the population of different cities, the ranking of sites on the Internet and even the distribution of wealth among people.



5. Path, diameter and average distance in the graph



Now let's determine how connected the members of our graph are. First, let's talk about the different types of node spacing.



Any sequence of edges that connects nodes is called a path. Most often, research considers a simple path, that is, a path without cycles and repeating nodes. The shortest path between two nodes is called the geodetic distance. The longest shortest path in a graph is called its diameter, but it is very sensitive to deviations (one chain in a multimillion-dollar graph can change its diameter). In directed graphs, the concept of diameter can be used only for the component of strong connectivity, because to calculate the diameter, it is necessary that for each pair of nodes there is a path between them. In undirected graphs, it is sufficient that the component under consideration is connected.



Another very informative characteristic is the average distance between nodes, which can be obtained by taking the average of all the shortest paths in the graph. The average distance is determined by the structure of the graph: if the graph is built in the form of a chain, it will be large, but the closer the nodes are connected, the smaller the average distance becomes. The average distance can be calculated for both the strong connected component and the weakly connected component:



1.	print (': ', nx.diameter(G_strong))
2.	print ('     : ', nx.average_shortest_path_length(G_strong))
3.	print ('     : ', nx.average_shortest_path_length(G_weak))

: 6
     :  2.5474824768713336
     :  2.164486568301397


Let's take a look at the results. The diameter in this case shows us the maximum distance between two strangers, and here, as in the well-known theory of six handshakes, this distance is 6. The average distance in the components is also small, on average 2 “handshakes” are enough for two strangers. An interesting phenomenon can also be seen here: the average distance in the strong connected component is slightly lower than in the weakly connected component. This can be explained by the fact that the direction of the bonds is not taken into account for the weakly connected component (only the very fact of its presence). Because of this, the connection in the weak component appears where it was absent for the strong component.



6. Clustering and allocation of communities



Having figured out the distances between the participants, let's move on to other phenomena that reflect how the participants in the graph are connected to each other: clustering and communities.



The cluster coefficient shows that two elements of the graph, connected through the third element, with a high degree of probability are connected with each other. Even if they are not linked, the likelihood that they will be linked in the future is much higher than the other two randomly taken nodes. This phenomenon, called clustering or transitivity, is extremely common in social graphs.



Graphs with a high degree of clustering are characterized by the presence of a significant number of connected triplets (three nodes connected to each other). They are the building block of many social networks, where the number of such triangles is very large. Often not even triangles arise, but entire cluster formations, called communities, which are more closely related to each other than to the rest of the graph.



Let's look at the clustering and the cluster coefficient for the weakly connected component:



1.	print (': ', nx.transitivity(G_strong))
2.	print (' : ', nx.average_clustering(G_strong))


:  0.2201493109315837
 :  0.37270757578876434


For a component of strong connectivity, we can obtain the same characteristics, and also determine the number of central nodes (nodes that are most closely related to the rest) and the number of nodes at the periphery of the graph:



1.	print (': ', nx.transitivity(G_strong))
2.	print (' : ', nx.average_clustering(G_strong))
3.	print ('  : ', len(nx.center(G_strong)))
4.	print ('   : ', len(nx.periphery(G_strong)))


:  0.2328022090200813
 :  0.3905903756516427
  : 46
   : 3


In the second case, the clustering and clustering coefficient increased, reflecting that the strong connected component contains a larger number of related triplets. Let's try together to define the main communities in the loosely coupled component:



1.	G_undirected = G_weak.to_undirected()
2.	partition = community.best_partition(G_undirected)
3.	communities = set(partition.values())
4.	communities_dict = {c: [k for k, v in partition.items() if v == c] for c in communities}
5.	highest_degree = {k: sorted(v, key=lambda x: G.degree(x))[-5:] for k, v in communities_dict.items()}
6.	print(' : ', len(highest_degree))
7.	print('    :', ', '.join([str(len(highest_degree[key])) for key in highest_degree]))


 : 8
    : 113, 114, 125, 131, 169, 188, 54, 92


So, in the column with e-mails, 8 communities can be distinguished, which are more closely related to each other than to the rest of the column. The smallest of the communities has 54 members, and the largest has 188 members. For networks with overlapping or nested communities, determining the optimal partitioning can be more difficult. Therefore, each time you run the code, the composition of the communities may differ. Let's see the visualization for the split we got:



1.	pos = nx.spring_layout(G)
2.	plt.figure(figsize=(15, 15))
3.	cmap = cm.get_cmap('gist_rainbow', max(partition.values()) + 1)
4.	nx.draw_networkx_nodes(G, pos, partition.keys(), node_size=20, cmap=cmap, node_color=list(partition.values()))
5.	nx.draw_networkx_edges(G, pos, alpha=0.5)
6.	plt.show()




Figure: 6. Display of various communities in a loosely coupled component with information about all incoming and outgoing e-mails between members of the research institution.



In the graph above, we see the distribution of participants by community, where the color of the nodes describes the belonging to a particular community.



7. Reciprocity of ties



In addition to the properties already described, there is also such a thing as reciprocity in a directed network. This characteristic describes what percentage of outgoing links have a feedback, incoming link. In order to find out, we use the special function overall_reciprocity of the networkx library and look at the level of reciprocity in the graph and its components:



1.	print ('  : ', nx.overall_reciprocity(G))
2.	print ('    : ', nx.overall_reciprocity(G_weak))
3.	print ('    : ', nx.overall_reciprocity(G_strong))

  :  0.6933635759258535
    :  0.6938791484032562
    :  0.7169719762222492


In 71% of the cases, users received replies to their messages in the strong connectivity component. For the weakly connected component and the entire graph as a whole, the level is predictably lower.



8. Universal properties of networks



To summarize, complex networks, in general, have certain properties and some are characteristic of many networks. Our analysis of the email dataset supports these trends well:



  1. . , , . : web-, , , (Wikipedia, Microsoft). , .
  2. . , , « ». : , .
  3. , , : 80% , . – , , . , , .
  4. , . , .


We analyzed and confirmed all the trends listed above for a dataset with email correspondence: a large connected component was found in the graph, containing more than 80% of all nodes. Within this component, most of the nodes were characterized by a small number, however, there was a small percentage of participants who had a huge number of neighbors. We also saw that the diameter and average distance between the participants in the graph is small: the average distance in the weakly connected component, containing 986 participants, was only 2.1, which means that most of the participants are connected to each other after just two handshakes. In addition, the graph is characterized by a high degree of reciprocity: 69% of all participants maintained two-way contact with each other.



Notes



The full text of the study can be found in the book by Nikolai Viktorovich Ursul "The Whole Truth About Forex" (2019).



The research is described in the article "The Application of Social Network Analysis to Accounting and Auditing" Slobodan Kacanski, Dean Lusher (2017, link ).



All Articles