Graph ML: Similarity Methods & Applications

Community-Based Methods

As with index-based methods, the algorithms belonging to this family also compute an index representing the probability of the disconnected nodes being connected.

The main difference between index-based and community-based methods is related to the logic behind them.

Indeed, community-based methods, before generating the index, need to compute information about the community belonging to those nodes.

Community Common Neighbor

To estimate the probability of two nodes being connected, this algorithm computes the number of common neighbors and adds to this value the number of common neighbors belonging to the same community.

Formally, for two nodes 𝑣 and 𝑢, the community common neighbor value is computed as follows:

Press enter or click to view image in full size

In this formula, 𝑓(𝑤) = 1 if 𝑤𝑤 belongs to the same community as 𝑢 and 𝑣; otherwise, this is 0. The function can be computed in NetworkX using this code.

import networkx as nxedges = [[1,3],[2,3],[2,4],[4,5],[5,6],[5,7]]
G = nx.from_edgelist(edges)
G.nodes[1]["community"] = 0
G.nodes[2]["community"] = 0
G.nodes[3]["community"] = 0
G.nodes[4]["community"] = 1
G.nodes[5]["community"] = 1
G.nodes[6]["community"] = 1
G.nodes[7]["community"] = 1
preds = nx.cn_soundarajan_hopcroft(G,[(1,2),(2,5),(3,4)])

From the preceding code snippet, it is possible to see how we need to assign the community property to each node of the graph.

This property is used to identify nodes belonging to the same community when computing the function 𝑓(w) defined in the previous equation.

The community value can also be automatically computed using specific algorithms.

As the example code above shows, the cn_soundarajan_hopcroft function takes the input graph and a couple of nodes for which we want to compute the score. As a result, we get the following result.

Press enter or click to view image in full size

The result indicates the pairs of nodes and their corresponding scores, where the score reflects the number of shared neighbors between the nodes, adjusted by their community membership.

The main difference from the previous function is in the index value. Indeed, we can easily see that the result is not in the range (0,1).

Community Resource Allocation

As with the previous method, the community resource allocation algorithm merges information obtained from the neighbors of the nodes with the community, as shown in the following formula.

Press enter or click to view image in full size

Here, 𝑓(𝑤) = 1 if 𝑤𝑤 belongs to the same community as 𝑢 and 𝑣; otherwise, this is 0.

The function can be computed in networkx using this code.

import networkx as nxedges = [[1,3],[2,3],[2,4],[4,5],[5,6],[5,7]]
G = nx.from_edgelist(edges)
G.nodes[1]["community"] = 0
G.nodes[2]["community"] = 0
G.nodes[3]["community"] = 0
G.nodes[4]["community"] = 1
G.nodes[5]["community"] = 1
G.nodes[6]["community"] = 1
G.nodes[7]["community"] = 1
preds = nx. ra_index_soundarajan_hopcroft(G,[(1,2),(2,5),(3,4)])

From the preceding code snippet, it is possible to see how we need to assign the community property to each node of the graph.

This property is used to identify nodes belonging to the same community when computing the function 𝑓(w) defined in the previous equation.

The community value can also be automatically computed using specific algorithms.

As we saw, the ra_index_soundarajan_hopcroft function takes the input graph and a couple of nodes for which we want to compute the score. As a result, we get the result.

Press enter or click to view image in full size

From the preceding output, it is possible to see the influence of the community in the computation of the index.

Since nodes 1 and 2 belong to the same community, they have a higher value in the index.

On the contrary, edges (2,5) and (3,4) feature the value 0, since they belong to different communities from each other.

In networkx, two other methods to compute the probability of a connection between two nodes based on their similarity score merged with community information are nx.within_inter_cluster and nx.common_neighbor_centrality.

Community-based methods, therefore, are best suited for applications where structural information is crucial, such as criminal network analysis and protein-protein interaction networks, since they leverage community structure to improve accuracy.

Furthermore, they are effective in detecting hidden connections in sparse networks, and they are more computationally expensive than index-based methods and may require community detection as a preprocessing step.