123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556 |
- """
- Graph summarization finds smaller representations of graphs resulting in faster
- runtime of algorithms, reduced storage needs, and noise reduction.
- Summarization has applications in areas such as visualization, pattern mining,
- clustering and community detection, and more. Core graph summarization
- techniques are grouping/aggregation, bit-compression,
- simplification/sparsification, and influence based. Graph summarization
- algorithms often produce either summary graphs in the form of supergraphs or
- sparsified graphs, or a list of independent structures. Supergraphs are the
- most common product, which consist of supernodes and original nodes and are
- connected by edges and superedges, which represent aggregate edges between
- nodes and supernodes.
- Grouping/aggregation based techniques compress graphs by representing
- close/connected nodes and edges in a graph by a single node/edge in a
- supergraph. Nodes can be grouped together into supernodes based on their
- structural similarities or proximity within a graph to reduce the total number
- of nodes in a graph. Edge-grouping techniques group edges into lossy/lossless
- nodes called compressor or virtual nodes to reduce the total number of edges in
- a graph. Edge-grouping techniques can be lossless, meaning that they can be
- used to re-create the original graph, or techniques can be lossy, requiring
- less space to store the summary graph, but at the expense of lower
- reconstruction accuracy of the original graph.
- Bit-compression techniques minimize the amount of information needed to
- describe the original graph, while revealing structural patterns in the
- original graph. The two-part minimum description length (MDL) is often used to
- represent the model and the original graph in terms of the model. A key
- difference between graph compression and graph summarization is that graph
- summarization focuses on finding structural patterns within the original graph,
- whereas graph compression focuses on compressions the original graph to be as
- small as possible. **NOTE**: Some bit-compression methods exist solely to
- compress a graph without creating a summary graph or finding comprehensible
- structural patterns.
- Simplification/Sparsification techniques attempt to create a sparse
- representation of a graph by removing unimportant nodes and edges from the
- graph. Sparsified graphs differ from supergraphs created by
- grouping/aggregation by only containing a subset of the original nodes and
- edges of the original graph.
- Influence based techniques aim to find a high-level description of influence
- propagation in a large graph. These methods are scarce and have been mostly
- applied to social graphs.
- *dedensification* is a grouping/aggregation based technique to compress the
- neighborhoods around high-degree nodes in unweighted graphs by adding
- compressor nodes that summarize multiple edges of the same type to
- high-degree nodes (nodes with a degree greater than a given threshold).
- Dedensification was developed for the purpose of increasing performance of
- query processing around high-degree nodes in graph databases and enables direct
- operations on the compressed graph. The structural patterns surrounding
- high-degree nodes in the original is preserved while using fewer edges and
- adding a small number of compressor nodes. The degree of nodes present in the
- original graph is also preserved. The current implementation of dedensification
- supports graphs with one edge type.
- For more information on graph summarization, see `Graph Summarization Methods
- and Applications: A Survey <https://dl.acm.org/doi/abs/10.1145/3186727>`_
- """
- from collections import Counter, defaultdict
- import networkx as nx
- __all__ = ["dedensify", "snap_aggregation"]
- def dedensify(G, threshold, prefix=None, copy=True):
- """Compresses neighborhoods around high-degree nodes
- Reduces the number of edges to high-degree nodes by adding compressor nodes
- that summarize multiple edges of the same type to high-degree nodes (nodes
- with a degree greater than a given threshold). Dedensification also has
- the added benefit of reducing the number of edges around high-degree nodes.
- The implementation currently supports graphs with a single edge type.
- Parameters
- ----------
- G: graph
- A networkx graph
- threshold: int
- Minimum degree threshold of a node to be considered a high degree node.
- The threshold must be greater than or equal to 2.
- prefix: str or None, optional (default: None)
- An optional prefix for denoting compressor nodes
- copy: bool, optional (default: True)
- Indicates if dedensification should be done inplace
- Returns
- -------
- dedensified networkx graph : (graph, set)
- 2-tuple of the dedensified graph and set of compressor nodes
- Notes
- -----
- According to the algorithm in [1]_, removes edges in a graph by
- compressing/decompressing the neighborhoods around high degree nodes by
- adding compressor nodes that summarize multiple edges of the same type
- to high-degree nodes. Dedensification will only add a compressor node when
- doing so will reduce the total number of edges in the given graph. This
- implementation currently supports graphs with a single edge type.
- Examples
- --------
- Dedensification will only add compressor nodes when doing so would result
- in fewer edges::
- >>> original_graph = nx.DiGraph()
- >>> original_graph.add_nodes_from(
- ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
- ... )
- >>> original_graph.add_edges_from(
- ... [
- ... ("1", "C"), ("1", "B"),
- ... ("2", "C"), ("2", "B"), ("2", "A"),
- ... ("3", "B"), ("3", "A"), ("3", "6"),
- ... ("4", "C"), ("4", "B"), ("4", "A"),
- ... ("5", "B"), ("5", "A"),
- ... ("6", "5"),
- ... ("A", "6")
- ... ]
- ... )
- >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
- >>> original_graph.number_of_edges()
- 15
- >>> c_graph.number_of_edges()
- 14
- A dedensified, directed graph can be "densified" to reconstruct the
- original graph::
- >>> original_graph = nx.DiGraph()
- >>> original_graph.add_nodes_from(
- ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
- ... )
- >>> original_graph.add_edges_from(
- ... [
- ... ("1", "C"), ("1", "B"),
- ... ("2", "C"), ("2", "B"), ("2", "A"),
- ... ("3", "B"), ("3", "A"), ("3", "6"),
- ... ("4", "C"), ("4", "B"), ("4", "A"),
- ... ("5", "B"), ("5", "A"),
- ... ("6", "5"),
- ... ("A", "6")
- ... ]
- ... )
- >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
- >>> # re-densifies the compressed graph into the original graph
- >>> for c_node in c_nodes:
- ... all_neighbors = set(nx.all_neighbors(c_graph, c_node))
- ... out_neighbors = set(c_graph.neighbors(c_node))
- ... for out_neighbor in out_neighbors:
- ... c_graph.remove_edge(c_node, out_neighbor)
- ... in_neighbors = all_neighbors - out_neighbors
- ... for in_neighbor in in_neighbors:
- ... c_graph.remove_edge(in_neighbor, c_node)
- ... for out_neighbor in out_neighbors:
- ... c_graph.add_edge(in_neighbor, out_neighbor)
- ... c_graph.remove_node(c_node)
- ...
- >>> nx.is_isomorphic(original_graph, c_graph)
- True
- References
- ----------
- .. [1] Maccioni, A., & Abadi, D. J. (2016, August).
- Scalable pattern matching over compressed graphs via dedensification.
- In Proceedings of the 22nd ACM SIGKDD International Conference on
- Knowledge Discovery and Data Mining (pp. 1755-1764).
- http://www.cs.umd.edu/~abadi/papers/graph-dedense.pdf
- """
- if threshold < 2:
- raise nx.NetworkXError("The degree threshold must be >= 2")
- degrees = G.in_degree if G.is_directed() else G.degree
- # Group nodes based on degree threshold
- high_degree_nodes = {n for n, d in degrees if d > threshold}
- low_degree_nodes = G.nodes() - high_degree_nodes
- auxiliary = {}
- for node in G:
- high_degree_neighbors = frozenset(high_degree_nodes & set(G[node]))
- if high_degree_neighbors:
- if high_degree_neighbors in auxiliary:
- auxiliary[high_degree_neighbors].add(node)
- else:
- auxiliary[high_degree_neighbors] = {node}
- if copy:
- G = G.copy()
- compressor_nodes = set()
- for index, (high_degree_nodes, low_degree_nodes) in enumerate(auxiliary.items()):
- low_degree_node_count = len(low_degree_nodes)
- high_degree_node_count = len(high_degree_nodes)
- old_edges = high_degree_node_count * low_degree_node_count
- new_edges = high_degree_node_count + low_degree_node_count
- if old_edges <= new_edges:
- continue
- compression_node = "".join(str(node) for node in high_degree_nodes)
- if prefix:
- compression_node = str(prefix) + compression_node
- for node in low_degree_nodes:
- for high_node in high_degree_nodes:
- if G.has_edge(node, high_node):
- G.remove_edge(node, high_node)
- G.add_edge(node, compression_node)
- for node in high_degree_nodes:
- G.add_edge(compression_node, node)
- compressor_nodes.add(compression_node)
- return G, compressor_nodes
- def _snap_build_graph(
- G,
- groups,
- node_attributes,
- edge_attributes,
- neighbor_info,
- edge_types,
- prefix,
- supernode_attribute,
- superedge_attribute,
- ):
- """
- Build the summary graph from the data structures produced in the SNAP aggregation algorithm
- Used in the SNAP aggregation algorithm to build the output summary graph and supernode
- lookup dictionary. This process uses the original graph and the data structures to
- create the supernodes with the correct node attributes, and the superedges with the correct
- edge attributes
- Parameters
- ----------
- G: networkx.Graph
- the original graph to be summarized
- groups: dict
- A dictionary of unique group IDs and their corresponding node groups
- node_attributes: iterable
- An iterable of the node attributes considered in the summarization process
- edge_attributes: iterable
- An iterable of the edge attributes considered in the summarization process
- neighbor_info: dict
- A data structure indicating the number of edges a node has with the
- groups in the current summarization of each edge type
- edge_types: dict
- dictionary of edges in the graph and their corresponding attributes recognized
- in the summarization
- prefix: string
- The prefix to be added to all supernodes
- supernode_attribute: str
- The node attribute for recording the supernode groupings of nodes
- superedge_attribute: str
- The edge attribute for recording the edge types represented by superedges
- Returns
- -------
- summary graph: Networkx graph
- """
- output = G.__class__()
- node_label_lookup = {}
- for index, group_id in enumerate(groups):
- group_set = groups[group_id]
- supernode = f"{prefix}{index}"
- node_label_lookup[group_id] = supernode
- supernode_attributes = {
- attr: G.nodes[next(iter(group_set))][attr] for attr in node_attributes
- }
- supernode_attributes[supernode_attribute] = group_set
- output.add_node(supernode, **supernode_attributes)
- for group_id in groups:
- group_set = groups[group_id]
- source_supernode = node_label_lookup[group_id]
- for other_group, group_edge_types in neighbor_info[
- next(iter(group_set))
- ].items():
- if group_edge_types:
- target_supernode = node_label_lookup[other_group]
- summary_graph_edge = (source_supernode, target_supernode)
- edge_types = [
- dict(zip(edge_attributes, edge_type))
- for edge_type in group_edge_types
- ]
- has_edge = output.has_edge(*summary_graph_edge)
- if output.is_multigraph():
- if not has_edge:
- for edge_type in edge_types:
- output.add_edge(*summary_graph_edge, **edge_type)
- elif not output.is_directed():
- existing_edge_data = output.get_edge_data(*summary_graph_edge)
- for edge_type in edge_types:
- if edge_type not in existing_edge_data.values():
- output.add_edge(*summary_graph_edge, **edge_type)
- else:
- superedge_attributes = {superedge_attribute: edge_types}
- output.add_edge(*summary_graph_edge, **superedge_attributes)
- return output
- def _snap_eligible_group(G, groups, group_lookup, edge_types):
- """
- Determines if a group is eligible to be split.
- A group is eligible to be split if all nodes in the group have edges of the same type(s)
- with the same other groups.
- Parameters
- ----------
- G: graph
- graph to be summarized
- groups: dict
- A dictionary of unique group IDs and their corresponding node groups
- group_lookup: dict
- dictionary of nodes and their current corresponding group ID
- edge_types: dict
- dictionary of edges in the graph and their corresponding attributes recognized
- in the summarization
- Returns
- -------
- tuple: group ID to split, and neighbor-groups participation_counts data structure
- """
- neighbor_info = {node: {gid: Counter() for gid in groups} for node in group_lookup}
- for group_id in groups:
- current_group = groups[group_id]
- # build neighbor_info for nodes in group
- for node in current_group:
- neighbor_info[node] = {group_id: Counter() for group_id in groups}
- edges = G.edges(node, keys=True) if G.is_multigraph() else G.edges(node)
- for edge in edges:
- neighbor = edge[1]
- edge_type = edge_types[edge]
- neighbor_group_id = group_lookup[neighbor]
- neighbor_info[node][neighbor_group_id][edge_type] += 1
- # check if group_id is eligible to be split
- group_size = len(current_group)
- for other_group_id in groups:
- edge_counts = Counter()
- for node in current_group:
- edge_counts.update(neighbor_info[node][other_group_id].keys())
- if not all(count == group_size for count in edge_counts.values()):
- # only the neighbor_info of the returned group_id is required for handling group splits
- return group_id, neighbor_info
- # if no eligible groups, complete neighbor_info is calculated
- return None, neighbor_info
- def _snap_split(groups, neighbor_info, group_lookup, group_id):
- """
- Splits a group based on edge types and updates the groups accordingly
- Splits the group with the given group_id based on the edge types
- of the nodes so that each new grouping will all have the same
- edges with other nodes.
- Parameters
- ----------
- groups: dict
- A dictionary of unique group IDs and their corresponding node groups
- neighbor_info: dict
- A data structure indicating the number of edges a node has with the
- groups in the current summarization of each edge type
- edge_types: dict
- dictionary of edges in the graph and their corresponding attributes recognized
- in the summarization
- group_lookup: dict
- dictionary of nodes and their current corresponding group ID
- group_id: object
- ID of group to be split
- Returns
- -------
- dict
- The updated groups based on the split
- """
- new_group_mappings = defaultdict(set)
- for node in groups[group_id]:
- signature = tuple(
- frozenset(edge_types) for edge_types in neighbor_info[node].values()
- )
- new_group_mappings[signature].add(node)
- # leave the biggest new_group as the original group
- new_groups = sorted(new_group_mappings.values(), key=len)
- for new_group in new_groups[:-1]:
- # Assign unused integer as the new_group_id
- # ids are tuples, so will not interact with the original group_ids
- new_group_id = len(groups)
- groups[new_group_id] = new_group
- groups[group_id] -= new_group
- for node in new_group:
- group_lookup[node] = new_group_id
- return groups
- def snap_aggregation(
- G,
- node_attributes,
- edge_attributes=(),
- prefix="Supernode-",
- supernode_attribute="group",
- superedge_attribute="types",
- ):
- """Creates a summary graph based on attributes and connectivity.
- This function uses the Summarization by Grouping Nodes on Attributes
- and Pairwise edges (SNAP) algorithm for summarizing a given
- graph by grouping nodes by node attributes and their edge attributes
- into supernodes in a summary graph. This name SNAP should not be
- confused with the Stanford Network Analysis Project (SNAP).
- Here is a high-level view of how this algorithm works:
- 1) Group nodes by node attribute values.
- 2) Iteratively split groups until all nodes in each group have edges
- to nodes in the same groups. That is, until all the groups are homogeneous
- in their member nodes' edges to other groups. For example,
- if all the nodes in group A only have edge to nodes in group B, then the
- group is homogeneous and does not need to be split. If all nodes in group B
- have edges with nodes in groups {A, C}, but some also have edges with other
- nodes in B, then group B is not homogeneous and needs to be split into
- groups have edges with {A, C} and a group of nodes having
- edges with {A, B, C}. This way, viewers of the summary graph can
- assume that all nodes in the group have the exact same node attributes and
- the exact same edges.
- 3) Build the output summary graph, where the groups are represented by
- super-nodes. Edges represent the edges shared between all the nodes in each
- respective groups.
- A SNAP summary graph can be used to visualize graphs that are too large to display
- or visually analyze, or to efficiently identify sets of similar nodes with similar connectivity
- patterns to other sets of similar nodes based on specified node and/or edge attributes in a graph.
- Parameters
- ----------
- G: graph
- Networkx Graph to be summarized
- edge_attributes: iterable, optional
- An iterable of the edge attributes considered in the summarization process. If provided, unique
- combinations of the attribute values found in the graph are used to
- determine the edge types in the graph. If not provided, all edges
- are considered to be of the same type.
- prefix: str
- The prefix used to denote supernodes in the summary graph. Defaults to 'Supernode-'.
- supernode_attribute: str
- The node attribute for recording the supernode groupings of nodes. Defaults to 'group'.
- superedge_attribute: str
- The edge attribute for recording the edge types of multiple edges. Defaults to 'types'.
- Returns
- -------
- networkx.Graph: summary graph
- Examples
- --------
- SNAP aggregation takes a graph and summarizes it in the context of user-provided
- node and edge attributes such that a viewer can more easily extract and
- analyze the information represented by the graph
- >>> nodes = {
- ... "A": dict(color="Red"),
- ... "B": dict(color="Red"),
- ... "C": dict(color="Red"),
- ... "D": dict(color="Red"),
- ... "E": dict(color="Blue"),
- ... "F": dict(color="Blue"),
- ... }
- >>> edges = [
- ... ("A", "E", "Strong"),
- ... ("B", "F", "Strong"),
- ... ("C", "E", "Weak"),
- ... ("D", "F", "Weak"),
- ... ]
- >>> G = nx.Graph()
- >>> for node in nodes:
- ... attributes = nodes[node]
- ... G.add_node(node, **attributes)
- ...
- >>> for source, target, type in edges:
- ... G.add_edge(source, target, type=type)
- ...
- >>> node_attributes = ('color', )
- >>> edge_attributes = ('type', )
- >>> summary_graph = nx.snap_aggregation(G, node_attributes=node_attributes, edge_attributes=edge_attributes)
- Notes
- -----
- The summary graph produced is called a maximum Attribute-edge
- compatible (AR-compatible) grouping. According to [1]_, an
- AR-compatible grouping means that all nodes in each group have the same
- exact node attribute values and the same exact edges and
- edge types to one or more nodes in the same groups. The maximal
- AR-compatible grouping is the grouping with the minimal cardinality.
- The AR-compatible grouping is the most detailed grouping provided by
- any of the SNAP algorithms.
- References
- ----------
- .. [1] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation
- for graph summarization. In Proc. 2008 ACM-SIGMOD Int. Conf.
- Management of Data (SIGMOD’08), pages 567–580, Vancouver, Canada,
- June 2008.
- """
- edge_types = {
- edge: tuple(attrs.get(attr) for attr in edge_attributes)
- for edge, attrs in G.edges.items()
- }
- if not G.is_directed():
- if G.is_multigraph():
- # list is needed to avoid mutating while iterating
- edges = [((v, u, k), etype) for (u, v, k), etype in edge_types.items()]
- else:
- # list is needed to avoid mutating while iterating
- edges = [((v, u), etype) for (u, v), etype in edge_types.items()]
- edge_types.update(edges)
- group_lookup = {
- node: tuple(attrs[attr] for attr in node_attributes)
- for node, attrs in G.nodes.items()
- }
- groups = defaultdict(set)
- for node, node_type in group_lookup.items():
- groups[node_type].add(node)
- eligible_group_id, neighbor_info = _snap_eligible_group(
- G, groups, group_lookup, edge_types
- )
- while eligible_group_id:
- groups = _snap_split(groups, neighbor_info, group_lookup, eligible_group_id)
- eligible_group_id, neighbor_info = _snap_eligible_group(
- G, groups, group_lookup, edge_types
- )
- return _snap_build_graph(
- G,
- groups,
- node_attributes,
- edge_attributes,
- neighbor_info,
- edge_types,
- prefix,
- supernode_attribute,
- superedge_attribute,
- )
|