summarization.py 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556
  1. """
  2. Graph summarization finds smaller representations of graphs resulting in faster
  3. runtime of algorithms, reduced storage needs, and noise reduction.
  4. Summarization has applications in areas such as visualization, pattern mining,
  5. clustering and community detection, and more. Core graph summarization
  6. techniques are grouping/aggregation, bit-compression,
  7. simplification/sparsification, and influence based. Graph summarization
  8. algorithms often produce either summary graphs in the form of supergraphs or
  9. sparsified graphs, or a list of independent structures. Supergraphs are the
  10. most common product, which consist of supernodes and original nodes and are
  11. connected by edges and superedges, which represent aggregate edges between
  12. nodes and supernodes.
  13. Grouping/aggregation based techniques compress graphs by representing
  14. close/connected nodes and edges in a graph by a single node/edge in a
  15. supergraph. Nodes can be grouped together into supernodes based on their
  16. structural similarities or proximity within a graph to reduce the total number
  17. of nodes in a graph. Edge-grouping techniques group edges into lossy/lossless
  18. nodes called compressor or virtual nodes to reduce the total number of edges in
  19. a graph. Edge-grouping techniques can be lossless, meaning that they can be
  20. used to re-create the original graph, or techniques can be lossy, requiring
  21. less space to store the summary graph, but at the expense of lower
  22. reconstruction accuracy of the original graph.
  23. Bit-compression techniques minimize the amount of information needed to
  24. describe the original graph, while revealing structural patterns in the
  25. original graph. The two-part minimum description length (MDL) is often used to
  26. represent the model and the original graph in terms of the model. A key
  27. difference between graph compression and graph summarization is that graph
  28. summarization focuses on finding structural patterns within the original graph,
  29. whereas graph compression focuses on compressions the original graph to be as
  30. small as possible. **NOTE**: Some bit-compression methods exist solely to
  31. compress a graph without creating a summary graph or finding comprehensible
  32. structural patterns.
  33. Simplification/Sparsification techniques attempt to create a sparse
  34. representation of a graph by removing unimportant nodes and edges from the
  35. graph. Sparsified graphs differ from supergraphs created by
  36. grouping/aggregation by only containing a subset of the original nodes and
  37. edges of the original graph.
  38. Influence based techniques aim to find a high-level description of influence
  39. propagation in a large graph. These methods are scarce and have been mostly
  40. applied to social graphs.
  41. *dedensification* is a grouping/aggregation based technique to compress the
  42. neighborhoods around high-degree nodes in unweighted graphs by adding
  43. compressor nodes that summarize multiple edges of the same type to
  44. high-degree nodes (nodes with a degree greater than a given threshold).
  45. Dedensification was developed for the purpose of increasing performance of
  46. query processing around high-degree nodes in graph databases and enables direct
  47. operations on the compressed graph. The structural patterns surrounding
  48. high-degree nodes in the original is preserved while using fewer edges and
  49. adding a small number of compressor nodes. The degree of nodes present in the
  50. original graph is also preserved. The current implementation of dedensification
  51. supports graphs with one edge type.
  52. For more information on graph summarization, see `Graph Summarization Methods
  53. and Applications: A Survey <https://dl.acm.org/doi/abs/10.1145/3186727>`_
  54. """
  55. from collections import Counter, defaultdict
  56. import networkx as nx
  57. __all__ = ["dedensify", "snap_aggregation"]
  58. def dedensify(G, threshold, prefix=None, copy=True):
  59. """Compresses neighborhoods around high-degree nodes
  60. Reduces the number of edges to high-degree nodes by adding compressor nodes
  61. that summarize multiple edges of the same type to high-degree nodes (nodes
  62. with a degree greater than a given threshold). Dedensification also has
  63. the added benefit of reducing the number of edges around high-degree nodes.
  64. The implementation currently supports graphs with a single edge type.
  65. Parameters
  66. ----------
  67. G: graph
  68. A networkx graph
  69. threshold: int
  70. Minimum degree threshold of a node to be considered a high degree node.
  71. The threshold must be greater than or equal to 2.
  72. prefix: str or None, optional (default: None)
  73. An optional prefix for denoting compressor nodes
  74. copy: bool, optional (default: True)
  75. Indicates if dedensification should be done inplace
  76. Returns
  77. -------
  78. dedensified networkx graph : (graph, set)
  79. 2-tuple of the dedensified graph and set of compressor nodes
  80. Notes
  81. -----
  82. According to the algorithm in [1]_, removes edges in a graph by
  83. compressing/decompressing the neighborhoods around high degree nodes by
  84. adding compressor nodes that summarize multiple edges of the same type
  85. to high-degree nodes. Dedensification will only add a compressor node when
  86. doing so will reduce the total number of edges in the given graph. This
  87. implementation currently supports graphs with a single edge type.
  88. Examples
  89. --------
  90. Dedensification will only add compressor nodes when doing so would result
  91. in fewer edges::
  92. >>> original_graph = nx.DiGraph()
  93. >>> original_graph.add_nodes_from(
  94. ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
  95. ... )
  96. >>> original_graph.add_edges_from(
  97. ... [
  98. ... ("1", "C"), ("1", "B"),
  99. ... ("2", "C"), ("2", "B"), ("2", "A"),
  100. ... ("3", "B"), ("3", "A"), ("3", "6"),
  101. ... ("4", "C"), ("4", "B"), ("4", "A"),
  102. ... ("5", "B"), ("5", "A"),
  103. ... ("6", "5"),
  104. ... ("A", "6")
  105. ... ]
  106. ... )
  107. >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
  108. >>> original_graph.number_of_edges()
  109. 15
  110. >>> c_graph.number_of_edges()
  111. 14
  112. A dedensified, directed graph can be "densified" to reconstruct the
  113. original graph::
  114. >>> original_graph = nx.DiGraph()
  115. >>> original_graph.add_nodes_from(
  116. ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
  117. ... )
  118. >>> original_graph.add_edges_from(
  119. ... [
  120. ... ("1", "C"), ("1", "B"),
  121. ... ("2", "C"), ("2", "B"), ("2", "A"),
  122. ... ("3", "B"), ("3", "A"), ("3", "6"),
  123. ... ("4", "C"), ("4", "B"), ("4", "A"),
  124. ... ("5", "B"), ("5", "A"),
  125. ... ("6", "5"),
  126. ... ("A", "6")
  127. ... ]
  128. ... )
  129. >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
  130. >>> # re-densifies the compressed graph into the original graph
  131. >>> for c_node in c_nodes:
  132. ... all_neighbors = set(nx.all_neighbors(c_graph, c_node))
  133. ... out_neighbors = set(c_graph.neighbors(c_node))
  134. ... for out_neighbor in out_neighbors:
  135. ... c_graph.remove_edge(c_node, out_neighbor)
  136. ... in_neighbors = all_neighbors - out_neighbors
  137. ... for in_neighbor in in_neighbors:
  138. ... c_graph.remove_edge(in_neighbor, c_node)
  139. ... for out_neighbor in out_neighbors:
  140. ... c_graph.add_edge(in_neighbor, out_neighbor)
  141. ... c_graph.remove_node(c_node)
  142. ...
  143. >>> nx.is_isomorphic(original_graph, c_graph)
  144. True
  145. References
  146. ----------
  147. .. [1] Maccioni, A., & Abadi, D. J. (2016, August).
  148. Scalable pattern matching over compressed graphs via dedensification.
  149. In Proceedings of the 22nd ACM SIGKDD International Conference on
  150. Knowledge Discovery and Data Mining (pp. 1755-1764).
  151. http://www.cs.umd.edu/~abadi/papers/graph-dedense.pdf
  152. """
  153. if threshold < 2:
  154. raise nx.NetworkXError("The degree threshold must be >= 2")
  155. degrees = G.in_degree if G.is_directed() else G.degree
  156. # Group nodes based on degree threshold
  157. high_degree_nodes = {n for n, d in degrees if d > threshold}
  158. low_degree_nodes = G.nodes() - high_degree_nodes
  159. auxiliary = {}
  160. for node in G:
  161. high_degree_neighbors = frozenset(high_degree_nodes & set(G[node]))
  162. if high_degree_neighbors:
  163. if high_degree_neighbors in auxiliary:
  164. auxiliary[high_degree_neighbors].add(node)
  165. else:
  166. auxiliary[high_degree_neighbors] = {node}
  167. if copy:
  168. G = G.copy()
  169. compressor_nodes = set()
  170. for index, (high_degree_nodes, low_degree_nodes) in enumerate(auxiliary.items()):
  171. low_degree_node_count = len(low_degree_nodes)
  172. high_degree_node_count = len(high_degree_nodes)
  173. old_edges = high_degree_node_count * low_degree_node_count
  174. new_edges = high_degree_node_count + low_degree_node_count
  175. if old_edges <= new_edges:
  176. continue
  177. compression_node = "".join(str(node) for node in high_degree_nodes)
  178. if prefix:
  179. compression_node = str(prefix) + compression_node
  180. for node in low_degree_nodes:
  181. for high_node in high_degree_nodes:
  182. if G.has_edge(node, high_node):
  183. G.remove_edge(node, high_node)
  184. G.add_edge(node, compression_node)
  185. for node in high_degree_nodes:
  186. G.add_edge(compression_node, node)
  187. compressor_nodes.add(compression_node)
  188. return G, compressor_nodes
  189. def _snap_build_graph(
  190. G,
  191. groups,
  192. node_attributes,
  193. edge_attributes,
  194. neighbor_info,
  195. edge_types,
  196. prefix,
  197. supernode_attribute,
  198. superedge_attribute,
  199. ):
  200. """
  201. Build the summary graph from the data structures produced in the SNAP aggregation algorithm
  202. Used in the SNAP aggregation algorithm to build the output summary graph and supernode
  203. lookup dictionary. This process uses the original graph and the data structures to
  204. create the supernodes with the correct node attributes, and the superedges with the correct
  205. edge attributes
  206. Parameters
  207. ----------
  208. G: networkx.Graph
  209. the original graph to be summarized
  210. groups: dict
  211. A dictionary of unique group IDs and their corresponding node groups
  212. node_attributes: iterable
  213. An iterable of the node attributes considered in the summarization process
  214. edge_attributes: iterable
  215. An iterable of the edge attributes considered in the summarization process
  216. neighbor_info: dict
  217. A data structure indicating the number of edges a node has with the
  218. groups in the current summarization of each edge type
  219. edge_types: dict
  220. dictionary of edges in the graph and their corresponding attributes recognized
  221. in the summarization
  222. prefix: string
  223. The prefix to be added to all supernodes
  224. supernode_attribute: str
  225. The node attribute for recording the supernode groupings of nodes
  226. superedge_attribute: str
  227. The edge attribute for recording the edge types represented by superedges
  228. Returns
  229. -------
  230. summary graph: Networkx graph
  231. """
  232. output = G.__class__()
  233. node_label_lookup = {}
  234. for index, group_id in enumerate(groups):
  235. group_set = groups[group_id]
  236. supernode = f"{prefix}{index}"
  237. node_label_lookup[group_id] = supernode
  238. supernode_attributes = {
  239. attr: G.nodes[next(iter(group_set))][attr] for attr in node_attributes
  240. }
  241. supernode_attributes[supernode_attribute] = group_set
  242. output.add_node(supernode, **supernode_attributes)
  243. for group_id in groups:
  244. group_set = groups[group_id]
  245. source_supernode = node_label_lookup[group_id]
  246. for other_group, group_edge_types in neighbor_info[
  247. next(iter(group_set))
  248. ].items():
  249. if group_edge_types:
  250. target_supernode = node_label_lookup[other_group]
  251. summary_graph_edge = (source_supernode, target_supernode)
  252. edge_types = [
  253. dict(zip(edge_attributes, edge_type))
  254. for edge_type in group_edge_types
  255. ]
  256. has_edge = output.has_edge(*summary_graph_edge)
  257. if output.is_multigraph():
  258. if not has_edge:
  259. for edge_type in edge_types:
  260. output.add_edge(*summary_graph_edge, **edge_type)
  261. elif not output.is_directed():
  262. existing_edge_data = output.get_edge_data(*summary_graph_edge)
  263. for edge_type in edge_types:
  264. if edge_type not in existing_edge_data.values():
  265. output.add_edge(*summary_graph_edge, **edge_type)
  266. else:
  267. superedge_attributes = {superedge_attribute: edge_types}
  268. output.add_edge(*summary_graph_edge, **superedge_attributes)
  269. return output
  270. def _snap_eligible_group(G, groups, group_lookup, edge_types):
  271. """
  272. Determines if a group is eligible to be split.
  273. A group is eligible to be split if all nodes in the group have edges of the same type(s)
  274. with the same other groups.
  275. Parameters
  276. ----------
  277. G: graph
  278. graph to be summarized
  279. groups: dict
  280. A dictionary of unique group IDs and their corresponding node groups
  281. group_lookup: dict
  282. dictionary of nodes and their current corresponding group ID
  283. edge_types: dict
  284. dictionary of edges in the graph and their corresponding attributes recognized
  285. in the summarization
  286. Returns
  287. -------
  288. tuple: group ID to split, and neighbor-groups participation_counts data structure
  289. """
  290. neighbor_info = {node: {gid: Counter() for gid in groups} for node in group_lookup}
  291. for group_id in groups:
  292. current_group = groups[group_id]
  293. # build neighbor_info for nodes in group
  294. for node in current_group:
  295. neighbor_info[node] = {group_id: Counter() for group_id in groups}
  296. edges = G.edges(node, keys=True) if G.is_multigraph() else G.edges(node)
  297. for edge in edges:
  298. neighbor = edge[1]
  299. edge_type = edge_types[edge]
  300. neighbor_group_id = group_lookup[neighbor]
  301. neighbor_info[node][neighbor_group_id][edge_type] += 1
  302. # check if group_id is eligible to be split
  303. group_size = len(current_group)
  304. for other_group_id in groups:
  305. edge_counts = Counter()
  306. for node in current_group:
  307. edge_counts.update(neighbor_info[node][other_group_id].keys())
  308. if not all(count == group_size for count in edge_counts.values()):
  309. # only the neighbor_info of the returned group_id is required for handling group splits
  310. return group_id, neighbor_info
  311. # if no eligible groups, complete neighbor_info is calculated
  312. return None, neighbor_info
  313. def _snap_split(groups, neighbor_info, group_lookup, group_id):
  314. """
  315. Splits a group based on edge types and updates the groups accordingly
  316. Splits the group with the given group_id based on the edge types
  317. of the nodes so that each new grouping will all have the same
  318. edges with other nodes.
  319. Parameters
  320. ----------
  321. groups: dict
  322. A dictionary of unique group IDs and their corresponding node groups
  323. neighbor_info: dict
  324. A data structure indicating the number of edges a node has with the
  325. groups in the current summarization of each edge type
  326. edge_types: dict
  327. dictionary of edges in the graph and their corresponding attributes recognized
  328. in the summarization
  329. group_lookup: dict
  330. dictionary of nodes and their current corresponding group ID
  331. group_id: object
  332. ID of group to be split
  333. Returns
  334. -------
  335. dict
  336. The updated groups based on the split
  337. """
  338. new_group_mappings = defaultdict(set)
  339. for node in groups[group_id]:
  340. signature = tuple(
  341. frozenset(edge_types) for edge_types in neighbor_info[node].values()
  342. )
  343. new_group_mappings[signature].add(node)
  344. # leave the biggest new_group as the original group
  345. new_groups = sorted(new_group_mappings.values(), key=len)
  346. for new_group in new_groups[:-1]:
  347. # Assign unused integer as the new_group_id
  348. # ids are tuples, so will not interact with the original group_ids
  349. new_group_id = len(groups)
  350. groups[new_group_id] = new_group
  351. groups[group_id] -= new_group
  352. for node in new_group:
  353. group_lookup[node] = new_group_id
  354. return groups
  355. def snap_aggregation(
  356. G,
  357. node_attributes,
  358. edge_attributes=(),
  359. prefix="Supernode-",
  360. supernode_attribute="group",
  361. superedge_attribute="types",
  362. ):
  363. """Creates a summary graph based on attributes and connectivity.
  364. This function uses the Summarization by Grouping Nodes on Attributes
  365. and Pairwise edges (SNAP) algorithm for summarizing a given
  366. graph by grouping nodes by node attributes and their edge attributes
  367. into supernodes in a summary graph. This name SNAP should not be
  368. confused with the Stanford Network Analysis Project (SNAP).
  369. Here is a high-level view of how this algorithm works:
  370. 1) Group nodes by node attribute values.
  371. 2) Iteratively split groups until all nodes in each group have edges
  372. to nodes in the same groups. That is, until all the groups are homogeneous
  373. in their member nodes' edges to other groups. For example,
  374. if all the nodes in group A only have edge to nodes in group B, then the
  375. group is homogeneous and does not need to be split. If all nodes in group B
  376. have edges with nodes in groups {A, C}, but some also have edges with other
  377. nodes in B, then group B is not homogeneous and needs to be split into
  378. groups have edges with {A, C} and a group of nodes having
  379. edges with {A, B, C}. This way, viewers of the summary graph can
  380. assume that all nodes in the group have the exact same node attributes and
  381. the exact same edges.
  382. 3) Build the output summary graph, where the groups are represented by
  383. super-nodes. Edges represent the edges shared between all the nodes in each
  384. respective groups.
  385. A SNAP summary graph can be used to visualize graphs that are too large to display
  386. or visually analyze, or to efficiently identify sets of similar nodes with similar connectivity
  387. patterns to other sets of similar nodes based on specified node and/or edge attributes in a graph.
  388. Parameters
  389. ----------
  390. G: graph
  391. Networkx Graph to be summarized
  392. edge_attributes: iterable, optional
  393. An iterable of the edge attributes considered in the summarization process. If provided, unique
  394. combinations of the attribute values found in the graph are used to
  395. determine the edge types in the graph. If not provided, all edges
  396. are considered to be of the same type.
  397. prefix: str
  398. The prefix used to denote supernodes in the summary graph. Defaults to 'Supernode-'.
  399. supernode_attribute: str
  400. The node attribute for recording the supernode groupings of nodes. Defaults to 'group'.
  401. superedge_attribute: str
  402. The edge attribute for recording the edge types of multiple edges. Defaults to 'types'.
  403. Returns
  404. -------
  405. networkx.Graph: summary graph
  406. Examples
  407. --------
  408. SNAP aggregation takes a graph and summarizes it in the context of user-provided
  409. node and edge attributes such that a viewer can more easily extract and
  410. analyze the information represented by the graph
  411. >>> nodes = {
  412. ... "A": dict(color="Red"),
  413. ... "B": dict(color="Red"),
  414. ... "C": dict(color="Red"),
  415. ... "D": dict(color="Red"),
  416. ... "E": dict(color="Blue"),
  417. ... "F": dict(color="Blue"),
  418. ... }
  419. >>> edges = [
  420. ... ("A", "E", "Strong"),
  421. ... ("B", "F", "Strong"),
  422. ... ("C", "E", "Weak"),
  423. ... ("D", "F", "Weak"),
  424. ... ]
  425. >>> G = nx.Graph()
  426. >>> for node in nodes:
  427. ... attributes = nodes[node]
  428. ... G.add_node(node, **attributes)
  429. ...
  430. >>> for source, target, type in edges:
  431. ... G.add_edge(source, target, type=type)
  432. ...
  433. >>> node_attributes = ('color', )
  434. >>> edge_attributes = ('type', )
  435. >>> summary_graph = nx.snap_aggregation(G, node_attributes=node_attributes, edge_attributes=edge_attributes)
  436. Notes
  437. -----
  438. The summary graph produced is called a maximum Attribute-edge
  439. compatible (AR-compatible) grouping. According to [1]_, an
  440. AR-compatible grouping means that all nodes in each group have the same
  441. exact node attribute values and the same exact edges and
  442. edge types to one or more nodes in the same groups. The maximal
  443. AR-compatible grouping is the grouping with the minimal cardinality.
  444. The AR-compatible grouping is the most detailed grouping provided by
  445. any of the SNAP algorithms.
  446. References
  447. ----------
  448. .. [1] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation
  449. for graph summarization. In Proc. 2008 ACM-SIGMOD Int. Conf.
  450. Management of Data (SIGMOD’08), pages 567–580, Vancouver, Canada,
  451. June 2008.
  452. """
  453. edge_types = {
  454. edge: tuple(attrs.get(attr) for attr in edge_attributes)
  455. for edge, attrs in G.edges.items()
  456. }
  457. if not G.is_directed():
  458. if G.is_multigraph():
  459. # list is needed to avoid mutating while iterating
  460. edges = [((v, u, k), etype) for (u, v, k), etype in edge_types.items()]
  461. else:
  462. # list is needed to avoid mutating while iterating
  463. edges = [((v, u), etype) for (u, v), etype in edge_types.items()]
  464. edge_types.update(edges)
  465. group_lookup = {
  466. node: tuple(attrs[attr] for attr in node_attributes)
  467. for node, attrs in G.nodes.items()
  468. }
  469. groups = defaultdict(set)
  470. for node, node_type in group_lookup.items():
  471. groups[node_type].add(node)
  472. eligible_group_id, neighbor_info = _snap_eligible_group(
  473. G, groups, group_lookup, edge_types
  474. )
  475. while eligible_group_id:
  476. groups = _snap_split(groups, neighbor_info, group_lookup, eligible_group_id)
  477. eligible_group_id, neighbor_info = _snap_eligible_group(
  478. G, groups, group_lookup, edge_types
  479. )
  480. return _snap_build_graph(
  481. G,
  482. groups,
  483. node_attributes,
  484. edge_attributes,
  485. neighbor_info,
  486. edge_types,
  487. prefix,
  488. supernode_attribute,
  489. superedge_attribute,
  490. )