分割图表,使每种类型的节点在每个分区中只出现一次,并尽可能少地进行切割

问题描述 投票:0回答:1

我在 BigQuery 中有一个表,它表示图的边缘。图中的每个节点都有一个类型。每条边都有一个created_at 属性。现在我想将图切割成子组件。没有子组件具有多个选定类型 (A) 的节点。我想用尽可能少的边缘切割来做到这一点。更喜欢切割较新的边缘而不是较旧的边缘。

  1. 什么是有效的方法来做到这一点?
  2. 这可以在 SQL 中通过递归来完成吗?
  3. 我需要使用什么类型的算法?

这是一个示例表结构。

在生成的子组件中,没有子组件具有多个 A 类型节点。

Visualization of an example graph

节点:

节点id 节点类型
1 A
2 B
3 A
4 C
5 A
6 B

边缘

边缘_id 源节点 目标节点 创建于 来源类型 目标类型
1 1 4 2024-05-01 20:43:00 A C
2 2 6 2024-05-02 06:43:00 B B
3 3 6 2024-05-03 08:43:00 A B
4 6 4 2024-05-04 19:43:00 B C
5 4 5 2024-05-05 21:43:00 C A
从上面反转边缘...
6 4 1 2024-05-01 20:43:00 C A
7 6 2 2024-05-02 06:43:00 B B
8 6 3 2024-05-03 08:43:00 B A
9 4 6 2024-05-04 19:43:00 C B
10 5 4 2024-05-05 21:43:00 A C

这就是我目前获取组件 ID 的方法。我目前陷入困境的是如何将这些组件切割成子组件。

with recursive
    connected_components as (
        select
            source_node as node_id,
            source_node as front,
            target_node,
            source_type,
            target_type,
            1 as depth,
            [source_node] as visited_nodes
        from edges
        union all
        select
            cc.node_id,
            e.source_node,
            e.target_node as front,
            e.source_type,
            e.target_type,
            cc.depth + 1 as depth,
            array_concat(cc.visited_nodes, [e.target_node]) as visited_nodes
        from edges as e
        inner join connected_components as cc on e.source_node = cc.target_node
        where cc.depth < 50 and not e.target_node in unnest(cc.visited_nodes)
    ),

    components as (
        select node_id, min(front) as component_id from connected_components group by 1
    )

select * from components

因此,我希望有一个类似的查找表,但作为每个节点的子组件 ID 的查找。

如果无法通过 sql 解决,也可以通过 Python 解决,然后将组件的查找表重新导入 Bigquery。

根据上面提供的数据,我期望得到下表:

id 组件id component_id_from_pic(仅供参考)
1 1 c2
4 1 c2
5 5 c1
6 2 c3
2 2 c3
3 2 c3
python graph networkx graph-theory connected-components
1个回答
0
投票

我想您可以从 A 类型的每个节点中进行孤立的广度优先搜索。将组件初始化为单例集,每个集合都包含 A 类型的节点。现在通过选择要添加到每个组件的边来扩展这些组件,类似于BFS。您可以按照与创建相反的顺序选择这些边。

我在 Python 中的尝试如下。 首先创建数据集(假设您有办法从 BigQuery 中提取该数据集):

import pandas as pd

data = {'node_id':[1,2,3,4,5,6],
        'node_type':['A', 'B', 'A', 'C', 'A', 'B'],
        'attr_name': ['type', 'type', 'type', 'type', 'type', 'type']}

nodes = pd.DataFrame(data)

data = {'edge_id' : [1,2,3,4,5],
        'source': [1,2,3,6,4],
        'target': [4,6,6,4,5],
        'created_at': [1,2,3,4,5],
        'source_type': ['A', 'B', 'A', 'B', 'C'],
        'target_type': ['C', 'B', 'B', 'C', 'A']}

edges = pd.DataFrame(data)

用它创建一个networkx图:

import networkx as nx

attrs = dict()
G = nx.from_pandas_edgelist(new_edges, 'source', 'target', edge_attr='created_at')
for i in range(len(nodes)):
  attrs[nodes.loc[i, 'node_id']] = {'type': nodes.loc[i, 'node_type']}
nx.set_node_attributes(G, attrs)


# in case G is disconnected to start with
connected_components = [G.subgraph(c).copy() for c in nx.connected_components(G)]
#Note: you can replace the above line with pre-defined components if you have

然后创建对的字典:

import copy

given_type = 'A'

for component in connected_components:

  #Make an isolated component for each node of type <given_type>
  sub_components = [node for node in component.nodes if component.nodes[node]['type'] == given_type]
  comp_belong = dict((sub_components[i], i+1) for i in range(len(sub_components))) #This is a dictionary of the form <node_id: component_id>
  visited = [] 
  remaining_edges = []
  while len(comp_belong) < G.number_of_nodes():
    
    candidate_edges = set()

    cb = copy.deepcopy(comp_belong) #So that iterable does not change size in the for loop
    
    for node in [x for x in cb.keys() if x not in visited]:
        
        if node in visited:
          continue

        visited.append(node)

        for e in G.edges(node):
          if (e[0], e[1]) not in candidate_edges and (e[1], e[0]) not in candidate_edges: #select the edges connected to node, similar to breadth first search 
            candidate_edges.add(e)

        sorted_edges = sorted(candidate_edges, key = lambda edge: component.edges[edge]['created_at'], reverse =  True) #sort the edges by time, youngest first
        
        for selected_edge in sorted_edges:
          #identify whichever end point is not in the component dictionary and add it to the same component as its parent
          if selected_edge[0] in comp_belong.keys() and selected_edge[1] not in comp_belong.keys():
            comp_belong[selected_edge[1]] = comp_belong[selected_edge[0]]
            remaining_edges.append(selected_edge)
          elif selected_edge[0] not in comp_belong.keys() and selected_edge[1] in comp_belong.keys():
            comp_belong[selected_edge[0]] = comp_belong[selected_edge[1]]
            remaining_edges.append(selected_edge)
print(comp_belong) #this is your desired output, put it in a data frame or whatever data structure you need. 

然后删除remaining_edges

的边并可视化图形:

G.remove_edges_from([e for e in G.edges if (e[0], e[1]) not in remaining_edges and (e[1], e[0]) not in remaining_edges])
nx.draw(G)
© www.soinside.com 2019 - 2024. All rights reserved.