分割图表，使每种类型的节点在每个分区中只出现一次，并尽可能少地进行切割

Question

我在 BigQuery 中有一个表，它表示图的边缘。图中的每个节点都有一个类型。每条边都有一个created_at 属性。现在我想将图切割成子组件。没有子组件具有多个选定类型 (A) 的节点。我想用尽可能少的边缘切割来做到这一点。更喜欢切割较新的边缘而不是较旧的边缘。

什么是有效的方法来做到这一点？
这可以在 SQL 中通过递归来完成吗？
我需要使用什么类型的算法？

这是一个示例表结构。

在生成的子组件中，没有子组件具有多个 A 类型节点。

节点：

节点id	节点类型
1	A
2	B
3	A
4	C
5	A
6	B

边缘

边缘_id	源节点	目标节点	创建于	来源类型	目标类型
1	1	4	2024-05-01 20:43:00	A	C
2	2	6	2024-05-02 06:43:00	B	B
3	3	6	2024-05-03 08:43:00	A	B
4	6	4	2024-05-04 19:43:00	B	C
5	4	5	2024-05-05 21:43:00	C	A
从上面反转边缘...
6	4	1	2024-05-01 20:43:00	C	A
7	6	2	2024-05-02 06:43:00	B	B
8	6	3	2024-05-03 08:43:00	B	A
9	4	6	2024-05-04 19:43:00	C	B
10	5	4	2024-05-05 21:43:00	A	C

这就是我目前获取组件 ID 的方法。我目前陷入困境的是如何将这些组件切割成子组件。

with recursive
    connected_components as (
        select
            source_node as node_id,
            source_node as front,
            target_node,
            source_type,
            target_type,
            1 as depth,
            [source_node] as visited_nodes
        from edges
        union all
        select
            cc.node_id,
            e.source_node,
            e.target_node as front,
            e.source_type,
            e.target_type,
            cc.depth + 1 as depth,
            array_concat(cc.visited_nodes, [e.target_node]) as visited_nodes
        from edges as e
        inner join connected_components as cc on e.source_node = cc.target_node
        where cc.depth < 50 and not e.target_node in unnest(cc.visited_nodes)
    ),

    components as (
        select node_id, min(front) as component_id from connected_components group by 1
    )

select * from components

因此，我希望有一个类似的查找表，但作为每个节点的子组件 ID 的查找。

如果无法通过 sql 解决，也可以通过 Python 解决，然后将组件的查找表重新导入 Bigquery。

根据上面提供的数据，我期望得到下表：

id	组件id	component_id_from_pic（仅供参考）
1	1	c2
4	1	c2
5	5	c1
6	2	c3
2	2	c3
3	2	c3

Answer 1

我想您可以从 A 类型的每个节点中进行孤立的广度优先搜索。将组件初始化为单例集，每个集合都包含 A 类型的节点。现在通过选择要添加到每个组件的边来扩展这些组件，类似于BFS。您可以按照与创建相反的顺序选择这些边。

我在 Python 中的尝试如下。首先创建数据集（假设您有办法从 BigQuery 中提取该数据集）：

import pandas as pd

data = {'node_id':[1,2,3,4,5,6],
        'node_type':['A', 'B', 'A', 'C', 'A', 'B'],
        'attr_name': ['type', 'type', 'type', 'type', 'type', 'type']}

nodes = pd.DataFrame(data)

data = {'edge_id' : [1,2,3,4,5],
        'source': [1,2,3,6,4],
        'target': [4,6,6,4,5],
        'created_at': [1,2,3,4,5],
        'source_type': ['A', 'B', 'A', 'B', 'C'],
        'target_type': ['C', 'B', 'B', 'C', 'A']}

edges = pd.DataFrame(data)

用它创建一个networkx图：

import networkx as nx

attrs = dict()
G = nx.from_pandas_edgelist(new_edges, 'source', 'target', edge_attr='created_at')
for i in range(len(nodes)):
  attrs[nodes.loc[i, 'node_id']] = {'type': nodes.loc[i, 'node_type']}
nx.set_node_attributes(G, attrs)


# in case G is disconnected to start with
connected_components = [G.subgraph(c).copy() for c in nx.connected_components(G)]
#Note: you can replace the above line with pre-defined components if you have

然后创建对的字典：

import copy

given_type = 'A'

for component in connected_components:

  #Make an isolated component for each node of type <given_type>
  sub_components = [node for node in component.nodes if component.nodes[node]['type'] == given_type]
  comp_belong = dict((sub_components[i], i+1) for i in range(len(sub_components))) #This is a dictionary of the form <node_id: component_id>
  visited = [] 
  remaining_edges = []
  while len(comp_belong) < G.number_of_nodes():
    
    candidate_edges = set()

    cb = copy.deepcopy(comp_belong) #So that iterable does not change size in the for loop
    
    for node in [x for x in cb.keys() if x not in visited]:
        
        if node in visited:
          continue

        visited.append(node)

        for e in G.edges(node):
          if (e[0], e[1]) not in candidate_edges and (e[1], e[0]) not in candidate_edges: #select the edges connected to node, similar to breadth first search 
            candidate_edges.add(e)

        sorted_edges = sorted(candidate_edges, key = lambda edge: component.edges[edge]['created_at'], reverse =  True) #sort the edges by time, youngest first
        
        for selected_edge in sorted_edges:
          #identify whichever end point is not in the component dictionary and add it to the same component as its parent
          if selected_edge[0] in comp_belong.keys() and selected_edge[1] not in comp_belong.keys():
            comp_belong[selected_edge[1]] = comp_belong[selected_edge[0]]
            remaining_edges.append(selected_edge)
          elif selected_edge[0] not in comp_belong.keys() and selected_edge[1] in comp_belong.keys():
            comp_belong[selected_edge[0]] = comp_belong[selected_edge[1]]
            remaining_edges.append(selected_edge)
print(comp_belong) #this is your desired output, put it in a data frame or whatever data structure you need.

然后删除remaining_edges中

不

的边并可视化图形：

G.remove_edges_from([e for e in G.edges if (e[0], e[1]) not in remaining_edges and (e[1], e[0]) not in remaining_edges])
nx.draw(G)

分割图表，使每种类型的节点在每个分区中只出现一次，并尽可能少地进行切割

问题描述投票：0回答：1

节点：

边缘

1个回答

最新问题

分割图表，使每种类型的节点在每个分区中只出现一次，并尽可能少地进行切割

问题描述 投票：0回答：1

节点：

边缘

1个回答

最新问题

问题描述投票：0回答：1