如何高效匹配分组元素的不同版本?

问题描述 投票:0回答:1

我正在尝试以尽可能小的差异重新映射版本化数据和代码版本之间的主键。

我有一个元素列表,例如

[a,b,c,...,j]
,以及一些将这些元素分配给组的确定性过程,例如
g0_0={a,b,c}, g0_1={d,e,f,g,h}, g0_2={i}, and g0_3={j}

后来,我将元素添加到列表中,例如

[k,l]
,我改进了我的流程,例如我的小组在这里
g1_0={a,b,c,d}, g1_1={e,f,g,h,i,k}, g1_2={j,l}

如何有效地获得从旧组到新组的(多对多)映射,以最小化映射集之间的差异?

在这个例子中我们会看到:

g0_0 |-> g1_0: diff=1
g0_1,g0_2 |-> g1_1: diff=2
g0_3 |-> g1_2: diff=1
total_diff = 4

我可以将其组装为具有二分图的混合整数优化问题:

g0_0   g0_1   g0_2   g0_3
  |       | /       /
g1_0     g1_1   g_1_2

其中边缘具有离散值 [0,1],并且我最小化总差异。

但是,在这个应用程序中,我的组数量要大得多。我预计大小为

count(g0_*)
count(g1_*)
的邻接矩阵太大,无法使这种方法易于处理。

algorithm optimization set primary-key graph-theory
1个回答
0
投票

OP 希望“差异尽可能小”。

不幸的是,没有具体说明如何计算这个“差异”,并且OP没有回复评论。

我假设“差异”是一对组之间的编辑距离。 即从一组移动到另一组所需的插入和删除的数量。 例如

g0_1={d,e,f,g,h}
g1_1={e,f,g,h,i,k}
需要删除d,添加i和k,编辑距离为3。

但是,无论是在我的假设中还是在发布的问题中,都有问题

g0_1,g0_2 |-> g1_1: diff=2

没有办法解决这个问题,因为OP似乎已经放弃了这个问题。 因此,我将在前进的过程中不断做出必要的假设。

假设1:diff 计算为编辑距离,我们需要最小化映射组的编辑距离之和。

假设 2:第一个版本中的每个组都必须映射到第二个版本中的组。即,版本 1 组中没有遗漏的多对一映射

这些假设使得解决这个问题的算法非常简单:贪婪算法,其中版本 1 中的每个组都映射到具有最小编辑距离的版本 2 组。

这给出了与OP发布的不同的最佳解决方案。

0 -> 0, dist = 1
1 -> 1, dist = 3
2 -> 2, dist = 3
3 -> 2, dist = 1
Total 8

这是实现此功能的 C++ 代码:

#include <string>
#include <fstream>
#include <sstream>
#include <iostream>
#include <vector>
#include <algorithm>

///////////////////// data structure ///////////////////

typedef std::string element_t;

typedef std::vector<element_t> group_t;

typedef std::vector<group_t> version_t;

// An edit distance between two groups
struct sEditDist
{
    int iv1;  // zero-based index to group in first version
    int iv2;  // zero-based index to group in second version
    int dist; // count of insertions and deletion needs to reach 2nd group from first

    void display();
};

typedef std::vector<sEditDist> vEdits_t;

//////////////////////// Globals //////////////////////////

// the versions
version_t v1, v2;

// the optimal connections
vEdits_t vConnections;

////////////////////// methods //////////////////////////////

/// @brief generate example from stackoverflow question
/// @return
/// https://stackoverflow.com/q/79258933/16582
void gen1()
{
    // g0_0={a,b,c}, g0_1={d,e,f,g,h}, g0_2={i}, and g0_3={j}.
    v1 = version_t{
        {"a", "b", "c"},
        {"d", "e", "f", "g", "h"},
        {"i"},
        {"j"}};

    // g1_0={a,b,c,d}, g1_1={e,f,g,h,i,k}, g1_2={j,l}
    v2 = version_t{
        {"a", "b", "c", "d"},
        {"e", "f", "g", "h", "i", "k"},
        {"j", "l"}};
}
/// @brief edit distance between two groups
/// @param g1
/// @param g2
/// @return
int editDistance(
    const group_t &g1,
    const group_t &g2)
{
    /* Do a linear search through a possible mapping between version 1 and 2 groups
    calculating the edit distance ( number of insertions and deletions reuired ) between them.

    This is the bottleneck.  It is blindingly fast for a few elements 
    O(N) where N is the number of elements ( element present in both versions counts for 2 )
    When there are more than ~100,000 elements this might drag a bit.

    TODO: This can be optimised by sorting the members of each group and doing a binary search

    */
    int dist = 0;
    for (auto &e : g1)
    {
        auto it = std::find(g2.begin(), g2.end(), e);
        if (it == g2.end())
            dist++; // element needs to be added to first group
    }
    for (auto &e : g2)
    {
        auto it = std::find(g1.begin(), g1.end(), e);
        if (it == g1.end())
            dist++; // element needs to be deleted from first group
    }
    return dist;
}

/// @brief connect v1 groups to v2 groups, mimimizing the total edit distance
void Connect()
{
    sEditDist ed;
    for (ed.iv1 = 0; ed.iv1 < v1.size(); ed.iv1++)
    {
        int bestv2 = 0;
        int bestDist = INT_MAX;
        for (ed.iv2 = 0; ed.iv2 < v2.size(); ed.iv2++)
        {
            // calculate edit distance
            ed.dist = editDistance(
                v1[ed.iv1],
                v2[ed.iv2]);
            ed.display();

            // keep the smallest edit distance
            if (ed.dist < bestDist)
            {
                bestDist = ed.dist;
                bestv2 = ed.iv2;
            }
        }
        // add connection with smallest edit distance from g1
        ed.iv2 = bestv2;
        ed.dist = bestDist;
        vConnections.push_back(ed);
    }
}
void ConnectDisplay()
{
    std::cout << "\nOptimal mapping between versions\n";
    for (auto &d : vConnections)
        d.display();
}

void sEditDist::display()
{
    std::cout << iv1 << " -> " << iv2 << ", dist = " << dist << "\n";
}

/////////////// Main //////////////////////

main()
{
    // generate problem
    gen1();

    // make connections
    Connect();

    // display result
    ConnectDisplay();
    return 0;
}

此应用程序的完整 VSCODE 项目位于 https://github.com/JamesBremner/so79258933

© www.soinside.com 2019 - 2024. All rights reserved.