匹配来自不同系统的字符串的最佳方式

问题描述 投票:0回答:1

假设我有一个像这样的字符串列表(真实的数据集要大得多并且也包含其他数据):

List<string> modelNames =
    [
        "XC60 Momentum Standard T6",
        "XC60 Inscription Standard T6",
        "XC60 R designStandard T6",
        "XC60 T5 Powershift",
        "XC60 D3 DRIVE MANUAL",
        "XC60 D3 GEARTRONIC",
        "XC60 D5 GEARTRONIC AWD",
        "XC60 T6 AWD GEARTRONIC",
        "XC60 T5  AWD R DESIGN",
        "XC60 D5 GEARTRONIC AWD R DESIGN",
        "XC60 T6 AWD GEARTRONIC R DESIGN",
    ];

我想使用这样的字符串获得最接近的匹配:

"2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr"
"2.0 D4 Momentum Auto Euro 6 (s/s) 5dr"
"2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)"
"2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"

如您所见,字符串根本不匹配,但有一些方面匹配。

我想产生某种置信度分数。我的想法是将两组字符串分解为单词,然后看看哪一组的单词匹配数最多。我不确定这是否是进行此类分析的最佳方法,或者在

c#
中完成此分析的最佳且高效的方法。

也许有比尝试得分更好的方法,就像我上面描述的那样?

如果有任何想法、建议和指点,我将不胜感激。

谢谢,

凯恩

c# string data-analysis fuzzy-search fuzzy-comparison
1个回答
0
投票

我用你的示例字符串进行了测试。结果并不好。最多有一个词匹配。我认为这还不足以进行可靠的匹配。另外,我的解决方案具有 O(n2) 时间复杂度,如果您有大量集合,则无法很好地扩展。

设置:

static List<string> modelNames =
[
    "XC60 Momentum Standard T6",
    "XC60 Inscription Standard T6",
    "XC60 R designStandard T6",
    "XC60 T5 Powershift",
    "XC60 D3 DRIVE MANUAL",
    "XC60 D3 GEARTRONIC",
    "XC60 D5 GEARTRONIC AWD",
    "XC60 T6 AWD GEARTRONIC",
    "XC60 T5  AWD R DESIGN",
    "XC60 D5 GEARTRONIC AWD R DESIGN",
    "XC60 T6 AWD GEARTRONIC R DESIGN",
];
static List<string> modelNames2 =
[
    "2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr",
    "2.0 D4 Momentum Auto Euro 6 (s/s) 5dr",
    "2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)",
    "2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"
];

static (string name, string[] words) GetWords(string sentence)
{
    return (sentence, sentence.Split());
}

测试:

var names1 = modelNames.Select(n => GetWords(n));
var names2 = modelNames2.Select(n => GetWords(n));
foreach (var n1 in names1) {
    int bestCount = 0;
    List<string> bestMatches = [];
    foreach (var n2 in names2) {

        int count = n1.words
            .Intersect(n2.words, StringComparer.InvariantCultureIgnoreCase)
            .Count();
        if (count > bestCount) {
            bestCount = count;
            bestMatches.Clear();
            bestMatches.Add(n2.name);
        } else if (count > 0 && count == bestCount) {
            bestMatches.Add(n2.name);
        }
    }
    Console.WriteLine($"{n1.name}  (count={bestCount})");
    foreach (var match in bestMatches) {
        Console.WriteLine($"    {match}");
    }
}
Console.ReadKey();

打印:

XC60 Momentum Standard T6  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0 D4 Momentum Auto Euro 6 (s/s) 5dr
XC60 Inscription Standard T6  (count=0)
XC60 R designStandard T6  (count=0)
XC60 T5 Powershift  (count=1)
    2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
XC60 D3 DRIVE MANUAL  (count=0)
XC60 D3 GEARTRONIC  (count=0)
XC60 D5 GEARTRONIC AWD  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T5  AWD R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 D5 GEARTRONIC AWD R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
© www.soinside.com 2019 - 2024. All rights reserved.