假设我有一个像这样的字符串列表(真实的数据集要大得多并且也包含其他数据):
List<string> modelNames =
[
"XC60 Momentum Standard T6",
"XC60 Inscription Standard T6",
"XC60 R designStandard T6",
"XC60 T5 Powershift",
"XC60 D3 DRIVE MANUAL",
"XC60 D3 GEARTRONIC",
"XC60 D5 GEARTRONIC AWD",
"XC60 T6 AWD GEARTRONIC",
"XC60 T5 AWD R DESIGN",
"XC60 D5 GEARTRONIC AWD R DESIGN",
"XC60 T6 AWD GEARTRONIC R DESIGN",
];
我想使用这样的字符串获得最接近的匹配:
"2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr"
"2.0 D4 Momentum Auto Euro 6 (s/s) 5dr"
"2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)"
"2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"
如您所见,字符串根本不匹配,但有一些方面匹配。
我想产生某种置信度分数。我的想法是将两组字符串分解为单词,然后看看哪一组的单词匹配数最多。我不确定这是否是进行此类分析的最佳方法,或者在
c#
中完成此分析的最佳且高效的方法。
也许有比尝试得分更好的方法,就像我上面描述的那样?
如果有任何想法、建议和指点,我将不胜感激。
谢谢,
凯恩
我用你的示例字符串进行了测试。结果并不好。最多有一个词匹配。我认为这还不足以进行可靠的匹配。另外,我的解决方案具有 O(n2) 时间复杂度,如果您有大量集合,则无法很好地扩展。
设置:
static List<string> modelNames =
[
"XC60 Momentum Standard T6",
"XC60 Inscription Standard T6",
"XC60 R designStandard T6",
"XC60 T5 Powershift",
"XC60 D3 DRIVE MANUAL",
"XC60 D3 GEARTRONIC",
"XC60 D5 GEARTRONIC AWD",
"XC60 T6 AWD GEARTRONIC",
"XC60 T5 AWD R DESIGN",
"XC60 D5 GEARTRONIC AWD R DESIGN",
"XC60 T6 AWD GEARTRONIC R DESIGN",
];
static List<string> modelNames2 =
[
"2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr",
"2.0 D4 Momentum Auto Euro 6 (s/s) 5dr",
"2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)",
"2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"
];
static (string name, string[] words) GetWords(string sentence)
{
return (sentence, sentence.Split());
}
测试:
var names1 = modelNames.Select(n => GetWords(n));
var names2 = modelNames2.Select(n => GetWords(n));
foreach (var n1 in names1) {
int bestCount = 0;
List<string> bestMatches = [];
foreach (var n2 in names2) {
int count = n1.words
.Intersect(n2.words, StringComparer.InvariantCultureIgnoreCase)
.Count();
if (count > bestCount) {
bestCount = count;
bestMatches.Clear();
bestMatches.Add(n2.name);
} else if (count > 0 && count == bestCount) {
bestMatches.Add(n2.name);
}
}
Console.WriteLine($"{n1.name} (count={bestCount})");
foreach (var match in bestMatches) {
Console.WriteLine($" {match}");
}
}
Console.ReadKey();
打印:
XC60 Momentum Standard T6 (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0 D4 Momentum Auto Euro 6 (s/s) 5dr
XC60 Inscription Standard T6 (count=0)
XC60 R designStandard T6 (count=0)
XC60 T5 Powershift (count=1)
2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
XC60 D3 DRIVE MANUAL (count=0)
XC60 D3 GEARTRONIC (count=0)
XC60 D5 GEARTRONIC AWD (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T5 AWD R DESIGN (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 D5 GEARTRONIC AWD R DESIGN (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC R DESIGN (count=1)
2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)