我有一个在我编写的脚本中使用的函数,用于从列表中删除多余的阻塞关键字。基本上,输入(以任何顺序):
{"apple", "bapple", "banana", "cherry", "bananaman", "sweetherrypie", "sweet", "b"}
它应该输出一个缩小的字符串数组(以任何顺序):
{"apple", "cherry", "sweet", "b"}
这样做是为了减少匹配算法需要通过的关键字数量,作为个人学习项目(如果字符串包含“banana”,它也会包含“b”,所以香蕉是不必要的,可以丢弃
它需要具有极高的性能,因为它必须在超过 10,000,000 个字符串的列表上运行。
最初,我编写了一些 Python 代码来执行此操作:
def deduplicate_keywords(keywords):
# Remove duplicates and sort the keywords by length in ascending order
keywords = sorted(set(keywords), key=len)
unique_keywords = []
for keyword in keywords:
not_redundant = True
for unique_keyword in unique_keywords:
# Check if the current keyword includes any of the unique keywords
if unique_keyword in keyword:
not_redundant = False
break
# If the keyword does not include any of the shorter keywords, add it to unique_keywords
if not_redundant:
unique_keywords.append(keyword)
return unique_keywords
以及衡量其性能的性能基准:
class TestDeduplicateKeywords(unittest.TestCase):
def test_performance_intensive_case(self):
keywords = [
"abcdef",
"def",
"ghidef",
"jkl",
"mnop",
"qrst",
"uvwxyz",
"ghijkl",
"abcdefgh",
"ijklmnop",
"mnopqrstuv",
"rstuvwx",
"mnopqrstuvwx",
"uvwx",
]
# Extending the list to make it very long and complex
base_keywords = keywords[:]
for i in range(200000):
base_keywords.extend([f"{k}{i}" for k in keywords])
base_keywords.extend([f"{i}{k}" for k in keywords])
base_keywords.extend([f"{i}{k}{i}" for k in keywords])
keywords = base_keywords
old_function_time = 0
new_function_time = 0
for repeat in range(1):
# This loop (not present in the cpp code) is used to loop it multiple times to find an average
start_time = time.time()
result = deduplicate_keywords(keywords)
end_time = time.time()
old_function_time = (
repeat * old_function_time + (end_time - start_time)
) / (repeat + 1)
start_time = time.time()
result = deduplicate_keywords_new(keywords)
# Above is a function that can be edited to check whether changes speed up the code
end_time = time.time()
new_function_time = (
repeat * new_function_time + (end_time - start_time)
) / (repeat + 1)
# As the list is extended with numbers, the result will be unique non-redundant keywords
expected_result = ["def", "jkl", "mnop", "qrst", "uvwx"]
self.assertEqual(set(result), set(expected_result))
print(
f"Test performance_intensive_case for old function took {old_function_time} seconds\n"
f"Test performance_intensive_case for new function took {new_function_time} seconds\n"
)
if old_function_time < new_function_time:
print(f"Old is faster than new")
else:
print(f"New is faster than old")
if __name__ == "__main__":
unittest.main()
这对于我在
12 seconds
的弱笔记本电脑上的需求来说太慢了。
因此,我尝试在 cpp 中编写相同的内容,看看是否能获得性能提升。它so慢得多,所以我向函数本身添加了一些计时代码,以查看速度减慢的地方:
#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <algorithm>
#include <chrono>
#include <cassert>
std::vector<std::string> deduplicate_keywords(const std::vector<std::string>& keywords) {
auto t_start = std::chrono::high_resolution_clock::now();
// Remove duplicates using a set
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << "started old function!" << std::endl;
std::set<std::string> unique_set(keywords.begin(), keywords.end());
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "removed duplicates in " << std::chrono::duration<double>(t2 - t1).count() << " seconds" << std::endl;
// Convert set back to vector and sort by length
auto t3 = std::chrono::high_resolution_clock::now();
std::vector<std::string> sorted_keywords(unique_set.begin(), unique_set.end());
std::sort(sorted_keywords.begin(), sorted_keywords.end(), [](const std::string& a, const std::string& b) {
return a.length() < b.length();
});
auto t4 = std::chrono::high_resolution_clock::now();
std::cout << "Sorted by length in " << std::chrono::duration<double>(t4 - t3).count() << " seconds" << std::endl;
// Filter out redundant keywords
auto t5 = std::chrono::high_resolution_clock::now();
std::vector<std::string> unique_keywords;
for (const auto& keyword : sorted_keywords) {
for (const auto& unique_keyword : unique_keywords) {
// Check if the current keyword includes any of the unique keywords
if (keyword.find(unique_keyword) != std::string::npos) {
goto next_iteration;
}
}
// If the keyword does not include any of the shorter keywords, add it to unique_keywords
unique_keywords.push_back(keyword);
next_iteration:
continue;
}
auto t6 = std::chrono::high_resolution_clock::now();
std::cout << "Filtered in " << std::chrono::duration<double>(t6 - t5).count() << " seconds" << std::endl;
auto t7 = std::chrono::high_resolution_clock::now();
sorted_keywords.clear();
sorted_keywords.shrink_to_fit();
unique_keywords.shrink_to_fit();
auto t8 = std::chrono::high_resolution_clock::now();
std::cout << "Destructed in " << std::chrono::duration<double>(t8 - t7).count() << " seconds" << std::endl;
auto t9 = std::chrono::high_resolution_clock::now();
auto total_time = std::chrono::duration<double>(t9 - t_start).count();
std::cout << "Total function time: " << total_time << " seconds" << std::endl;
return unique_keywords;
}
void test_performance_intensive_case() {
std::vector<std::string> keywords = {
"abcdef", "def", "ghidef", "jkl", "mnop", "qrst", "uvwxyz",
"ghijkl", "abcdefgh", "ijklmnop", "mnopqrstuv", "rstuvwx",
"mnopqrstuvwx", "uvwx"
};
// Extending the list to make it very long and complex
std::vector<std::string> base_keywords = keywords;
for (int i = 0; i < 20000; ++i) {
for (const auto& k : keywords) {
base_keywords.push_back(k + std::to_string(i));
base_keywords.push_back(std::to_string(i) + k);
base_keywords.push_back(std::to_string(i) + k + std::to_string(i));
}
}
auto keywords_extended = base_keywords;
std::set<std::string> expected_result = {"def", "jkl", "mnop", "qrst", "uvwx"};
// there is no looping to calculate an average because it is so much slower
auto start_time_total = std::chrono::high_resolution_clock::now();
auto start_time = std::chrono::high_resolution_clock::now();
std::vector<std::string> result_old = deduplicate_keywords(keywords_extended);
auto end_time = std::chrono::high_resolution_clock::now();
double old_function_time = std::chrono::duration<double>(end_time - start_time).count();
auto end_time_total = std::chrono::high_resolution_clock::now();
double total_test_time = std::chrono::duration<double>(end_time_total - start_time_total).count();
assert(std::set<std::string>(result_old.begin(), result_old.end()) == expected_result);
std::cout << "old function time: " << old_function_time << " seconds" << std::endl;
std::cout << "Total test case time: " << total_test_time << " seconds" << std::endl;
auto start_time_total_new = std::chrono::high_resolution_clock::now();
auto start_time_new = std::chrono::high_resolution_clock::now();
std::vector<std::string>result_new = deduplicate_keywords_new(keywords_extended); // same as in the python code, useful for comparison
auto end_time_new = std::chrono::high_resolution_clock::now();
double new_function_time = std::chrono::duration<double>(end_time - start_time).count();
auto end_time_total_new = std::chrono::high_resolution_clock::now();
double total_test_time_new = std::chrono::duration<double>(end_time_total_new - start_time_total_new).count();
assert(std::set<std::string>(result_new.begin(), result_new.end()) == expected_result);
std::cout << "new function time: " << new_function_time << " seconds" << std::endl;
std::cout << "Total test case time: " << total_test_time_new << " seconds" << std::endl;
std::cout << "Test performance_intensive_case for old function took " << old_function_time
<< " seconds but new function took " << new_function_time << " seconds" << std::endl;
if (old_function_time < new_function_time) {
std::cout << "Old is faster than new" << std::endl;
} else {
std::cout << "New is faster than old" << std::endl;
}
}
int main() {
test_performance_intensive_case();
return 0;
}
这会输出到终端 this (缩短,删除当前空的 deduplicate_keywords_new 函数的日志):
started old function!
removed duplicates in 9.64173 seconds
Sorted by length in 3.36138 seconds
Filtered in 0.366933 seconds
Destructed in 7.63208 seconds
Total function time: 21.0037 seconds
old function time: 405.213 seconds
Total test case time: 405.213 seconds
除了总体速度降低之外,日志似乎表明返回函数(
return unique keywords
)似乎需要非常长的时间。为什么是这样?我能做些什么来阻止这种情况呢?我对 cpp 编码和开发相当陌生,因此如果您的解释详细而透彻,那就太好了。谢谢!
恕我直言,这个解决方案更快:
struct sort {
bool operator () (const std::string& s1, const std::string& s2) const {
size_t len1 = s1.length();
size_t len2 = s2.length();
return len1 == len2 ? s1 < s2 : len1 < len2;
}
};
struct unique {
bool operator () (const std::string& s1, const std::string& s2) const {
return s1.find(s2) == std::string::npos &&
s2.find(s1) == std::string::npos &&
s1 < s2;
}
};
std::vector<std::string> deduplicate_keywords (const std::vector<std::string>& keywords)
{
std::set<std::string, sort> sorted_keywords (keywords.cbegin(), keywords.cend());
std::set<std::string, unique> unique_keywords (sorted_keywords.cbegin(), sorted_keywords.cend());
return std::vector<std::string>(unique_keywords.cbegin(), unique_keywords.cend());
}
比较运算符由std::set的优化插入算法使用。