在非常大的txt文件（50+GB）中搜索文本

Question

我有一个

hashes.txt

文件，用于存储字符串及其压缩的 SHA-256 哈希值。文件中的每一行的格式如下：

<compressed_hash>:<original_string>

compressed_hash

是通过取完整 SHA-256 哈希值的第 6、13、20 和 27 个字符创建的。例如，字符串

alon

散列后：

5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864

将存储为：

0db3:alon

我有一个

search.py

脚本，其工作原理如下

例如，如果用户在

5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864

中输入

search.py

，脚本会在

0db3

中搜索其缩写形式

hashes.txt

。如果找到多个匹配项，例如：

0db3:alon

0db3:apple

脚本重新哈希匹配项 (

alon

、

apple

) 以获得完整的 SHA-256 哈希值，如果存在匹配项（例如，当完全哈希值与用户输入 (

alon

) 匹配时，

5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864

），则脚本打印字符串 (

alon

)

这个脚本的问题在于它，搜索通常需要1个小时左右，而我的

hashes.txt

是54GB。这是

search.py

：

import hashlib
import mmap

def compress_hash(hash_value):
    return hash_value[6] + hash_value[13] + hash_value[20] + hash_value[27]

def search_compressed_hash(hash_input, compressed_file):
    compressed_input = compress_hash(hash_input)
    potential_matches = []
    
    with open(compressed_file, "r+b") as file:
        # Memory-map the file, size 0 means the whole file
        mmapped_file = mmap.mmap(file.fileno(), 0)
        
        # Read through the memory-mapped file line by line
        for line in iter(mmapped_file.readline, b""):
            line = line.decode().strip()
            parts = line.split(":", 1)  # Split only on the first colon
            if len(parts) == 2:  # Ensure there are exactly two parts
                compressed_hash, string = parts
                if compressed_hash == compressed_input:
                    potential_matches.append(string)
        
        mmapped_file.close()
    
    return potential_matches

def verify_full_hash(potential_matches, hash_input):
    for string in potential_matches:
        if hashlib.sha256(string.encode()).hexdigest() == hash_input:
            return string
    return None

if __name__ == "__main__":
    while True:
        hash_input = input("Enter the hash (or type 'exit' to quit): ")
        if hash_input.lower() == 'exit':
            break
        
        potential_matches = search_compressed_hash(hash_input, "hashes.txt")
        found_string = verify_full_hash(potential_matches, hash_input)
        
        if found_string:
            print(f"Corresponding string: {found_string}")
        else:
            print("String not found for the given hash.")

而且，如果有帮助的话，这里是

hash.py

脚本，它实际上生成字符串和哈希值并将它们放入

hashes.txt

import hashlib
import sys
import time

`# Set the interval for saving progress (in seconds)
SAVE_INTERVAL = 60  # Save progress every minute
BUFFER_SIZE = 1000000  # Number of hashes to buffer before writing to file

def generate_hash(string):
    return hashlib.sha256(string.encode()).hexdigest()

def compress_hash(hash_value):
    return hash_value[6] + hash_value[13] + hash_value[20] + hash_value[27]

def write_hashes_to_file(start_length):
    buffer = []  # Buffer to store generated hashes
    last_save_time = time.time()  # Store the last save time
    
    for generated_string in generate_strings_and_hashes(start_length):
        full_hash = generate_hash(generated_string)
        compressed_hash = compress_hash(full_hash)
        buffer.append((compressed_hash, generated_string))
        
        if len(buffer) >= BUFFER_SIZE:
            save_buffer_to_file(buffer)
            buffer = []  # Clear the buffer after writing to file
        
        # Check if it's time to save progress
        if time.time() - last_save_time >= SAVE_INTERVAL:
            print("Saving progress...")
            save_buffer_to_file(buffer)  # Save any remaining hashes in buffer
            buffer = []  # Clear buffer after saving
            last_save_time = time.time()
    
    # Save any remaining hashes in buffer
    if buffer:
        save_buffer_to_file(buffer)

def save_buffer_to_file(buffer):
    with open("hashes.txt", "a") as file_hashes:
        file_hashes.writelines(f"{compressed_hash}:{generated_string}\n" for compressed_hash, generated_string in buffer)

def generate_strings_and_hashes(start_length):
    for length in range(start_length, sys.maxsize):  # Use sys.maxsize to simulate infinity
        current_string = [' '] * length  # Initialize with spaces
        while True:
            yield ''.join(current_string)
            if current_string == ['z'] * length:  # Stop when all characters reach 'z'
                break
            current_string = increment_string(current_string)

def increment_string(string_list):
    index = len(string_list) - 1
    
    while index >= 0:
        if string_list[index] == 'z':
            string_list[index] = ' '
            index -= 1
        else:
            string_list[index] = chr(ord(string_list[index]) + 1)
            break
    
    if index < 0:
        string_list.insert(0, ' ')
    
    return string_list

def load_progress():
    # You may not need this function anymore
    return 1  # Just return a default value

if __name__ == "__main__":
    write_hashes_to_file(load_progress())`

我的操作系统是Windows 10。

Answer 1

您只有 65536 个唯一搜索键。为什么不直接制作 65536 个文件呢？您可以使用 256 个目录，每个目录包含 256 个文件，以使其易于管理。

那么你就可以完全消除所有的查找过程。奖励：您的文件更小，因为您根本不需要存储密钥。

当然，您必须进行一些一次性处理才能将大哈希文件拆分为较小的文件，但您的查找应该是即时有效的。

在非常大的txt文件（50+GB）中搜索文本

问题描述投票：0回答：1

1个回答

最新问题

在非常大的txt文件（50+GB）中搜索文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1