我正在尝试编写一个Python脚本,该脚本转到特定路径并将所有文件读入SQLite表(文件中的内容不相关)。一旦文件位于 SQLite 数据库中,我希望能够循环并为每个文件生成 MD5 总和、源目录和创建日期。因此,当我稍后按创建日期查询时,我将获得包含 MD5 和源目录的文件列表。
我只能将路径中的文件添加到 SQLite 中,但不能做太多事情。我确实尝试过使用 Pandas DateFrame 但这没有多大帮助。
[这是我到目前为止所拥有的]
import sqlite3
import os
import pandas as pd
directory = "/unixhomes003/Desktop/"
#create connection
conn = sqlite3.connect("dbfile1.db")
cur = conn.cursor()
#Creating and inserting into table
cur.execute("""CREATE TABLE IF NOT EXISTS filenames (filename TEXT UNIQUE)""")
for root, dirnames, filenames in os.walk(directory):
cur.executemany("INSERT OR IGNORE INTO filenames (filename) VALUES (?)", [(filename,)for filename in filenames if filename.endswith("")])
conn.commit()
# Reading the table into a Pandas DataFrame
sql_query = pd.read_sql_query("SELECT * FROM filenames", conn)
df = pd.DataFrame(sql_query, columns = ["filname"])
print(df)
将文件读入 SQLite 表中,您的方向是正确的。要生成每个文件的 MD5 和、源目录和创建日期,您可以迭代目录中的文件并使用 Python 的 os 模块检索必要的信息。这是代码的更新版本,其中包含附加功能:
import sqlite3
import os
import hashlib
import pandas as pd
def calculate_md5(file_path):
"""Calculate the MD5 sum of a file."""
with open(file_path, "rb") as file:
data = file.read()
md5 = hashlib.md5(data).hexdigest()
return md5
directory = "/unixhomes003/Desktop/"
# Create connection
conn = sqlite3.connect("dbfile1.db")
cur = conn.cursor()
# Creating and inserting into table
cur.execute("""CREATE TABLE IF NOT EXISTS filenames (filename TEXT UNIQUE, md5 TEXT, source_directory TEXT, created_date TEXT)""")
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith(""):
file_path = os.path.join(root, filename)
md5 = calculate_md5(file_path)
source_directory = os.path.dirname(file_path)
created_date = os.path.getctime(file_path)
cur.execute("INSERT OR IGNORE INTO filenames (filename, md5, source_directory, created_date) VALUES (?, ?, ?, ?)", (filename, md5, source_directory, created_date))
conn.commit()
# Querying the table
sql_query = pd.read_sql_query("SELECT * FROM filenames", conn)
df = pd.DataFrame(sql_query, columns=["filename", "md5", "source_directory", "created_date"])
print(df)