我目前正在写论文,尝试自动创建家谱。现在我正在尝试通过查看结婚证并链接姓名来将出生证链接到各自的父母:如果出生证中父母双方的姓名与结婚证中的姓名完全相同,我想链接它们。然而,有多份结婚证书上的名字完全相同。因此,我尝试“解决”的方法是找到与出生证明日期最接近的结婚证书,以减少错误(我正在处理 1200 万份出生证明和 500 万份结婚证书)。
我想从这个查询中得到的是所有结果中孩子出生与其父母结婚日期之间的年份差异。
我首先创建一个 PARENTS_ID,其中包含母亲的全名和父亲的全名,以及一个 MARRIAGE_ID,其中将新娘的全名附加到新郎。我创建了一个块生成器来减少内存使用量,并创建了一个差异计数器来计算发现特定差异的次数。
def generate_chunks(df, chunk_size):
for i in range(0, len(df), chunk_size):
yield df.iloc[i:i+chunk_size]
def freq(dic, arr):
for i in arr:
if i not in dic:
dic[i] = 1
else:
dic[i] += 1
return dic
marriages.sort_values(by=['EVENT_YEAR'])
marriages.sort_values(by=['MARRIAGE_ID'])
births.sort_values(by=['EVENT_YEAR'])
# Create an in-memory database
conn = sqlite3.connect(':memory:')
conn.execute('PRAGMA synchronous = OFF;')
conn.execute('PRAGMA journal_mode = MEMORY;')
conn.execute('PRAGMA temp_store = MEMORY;')
marriages.to_sql('sqlmarriages', conn, index=False, if_exists='append')
conn.execute("CREATE INDEX idx_marriage_id ON sqlmarriages (MARRIAGE_ID);")
chunk_size = 2000
dic = {}
with open('age_diffs.txt', 'w') as f:
for i, chunk in enumerate(generate_chunks(births, chunk_size)):
print(f"Processing chunk {i} at {time.time() - start:.2f} seconds")
chunk.to_sql('sqlchunk', conn, index=False, if_exists='append')
conn.execute("CREATE INDEX idx_parents_id ON sqlchunk (PARENTS_ID);")
query = """
SELECT b.*, m.EVENT_YEAR - b.PR_BIR_YEAR AS AGE_DIFF, FIRST_VALUE(AGE_DIFF) OVER (PARTITION BY b.PARENTS_ID ORDER BY ABS(b.PR_BIR_YEAR - m.EVENT_YEAR) ASC) AS AGE_DIFF
FROM sqlchunk b
INNER JOIN sqlmarriages m
ON b.PARENTS_ID = m.MARRIAGE_ID
"""
matches = pd.read_sql_query(query, conn)
dic = freq(dic, matches.loc[:, 'AGE_DIFF'].tolist())
print(dic)
conn.close()
temp_conn.close()
但是,当我运行此命令时,出现 AGE_DIFF 不作为列存在的错误。有谁知道如何解决这个问题吗?
def generate_chunks(df, chunk_size):
for i in range(0, len(df), chunk_size):
yield df.iloc[i:i+chunk_size]
def freq(dic, arr):
for i in arr:
dic[i] = dic.get(i, 0) + 1
return dic
# Assuming births and marriages are DataFrames
# and you have their EVENT_YEAR and PARENTS_ID / MARRIAGE_ID columns prepared.
# Create an in-memory database and load the marriages table
conn = sqlite3.connect(':memory:')
conn.execute('PRAGMA synchronous = OFF;')
conn.execute('PRAGMA journal_mode = MEMORY;')
conn.execute('PRAGMA temp_store = MEMORY;')
marriages.to_sql('sqlmarriages', conn, index=False, if_exists='append')
conn.execute("CREATE INDEX idx_marriage_id ON sqlmarriages (MARRIAGE_ID);")
chunk_size = 2000
dic = {}
with open('age_diffs.txt', 'w') as f:
for i, chunk in enumerate(generate_chunks(births, chunk_size)):
print(f"Processing chunk {i} at {time.time() - start:.2f} seconds")
# Write chunk to a temporary table
chunk.to_sql('sqlchunk', conn, index=False, if_exists='replace')
# Index the chunk to speed up the join
conn.execute("CREATE INDEX idx_parents_id ON sqlchunk (PARENTS_ID);")
query = """
WITH matched AS (
SELECT
b.*,
m.EVENT_YEAR,
m.EVENT_YEAR - b.PR_BIR_YEAR AS AGE_DIFF,
ROW_NUMBER() OVER (
PARTITION BY b.PARENTS_ID
ORDER BY ABS(b.PR_BIR_YEAR - m.EVENT_YEAR) ASC
) AS rn
FROM sqlchunk b
INNER JOIN sqlmarriages m
ON b.PARENTS_ID = m.MARRIAGE_ID
)
SELECT * FROM matched WHERE rn = 1;
"""
matches = pd.read_sql_query(query, conn)
# Write or print differences if needed
# f.write(...) or just process them
age_differences = matches['AGE_DIFF'].tolist()
dic = freq(dic, age_differences)
print(dic)
conn.close()
这实现了什么 高效的块: 您以 2,000 行的可管理块的形式处理大型出生数据集,以保持内存使用合理。
索引连接: 在 PARENTS_ID 和 MARRIAGE_ID 上创建索引有助于加快联接操作。
最小绝对差异: 使用 ROW_NUMBER() 和 ORDER BY ABS(b.PR_BIR_YEAR - m.EVENT_YEAR) 可确保对于每组父母 (PARENTS_ID),您获得在时间上最接近出生的一个婚姻行。
常用词典: 仅提取最接近的匹配项后,您可以更新频率字典以跟踪特定年份差异发生的次数。
最终结果 dic 将包含这些最小年龄差异的分布,让您了解婚姻和孩子出生之间的典型差距。