如何在 azure databricks 中将哈希用户掩码为随机值 [电子邮件受保护]

问题描述 投票:0回答:1

我们在文件存储中有一个 Excel 文件,每个 Excel 文件中包含超过 10,000 列 JSON 数据。

例如,如下所示的示例。

json:

{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,[email protected],ab6019a4-851c-4af2-8ddc-1e03ee9be97a,[email protected],85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,[email protected],48168530-659c-44d3-8985-65f9b0af2b85,[email protected],0937a1e5-8a68-4573-ae9c-e13f9a2f3617,[email protected],c822dd8b-0b79-4c13-af1e-bc080b8108c5,[email protected],ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,[email protected]"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}

我们只是想将用户哈希值更改为普通掩码值。
像这样:对于整个 Excel 文件用户,将

[email protected]
转换为
[email protected]

每次我们像下面这样手动复制用户数据和屏蔽,都会花费我们很多时间。然后,无论我们得到什么输出,我们只需用输出替换哈希值。

import random  
  
main=['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']  
  
l=["0e07209b-807b-4938-8bfd-f87cee98e924,[email protected],c747a82c-656e-40eb-9194-88c4a0f8061e"]  
n=len(l)  
print(n)  
print(random.sample(main,n))

我的问题 azure databricks 有没有办法一次将整个 Excel 文件用户 json 键哈希值替换为随机用户

[email protected]
,然后重写回特定位置

python excel azure pyspark azure-databricks
1个回答
0
投票

正如您提到的,您想将用户哈希值更改为正常掩码值。

我尝试过以下方法:

sample_json = """
{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,[email protected],ab6019a4-851c-4af2-8ddc-1e03ee9be97a,[email protected],85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,[email protected],48168530-659c-44d3-8985-65f9b0af2b85,[email protected],0937a1e5-8a68-4573-ae9c-e13f9a2f3617,[email protected],c822dd8b-0b79-4c13-af1e-bc080b8108c5,[email protected],ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,[email protected]"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
"""
masked_emails = [
    '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
    '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
    '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
    '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'
]
def mask_emails(json_str):
    try:
        data = json.loads(json_str)
        users = data.get("Users", [])
        if users:
            original_emails = users[0].split(',')
            masked = random.sample(masked_emails, len(original_emails))
            data["Users"] = [','.join(masked)]
        return json.dumps(data)
    except Exception as e:
        return json_str
mask_emails_udf = udf(mask_emails, StringType())
data = [(sample_json,)]
df = spark.createDataFrame(data, ["json_column"])
df = df.withColumn("transformed", mask_emails_udf(col("json_column")))
display(df)

在上面的代码中,我在 JSON 数据中屏蔽了电子邮件地址。 读取 JSON 数据,用屏蔽值替换用户电子邮件地址,然后将转换后的 JSON 写回。

结果:

transformed
{"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}

[Errno 2] 没有这样的文件或目录:'/dbfs/FileStore/Book111.xlsx'

错误指出 Python 解释器无法在 DBFS 中的给定路径中找到指定的 Excel 文件。

为了解决上述错误并将结果写入 xlsx 格式,我尝试了以下方法:

pandas_df = df.toPandas()
dir_path = "/FileStore/tables/"
dbutils.fs.mkdirs(dir_path)
file_path = dir_path + "transformed_data.xlsx"
local_file_path = "/tmp/transformed_data.xlsx"
pandas_df.to_excel(local_file_path, index=False, engine='openpyxl')
dbutils.fs.cp("file:" + local_file_path, "dbfs:" + file_path)
print(f"File saved to {file_path}")

结果:

dbutils.fs.ls("/FileStore/tables/transformed_data.xlsx")

[FileInfo(path='dbfs:/FileStore/tables/transformed_data.xlsx', name='transformed_data.xlsx', size=5623, modificationTime=1720070340000)]
© www.soinside.com 2019 - 2024. All rights reserved.