如何在snowflake中使用python将.xlsx文件拆分为多个.csv?

问题描述 投票:0回答:1

我有一个 .xlsx 文件存储在已集成到 Snowflake 阶段环境中的 Blob 存储容器中,因此我可以在 Snowflake 中访问它。我试图让 python (在雪花存储过程中运行)将 .xlsx 文件拆分为同一阶段环境中的多个 .csv 文件。

我一直在使用这篇文章和副驾驶的组合来获得某种形式的工作存储过程,但它不起作用并返回错误。再次出现的错误是:

Python 解释器错误:

Traceback (most recent call last):
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/base.py", line 59, in _convert
    value = expected_type(value)
TypeError: an integer is required (got type datetime.date)
 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
  File "_udf_code.py", line 10, in main
    workbook = load_workbook(f, data_only=True, read_only=True)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/reader/excel.py", line 348, in load_workbook
    reader.read()
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/reader/excel.py", line 295, in read
    self.read_properties()
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/reader/excel.py", line 176, in read_properties
    self.wb.properties = DocumentProperties.from_tree(src)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
    return cls(**attrib)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/packaging/core.py", line 107, in __init__
    self.modified = modified or now
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/base.py", line 272, in __set__
    super().__set__(instance, value)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/nested.py", line 33, in __set__
    super().__set__(instance, value)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/base.py", line 71, in __set__
    value = _convert(self.expected_type, value)
  File "/usr/lib/python_udf/07da35646d4f3b655d96bf0222d31c191a6797e450b0ac0eef955ee969f584c8/lib/python3.8/site-packages/openpyxl/descriptors/base.py", line 61, in _convert
    raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'datetime.datetime'>
 in function SPLIT_XLSX_TO_CSV_PROC with handler main

它引用需要一个 interget 但得到了一个 datetime.date,但是我要求它将 .xlsx 中的所有数据存储为字符串。

我的(坦白说很糟糕)代码如下。我仍在学习 python,所以请原谅任何菜鸟错误或明显由副驾驶启发的错误。

CREATE OR REPLACE PROCEDURE split_xlsx_to_csv_proc(file_path string, sheet_to_process string, sheet_to_ignore string, target_stage string)

RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas', 'openpyxl')
HANDLER = 'main'
EXECUTE AS CALLER
AS
$$
from openpyxl import load_workbook
import os, sys, csv
import pandas as pd

def main(session, file_path, sheet_to_process, sheet_to_ignore, target_stage):
session.file.get(file_path, "/tmp/")
file_name = os.path.basename(file_path)
with open(os.path.join("/tmp", file_name), "rb") as f: 
    workbook = load_workbook(f, data_only=True, read_only=True)
    
    # Check if the sheet to process is the one to ignore
    if sheet_to_process == sheet_to_ignore:
        return f"Sheet '{sheet_to_ignore}' is set to be ignored."
    
    # Choose the desired worksheet
    worksheet = workbook[sheet_to_process]
    
    # Open the CSV file in write mode
    with open("/tmp/exported.csv", 'w', newline='') as csv_file:
        csv_writer = csv.writer(csv_file)
        # Iterate through rows in the worksheet and write to the CSV file
        for row in worksheet.iter_rows(values_only=True):
            csv_writer.writerow([str(cell) if cell is not None else '' for cell in row])
    
    # Close the workbook
    workbook.close()
    session.file.put("file:///tmp/exported.csv", target_stage)

return os.path.join(target_stage, "exported.csv")
$$;

我尝试对其进行设置,以便您定义一张要复制的工作表,以及一张要忽略的工作表(因为我认为有一张工作表包含错误数据,但即使您忽略该工作表,它仍然会返回错误)

有人可以帮我解决这个问题吗?我希望这将是一项简单的任务,也许我把它复杂化了,但似乎没有任何效果?

python python-3.x snowflake-cloud-data-platform
1个回答
0
投票

使用您已经使用的 pandas 会更容易:

with open(r"c:\temp\orders.xlsx","rb") as xl:
   df = pd.read_excel(xl,SheetName)

with open(r"c:\temp\orders.csv","w") as csv:
    df.to_csv(csv, index=False)
© www.soinside.com 2019 - 2024. All rights reserved.