Python .txt 到 .xlsx 转换分块和索引问题

问题描述 投票:0回答:1

我的 python 代码存在分块和/或索引问题,我试图将文本脚本转换为 xlsx 文件。问题是 xlsx 文件对可以拥有的行数有硬性限制:

Traceback (most recent call last):
  File "/Users/rbarrett/Git/Cleanup/yourPeople3/convert_txt_to_xls.py", line 46, in <module>
    df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/core/generic.py", line 2417, in to_excel
    formatter.write(
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/formats/excel.py", line 952, in write
    writer._write_cells(
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/excel/_openpyxl.py", line 487, in _write_cells
    xcell = wks.cell(
            ^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 244, in cell
    cell = self._get_cell(row, column)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 257, in _get_cell
    raise ValueError(f"Row numbers must be between 1 and 1048576. Row number supplied was {row}")
ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577

正如我们所见,

ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577
是我的错误,看起来我在以下脚本中的切片有问题:

  • convert_txt_to_xls.py
#!/usr/bin/env python3

import pandas as pd
import argparse

# Set up argument parsing
parser = argparse.ArgumentParser(description="Convert a TXT file to CSV or XLSX format.")
parser.add_argument("input_txt_file", help="Path to the input TXT file")
parser.add_argument("output_file", help="Path to the output file (either .csv or .xlsx)")
parser.add_argument("--type", choices=['csv', 'xlsx'], required=True, help="Output file type: 'csv' or 'xlsx'")
parser.add_argument("--multiple-sheets", action='store_true', help="Split data across multiple sheets if type is 'xlsx'")
parser.add_argument("--delimiter", default=' ', help="Delimiter used in the input TXT file")

# Parse the arguments
args = parser.parse_args()

# Read the .txt file into a pandas DataFrame
df = pd.read_csv(args.input_txt_file, delimiter=args.delimiter, engine='python')

# Print DataFrame shape for inspection
print(f"DataFrame shape: {df.shape}")
print(df.head())

# Handle output based on the specified type
if args.type == 'csv':
    # Write the DataFrame to a CSV file
    df.to_csv(args.output_file, index=False)
    print(f"Conversion complete: {args.output_file}")

elif args.type == 'xlsx':
    if args.multiple_sheets:
        # Define Excel's maximum row limit
        max_rows = 1048576

        # Create a Pandas Excel writer using openpyxl
        with pd.ExcelWriter(args.output_file, engine='openpyxl') as writer:
            sheet_number = 1
            for i in range(0, len(df), max_rows):
                # Extract the chunk of data to be written
                df_chunk = df.iloc[i:i + max_rows].copy()

                # Reset the index to ensure each sheet starts at row 1
                df_chunk.reset_index(drop=True, inplace=True)

                # Write the chunk to the corresponding sheet
                df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
                sheet_number += 1

        print(f"Conversion complete with multiple sheets: {args.output_file}")
    else:
        # Write the entire DataFrame to a single Excel sheet
        if len(df) > max_rows:
            raise ValueError("Data exceeds Excel's row limit. Use --multiple-sheets to split data across sheets.")
        df.to_excel(args.output_file, index=False, engine='openpyxl')
        print(f"Conversion complete: {args.output_file}")

举例来说,我有一个包含很多行的文本文件,如下所示:

46eb61ab1c0i90e909090w.................2 blob 88924339 logs/swf.log.1
5fb..........................c53da3f0cf1 blob 79474600 logs/swf.log.1
0f373270ad....................e3441da6bd blob 75058654 logs/swf.log.1
7f2..................5e510548fe2f35f9358 blob 74196729 hub/growth/growth/files/NewHireOnboarding.pptx
d7........................1e7e1cb8c0631f blob 70885244 logs/sqllog

但是我在这个文件中有很多行,比如

4730559
行,发生的情况是我试图创建另一个工作表并将这些部分分块,这样如果我达到限制,我可以开始跨工作表分页。 python 脚本部分有什么问题?

如果你想运行脚本:

python3 convert_txt_to_xls.py blobs_with_sizes.txt blobs_with_sizes.xlsx --type=xlsx --multiple-sheets --delimiter=' '

我使用

' '
分隔符作为列之间的空间。

python-3.x pandas xlsx txt file-conversion
1个回答
0
投票

您的溢出量为 1

更换

max_rows = 1048576

max_rows = 1048575
© www.soinside.com 2019 - 2024. All rights reserved.