Python xlrd：读取每个单元格值的格式

Question

我目前正在从事一个数据工程项目。目前，我想将扩展名为“.xls”的 Excel 文件读入我的 Python 工作区。对于 Pandas 来说这不是一个大问题。但是，我还想导入Excel文件的格式。也就是说，我想读取每个单元格中每个值的颜色以及它是否被删除。

我尝试了不同的方法来解决这个问题。下面您可以看到我的最后一次尝试，这导致了有关整个单元格的字体颜色值以及单元格是否被划掉的信息。但我只为每个单元格获取一个值，尽管单元格中有很多值，并且这些值可以着色为黑色、红色、绿色……可以着色并且可以删除或不删除。

我用 xlrd 打开工作簿。然后我遍历单元格的每一行和每一列。我读了作业簿的字体。然后我在

strike_bool

和

font_color_bool

中保存字体是否对应于允许的颜色以及单元格是否被删除。根据单元格值的类型，我将其以正确的数据类型保存在

filtered_row

列表中。该列表代表 Excel 文件的行。然后，我将行

filtered_row

的单元格值的格式化列表保存在列值

filtered_data

的列表中。然后这个列表

filtered_data

被转换成Pandas数据框

df_proper

。

现在，我为每个单元格仅收到一个字体颜色值和一个删除函数值。但我需要单元格中每个值的字体颜色。单元格中删除的值也是如此 -

如何遍历特定单元格中的每个值并检查单元格中每个值的颜色以及单元格的特定值是否被删除？

我的代码：

import pandas as pd
import numpy as np
import xlrd
import math

palette = self.get_color_palette(data_path)
workbook = xlrd.open_workbook(data_path, formatting_info=True, on_demand=True)

for sheet_name in sheet_names:
    print(f'Read sheet: {sheet_name}')
    df_proper = pd.DataFrame()
    sheet = workbook.sheet_by_name(sheet_name)
                    
    # Iterate through the cells
    filtered_data = []
    for row in range(sheet.nrows):
        filtered_row = []
        for col in range(sheet.ncols):
            keep_cell = False
            cell = sheet.cell(row, col)
            xf_index = cell.xf_index
            xf = workbook.xf_list[xf_index]
            font_index = xf.font_index
            font = workbook.font_list[font_index]
            strike_bool = False
            # Check if cell is struck out
            if not font.struck_out:
                 strike_bool = True
            else:
                 strike_bool= False

            font_color_bool = False
            # Check if color meets condition
            if self.compare_font_color(palette[font.colour_index]):
                font_color_bool = True
            else:
                font_color_bool = False

            if font_color_bool and strike_bool:
                keep_cell = True
                                
            if cell.value == '':
                filtered_row.append(math.nan)
            elif isinstance(cell.value, (int, float)):
                if isinstance(cell.value, float) and cell.value.is_integer():
                    filtered_row.append(int(cell.value) if keep_cell else None)
                elif isinstance(cell.value, float):
                    filtered_row.append(float(cell.value) if keep_cell else None)
                else:
                    filtered_row.append(int(cell.value) if keep_cell else None)
            else:
                filtered_row.append(str(cell.value) if keep_cell else None)
        filtered_data.append(filtered_row)
    # DataFrame aus den gefilterten Daten erstellen
    df_proper = pd.DataFrame(filtered_data)
    dfs[sheet_name] = []
    dfs[sheet_name] = df_proper
    workbook.unload_sheet(sheet_name)

示例：

A 栏	B 栏
这可能是不是一个值	值2
值3	价值4

脚本应该识别出“That”是用颜色红色写的，“not”是删除掉。

Answer 1

我认为问题的关键是获取

xls

文件中单元格的每个字符的格式。
我制作了一个这样的示例表：

(0, 0)单元格的文本值为

ab

，

为红色。
(1, 0) 单元格的文本值为

cd

，

被删除。

这是我读取所有这些样式的代码。

python 3.7.9

和

xlrd 1.2.0

。

import xlrd
# accessing Column 'A' in this example
COL_IDX = 0

with xlrd.open_workbook('xls_file', formatting_info=True) as book:
    sheet = book.sheet_by_index(0)
    for row_idx in range(sheet.nrows):
        text_cell = sheet.cell_value(row_idx, COL_IDX)
        text_cell_xf = book.xf_list[sheet.cell_xf_index(row_idx, COL_IDX)]

        # skip rows where cell is empty
        if not text_cell:
            continue

        print(f'============\nText of ({row_idx}, {COL_IDX}) is `{text_cell}`')

        text_cell_runlist = sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
        if text_cell_runlist:
            print('============\nStyle segments of this cell:')

        segments = []
        for segment_idx in range(len(text_cell_runlist)):
            start = text_cell_runlist[segment_idx][0]
            # the last segment starts at given 'start' and ends at the end of the string
            end = None
            if segment_idx != len(text_cell_runlist) - 1:
                end = text_cell_runlist[segment_idx + 1][0]
            segment_text = text_cell[start:end]
            segments.append({
                'text': segment_text,
                'font': book.font_list[text_cell_runlist[segment_idx][1]]
            })

            # segments did not start at beginning, assume cell starts with text styled as the cell
            if text_cell_runlist[0][0] != 0:
                segments.insert(0, {
                    'text': text_cell[:text_cell_runlist[0][0]],
                    'font': book.font_list[text_cell_xf.font_index]
                })
            for segment in segments:
                print('------------\nTEXT:', segment['text'])
                print('color:', segment['font'].colour_index)
                print('bold:', segment['font'].bold)
                print('struck out:', segment['font'].struck_out)
            else:
                print('------------\nCell whole style')
                print('italic:', book.font_list[text_cell_xf.font_index].italic)
                print('bold:', book.font_list[text_cell_xf.font_index].bold)
                print('struck out:', book.font_list[text_cell_xf.font_index].struck_out)

Python xlrd：读取每个单元格值的格式

问题描述投票：0回答：1

1个回答

最新问题

Python xlrd：读取每个单元格值的格式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1