从结构不一致的多个 .xlsx 文件中提取表格数据

Question

我有多个 Excel 电子表格，其中包含一系列数据，包括表格。我需要从每个文件访问一个特定的表。我考虑过使用 pandas skiprows，但是在每张工作表中找到表格的行是可变的，无论是从工作表的开头还是结尾。在下面的示例中，我需要访问带有标题“Well”、“Content”等的最后一个表，并将其转换为数据框以进行处理。明确地说，在此示例中，相关表的行是 115，但是，这会因文件而异。同样，与纸张末端的距离也是可变且不一致的。非常感谢任何帮助！

我查看了 openpyxl，但没有找到任何可以根据标头值隔离表的东西。我还研究了 pd.read_excel skiprows 和/或使用 iloc 索引数据帧。这里的问题是表格的位置不一致，表格的大小可变。

Answer 1

我能够通过获取第一行的索引来解决这个问题，然后是下面最近行的索引，其中的单元格与所有文件一致（在这种情况下，“基本设置”与底部的距离始终相同桌子）。如下：

#defining the header index by getting the first column heading
header_index = raw_table[raw_table[0].eq('Well')].index.values[0]
#defining the footer dimensions based on the first consistent title 
footer_index = raw_table[raw_table[0].eq('Basic settings')].index.values[0]
#Slicing the table according to the indices determined above, the footer is *strong text*3 
#below the end, so subtracting 3 
cropped_table = raw_table[header_index:footer_index-3]'''

从结构不一致的多个 .xlsx 文件中提取表格数据

问题描述投票：0回答：1

1个回答

最新问题

从结构不一致的多个 .xlsx 文件中提取表格数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1