我正在尝试比较两个数据帧的列和类型以检查是否相等,行预计会不同。
我使用的是pandas版本1.1.2
pd.__version__
'1.1.2'
if (df1.columns.difference(df2.columns).empty) and
(df1.dtypes == df2.dtypes).all()
但这行错误:
(df1.dtypes == df2.dtypes).all()
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/common.py", line 65, in new_method
return method(self, other)
File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 370, in wrapper
res_values = comparison_op(lvalues, rvalues, op)
File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 239, in comparison_op
res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 53, in comp_method_OBJECT_ARRAY
result = libops.vec_compare(x.ravel(), y.ravel(), op)
File "pandas/_libs/ops.pyx", line 169, in pandas._libs.ops.vec_compare
TypeError: Cannot interpret 'StringDtype' as a data type
发生了什么事以及如何解决这个问题?
更新第一个数据帧的数据类型:
df1.dtypes
update_timestamp string
issue_type_desc string
id Int64
project_key string
teamname string
issue_key string
summary string
created_at string
updated_at string
updated_at_utc string
status_updated_at string
resolved_at string
closed_at string
due_date string
age_ts Int64
resolution_seconds string
reporter_name string
assignee string
issue_type_id Int64
priority_id Int64
status_id Int64
status_detail string
status_category string
resolution_id string
security_level_id Int64
parent_issue_key string
subtask_parent_key string
subtask_parent_id Int64
subtask_key string
epickey string
epic_id Int64
epic_parent_key string
customfield_fversion string
customfield_10004 string
epicdesc string
customfield_10005 string
sprint string
customfield_10003_original string
customfield_10003_status string
storypoint float64
storyhours float64
subtaskhours string
subcategory string
customfield_feature string
zephyr_last_executed_at string
zephyr_last_executed_by string
zephyr_last_execution_status string
fix_versions string
deleted Int64
lastupdateddatetime string
dtype: object
对于第二个数据帧的数据类型:
df2.dtypes
update_timestamp string
issue_type_desc string
id Int64
project_key string
teamname string
issue_key string
summary string
created_at string
updated_at string
updated_at_utc string
status_updated_at string
resolved_at string
closed_at string
due_date string
age_ts Int64
resolution_seconds string
reporter_name string
assignee string
issue_type_id Int64
priority_id Int64
status_id Int64
status_detail string
status_category string
resolution_id string
security_level_id Int64
parent_issue_key string
subtask_parent_key string
subtask_parent_id Int64
subtask_key string
epickey string
epic_id Int64
epic_parent_key string
customfield_fversion string
customfield_10004 string
epicdesc string
customfield_10005 string
sprint string
customfield_10003_original string
customfield_10003_status string
storypoint string
storyhours string
subtaskhours string
subcategory string
customfield_feature string
zephyr_last_executed_at string
zephyr_last_executed_by string
zephyr_last_execution_status string
fix_versions string
deleted Int64
lastupdateddatetime string
dtype: object
更新到目前为止,我已经能够通过编写一个函数来进行比较来解决这个错误,但这似乎不是理想的解决方案。我希望有一种方法可以与这种新的扩展类型进行数据类型比较:
def compare_dataframe_column_dtypes(df_compare1, df_compare2):
"""Since pandas 1.0.0 and the inception of the StringDtype
doing `df1.dtypes == df2.dtypes` will fail on StringDtype,
so in order to compare dataframes one must catch this error
and convert to prior pd 1.0.0 string dtype and compare again.
"""
df1 = df_compare1.copy()
df2 = df_compare2.copy()
try:
if (df1.dtypes == df2.dtypes).all():
return True
return False
except Exception as ex:
logger.error("%s. Converting datatypes StringDType to str and then comparing again", ex.args[0])
for column in df1:
if pd.StringDtype.is_dtype(df1[column]):
df1[column] = df1[column].astype(str)
for column2 in df2:
if pd.StringDtype.is_dtype(df2[column2]):
df2[column2] = df2[column2].astype(str)
if (df1.dtypes == df2.dtypes).all():
return True
return False
我很晚才偶然发现这一点,但你也许可以将它们转换为字典并进行比较
if (dict(df1.dtypes) == dict(df2.dtypes)):
return True
return False
https://docs.python.org/3/library/stdtypes.html#mapping-types-dict