我的数据的每一行看起来都是这样的:
8,0 0 1 0.000000000 8082 A WS 24664872 + 8 <- (8,2) 23604576
我想将数据分成这样的列:
col1 col2 col3 col4 col5 col6 col7 col8 col9
8,0 0 1 0.000000000 8082 A WS 24664872 + 8 <- (8,2) 23604576
我是Python数据处理新手,不知道如何正确分隔列。由于文件大小较大,我当前使用的代码使用分块:
import pandas as pd
file_path = "test_data.txt"
chunk_size = 1000000
#column_names = ["col1", "col2", "col3"]
df_list = []
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
df_list.append(chunk)
df = pd.concat(df_list)
#print(df.head(10))
for row in df.iterrows():
print(row)
从您提供的示例行来看,我们似乎可以简单地按空格分割,然后连接
8:
部分(第 8 个元素开始)。为了获得一些直觉,您可以尝试以下代码片段,它可以通过字符串实现您想要的效果:
import pandas as pd
data_string = "8,0 0 1 0.000000000 8082 A WS 24664872 + 8 <- (8,2) 23604576"
parts = data_string.split() # Split the string by whitespace
result = pd.Series(parts[:8] + [" ".join(parts[8:])])
print(result)
# Output:
# 0 8,0
# 1 0
# 2 1
# 3 0.000000000
# 4 8082
# 5 A
# 6 WS
# 7 24664872
# 8 + 8 <- (8,2) 23604576
# dtype: object
现在,我们可以将其翻译为
pandas
,稍作调整并使用 pandas.Series.str
和 pandas.Series.apply
。
file_path = "test_data.txt"
chunk_size = 1000000
df_list = []
for chunk in pd.read_csv(
file_path,
chunksize=chunk_size,
header=None,
engine="python",
delimiter="\t", # This should make all lines be loaded as a single column
):
chunk_df = chunk[0].str.split(expand=True)
# Concatenate the columns starting from the 8th one into a single column
chunk_df["8"] = chunk_df.iloc[:, 8:].agg(" ".join, axis=1)
# Select only the first 8 columns and the newly created column
chunk_df = pd.concat([chunk_df.iloc[:, :8], chunk_df["8"]], axis=1)
# Append modified chunk to the list
df_list.append(chunk_df)
# Concatenate all chunks into a single DataFrame
df = pd.concat(df_list, ignore_index=True)
df.head(10)
这应该给你这样的东西:
注意,整个想法是将
chunk
加载为单列DataFrame,以便我们可以在之后将其拆分。循环内的第一行:chunk_df = chunk[0].str.split(expand=True)
已经正确给出了前 7 列。其余的代码是连接所有其余的代码,这可能可以通过多种方式完成。