如何将此数据拆分为数据框中的行，并使用 pandas 进行列名

Question

我的数据的每一行看起来都是这样的：

8,0    0        1     0.000000000  8082  A  WS 24664872 + 8 <- (8,2) 23604576

我想将数据分成这样的列：

col1   col2     col3  col4         col5  col6 col7 col8      col9
8,0    0        1     0.000000000  8082  A    WS   24664872  + 8 <- (8,2) 23604576

我是Python数据处理新手，不知道如何正确分隔列。由于文件大小较大，我当前使用的代码使用分块：

import pandas as pd
file_path = "test_data.txt"
chunk_size = 1000000
#column_names = ["col1", "col2", "col3"]


df_list = []

for chunk in pd.read_csv(file_path, chunksize=chunk_size):
    df_list.append(chunk)

df = pd.concat(df_list)
#print(df.head(10))

for row in df.iterrows():
    print(row)

Answer 1

从您提供的示例行来看，我们似乎可以简单地按空格分割，然后连接

8:

部分（第 8 个元素开始）。为了获得一些直觉，您可以尝试以下代码片段，它可以通过字符串实现您想要的效果：

import pandas as pd

data_string = "8,0    0        1     0.000000000  8082  A  WS 24664872 + 8 <- (8,2) 23604576"
parts = data_string.split()  # Split the string by whitespace
result = pd.Series(parts[:8] + [" ".join(parts[8:])])

print(result)

# Output:
# 0    8,0
# 1     0
# 2     1
# 3     0.000000000
# 4     8082
# 5     A
# 6     WS
# 7    24664872
# 8    + 8 <- (8,2) 23604576
# dtype: object

现在，我们可以将其翻译为

pandas

，稍作调整并使用

pandas.Series.str

和

pandas.Series.apply

。

file_path = "test_data.txt"
chunk_size = 1000000

df_list = []

for chunk in pd.read_csv(
    file_path,
    chunksize=chunk_size,
    header=None,
    engine="python",
    delimiter="\t",  # This should make all lines be loaded as a single column
):
    chunk_df = chunk[0].str.split(expand=True)

    # Concatenate the columns starting from the 8th one into a single column
    chunk_df["8"] = chunk_df.iloc[:, 8:].agg(" ".join, axis=1)

    # Select only the first 8 columns and the newly created column
    chunk_df = pd.concat([chunk_df.iloc[:, :8], chunk_df["8"]], axis=1)
    
    # Append modified chunk to the list
    df_list.append(chunk_df)

# Concatenate all chunks into a single DataFrame
df = pd.concat(df_list, ignore_index=True)

df.head(10)

这应该给你这样的东西：

注意，整个想法是将

chunk

加载为单列DataFrame，以便我们可以在之后将其拆分。循环内的第一行：

chunk_df = chunk[0].str.split(expand=True)

已经正确给出了前 7 列。其余的代码是连接所有其余的代码，这可能可以通过多种方式完成。

如何将此数据拆分为数据框中的行，并使用 pandas 进行列名

问题描述投票：0回答：1

1个回答

最新问题

如何将此数据拆分为数据框中的行，并使用 pandas 进行列名

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1