我创建了一个数据框,添加来自多个来源的数据。这是一个示例子集:
index CompanyName Source1site Source2site Source3site City
1 Comp1 web1.com Nan web2.com Paris
2 Comp1 Web2.com web2.com Nan Nan
3 Comp2 Nan site1.com Nan Oakland
4 Comp2 site2.com Nan Nan London
5 Comp3 Nan Nan Nan Mexico
6 Comp4 Nan url1.com Nan Nan
7 Comp5 Nan example.com Nan New York
现在Source1site、Source2site和Source3site基本上都是从不同来源为CompanyName收集的网站域名。我希望以一种也保留其他列中的数据的方式合并这三列。这是我正在寻找的示例输出:
index CompanyName MergeSourceSite City
1 Comp1 web1.com Paris
2 Comp1 web2.com Paris
3 Comp2 site1.com Oakland
4 Comp2 Site1.com London
5 Comp2 site2.com Oakland
6 Comp2 site2.com London
7 Comp3 Nan Mexico
8 Comp4 url1.com Nan
9 Comp5 example.com New York
非常感谢我能得到的任何帮助。
谢谢,
您可以通过以下方式实现此目的:
df = your_dataframe
# Merge source columns into a single column
merged_sources = df.melt(
id_vars=["index", "CompanyName", "City"],
value_vars=["Source1site", "Source2site", "Source3site"],
value_name="MergeSourceSite"
)
# Remove rows with NaN in the MergeSourceSite column
merged_sources = merged_sources.dropna(subset=["MergeSourceSite"])
# Remove duplicate rows and reset index
merged_sources = merged_sources.drop(columns=
["variable"]).drop_duplicates().reset_index(drop=True)
# Sort by index for a cleaner output
merged_sources = merged_sources.sort_values(by="index").reset_index(drop=True)
它将为您提供所需的输出。