我有一个数据框,其中有一个“文本”列和一个“关键字”列,是使用 KeyBERT 生成的。这是系列类型的列的结果,即:
df['words'][1]
看起来像这样
[('white butterfly', 0.4587),
('pest horseradish', 0.4974),
('mature caterpillars', 0.6484)]
我想要实现的是有两列,一列代表单词/短语,另一列代表分数,如下所示:
df2data = [["The word horseradish is attested in English from the 1590s. It combines the word horse (formerly used in a figurative sense to mean strong or coarse) and the word radish. Some sources claim that the term originates from a mispronunciation of the German word meerrettich as mareradish. However, this hypothesis has been disputed, as there is no historical evidence of this term being used. In Central and Eastern Europe, horseradish is called chren, hren and ren (in various spellings like kren) in many Slavic languages, in Austria, in parts of Germany (where the other German name Meerrettich is not used), in North-East Italy, and in Yiddish (כריין transliterated as khreyn). It is common in Ukraine (under the name of хрін, khrin), in Belarus (under the name of хрэн, chren), in Poland (under the name of chrzan), in the Czech Republic (křen), in Slovakia (chren), in Russia (хрен, khren), in Hungary (torma), in Romania (hrean), in Lithuania (krienas), and in Bulgaria (under the name of хрян)",
['mispronunciation german','mareradish hypothesis','horse used'],
[0.3715, 0.422,0.4594]],
["Widely introduced by accident, cabbageworms, the larvae of Pieris rapae, the small white butterfly, are a common caterpillar pest in horseradish. The adults are white butterflies with black spots on the forewings that are commonly seen flying around plants during the day. The caterpillars are velvety green with faint yellow stripes running lengthwise down the back and sides. Fully grown caterpillars are about 25-millimetre (1 in) in length. They move sluggishly when prodded. They overwinter in green pupal cases. Adults start appearing in gardens after the last frost and are a problem through the remainder of the growing season. There are three to five overlapping generations a year. Mature caterpillars chew large, ragged holes in the leaves leaving the large veins intact. Handpicking is an effective control strategy in home gardens",
['white butterfly','pest horseradish','mature caterpillars'],
[0.4587, 0.4974,0.6484]]]
df2 = pd.DataFrame(data = df2data, columns=['texts', 'words','scores'])
话 | 分数 | 文字 |
---|---|---|
[德语发音错误,mareradish 假设...] | [0.3715, 0.422, 0.4594] | “辣根”一词自 1590 年代起就在英语中得到证实... |
【白蝴蝶、害虫辣根、成熟猫……】 | [0.4587, 0.4974, 0.6484] | 菜青虫、菜青虫的幼虫偶然被广泛引入,... |
我尝试过对单词列建立索引,但只能在系列本身内滑动。
生成可重现示例的代码:
import pandas as pd
from keybert import KeyBERT
kw_model = KeyBERT()
text = ["The word horseradish is attested in English from the 1590s. It combines the word horse (formerly used in a figurative sense to mean strong or coarse) and the word radish. Some sources claim that the term originates from a mispronunciation of the German word meerrettich as mareradish. However, this hypothesis has been disputed, as there is no historical evidence of this term being used. In Central and Eastern Europe, horseradish is called chren, hren and ren (in various spellings like kren) in many Slavic languages, in Austria, in parts of Germany (where the other German name Meerrettich is not used), in North-East Italy, and in Yiddish (כריין transliterated as khreyn). It is common in Ukraine (under the name of хрін, khrin), in Belarus (under the name of хрэн, chren), in Poland (under the name of chrzan), in the Czech Republic (křen), in Slovakia (chren), in Russia (хрен, khren), in Hungary (torma), in Romania (hrean), in Lithuania (krienas), and in Bulgaria (under the name of хрян)","Widely introduced by accident, cabbageworms, the larvae of Pieris rapae, the small white butterfly, are a common caterpillar pest in horseradish. The adults are white butterflies with black spots on the forewings that are commonly seen flying around plants during the day. The caterpillars are velvety green with faint yellow stripes running lengthwise down the back and sides. Fully grown caterpillars are about 25-millimetre (1 in) in length. They move sluggishly when prodded. They overwinter in green pupal cases. Adults start appearing in gardens after the last frost and are a problem through the remainder of the growing season. There are three to five overlapping generations a year. Mature caterpillars chew large, ragged holes in the leaves leaving the large veins intact. Handpicking is an effective control strategy in home gardens"]
df = pd.DataFrame(text, columns=['texts'])
df['words'] = kw_model.extract_keywords(df['texts'], keyphrase_ngram_range=(1, 2), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=3)
其中一种方法是:转置内部序列(使用
zip
?)并将 pandas.Series
应用于它们:
df = pd.DataFrame({
'words': [
[('mispronunciation german', 0.3715),
('mareradish hypothesis', 0.422),
('horse used', 0.4594)],
[('white butterfly', 0.4587),
('pest horseradish', 0.4974),
('mature caterpillars', 0.6484)]]
})
splitted_values = df['words'].apply(lambda x: pd.Series([*zip(*x)], ['words','scores']))
print(splitted_values.to_string()
输出:
words scores
0 (mispronunciation german, mareradish hypothesis, horse used) (0.3715, 0.422, 0.4594)
1 (white butterfly, pest horseradish, mature caterpillars) (0.4587, 0.4974, 0.6484)