我正在类比数据,我有句子,每行一个例子。
PhraseCleaned
0 get house business distribute sell outside house opportunities
1 business changing offices culture work business
2 search company best practices
3 1 let go back desk spaces one
这都是句子,我需要计算每行的字数有多少次相同的字,得到的是这样的结果。
id PhraseCleaned
0 get house business house opportunities
1 business changing offices culture work business
2 desk big work culture
这个画面是我真正需要去的
我做了这个
tokenaize_data= PraseFinalD.apply(lambda row: nltk.word_tokenize(row['PhraseCleaned']), axis=1)
它将用逗号分隔单词
[get, house, business, house, opportunities ]
[business, changing, offices, culture, work, business]
[desk, big, work, culture]
现在我试图计算它们 这只是把所有的词都计算在一起 PhaseFinal是一个列表... 我清理了数据 删除了一些东西
word2count = {}
for data in PhraseFinal:
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
df
collections.Counter
并将其分割成带有 .tolist()
df
from collections import Counter
import pandas as pd
# create a word count dict and split it into columns
df1 = pd.DataFrame(df['PhraseCleaned'].apply(lambda x: Counter(x.split())).tolist())
print(df1)
get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
# join df and df1
df2 = df.join(df1)
print(df2)
PhraseCleaned get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
get house business distribute sell outside house opportunities 1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
business changing offices culture work business NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
search company best practices NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
1 let go back desk spaces one NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
随着 scikit-learn
矢量器。
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
print('Vocab:', words_sorted_by_index)
print()
print('Matrix/Vectors:\n', vectorizer.transform(df['text']).toarray())
[out]:
Vocab: ('back', 'best', 'business', 'changing', 'company', 'culture', 'desk', 'distribute', 'get', 'go', 'house', 'let', 'offices', 'one', 'opportunities', 'outside', 'practices', 'search', 'sell', 'spaces', 'work')
Matrix/Vectors:
[[0 0 1 0 0 0 0 1 1 0 2 0 0 0 1 1 0 0 1 0 0]
[0 0 2 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
[1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0]]
把它放回DataFrame。
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
texts = """get house business distribute sell outside house opportunities
business changing offices culture work business
search company best practices
1 let go back desk spaces one""".split('\n')
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
matrix = vectorizer.transform(df['text']).toarray()
# Putting it back to the DataFrame.
df_new = pd.concat([df, pd.DataFrame(matrix)], axis=1)
column_names = dict(zip(range(len(words_sorted_by_index)), words_sorted_by_index))
df_new.rename(column_names, axis=1)
并将其写入'csv'文件。
df_new.to_csv('data-analogize.csv', index=False)