从句子中计算熊猫数据框中每行的不同字数。

问题描述 投票:-1回答:1

我正在类比数据,我有句子,每行一个例子。

PhraseCleaned   
0   get house business distribute sell outside house opportunities  
1   business changing offices culture work business
2   search company best practices 
3   1 let go back desk spaces one

这都是句子,我需要计算每行的字数有多少次相同的字,得到的是这样的结果。

id    PhraseCleaned 
0   get house business house opportunities  
1   business changing offices culture work business
2   desk big work culture

这个画面是我真正需要去的

image of what I want to get

我做了这个

tokenaize_data= PraseFinalD.apply(lambda row: nltk.word_tokenize(row['PhraseCleaned']), axis=1)

它将用逗号分隔单词

[get, house, business, house, opportunities ]
[business, changing, offices, culture, work, business]
[desk, big, work, culture]

现在我试图计算它们 这只是把所有的词都计算在一起 PhaseFinal是一个列表... 我清理了数据 删除了一些东西

word2count = {} 
for data in PhraseFinal: 
words = nltk.word_tokenize(data) 
for word in words: 
    if word not in word2count.keys(): 
        word2count[word] = 1
    else: 
        word2count[word] += 1
python pandas loops nlp nltk
1个回答
0
投票
  1. 鉴于您的数据为 df
  2. 创建一个字数听写器 collections.Counter 并将其分割成带有 .tolist()
  3. 加入它 df
from collections import Counter
import pandas as pd

# create a word count dict and split it into columns
df1 = pd.DataFrame(df['PhraseCleaned'].apply(lambda x: Counter(x.split())).tolist())

print(df1)

 get  house  business  distribute  sell  outside  opportunities  changing  offices  culture  work  search  company  best  practices    1  let   go  back  desk  spaces  one
 1.0    2.0       1.0         1.0   1.0      1.0            1.0       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       2.0         NaN   NaN      NaN            NaN       1.0      1.0      1.0   1.0     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     1.0      1.0   1.0        1.0  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  1.0  1.0  1.0   1.0   1.0     1.0  1.0

# join df and df1
df2 = df.join(df1)

print(df2)

                                                  PhraseCleaned  get  house  business  distribute  sell  outside  opportunities  changing  offices  culture  work  search  company  best  practices    1  let   go  back  desk  spaces  one
 get house business distribute sell outside house opportunities  1.0    2.0       1.0         1.0   1.0      1.0            1.0       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                business changing offices culture work business  NaN    NaN       2.0         NaN   NaN      NaN            NaN       1.0      1.0      1.0   1.0     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                                  search company best practices  NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     1.0      1.0   1.0        1.0  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                                  1 let go back desk spaces one  NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  1.0  1.0  1.0   1.0   1.0     1.0  1.0

0
投票

随着 scikit-learn 矢量器。

from operator import itemgetter

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame({'text': texts})

# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])

# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
print('Vocab:', words_sorted_by_index)
print()
print('Matrix/Vectors:\n', vectorizer.transform(df['text']).toarray())

[out]:

Vocab: ('back', 'best', 'business', 'changing', 'company', 'culture', 'desk', 'distribute', 'get', 'go', 'house', 'let', 'offices', 'one', 'opportunities', 'outside', 'practices', 'search', 'sell', 'spaces', 'work')

Matrix/Vectors:
 [[0 0 1 0 0 0 0 1 1 0 2 0 0 0 1 1 0 0 1 0 0]
 [0 0 2 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
 [0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
 [1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0]]

把它放回DataFrame。

from operator import itemgetter

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

texts = """get house business distribute sell outside house opportunities
business changing offices culture work business
search company best practices 
1 let go back desk spaces one""".split('\n')

df = pd.DataFrame({'text': texts})

# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])

# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
matrix = vectorizer.transform(df['text']).toarray()

# Putting it back to the DataFrame.
df_new = pd.concat([df, pd.DataFrame(matrix)], axis=1)
column_names = dict(zip(range(len(words_sorted_by_index)), words_sorted_by_index))
df_new.rename(column_names, axis=1)

并将其写入'csv'文件。

df_new.to_csv('data-analogize.csv', index=False)
© www.soinside.com 2019 - 2024. All rights reserved.