如何在带有句子ID号的文章中标记的CSV文件中保存单词?

问题描述 投票:0回答:1

我正在尝试从CSV文件中存储的文章中提取所有单词,并将句子ID号和包含单词的单词写到新的CSV文件中。

到目前为止我尝试过的,

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)

row = 0; sentNo = 0
while( row < 1 ):
    sentences = tokenizer.tokenize(df['articles'][row])
    for index, sents in enumerate(sentences):
        sentNo += 1
        words = word_tokenize(sents)
        print(f'{sentNo}: {words}')
    row += 1

df['articles'][0]包含:

The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”

我只用了df['articles'][0],它给出这样的输出:

1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']

我如何以给定格式编写一个新的output.csv文件,其中包含data.csv文件中所有文章的所有句子:

Sentence No | Word
1             The
              ultimate
              productivity
              hack
              is
              saying
              no
              .
2             Not
              doing 
              something 
              will
              always
              be
              faster
              than
              doing
              it
              .
3             This 
              statement 
              reminds 
              me 
              of 
              the 
              old 
              computer 
              programming 
              saying
              , 
              “
              Remember
              that 
              there
              is
              no
              code
              faster
              than
              no
              code
              .
              ”

我是Python的新手,并在Jupyter Notebook上使用它。

这是我关于Stack溢出的第一篇文章。如果没有正确的顺序,请指正我学习。谢谢。

python pandas csv preprocessor
1个回答
0
投票

只需要遍历单词并为每个单词写一个新行。

由于您也将逗号作为“单词”,所以这将变得有些不可预测-可能要考虑另一个除法器或从单词列表中去除逗号。

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)

row = 0; sentNo = 0
while( row < 1 ):
    sentences = tokenizer.tokenize(df['articles'][row])
    for index, sents in enumerate(sentences):
        sentNo += 1
        words = word_tokenize(sents)
        word_num = 0
        for word in words:
           if word_num == 0:
              print(f'{sentNo},{word}')  # first row, print number and word
           else:
              print(f',{word}')           # all other rows, blank then word
           word_num += 1
    row += 1

编辑:这似乎是一种更简洁的方法。

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize

df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1

for stc in sentences:
  for word in stc:
    prntLine = ','
    if word == stc[0]:
      prntLine = str(stcNum) + prntLine
    prntLine = prntLine + word + '\r\n'
    f.write(prntLine)
  stcNum += 1

f.close()

output.csv:

1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,,
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”

© www.soinside.com 2019 - 2024. All rights reserved.