我正在尝试从CSV文件中存储的文章中提取所有单词,并将句子ID号和包含单词的单词写到新的CSV文件中。
到目前为止我尝试过的,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
row = 0; sentNo = 0
while( row < 1 ):
sentences = tokenizer.tokenize(df['articles'][row])
for index, sents in enumerate(sentences):
sentNo += 1
words = word_tokenize(sents)
print(f'{sentNo}: {words}')
row += 1
df['articles'][0]
包含:
The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”
我只用了df['articles'][0]
,它给出这样的输出:
1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']
我如何以给定格式编写一个新的output.csv
文件,其中包含data.csv
文件中所有文章的所有句子:
Sentence No | Word
1 The
ultimate
productivity
hack
is
saying
no
.
2 Not
doing
something
will
always
be
faster
than
doing
it
.
3 This
statement
reminds
me
of
the
old
computer
programming
saying
,
“
Remember
that
there
is
no
code
faster
than
no
code
.
”
我是Python的新手,并在Jupyter Notebook上使用它。
这是我关于Stack溢出的第一篇文章。如果没有正确的顺序,请指正我学习。谢谢。
只需要遍历单词并为每个单词写一个新行。
由于您也将逗号作为“单词”,所以这将变得有些不可预测-可能要考虑另一个除法器或从单词列表中去除逗号。
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
row = 0; sentNo = 0
while( row < 1 ):
sentences = tokenizer.tokenize(df['articles'][row])
for index, sents in enumerate(sentences):
sentNo += 1
words = word_tokenize(sents)
word_num = 0
for word in words:
if word_num == 0:
print(f'{sentNo},{word}') # first row, print number and word
else:
print(f',{word}') # all other rows, blank then word
word_num += 1
row += 1
编辑:这似乎是一种更简洁的方法。
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1
for stc in sentences:
for word in stc:
prntLine = ','
if word == stc[0]:
prntLine = str(stcNum) + prntLine
prntLine = prntLine + word + '\r\n'
f.write(prntLine)
stcNum += 1
f.close()
output.csv:
1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,,
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”