我有映射器和归约器代码,可在文本文件中找到最常用的单词。我想在我的文本文件的特定列中输出最常用的单词。 txt文件中列的名称为“流派”。该列有多个字符串,以逗号分隔。这是我的txt文件的示例:
tconst averageRating numVotes titleType primaryTitle startYear genres
tt0002020 5.2 85 short Queen Elizabeth 1912 Biography,Drama,History
tt0002026 4 7 movie Anny - Story of a Prostitute 1912 Drama,Romance
tt0002029 6.1 33 short Poor Jenny 1912 Short
tt0002031 4.6 8 movie As You Like It 1912 \N
tt0002033 5.6 26 short Asesinato y entierro de Don JosŽ Canalejas 1912 Short
tt0002034 4.9 17 short At Coney Island 1912 Comedy,Short
tt0002041 3.9 14 short The Baby and the Stork 1912 Crime,Drama,Short
tt0002045 4.2 71 short The Ball Player and the Bandit 1912 Drama,Romance,Short
//Mapper code
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
//Reducer
from itertools import groupby
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
max_count = 0
max_word = None
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
# check if new word greater
if current_count > max_count:
max_count= current_count
max_word = current_word
current_count = count
current_word = word
# do not forget to check last word if needed!
if current_count > max_count:
max_count= current_count
max_word = current_word
print '%s\t%s' % (max_word, max_count)
您能否指导我如何更改此代码以在'体裁'列中打印最常用的词。我也想输出“请”中所有单词的单词计数。如果需要提供其他信息,请让我知道。
尝试使用line变量的索引的倍数。使用索引来查找最常见单词的特定列。