我正在使用
List<String>
——它包含一个大文本。文字如下:
List<String> lines = Arrays.asList("The first line", "The second line", "Some words can repeat", "The first the second"); //etc
我需要用输出计算其中的单词:
first - 2
line - 2
second - 2
can - 1
repeat - 1
some - 1
words - 1
应跳过少于 4 个符号的单词,这就是输出中不包含“the”和“can”的原因。这里我写了例子,不过本来如果这个词很少见并且入门< 20, i should skip this word. Then sort the map by Key in alphabetical order. Using only streams, without "if", "while" and "for" constructions.
我所实施的:
Map<String, Integer> wordCount = Stream.of(list)
.flatMap(Collection::stream)
.flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
.filter(str -> (str.length() >= 4))
.collect(Collectors.toMap(
i -> i.toLowerCase(),
i -> 1,
(a, b) -> java.lang.Integer.sum(a, b))
);
wordCount 包含带有单词及其条目的 Map。但我怎样才能跳过生僻字呢?我应该创建新流吗?如果是的话,如何获取Map的值?我尝试了这个,但它不正确:
String result = Stream.of(wordCount)
.filter(i -> (Map.Entry::getValue > 10));
我的计算应该返回一个字符串:
"word" - number of entries
谢谢!
鉴于已经完成的流:
List<String> lines = Arrays.asList(
"For the rabbit, it was a bad day.",
"An Antillean rabbit is very abundant.",
"She put the rabbit back in the cage and closed the door securely, then ran away.",
"The rabbit tired of her inquisition and hopped away a few steps.",
"The Dean took the rabbit and went out of the house and away."
);
Map<String, Integer> wordCounts = Stream.of(lines)
.flatMap(Collection::stream)
.flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
.filter(str -> (str.length() >= 4))
.collect(Collectors.toMap(
String::toLowerCase,
i -> 1,
Integer::sum)
);
System.out.println("Original:" + wordCounts);
原始输出:
Original:{dean=1, took=1, door=1, very=1, went=1, away=3, antillean=1, abundant=1, tired=1, back=1, then=1, house=1, steps=1, hopped=1, inquisition=1, cage=1, securely=1, rabbit=5, closed=1}
你可以这样做:
String results = wordCounts.entrySet()
.stream()
.filter(wordToCount -> wordToCount.getValue() > 2) // 2 is rare
.sorted(Map.Entry.comparingByKey()).map(wordCount -> wordCount.getKey() + " - " + wordCount.getValue())
.collect(Collectors.joining(", "));
System.out.println(results);
过滤后的输出:
away - 3, rabbit - 5
在计算频率计数之前,您不能排除任何小于
rare
的值。
我可能会这样做。
TreeMap
重建映射以按词汇顺序对单词进行排序List<String> list = Arrays.asList(....);
int wordRarity = 10; // minimum frequency to accept
int wordLength = 4; // minimum word length to accept
Map<String, Long> map = list.stream()
.flatMap(str -> Arrays.stream(
str.split("\\p{Punct}|\\s+|[0-9]|…|«|»|“|„")))
.filter(str -> str.length() >= wordLength)
.collect(Collectors.groupingBy(String::toLowerCase,
Collectors.counting()))
// here is where the rare words are filtered out.
.entrySet().stream().filter(e->e.getValue() > wordRarity)
.collect(Collectors.toMap(Entry::getKey, Entry::getValue,
(a,b)->a,TreeMap::new));
}
请注意,
(a,b)->a
lambda 是一个用于处理重复项的合并函数,并未使用。 不幸的是,如果不指定合并函数,就无法指定供应商。
最简单的打印方法如下:
map.entrySet().forEach(e -> System.out.printf("%s - %s%n",
e.getKey(), e.getValue()));