如何在大量单词中找到出现频率最高的单词(例如900000)

问题描述 投票:0回答:4

我面临的任务是生成 900000 个随机单词,然后打印出最常见的单词。所以这是我的算法:

1. move all number into a collection rather than printhing out them
2. for (900000...){move the frequency of Collection[i] to another collection B}
** 90W*90W is too much for a computer(lack of efficiency)
3. find the biggest number in that collection and the index.
4. then B[index] is output.

但问题是我的电脑无法处理第二步。所以我在这个网站上搜索并找到了一些关于在一堆单词中查找单词频率的答案,我查看了答案代码,但我还没有找到一种方法将它们应用到大量单词中。

现在我在这里展示我的代码:

/** Funny Words Generator
  * Tony
  */

import java.util.*;

public class WordsGenerator {

  //data field (can be accessed in whole class):
  private static int xC; // define a xCurrent so we can access it all over the class
  private static int n;
  private static String[] consonants = {"b","c","d","f","g","h","j","k","l","m","n","p","r","s","t","v","w","x","z"};
  private static String[] vowels = {"a", "e", "i", "o", "u"};
  private static String funnyWords = "";



  public static void main(String[] args) {

    Scanner sc = new Scanner(System.in);
    int times = 900000; // words number
    xC = sc.nextInt(); // seeds (only input)

    /* Funny word list */
    ArrayList<String> wordsList = new ArrayList<String>();
    ArrayList<Integer> frequencies = new ArrayList<Integer>();
    int maxFreq;
    for (int i = 0; i < times; i++) {
      n = 6; // each words are 6 characters long
      funnyWords = ""; // reset the funnyWords each new time
      for (int d = 0; d < n; d ++) {

        int letterNum = randomGenerator(); /* random generator will generate numbers based on current x */
        int letterIndex = 0; /* letterNum % 19 or % 5 based on condition */

        if ((d + 1) % 2 == 0) {
          letterIndex = letterNum % 5;
          funnyWords += vowels[letterIndex];
        }

        else if ((d + 1) % 2 != 0) {
          letterIndex = letterNum % 19;
          funnyWords += consonants[letterIndex];
        }
      }
      wordsList.add(funnyWords);
    }


    /* put all frequencies of each words into an array called frequencies */
    for (int i = 0; i < 900000; i++) {
      frequencies.add(Collections.frequency(wordsList, wordsList.get(i)));
    }



    maxFreq = Collections.max(frequencies);
    int index = frequencies.indexOf(maxFreq); // get the index of the most frequent word
    System.out.print(wordsList.get(index));


    sc.close();
  }

  /** randomGenerator
    * param: N(generate times), seeds
    * return: update the xC and return it */
  private static int randomGenerator() {
    int a = 445;
    int c = 700001;
    int m = 2097152;
    xC = (a * xC + c) % m; // update
    return xC; // return
  }

}

所以我意识到也许有办法以某种方式跳过第二步。任何人都可以给我提示吗?只是一个提示而不是代码,这样我就可以自己尝试一下,那就太好了!谢谢!

修改: 我看到你的很多答案代码都包含“words.stream()”,我用谷歌搜索但找不到它。请问各位大佬能告诉我哪里可以找到这方面的知识吗?这个流方法在哪个类中?谢谢!

java algorithm arraylist collections
4个回答
3
投票

您可以使用 Java Lambda 来完成此操作(需要 JDK 8)。另请注意,您的单词列表中可以包含相同频率的单词。

public class Main {
    public static void main(String[] args) {

        List<String> words = new ArrayList<>();

        words.add("World");
        words.add("Hello");
        words.add("World");
        words.add("Hello");

        // Imagine we have 90000 words in word list
        Set<Map.Entry<String, Integer>> set = words.stream()
                // Here we create map of unique words and calculates their frequency
                .collect(Collectors.toMap(word -> word, word -> 1, Integer::sum)).entrySet();

        // Find the max frequency
        int max = Collections
                .max(set, (a, b) -> Integer.compare(a.getValue(), b.getValue())).getValue();

        // We can have words with the same frequency like in my words list. Let's get them all
        List<String> list = set.stream()
                .filter(entry -> entry.getValue() == max)
                .map(Map.Entry::getKey).collect(Collectors.toList());

        System.out.println(list); // [Hello, World]


    }
}

2
投票

这基本上可以分为两个步骤:

  1. 计算词频,如
    Map<String, Long>
    。有多种选择,请参阅此问题了解示例。
  2. 计算该map的最大条目,其中“最大”指的是具有最高值的条目。

所以,如果你真的有能力,你可以写得非常紧凑:

private static <T> T maxCountElement(List<? extends T> list)
{
    return Collections.max(list.stream().collect(Collectors.groupingBy(
        Function.identity(), Collectors.counting())).entrySet(), 
            (e0, e1) -> Long.compare(e0.getValue(), e1.getValue())).getKey();
}

已编辑以回复评论:

紧凑的表示可能不是最具可读性的。分解它会使代码变得有点复杂,但可能会更清楚那里发生的事情:

private static <T> T maxCountElement(List<? extends T> list)
{
    // A collector that receives the input elements, and converts them 
    // into a map. The key of the map is the input element. The value 
    // of the map is the number of occurrences of the element
    Collector<T, ?, Map<T, Long>> collector = 
        Collectors.groupingBy(Function.identity(), Collectors.counting());

    // Create the map and obtain its set of entries
    Map<T, Long> map = list.stream().collect(collector);
    Set<Entry<T, Long>> entrySet = map.entrySet();

    // A comparator that compares two map entries based on their value
    Comparator<Entry<T, Long>> comparator = 
        (e0, e1) -> Long.compare(e0.getValue(), e1.getValue());

    // Compute the maximum element of the set of entries. That is,
    // the entry with the largest value (which is the entry for the
    // element with the maximum number of occurrences)
    Entry<T, Long> entryWithMaxValue = 
        Collections.max(entrySet, comparator);

    return entryWithMaxValue.getKey();
}

1
投票

HashMap 是最快的数据结构之一,只需循环遍历每个单词,将其用作 HashMap 的键,在循环内部,使计数器成为 hashMap 的值。

HashMap<string, Integer> hashMapVariable = new HashMap<>();
...
//inside the loop of words
if (hashMapVariable.containsKey(word){
   hashMapVariable.put(key, hashMapVariable.get(key) + 1);
} else {
   hashMapVariable.put(word, 1);
}
...

对于每个键(单词),只需增加与该键关联的值。尽管你必须检查密钥是否存在(在java中是

hashMapVariable.containsKey("key")
)。如果它存在,则只需递增,否则将其添加到 HashMap 中。通过这样做,您并没有恢复整个数据,您只是将每个键设为一个,并将其出现的次数作为该键的值。

在循环结束时,最常见的单词将具有最高的计数器/值。


1
投票

您可以使用

HashMap
key
存储 word,其值为
correspond times

伪代码如下:

String demo(){
   int maxFrequency = 0;
   String maxFrequencyStr = "";
   String strs[] ;
   Map<String,Integer> map = new HashMap<String,Integer>();
   for(int i = 0; i < 900000;i++){//for
      if(map.containsKey(strs[i])){
          int times = map.get(strs[i]);
          map.put(strs[i], times+1);
          if(maxFrequency<times+1){
              maxFrequency = times + 1;
              maxFrequencyStr = strs[i];
          }
      }
      else{
          map.put(strs[i], 1);
          if(maxFrequency<1){
              maxFrequency = 1;
              maxFrequencyStr = strs[i];
          }
      }
   }//for
   return maxFrequencyStr;
 }
© www.soinside.com 2019 - 2024. All rights reserved.