在某人的推文中找到最受欢迎的词

Question

[在一个项目中，我试图查询特定用户句柄的推文，并在该用户的推文中找到最常见的单词，并返回该最常见单词的频率。

下面是我的代码：

  public String mostPopularWord()
  {
     this.removeCommonEnglishWords();
     this.sortAndRemoveEmpties();

     Map<String, Integer> termsCount = new HashMap<>();
     for(String term : terms)
     {
        Integer c = termsCount.get(term);
        if(c==null)
           c = new Integer(0);
        c++;
        termsCount.put(term, c);
     }
     Map.Entry<String,Integer> mostRepeated = null;
     for(Map.Entry<String, Integer> curr: termsCount.entrySet())
     {
         if(mostRepeated == null || mostRepeated.getValue()<curr.getValue())
             mostRepeated = curr;
     }

     //frequencyMax = termsCount.get(mostRepeated.getKey());

     try 
     {
        frequencyMax = termsCount.get(mostRepeated.getKey());
        return mostRepeated.getKey();
     } 
     catch (NullPointerException e) 
     {
        System.out.println("Cannot find most popular word from the tweets.");
     }

     return ""; 
  }

我还认为，这将有助于显示我在上述方法中调用的前两个方法的代码，如下所示。它们都在同一个类中，并定义了以下内容：

  private Twitter twitter;
  private PrintStream consolePrint;
  private List<Status> statuses;
  private List<String> terms;
  private String popularWord;
  private int frequencyMax;

  @SuppressWarnings("unchecked")
  public void sortAndRemoveEmpties()
  {
     Collections.sort(terms);
     terms.removeAll(Arrays.asList("", null));
  }

  private void removeCommonEnglishWords()
  {          
     Scanner sc = null;

     try
     {
        sc = new Scanner(new File("commonWords.txt"));
     }
     catch(Exception e)
     {
        System.out.println("The file is not found");
     }

     List<String> commonWords = new ArrayList<String>(); 
     int count = 0;
     while(sc.hasNextLine())
     {
        count++;
        commonWords.add(sc.nextLine()); 
     }

     Iterator<String> termIt = terms.iterator();
     while(termIt.hasNext())
     {
        String term = termIt.next();
        for(String word : commonWords)
           if(term.equalsIgnoreCase(word))
              termIt.remove();
     }
  }

对于较长的代码段，我深表歉意。但是令人沮丧的是，即使我的removeCommonEnglish（）方法显然是正确的（在另一篇文章中讨论过），当我运行mostPopularWord（）时，它也会返回“ the”，这显然是我常见英语单词列表的一部分已经并且打算从列表中删除条款。我可能做错了什么？

更新1：这是commonWords文件的链接：https://drive.google.com/file/d/1VKNI-b883uQhfKLVg-L8QHgPTLNb22uS/view?usp=sharing

更新2：我在调试时注意到的一件事是while（sc.hasNext（））在removeCommonEnglishWords（）中被完全跳过。我不明白为什么。

Answer 1

如果像这样使用流，可能会更简单：

String mostPopularWord() {
    return terms.stream()
            .collect(Collectors.groupingBy(s -> s, Collectors.counting()))
            .entrySet().stream()
            .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
            .findFirst()
            .map(Map.Entry::getKey)
            .orElse("");
}

Answer 2

我尝试了您的代码。这是您将要做的。将以下部分替换为removeCommonEnglishWords()

Iterator<String> termIt = terms.iterator();
while(termIt.hasNext())
{
   String term = termIt.next();
   for(String word : commonWords)
      if(!term.equalsIgnoreCase(word))
           reducedTerms.add( term );
}

带有此：

 List<String> reducedTerms = new ArrayList<>();
 for( String term : this.terms ) {
     if( !commonWords.contains( term ) ) reducedTerms.add( term );
 }

 this.terms = reducedTerms;

由于您没有提供该类，所以我创建了一个带有一些假设的类，但我认为这段代码会通过。

Answer 3

使用流的方法略有不同。

这使用相对常见的频率计数惯用法，将其存储在流中。
然后进行简单扫描以查找获得的最大计数，然后返回该单词或字符串“找不到单词”。
它还会过滤掉名为Set<String>的ignore中的单词，因此您也需要创建它。


           import java.util.Arrays;
           import java.util.Comparator;
           import java.util.Map;
           import java.util.Map.Entry;
           import java.util.stream.Collectors;

            Set<String> ignore = Set.of("the", "of", "and", "a",
            "to", "in", "is", "that", "it", "he", "was",
            "you", "for", "on", "are", "as", "with",
            "his", "they", "at", "be", "this", "have",
            "via", "from", "or", "one", "had", "by",
            "but", "not", "what", "all", "were", "we",
            "RT", "I", "&", "when", "your", "can",
            "said", "there", "use", "an", "each",
            "which", "she", "do", "how", "their", "if",
            "will", "up", "about", "out", "many",
            "then", "them", "these", "so", "some",
            "her", "would", "make", "him", "into",
            "has", "two", "go", "see", "no", "way",
            "could", "my", "than", "been", "who", "its",
            "did", "get", "may", "…", "@", "??", "I'm",
            "me", "u", "just", "our", "like");


            Map.Entry<String, Long> entry = terms.stream()
                 .filter(wd->!ignore.contains(wd)).map(String::trim)
                .collect(Collectors.groupingBy(a -> a,
                        Collectors.counting()))
                .entrySet().stream()
                .collect(Collectors.maxBy(Comparator
                        .comparing(Entry::getValue)))
                .orElse(Map.entry("No words found", 0L));


              System.out.println(entry.getKey() + " " + entry.getValue());

在某人的推文中找到最受欢迎的词

问题描述投票：1回答：3

3个回答

最新问题

在某人的推文中找到最受欢迎的词

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3