TermRangeQuery的行为不符合我的预期。我是Lucene的新手,也是Java的新手。因此,可能是我不明白我的代码应该得到什么,或者我犯了一些丑陋的错误。这是代码(您可以在https://repl.it/@Tekener/AstonishingAridWatch处尝试):
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
@SuppressWarnings("deprecation")
class Main {
private static IndexSearcher indexSearcher;
private static IndexReader indexReader;
private static String separatorLine = "===========================";
public static void main(String[] args) throws IOException {
Analyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
System.out.println(separatorLine);
System.out.println("Building the index:");
indexWriter.addDocument(createDocumentWithFields("1st", "Humpty Dumpty sat on a wall,"));
indexWriter.addDocument(createDocumentWithFields("2nd", "Humpty Dumpty had a great fall."));
indexWriter.addDocument(createDocumentWithFields("3rd", "All the king's horses and all the king'smen"));
indexWriter.addDocument(createDocumentWithFields("4th", "Couldn't put Humpty together again."));
System.out.println(separatorLine);
indexWriter.commit();
indexWriter.close();
indexReader = DirectoryReader.open(directory);
indexSearcher = new IndexSearcher(indexReader);
showQueryResult(1, TermRangeQuery.newStringRange("content", "a", "h", true, true));
showQueryResult(2, TermRangeQuery.newStringRange("content", "A", "H", true, true));
showQueryResult(3, TermRangeQuery.newStringRange("content", "a", "f", true, true));
showQueryResult(4, TermRangeQuery.newStringRange("content", "A", "F", true, true));
}
private static void showQueryResult(int queryNo, Query query) throws IOException {
System.out.println(String.format("Query #%d: %s", queryNo, query.toString()));
TopDocs topDocs = indexSearcher.search(query, 100);
System.out.println("Result:");
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = indexReader.document(scoreDoc.doc);
System.out.println(String.format("name: %s - content: %s", doc.getField("name").stringValue(), doc.getField("content").stringValue()));
}
System.out.println(separatorLine);
}
private static Document createDocumentWithFields(String name, String content) {
System.out.println(String.format("name: %s - content: %s", name, content));
Document doc = new Document();
doc.add(new StringField("name", name, Store.YES));
doc.add(new TextField("content", content, Store.YES));
return doc;
}
}
这是控制台输出:
===========================
Building the index:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #1: content:[a TO h]
Result:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #2: content:[A TO H]
Result:
===========================
Query #3: content:[a TO f]
Result:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #4: content:[A TO F]
Result:
===========================
我的结论:如果将索引文本(用于“内容”字段)存储为小写字符串,则查询#1,#2和#4的结果可能是正确的。但是,如果是这种情况,查询3的结果将是错误的。在查询#3中只能找到第3和第4个条目。我的错误在哪里?
[标准分析器uses the lower case filter-是的,是的,所有索引数据都将是小写字母:
使用LowerCaseFilter和StopFilter过滤StandardTokenizer,使用可配置的停用词列表。
此外,请记住这一点:
TermRangeQuery.newStringRange("content", "a", "f", true, true);
表示在范围内(true
值”包括“ a”和“ f”。
因此,“跌倒很大”中的“ a”是一个匹配项。这就是为什么在查询3中找到所有4个结果的原因。将第3个搜索更改为类似这样的内容即可查看影响:
TermRangeQuery.newStringRange("content", "a", "b", true, true);
TermRangeQuery.newStringRange("content", "a", "b", false, false);
下一点与您的问题并不严格相关,但可能很有用。通常,执行搜索时使用与索引数据时使用的分析器相同的分析器(有例外)。因此,例如,搜索以不区分大小写的方式匹配搜索词是很常见的。通过对搜索词使用标准分析器,可以实现这一目标。
有多种方法可以执行此操作-这是一种方法-可能会有一些闪烁的方法:
QueryParser parser = new QueryParser("content", analyzer);
Query q1 = TermRangeQuery.newStringRange("content", "b", "h", true, true);
Query query1 = parser.parse(q1.toString());
showQueryResult(1, query1);
根据上述结果,结果应该很有意义。
如果您想探索实际被索引的内容,建议更改为使用此内容:
org.apache.lucene.store.MMapDirectory;
以及类似的内容:
Directory directory = new MMapDirectory(Paths.get("E:/lucene/indexes/range_queries"));
而且,无论如何,RAMDirectory都是not generally recommended-可能用于演示。
一旦数据存储在磁盘上,您就可以将Luke指向它-这是一个非常有用的工具(带有GUI),用于浏览索引数据。它以JAR文件(lucene-luke-8.x.x.jar)的形式提供,可以在主要的Lucene二进制发行包中找到。
编辑:
如果使用RAMDirectory,则可能还需要使用此:
if (!DirectoryReader.indexExists(directory)) {
// index builder logic here
}
这避免了用重复数据重新填充索引。
关于停用词:默认情况下,标准分析仪的停用词列表为空。您可以在org.apache.lucene.analysis.CharArraySet
中向构造函数提供单词列表:
import org.apache.lucene.analysis.CharArraySet;
...
CharArraySet myStopWords = new CharArraySet(2, true);
myStopWords .add("foo");
myStopWords .add("bar");
Analyzer analyzer = new StandardAnalyzer(myStopWords);
或者您可以使用现有的停用词列表之一。这是一个英文停用词:
import static org.apache.lucene.analysis.en.EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
...
Analyzer analyzer = new StandardAnalyzer(ENGLISH_STOP_WORDS_SET);