如何查看Lucene索引

Question

我正在尝试学习和理解 lucene 是如何工作的，lucene 索引里面有什么。基本上我想看看数据在 lucene 索引中是如何表示的？

我使用

lucene-core 8.6.0

作为依赖项

下面是我非常基本的 Lucene 代码

private Document create(File file) throws IOException {
    Document document = new Document();

    Field field = new Field("contents", new FileReader(file), TextField.TYPE_NOT_STORED);
    Field fieldPath = new Field("path", file.getAbsolutePath(), TextField.TYPE_STORED);
    Field fieldName = new Field("name", file.getName(), TextField.TYPE_STORED);

    document.add(field);
    document.add(fieldPath);
    document.add(fieldName);

    //Create analyzer
    Analyzer analyzer = new StandardAnalyzer();

    //Create IndexWriter pass the analyzer

    Path indexPath = Files.createTempDirectory("tempIndex");
    Directory directory = FSDirectory.open(indexPath);
    IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, indexWriterCOnfig);
    iwriter.addDocument(document);
    iwriter.close();
    return document;
}

注意：我了解 Lucene 背后的知识 - 倒排索引，但我缺乏对 Lucene 库如何使用此概念以及如何创建文件以便使用 Lucene 使搜索变得简单可行的理解。

我尝试过 Limo，但没用。即使我在

web.xml

文件中给出了索引位置，它也不起作用。

Answer 1

如果您想看到一个很好的介绍性代码示例，使用当前版本的 Lucene（构建索引然后使用它），您可以从基本演示开始（选择您的版本 - 此链接适用于 Lucene 8.6).

演示的源代码（使用最新版本的 Lucene）可以在 Github 上找到。

如果您想

探索索引数据，创建索引数据后，您可以使用 Luke。如果您以前没有使用过它：要运行 Luke，您需要从主下载页面下载二进制版本。解压缩文件，然后导航到 luke

 目录。然后运行相关脚本（

luke.bat

或

luke.sh

）。

（我能找到的唯一版本的

LIMO

 工具是 Sourceforge 上的

this。鉴于它是 2007 年的版本，几乎可以肯定它不再与最新的 Lucene 索引文件兼容。也许某处有更新的版本.) 如果您想要

典型 Lucene 索引中的文件概述

，您可以从这里开始。许多具体问题可以通过查看相关包和类的

API 文档

来回答。就我个人而言，我还发现

Solr

和 ElasticSearch 文档对于解释特定概念非常有用，这些概念通常与 Lucene 直接相关。除此之外，我不太担心 Lucene 如何管理其内部索引数据结构。相反，我专注于可用于访问该数据的不同类型的分析器和查询。

更新：SimpleTextCodec

现在已经是几个月后的事了，但这里还有另一种探索 Lucene 索引数据的方法：

SimpleTextCodec

。标准 Lucene 编解码器（如何将数据写入索引文件并从中读取数据）使用二进制格式 - 因此人类不可读。您不能只打开索引文件并查看其中的内容。

但是，如果您将编解码器更改为

SimpleTextCodec

，那么Lucene将创建纯文本索引文件，您可以在其中更清楚地看到结构。

此编解码器纯粹用于信息/教育，不应在生产中使用。

要使用编解码器，首先需要包含相关依赖项 - 例如，如下所示：

<dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-codecs</artifactId> <version>8.7.0</version> </dependency>

现在您可以按如下方式使用这个新的编解码器：

iwc.setCodec(new SimpleTextCodec());

所以，例如：

final String indexPath = "/path/to/index_dir"; final String docsPath = "/path/to/inputs_dir"; final Path docDir = Paths.get(docsPath); Directory dir = FSDirectory.open(Paths.get(indexPath)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setOpenMode(OpenMode.CREATE); iwc.setCodec(new SimpleTextCodec()); System.out.println(iwc.getCodec().getName()); try ( IndexWriter writer = new IndexWriter(dir, iwc)) { // read documents, and write index data: indexDocs(writer, docDir); }

您现在可以在文本阅读器（例如 Notepad++）中自由检查生成的索引文件。

就我而言，索引数据生成了多个文件 - 但我感兴趣的是我的

*.scf

文件 - 一个“复合”文件，包含各种“虚拟文件”部分，其中人类可读的索引数据是已保存。

Answer 2

I-Rex

。它是为信息检索研究而开发的。这是它的链接：

https://github.com/souravsaha/I-REX/tree/shell-lucene8

随意添加/编辑代码。

如何查看Lucene索引

问题描述投票：0回答：2

2个回答

最新问题

如何查看Lucene索引

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2