PyLucene Indexer和检索器样本

问题描述 投票:2回答:2

我是Lucene的新手。我想在Python 3中编写PyLucene 6.5的示例代码。我更改了版本的this示例代码。但是,我找不到文件,我不确定这些更改是否正确。

# indexer.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, StringField, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    writerConfig = IndexWriterConfig(StandardAnalyzer())
    writer = IndexWriter(indexDir, writerConfig)

    print("%d docs in index" % writer.numDocs())
    print("Reading lines from sys.stdin...")

    tft = FieldType()
    tft.setStored(True)
    tft.setTokenized(True)
    for n, l in enumerate(sys.stdin):
        doc = Document()
        doc.add(Field("text", l, tft))
        writer.addDocument(doc)
    print("Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs()))
    print("Closing index of %d docs..." % writer.numDocs())
    writer.close()

此代码读取index目录中的输入和存储。

# retriever.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader, DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    analyzer = StandardAnalyzer()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    reader = DirectoryReader.open(indexDir)
    searcher = IndexSearcher(reader)

    query = QueryParser("text", analyzer).parse("hello")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))
    for hit in hits.scoreDocs:
        print(hit.score, hit.doc, hit.toString())
        doc = searcher.doc(hit.doc)
        print(doc.get("text").encode("utf-8"))

我们应该能够使用retriever.py检索(搜索),但它不会返回任何内容。它出什么问题了?

python python-3.x lucene pylucene
2个回答
1
投票

我认为最好的方法是下载PyLucene的tarball(你选择的版本):

https://www.apache.org/dist/lucene/pylucene/

在里面你会发现一个test3/文件夹(对于python3,其他test2/用于python2)和python测试。它们涵盖了常见的操作,如索引,阅读,搜索等等。鉴于Pylucene周围缺乏文档,我发现这些非常有用。

特别查看test_Pylucene.py

注意

如果Changelog不够直观,这也是快速掌握更改并在各版本之间调整代码的一种非常好的方法。

(为什么我不在这个答案中提供代码:为PyLucene的SO答案提供代码片段的问题是,一旦新版本发布,这些代码片段很快就会过时,正如我们在大多数现有版本中看到的那样。 )


0
投票
In []: tft.indexOptions()
Out[]: <IndexOptions: NONE>

虽然记录了DOCS_AND_FREQS_AND_POSITIONS是默认值,但现在已不再是这种情况了。这是TextField的默认值;一个FieldType必须明确地setIndexOptions

© www.soinside.com 2019 - 2024. All rights reserved.