Apache Tika Server v2 未公开异步或管道端点

问题描述 投票:0回答:1

我的目标是使用 Tika 服务器获取 S3 源/目标 url 来异步解析各种文件类型。使用 this 指南 作为起点,我使用 docker 在本地运行 Tika 服务器(2.9.2),但我没有看到任何 /async 或 /pipes 端点。我不希望它们在没有存储桶的情况下在本地工作,这很好,但我希望端点至少会出现。这是他们在 tika-pipes 上的文档。

这些是我在启动时获得的唯一日志,/async 和 /pipes 端点都返回 404。主主页看起来不错,但也没有显示我正在寻找的路线。 enter image description here

我假设要么我需要显式公开这些端点,要么它没有拾取我带来的罐子,因此不会自动加载它们。或者也许还有其他我不明白的配置文件。

如有任何指点,不胜感激!

我的 tika-config.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    </parser>
  </parsers>
  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
    </params>
  </server>
  <pipes>
    <params>
      <tikaConfig>./config/tika-config.xml</tikaConfig>
    </params>
  </pipes>
  <async>
    <params>
      <timeoutMillis>1000000</timeoutMillis>
    </params>
  </async>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
      <params>
        <name>s3f</name>
        <region>us-east-1</region>
        <bucket>tika-bucket</bucket>
        <credentialsProvider>instance</credentialsProvider>
        <spoolToTemp>false</spoolToTemp>
        <extractUserMetadata>false</extractUserMetadata>
        <maxConnections>100</maxConnections>
      </params>
    </fetcher>
  </fetchers>
  <emitters>
    <emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
      <params>
        <param name="name" type="string">s3e</param>
        <param name="region" type="string">us-east-1</param>
        <param name="credentialsProvider" type="string">instance</param>
        <param name="bucket" type="string">tika-bucket</param>
        <param name="fileExtension" type="string">json</param>
        <param name="spoolToTemp" type="bool">true</param>
      </params>
    </emitter>
  </emitters>
</properties>

dockerfile(我从这部分中删除了一堆刚刚下载的依赖项的行,它们位于上面链接的教程中):

FROM ubuntu:focal as base
RUN apt-get update

ENV TIKA_VERSION 2.9.2
ENV TIKA_SERVER_JAR tika-server-standard

FROM base as dependencies

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
        tesseract-ocr-eng curl gnupg

# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1

RUN apt-get -y install openjdk-17-jdk

FROM dependencies as fetch_tika

# download all the tika dependencies (removed those lines of code for this question)

ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar
RUN mkdir /config
COPY tika-config.xml /config

EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]

然后构建+运行:

docker build --tag 'tika_server_local' .

docker run -d \                         
    --name tika_container \
    -v tika_dir:/config \
    -p 9998:9998 tika_server_local:latest \
    -c ./config/tika-config.xml
java parsing apache-tika tika-server
1个回答
0
投票

在我发现如果我尝试强制打开 /async 和 /pipes 时,它会抛出一个明显的错误,指出它无法解析发射器配置,我自己解决了这个问题。我猜如果不尝试强制这些端点,只会吃掉错误并且不会启用它们。

强制启用:

  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
      <endpoints>
        <endpoint>pipes</endpoint>
        <endpoint>async</endpoint>
      </endpoints>
    </params>
  </server>

我通过将发射器配置转换为更像解析器配置来修复错误(盲目遵循教程又是我的错)。

<emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
  <params>
    <name>s3e</name>
    <region>us-east-1</region>
    <bucket>tika-bucket</bucket>
    <credentialsProvider>instance</credentialsProvider>
    <spoolToTemp>true</spoolToTemp>
    <maxConnections>100</maxConnections>
    <fileExtension>json</fileExtension>
  </params>
</emitter>

最后,指南中缺少的东西是在我的请求的 json 正文中,我需要向对象添加一个 id 字段,如下所示。

[
    {
        "id": "1234",
        "fetcher": "s3f",
        "fetchKey": "presignedS3GetRequest",
        "emitter": "s3e",
        "emitKey": "presignedS3PutRequest"
    }
]
© www.soinside.com 2019 - 2024. All rights reserved.