我的目标是使用 Tika 服务器获取 S3 源/目标 url 来异步解析各种文件类型。使用 this 指南 作为起点,我使用 docker 在本地运行 Tika 服务器(2.9.2),但我没有看到任何 /async 或 /pipes 端点。我不希望它们在没有存储桶的情况下在本地工作,这很好,但我希望端点至少会出现。这是他们在 tika-pipes 上的文档。
这些是我在启动时获得的唯一日志,/async 和 /pipes 端点都返回 404。主主页看起来不错,但也没有显示我正在寻找的路线。 。
我假设要么我需要显式公开这些端点,要么它没有拾取我带来的罐子,因此不会自动加载它们。或者也许还有其他我不明白的配置文件。
如有任何指点,不胜感激!
我的 tika-config.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
</parsers>
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
</params>
</server>
<pipes>
<params>
<tikaConfig>./config/tika-config.xml</tikaConfig>
</params>
</pipes>
<async>
<params>
<timeoutMillis>1000000</timeoutMillis>
</params>
</async>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
<params>
<name>s3f</name>
<region>us-east-1</region>
<bucket>tika-bucket</bucket>
<credentialsProvider>instance</credentialsProvider>
<spoolToTemp>false</spoolToTemp>
<extractUserMetadata>false</extractUserMetadata>
<maxConnections>100</maxConnections>
</params>
</fetcher>
</fetchers>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
<params>
<param name="name" type="string">s3e</param>
<param name="region" type="string">us-east-1</param>
<param name="credentialsProvider" type="string">instance</param>
<param name="bucket" type="string">tika-bucket</param>
<param name="fileExtension" type="string">json</param>
<param name="spoolToTemp" type="bool">true</param>
</params>
</emitter>
</emitters>
</properties>
dockerfile(我从这部分中删除了一堆刚刚下载的依赖项的行,它们位于上面链接的教程中):
FROM ubuntu:focal as base
RUN apt-get update
ENV TIKA_VERSION 2.9.2
ENV TIKA_SERVER_JAR tika-server-standard
FROM base as dependencies
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
tesseract-ocr-eng curl gnupg
# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1
RUN apt-get -y install openjdk-17-jdk
FROM dependencies as fetch_tika
# download all the tika dependencies (removed those lines of code for this question)
ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar
RUN mkdir /config
COPY tika-config.xml /config
EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]
然后构建+运行:
docker build --tag 'tika_server_local' .
docker run -d \
--name tika_container \
-v tika_dir:/config \
-p 9998:9998 tika_server_local:latest \
-c ./config/tika-config.xml
在我发现如果我尝试强制打开 /async 和 /pipes 时,它会抛出一个明显的错误,指出它无法解析发射器配置,我自己解决了这个问题。我猜如果不尝试强制这些端点,只会吃掉错误并且不会启用它们。
强制启用:
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
<endpoints>
<endpoint>pipes</endpoint>
<endpoint>async</endpoint>
</endpoints>
</params>
</server>
我通过将发射器配置转换为更像解析器配置来修复错误(盲目遵循教程又是我的错)。
<emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
<params>
<name>s3e</name>
<region>us-east-1</region>
<bucket>tika-bucket</bucket>
<credentialsProvider>instance</credentialsProvider>
<spoolToTemp>true</spoolToTemp>
<maxConnections>100</maxConnections>
<fileExtension>json</fileExtension>
</params>
</emitter>
最后,指南中缺少的东西是在我的请求的 json 正文中,我需要向对象添加一个 id 字段,如下所示。
[
{
"id": "1234",
"fetcher": "s3f",
"fetchKey": "presignedS3GetRequest",
"emitter": "s3e",
"emitKey": "presignedS3PutRequest"
}
]