如何修复 Dockerized Elasticsearch 实例中损坏的 translog

问题描述 投票:0回答:2

tl;dr:当您的 ElasticSearch 实例进行 Docker 化时,运行 elasticsearch-shard 实用程序似乎是不可能的。如果这是真的,我们如何修复偶尔导致 ES 崩溃的损坏的 translog 错误??

我已经使用 docker-compose 通过 docker 在本地很好地运行 ElasticSearch (ES) 一段时间了,但是今天当我启动它时,它开始崩溃并显示错误消息:

TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-175.tlog] is corrupted 
(请参阅帖子末尾以获取完整的错误消息)

一些谷歌搜索显示,这个问题可以通过运行该实用程序来解决

bin/elasticsearch-shard remove-corrupted-data
。问题是,为了运行此实用程序,ES 必须关闭,但为了使托管 ES 实例的容器处于活动状态,ES 需要运行。这意味着无法访问
elasticsearch-shard
来解决数据和 Elasticsearch 实例实际所在的环境内部的问题。

我已经验证它不会像这样通过在容器的命令行中停止 ES 来保持活动状态

## get into the docker container
docker exec -it 43146ff2a50c bash
## kill elasticsearch
pkill -f elasticsearch

它立即杀死容器并将我踢出外壳。

我尝试查看另一个可以访问相同数据卷但不基于 ES 映像(以便在 ES 关闭时仍处于活动状态)的 docker 容器是否可以运行该实用程序并修复磁盘上的数据。我使用新的 Dockerfile 创建了一个新的 docker-compose 条目,并保持所有设置相同,但基于 ubuntu 映像进行构建(忽略除

ES_01_DATA_VOLUME
之外的环境变量,它们不相关):

docker-compose.yml

 es01-truncate-corrupted-shards:
        build:
            context: .
            dockerfile: Elasticsearch.TruncateCorruptedShards.Dockerfile
            args:
                - CERTS_DIR=${CERTS_DIR}
        container_name: es01-truncate-corrupted-shards
        environment:
            - node.name=es01
            - cluster.name=es-docker-cluster
            - discovery.seed_hosts=es02,es03
            - cluster.initial_master_nodes=es01,es02,es03
            - bootstrap.memory_lock=true
            - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
            - xpack.license.self_generated.type=basic 
            - xpack.security.enabled=true
            - xpack.security.http.ssl.enabled=true 
            - xpack.security.http.ssl.key=$CERTS_DIR/es01/es01.key
            - xpack.security.http.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.http.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.enabled=true 
            - xpack.security.transport.ssl.verification_mode=certificate 
            - xpack.security.transport.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.transport.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.key=$CERTS_DIR/es01/es01.key
        ulimits:
            memlock:
                soft: -1
                hard: -1
        volumes:
            - ${ES_01_DATA_VOLUME}
            - ${CERTS_VOLUME}
        ports:
            - ${ES_01_PORT}
        mem_limit: ${SINGLE_NODE_MEM_LIMIT}

Elasticsearch.TruncateCorruptedShards.Dockerfile

FROM ubuntu:rolling

RUN apt-get update \
    && apt-get install --yes curl \
    && apt-get install -y gnupg \
    && curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add - \
    && echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list \
    && apt update \
    && apt install elasticsearch

RUN /usr/share/elasticsearch/bin/elasticsearch-shard remove-corrupted-data

当我运行此命令时,它会正确安装所有内容并尝试使用该实用程序,但随后会出现如下错误:

#6 1.265     WARNING: Elasticsearch MUST be stopped before running this tool.
#6 1.265
#6 1.360 Exception in thread "main" ElasticsearchException[no node folder is found in data folder(s), node has not been started yet?]
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:148)
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:168)
#6 1.363    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:95)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.Command.main(Command.java:77)
#6 1.363    at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:24)

这让我相信,尽管可以访问

ES_01_DATA_VOLUME
卷,但它知道该容器中尚未设置实例。

最终,我不太关心如何尽可能修复损坏的 translog,但在我看来,由于 docker 环境的这些限制,这是不可能的。我是否需要在主机上安装 ES 并将其指向数据文件并让它修改它们?看起来这和我尝试过的第二个非 ES 容器技巧是一样的,所以会失败。而且,这违背了容器化环境的目的。

我很困惑,非常感谢任何帮助。很难想象修复损坏的数据文件之类的问题是不可能的/会被 ES 团队忽视!

来自 ES 的完整错误消息

{"type": "server", "timestamp": "2022-07-28T22:40:49,356Z", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "es-docker-cluster", "node.name": "es01", "message": "[application_log][0] marking and sending shard failed due to [shard failure, reason [failed to recover from translog]]", "cluster.uuid": "W-cXJOamQw-XU8LyZ9ZUoA", "node.id": "SBUMvZCRRTaZvhhVqmm9sQ" ,
es01     | "stacktrace": ["org.elasticsearch.index.engine.EngineException: failed to recover from translog",

{"type": "server", "timestamp": "2022-07-28T22:40:49,361Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "es-docker-cluster", "node.name": "es03", "message": "failing shard [failed shard, shard [plant_pod_application_log][0], node[SBUMvZCRRTaZvhhVqmm9sQ], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=_fVgJKMpSymo_mGd-QvRxQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2022-07-28T22:40:46.444Z], failed_attempts[4], failed_nodes[[VqFl_rNTRnyoHHgVJdIBhQ, SBUMvZCRRTaZvhhVqmm9sQ]], delayed=false, details[failed shard on node [VqFl_rNTRnyoHHgVJdIBhQ]: failed recovery, failure RecoveryFailedException[[plant_pod_application_log][0]: Recovery failed on {es03}{VqFl_rNTRnyoHHgVJdIBhQ}{qLLppr6pTrCa8-lCFhz1NA}{172.22.0.5}{172.22.0.5:9300}{dilmrt}{ml.machine_memory=5175267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-187.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [16592762] length: [4] end: [16592762]]; ], allocation_status[fetching_shard_data]], message [shard failure, reason [failed to recover from translog]], failure [EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-175.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [16592762] length: [4] end: [16592762]]; ], markAsStale [true]]", "cluster.uuid": "W-cXJOamQw-XU8LyZ9ZUoA", "node.id": "VqFl_rNTRnyoHHgVJdIBhQ" ,
es03     | "stacktrace": ["org.elasticsearch.index.engine.EngineException: failed to recover from translog",

我知道这些被列为警告,但它们是唯一看起来错误的输出类型,当我 ping 集群以检查其运行状况时,我得到了这个有效负载:

{"cluster_name":"es-docker-cluster","status":"red","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":20,"active_shards":30,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":2,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":93.75

Kibana 永远无法加载并且结果

docker elasticsearch docker-compose
2个回答
3
投票

迟了 7 个月,但我目前正在使用 docker 实例来尝试修复集群外部的一堆损坏的索引,并且遇到了同样的问题 - 所以我正在为后代以及未来我再次遇到这个问题进行记录: -)

修复很简单:运行容器,但覆盖入口点。也就是说,您将

-it
添加到开关(交互式容器),并在图像名称后面添加
/bin/bash
。这样你最终会在 bash shell 中进入一个新启动的容器,而不是运行 ES。

然后你可以通过运行

/usr/local/bin/docker-entrypoint.sh
来启动 ES 并使用 ctrl-C 杀死它,然后你将再次进入 bash shell。在退出 bash 之前,容器不会退出,因此您现在可以自由运行 elasticsearch-shard 或任何您需要的工具,再次启动 ES 来调用路由 API 等。

而且,我遇到了一些问题:以elasticsearch用户身份运行elasticsearch-shard,因为如果你以root身份运行它,它将以root身份创建新的translog文件,而ES将无法重新路由分片。


0
投票

晚了2年!添加到 @vegvamp 的答案帮助我完成了整个过程。我希望它能帮助那里的人!就我而言,在 docker-compose.yml 中阻止 ES 启动会更容易:

services:   
    elasticsearch:
        image: elasticsearch:6.8.22
        entrypoint: ["/bin/bash", "-c", "sleep infinity"]

然后小心地运行 bin/elasticsearch-translog ...

之后,重新启动容器以让 ES 启动,我必须通过运行手动分配应用的分片:

curl -X POST "http://localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d '{   "commands": [
{
  "allocate_stale_primary": {
    "index": "graylog_236",
    "shard": 2,
    "node": "TQzzzzzzzzzzzz",
    "accept_data_loss": true
  }
}   ] }'

注意,必须设置“accept_data_loss”:true。

很可能我确实丢失了一些数据。幸亏他们不敏感!

© www.soinside.com 2019 - 2024. All rights reserved.