我有 40 个节点的服务器集群和数十个客户端。当客户端节点重新启动时,其中一台服务器会脱机,并显示有关阻塞系统关键 NIO 工作线程的日志,下一个日志会出现多次(3,可能是因为连接尝试配置为 3)
[09:02:27,567][SEVERE][sys-stripe-1-#2][TcpCommunicationSpi] Failed to send message to remote node [node=ZookeeperClusterNode [id=5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa, addrs=[2001:558:0000:0000:250:56ff:fe9a:f36f%ens192, 2001:558:0000:0000:0:0:0:2f%ens192, 100.0.0.175, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=297, loc=false, client=true], msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=5, arr=[6283512,6292346,6292347,6283518,6259652]]]]]
class org.apache.ignite.IgniteCheckedException: Failed to connect to node due to unrecoverable exception (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa, addrs=[/[2001:558:0000:0000:250:56ff:fe9a:f36f%ens192]:47100, /[2001:558:0000:0000:0:0:0:2f%ens192]:47100, /100.0.0.175:47100, /[0:0:0:0:0:0:0:1%lo]:47100, /127.0.0.1:47100], err= class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa]
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:625)
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:693)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1181)
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:442)
at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:231)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1105)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1052)
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2102)
at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2195)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1279)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1318)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.sendDeferredUpdateResponse(GridDhtAtomicCache.java:3510)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$2500(GridDhtAtomicCache.java:147)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout.run(GridDhtAtomicCache.java:3745)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:637)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
at java.base/java.lang.Thread.run(Thread.java:832)
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to node due to unrecoverable exception (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa, addrs=[/[2001:558:0000:0000:250:56ff:fe9a:f36f%ens192]:47100, /[2001:558:0000:0000:0:0:0:2f%ens192]:47100, /100.0.0.175:47100, /[0:0:0:0:0:0:0:1%lo]:47100, /127.0.0.1:47100], err= class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa]
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:627)
... 17 more
Caused by: class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:502)
... 17 more
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to node due to unrecoverable exception (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa, addrs=[/[2001:558:0000:0000:250:56ff:fe9a:f36f%ens192]:47100, /[2001:558:0000:0000:0:0:0:2f%ens192]:47100, /100.0.0.175:47100, /[0:0:0:0:0:0:0:1%lo]:47100, /127.0.0.1:47100], err= class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa]
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:627)
... 17 more
Caused by: class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:502)
... 17 more
Caused by: class org.apache.ignite.IgniteCheckedException: Remote node does not observe current node in topology : 5c10e43b-e726-4e5b-aa5c-2ccfc3e8abaa
at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:502)
... 17 more
之后
[09:03:26,145][SEVERE][grid-timeout-worker-#22][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-1, threadName=sys-stripe-1-#2, blockedFor=95s]
[09:03:26,146][SEVERE][grid-timeout-worker-#22][] Critical system error detected. Will be handled accordingly to configured handler [hnd=RestartProcessFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet []]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-1, igniteInstanceName=null, finished=false, heartbeatTs=1697619710526]]]
它尝试重新启动,但在所有情况下都没有成功。
根据日志,它尝试与先前正在运行的已重新启动的客户端实例保持连接。
很快 - 服务器节点可能会受到客户端状态的影响。有没有办法完全“取消链接”状态,这样无论客户端发生什么,服务器节点都不会受到影响?
我们可以使用超时,但这表明“链接”仍然存在。我看到有关服务器节点向客户端发送一些数据的信息,有没有办法禁用它?是一些系统关键数据吗? 瘦客户端的工作方式是否有所不同?
注意事项: ZK发现SPI使用
尝试降低连接超时并增加systemWorkerBlockedTimeout,但如果网络中发生关键事件并且服务器状态仍然可能受到客户端状态的影响,这仍然无济于事
尝试将 IGNITE_ENABLE_FORCIBLE_NODE_KILL 设置为 true,但这仍然无法解决状态关系问题
我敢打赌,你试图以不优雅的方式关闭客户。它让服务器节点别无选择,只能检测是否出了问题。尝试优雅地切换客户端节点。一般来说,方法是以编程方式调用
Ignite.close()
。其他信息可以在页面上找到。
其他想法。
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet []]]
) 中看到,有一个被忽略的故障类型的空列表。这必须已明确配置。这是有关此问题的答案的link。最有可能的是 SYSTEM_WORKER_BLOCKED
不应被排除在列表之外。systemWorkerBlockedTimeout
或类似超时的情况对我来说看起来非常狭窄且特定。