函数末尾的torch_xla会导致“无法满足会合”错误

Question

我目前正在尝试在Google Colab的多个tpu内核上运行一些代码，但是当在目标函数的末尾调用同步代码（xm.rendezvous）时却出现错误，但是现在当同步代码为在顶部。这是一个例子：

# "Map function": acquires a corresponding Cloud TPU core, creates a tensor on it,
# and prints its core
def simple_map_fn(index, flags):
  #   xm.rendezvous('init') # place rendezvous here instead of at the bottom works fine.

  # Acquires the (unique) Cloud TPU core corresponding to this process's index
  device = xm.xla_device()
  ordinal = xm.get_ordinal()
  local_ordinal = xm.get_ordinal()

  print(f"index {index}, process device {device}, local ordinal {local_ordinal}, ordinal {ordinal}")


  # Barrier to prevent master from exiting before workers connect.
  xm.rendezvous('leave')

# Spawns eight of the map functions, one for each of the eight cores on
# the Cloud TPU
flags = {}

xmp.spawn(simple_map_fn, args=(flags,), nprocs=8, start_method='fork')

当我像notebook一样在Google Colab中运行上面的代码时，出现以下错误：

Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'leave': Socket closed (14)

知道将交会点放在目标函数底部时为什么会失败？

Answer 1

经过一些仔细的调查，我发现当Google colab实例作为“高内存”实例运行时，不会发生此问题。我认为此错误的最可能原因是内存溢出错误。

函数末尾的torch_xla会导致“无法满足会合”错误

问题描述投票：0回答：1

1个回答

最新问题

函数末尾的torch_xla会导致“无法满足会合”错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1