我正在尝试通过 terraform 代码创建 7 个虚拟机的环境 - 创建 NIC 和虚拟机(已创建 NSG、RG、子网),并且我使用了自定义 RHEL8 映像。
创建虚拟机后,我通过 jenkins 使用 ansible playbook“配置”它们。问题出现在这个配置阶段。每次运行 12-13 分钟后,1-3 个虚拟机会因以下错误而断开连接
fatal: FAILED! => {"changed": false, "elapsed": 641, "msg": "timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.22.136.194 port 22: Connection timed out"}
或
UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: mux_client_request_session: read from master failed: Broken pipe\r\nssh: connect to host 10.22.136.195 port 22: Connection timed out", "unreachable": true}
我联系了一位微软支持工程师,他说网卡没有正确连接,但无法详细说明这一点
注意 - 相同的基础设施和管道在自定义 CENTOS7 映像上运行良好,发生了 ssh 超时,但它不像这次那样成为阻止者
我尝试销毁并重新创建,更改ansible.cfg中的ssh超时限制(不起作用),并与MS支持工程师联系,但尚未解决
以上设置应该
这是我的 main.tf,我在其中创建虚拟机和网卡
`resource "azurerm_network_interface" "nw_interface" {
count = var.vm_count
name = "nic-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}00${count.index + 1}-01"
location = var.rg.location
resource_group_name = var.rg.name
ip_configuration {
name = "vm-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}0${count.index + 1}-network-configuration"
subnet_id = var.subnet_id
private_ip_address_allocation = var.pvt_ip_allocation
}
tags = merge(var.tags, tomap({"Name" = "vm-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}00${count.index + 1}", "Role" = var.role}))
}
resource "azurerm_virtual_machine" "azure_vm" {
count = var.vm_count
name = "vm-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}00${count.index + 1}"
location = var.rg.location
resource_group_name = var.rg.name
vm_size = var.vm_size
network_interface_ids = [element(azurerm_network_interface.nw_interface.*.id, count.index)]
# Uncomment this line to delete the OS disk automatically when deleting the VM
delete_os_disk_on_termination = var.del_disk_on_termination
# Uncomment this line to delete the data disks automatically when deleting the VM
delete_data_disks_on_termination = var.del_disk_on_termination
storage_image_reference {
id = var.vm_image_id
}
storage_os_disk {
name = "disk-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}0${count.index + 1}-disk"
caching = var.disk_opts.caching
create_option = var.disk_opts.create_option
managed_disk_type = var.disk_opts.managed_disk_type
disk_size_gb = var.os_disk_size
os_type = var.disk_opts.os_type
}
os_profile {
computer_name = "${var.vm_name}0${count.index + 1}"
admin_username = var.admin_username
admin_password = var.admin_password
}
os_profile_linux_config {
disable_password_authentication = var.disable_password_authentication
}
tags = merge(var.tags, tomap({"Name" = "vm-${var.customer_acronym}-${var.env_acronym}-${var.vm_name}00${count.index + 1}", "Role" = var.role}))
}`
这是我的变量.tf 提到图像
variable "image_name" { default = "rhlel8-base-image-2024-09-16" description = "Image name from which Dev VM would be created" }
注意 - 插入支持工程师所说的内容
“已采取的故障排除:正如显示许多 VMware 参考的日志所示,该问题似乎与网络未启动有关。怀疑基础映像可能未为 Azure 做好准备。” (https://i.sstatic.net/XIw4DbLc.png)
在不同虚拟机中面临 SSH 超时
拦截器带有未设置为azure的自定义图像。一般检查图像网络代理和网络设置。
您可以尝试在 ansible 中使用更多超时,以便机器有足够的时间来设置连接,此外尝试添加重试,然后使用 测试对有问题的虚拟机的直接 SSH 访问
ssh -v <vm_user>@<vm_ip_address>
根据尝试启用 ansible_user 而不是 ansible_ssh_user 并且与密码相同
ansible 192.168.15.29 -i your_hosts_file -m ping -e "ansible_ssh_user=remote ansible_ssh_pass=password"
或
ansible 192.168.15.29 -i your_hosts_file -m ping -e "ansible_user=remote ansible_password=password"
此更改应在您的 ansible 文件中进行,并使用 -c paramiko 选项 ping 我的主机。
参考:
Ansible 无法与 SSH 连接(横幅交换) - Stack Overflow by Ripper Tops