目录

13 号 上午 GPU 服务器突然不能访问了,可以通过 CPU 服务器访问 GPU 服务器。这一周一直在查找问题,这里记录一下过程。

traceroute 路由追踪

  • GPU 服务器
    traceroute gpu1
    
    traceroute to gpu1 (172.16.33.66), 64 hops max, 52 byte packets
     1  * * *
     2  172.16.136.2 (172.16.136.2)  7.462 ms  3.820 ms  3.014 ms
     3  * * *
     4  * * *
     5  * * *
     6  * * *
     7  * * *
     8  * * *
     9  * * *
    10  * * *
    
  • CPU 服务器
    traceroute cpu1
    
    traceroute to cpu1 (172.16.33.157), 64 hops max, 52 byte packets
     1  * * *
     2  172.16.136.2 (172.16.136.2)  7.827 ms  4.712 ms  3.162 ms
     3  * * *
     4  cpu1 (172.16.33.157)  8.619 ms  4.205 ms  4.982 ms
    

tcpdump 抓包

在GPU服务器上抓取 22 端口的数据包

sudo tcpdump -i any 'tcp port 22'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
14:59:21.757214 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 4001681261:4001681457, ack 480004153, win 501, options [nop,nop,TS val 48088535 ecr 2526753149], length 196
14:59:21.757344 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 196, win 501, options [nop,nop,TS val 2526753233 ecr 48088535], length 0
14:59:21.757945 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 196:400, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753233], length 204
14:59:21.757974 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 400:572, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753233], length 172
14:59:21.758023 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 572:752, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753233], length 180
14:59:21.758047 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 400, win 501, options [nop,nop,TS val 2526753234 ecr 48088536], length 0
14:59:21.758047 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 572, win 501, options [nop,nop,TS val 2526753234 ecr 48088536], length 0
14:59:21.758069 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 752:932, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 180
14:59:21.758131 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 932:1112, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 180
14:59:21.758144 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 752, win 501, options [nop,nop,TS val 2526753234 ecr 48088536], length 0
14:59:21.758163 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 1112:1284, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 172
14:59:21.758193 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 932, win 501, options [nop,nop,TS val 2526753234 ecr 48088536], length 0
14:59:21.758202 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 1284:1456, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 172
14:59:21.758221 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 1456:1788, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 332
14:59:21.758242 IP cpu1.60682 > gpu2.ssh: Flags [.], ack 1112, win 501, options [nop,nop,TS val 2526753234 ecr 48088536], length 0
14:59:21.758251 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 1788:2104, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 316
14:59:21.758263 IP gpu2.ssh > cpu1.60682: Flags [P.], seq 2104:2276, ack 1, win 501, options [nop,nop,TS val 48088536 ecr 2526753234], length 172

netstat 查看网络连接状态和统计信息

显示所有处于监听状态的 TCP 和 UDP 网络连接的信息,包括源 IP、目标 IP、源端口、目标端口以及协议类型等。

netstat -tuln

查看端口占用的进程

sudo lsof -i:22
COMMAND     PID   USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
sshd    1635782   root    4u  IPv4 1127621335      0t0  TCP cpu1:ssh->192.168.73.2:62014 (ESTABLISHED)
sshd    1635927 lnsoft    4u  IPv4 1127621335      0t0  TCP cpu1:ssh->192.168.73.2:62014 (ESTABLISHED)
sshd    1936495   root    3u  IPv4 1037646169      0t0  TCP *:ssh (LISTEN)
sshd    1936495   root    4u  IPv6 1037635959      0t0  TCP *:ssh (LISTEN)

ip addr 查看网卡信息

GPU 服务器

ip addr
8: br-b1a37228308c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:69:12:87:eb brd ff:ff:ff:ff:ff:ff
    inet 192.168.64.1/20 brd 192.168.79.255 scope global br-b1a37228308c
       valid_lft forever preferred_lft forever
    inet6 fe80::42:69ff:fe12:87eb/64 scope link 
       valid_lft forever preferred_lft forever

在 GPU 服务器上监控 br-b1a37228308c 网桥

sudo tcpdump -i br-b1a37228308c
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-b1a37228308c, link-type EN10MB (Ethernet), capture size 262144 bytes

11:59:17.123725 ARP, Request who-has 192.168.73.2 tell gpu1, length 28
11:59:18.131063 ARP, Request who-has 192.168.73.2 tell gpu1, length 28
11:59:19.155067 ARP, Request who-has 192.168.73.2 tell gpu1, length 28

很有可能是这个网桥设备出的问题,它的网段是:192.168.64.1/20 - 192.168.79.255,而我的电脑 IP 是:192.168.73.2

通过变更登录无线设备解决了这个问题,我的 IP 变更为:172.16.122.222

dmesg 查看内核日志

dmesg | grep br-b1a37228308c
[   28.911290] br-b1a37228308c: port 1(veth4974272) entered blocking state
[   28.911294] br-b1a37228308c: port 1(veth4974272) entered disabled state
[   28.911885] br-b1a37228308c: port 1(veth4974272) entered blocking state
[   28.911889] br-b1a37228308c: port 1(veth4974272) entered forwarding state
[   28.911928] IPv6: ADDRCONF(NETDEV_CHANGE): br-b1a37228308c: link becomes ready
[   28.913006] br-b1a37228308c: port 1(veth4974272) entered disabled state
[   28.929009] br-b1a37228308c: port 2(veth086e445) entered blocking state
[   28.929012] br-b1a37228308c: port 2(veth086e445) entered disabled state
[   28.932252] br-b1a37228308c: port 2(veth086e445) entered blocking state
[   28.932255] br-b1a37228308c: port 2(veth086e445) entered forwarding state
[   29.130690] br-b1a37228308c: port 3(veth6495b3f) entered blocking state
[   29.130695] br-b1a37228308c: port 3(veth6495b3f) entered disabled state
[   29.131245] br-b1a37228308c: port 3(veth6495b3f) entered blocking state
[   29.131248] br-b1a37228308c: port 3(veth6495b3f) entered forwarding state
[   29.252413] br-b1a37228308c: port 2(veth086e445) entered disabled state
[   29.252716] br-b1a37228308c: port 3(veth6495b3f) entered disabled state
[   30.376166] br-b1a37228308c: port 1(veth4974272) entered blocking state
[   30.376170] br-b1a37228308c: port 1(veth4974272) entered forwarding state
[   30.432464] br-b1a37228308c: port 3(veth6495b3f) entered blocking state
[   30.432467] br-b1a37228308c: port 3(veth6495b3f) entered forwarding state
[   30.432703] br-b1a37228308c: port 2(veth086e445) entered blocking state
[   30.432708] br-b1a37228308c: port 2(veth086e445) entered forwarding state
[160528.388829] device br-b1a37228308c entered promiscuous mode
[161199.293262] device br-b1a37228308c left promiscuous mode

journalctl 查看系统日志

journalctl | grep br-b1a37228308c
Nov 20 14:25:24 gpu1 kernel: br-b1a37228308c: port 2(veth80039d3) entered disabled state
Nov 20 14:25:24 gpu1 kernel: br-b1a37228308c: port 1(veth68fe7f7) entered disabled state
Nov 20 14:25:24 gpu1 kernel: br-b1a37228308c: port 3(veth10420d1) entered disabled state
Nov 20 14:25:26 gpu1 systemd-networkd[1565]: br-b1a37228308c: Lost carrier
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 2(veth80039d3) entered disabled state
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 2(veth80039d3) entered disabled state
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 1(veth68fe7f7) entered disabled state
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 1(veth68fe7f7) entered disabled state
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 3(veth10420d1) entered disabled state
Nov 20 14:25:27 gpu1 kernel: br-b1a37228308c: port 3(veth10420d1) entered disabled state
Nov 20 14:30:00 gpu1 systemd-networkd[1523]: br-b1a37228308c: Link UP
Nov 20 14:30:01 gpu1 systemd-networkd[1523]: br-b1a37228308c: Gained carrier
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered disabled state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered forwarding state
Nov 20 14:30:01 gpu1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): br-b1a37228308c: link becomes ready
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered disabled state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered disabled state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered forwarding state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered disabled state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered blocking state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered forwarding state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered disabled state
Nov 20 14:30:01 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered disabled state
Nov 20 14:30:02 gpu1 systemd-networkd[1523]: br-b1a37228308c: Gained IPv6LL
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered blocking state
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered forwarding state
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered blocking state
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered forwarding state
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered blocking state
Nov 20 14:30:02 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered forwarding state
Nov 20 15:21:04 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 2(vethcdc9775) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 1(veth8eff6fd) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered disabled state
Nov 20 15:21:05 gpu1 kernel: br-b1a37228308c: port 3(vethc75a3fb) entered disabled state
Nov 20 15:21:05 gpu1 systemd-networkd[1523]: br-b1a37228308c: Lost carrier
Nov 20 15:23:56 gpu1 systemd-networkd[1509]: br-b1a37228308c: Link UP
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered disabled state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered forwarding state
Nov 20 15:23:57 gpu1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): br-b1a37228308c: link becomes ready
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered disabled state
Nov 20 15:23:57 gpu1 systemd-networkd[1509]: br-b1a37228308c: Gained carrier
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered disabled state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered forwarding state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered disabled state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered blocking state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered forwarding state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered disabled state
Nov 20 15:23:57 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered disabled state
Nov 20 15:23:58 gpu1 systemd-networkd[1509]: br-b1a37228308c: Gained IPv6LL
Nov 20 15:23:58 gpu1 systemd-networkd[1509]: br-b1a37228308c: Lost carrier
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered blocking state
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 1(veth4974272) entered forwarding state
Nov 20 15:23:58 gpu1 systemd-networkd[1509]: br-b1a37228308c: Gained carrier
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered blocking state
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 3(veth6495b3f) entered forwarding state
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered blocking state
Nov 20 15:23:58 gpu1 kernel: br-b1a37228308c: port 2(veth086e445) entered forwarding state
Nov 22 11:58:59 gpu1 sudo[1527288]:   lnsoft : TTY=pts/2 ; PWD=/home/lnsoft ; USER=root ; COMMAND=/usr/sbin/tcpdump -i br-b1a37228308c
Nov 22 11:58:59 gpu1 kernel: device br-b1a37228308c entered promiscuous mode
Nov 22 12:10:10 gpu1 kernel: device br-b1a37228308c left promiscuous mode

参考资料