菜单

[ Experience ] 解决一个因为手误导致路由环路的问题

2020年3月8日 - Experience

起始:客户发邮件报障,称经过我们 IXP 接口访问我们内网会绕道 HE 接口,且性能有些许下降,希望我们能够排查。

#Greetings,

#I happened to do a traceroute to [ our destination IP ], and saw some rather odd behavior... Just thought you might want to look into why you're correctly advertising a prefix into FCIX, but your peering router seems to be hairpinning the traffic through Hurricane's port back to your transit port?

kenneth@shell:~$ traceroute [ our destination IP ]
traceroute to [ our destination IP ], 30 hops max, 60 byte packets
 1  [ Client GW ] 0.425 ms  0.455 ms  0.507 ms
 2  [ our IXP IP ]  0.140 ms  0.125 ms  0.151 ms
 3  AS6939.ixp.fcix.net (206.80.238.9)  0.268 ms  0.193 ms  0.193 ms
 4  [ our Transit IP ]  0.180 ms  0.183 ms  0.181 ms
 5  [ our Internal Port ]  0.787 ms  0.780 ms  0.787 ms
 6  [ our destination IP ]  0.220 ms  0.198 ms  0.183 ms

结论:人为配置错误引起的路由绕路现象,导致 ping 直连接口环路,典型的 Human Error.

排查过程:

首先检测 VRF 路由表,查看相关的路由表项。

#show ip route vrf ixp-fcix [ our /32 ip ]

Routing Table: ixp-fcix
Routing entry for [ our subnet /24 ]
  Known via "bgp xxxxx", distance 20, metric 0 (connected), type external
  Routing Descriptor Blocks:
  * directly connected, via Null0
      Route metric is 0, traffic share count is 1
      AS Hops 0
      MPLS label: none

这个路由其实是有点问题的,没有具体的内部 IGP 协议发过来的地址,但是不会引起环路(以及绕路),暂且放置稍后处理。

再继续检查,发现在接了 FCIX 的机器上 MTR 我们路由器的接口居然出现了环路现象,如图所示:

会不会是 Forwarding-table 出问题了?于是检查 Cisco CEF Table:

第二个有点问题,但是不会引起环路。

那既然路由表与转发表都没致命错误,那为什么会环路呢?

根据我的经验,既然转发表和路由表都没有问题,而错误依旧存在的话,那么就要考虑比转发表更底层的策略了。而比转发表更底层的有一个叫策略表的东西。策略表直接把策略写入底层,不经过路由表以及转发表判断,就算双表无对应条目也可以转发。话不多说,立即检查:

#show ip policy
Interface      Route map
Te1/1/0    [ our routing policy ]

果不其然,FCIX 的接口被人误配置了客户的路由策略,且查看策略具体内容,果然是和 HE 有关联。经过思考确认没问题后,决定删除该策略表。

删除后,测试机器错误变成了 unreachable,而不是先前的绕路/环路了,如下所示:

root@fcix1 ~ # ping [ our destination IP ]
PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data.
From [ our destination IP ] icmp_seq=1 Destination Host Unreachable
From [ our destination IP ] icmp_seq=2 Destination Host Unreachable
From [ our destination IP ] icmp_seq=3 Destination Host Unreachable
From [ our destination IP ] icmp_seq=6 Destination Host Unreachable
^C
--- [ our destination IP ] ping statistics ---
7 packets transmitted, 0 received, +4 errors, 100% packet loss, time 123ms

正如我们在上面所说,我们 VRF 表配置还是有点问题的,没有 IGP协议传过来的路由以及直连路由,所以我们要将路由器接口直连路由信息导入到 VRF 内。

interface TenGigabitEthernet0/2/0.100
 encapsulation dot1Q 100
 ip vrf select source
 ip vrf receive ixp-fcix

该命令是专门用于导入直连路由的(就是导入接口信息到 VRF ),将路由导入到 VRF 后,再查看路由表:

#show ip route vrf ixp-fcix [ our destination IP ] 

Routing Table: ixp-fcix
Routing entry for [ our destination IP ]/32
  Known via "connected", distance 0, metric 0 (connected)
  Routing Descriptor Blocks:
  * directly connected, via TenGigabitEthernet0/2/0.100
      Route metric is 0, traffic share count is 1

路由表没问题,再查看 CEF Table:

#show ip cef vrf ixp-fcix [ our destination IP ]  
[ our destination IP ]/32
  receive for TenGigabitEthernet0/2/0.100

都没有问题了,我们再去测试机器测试连通性:

root@fcix-test ~ # mtr -4 [ our destination IP ] --report
Start: 2020-03-08T08:55:18-0700
HOST: fcix-test                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- [ our destination IP ]               0.0%    10    0.2   0.2   0.1   0.3   0.1

root@fcix-test ~ # ping -4 [ our destination IP ] -c 3
PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data.
64 bytes from [ our destination IP ]: icmp_seq=1 ttl=255 time=0.124 ms
64 bytes from [ our destination IP ]: icmp_seq=2 ttl=255 time=0.160 ms
64 bytes from [ our destination IP ]: icmp_seq=3 ttl=255 time=0.155 ms

--- [ our destination IP ] ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 36ms
rtt min/avg/max/mdev = 0.124/0.146/0.160/0.018 ms

已恢复连通性,且路径正常,问题解决。

教训如下:

有时候 Human Error 检查起来真的是要人命……

发表评论

电子邮件地址不会被公开。 必填项已用*标注