起始:客户发邮件报障,称经过我们 IXP 接口访问我们内网会绕道 HE 接口,且性能有些许下降,希望我们能够排查。
#Greetings, #I happened to do a traceroute to [ our destination IP ], and saw some rather odd behavior... Just thought you might want to look into why you're correctly advertising a prefix into FCIX, but your peering router seems to be hairpinning the traffic through Hurricane's port back to your transit port? kenneth@shell:~$ traceroute [ our destination IP ] traceroute to [ our destination IP ], 30 hops max, 60 byte packets 1 [ Client GW ] 0.425 ms 0.455 ms 0.507 ms 2 [ our IXP IP ] 0.140 ms 0.125 ms 0.151 ms 3 AS6939.ixp.fcix.net (206.80.238.9) 0.268 ms 0.193 ms 0.193 ms 4 [ our Transit IP ] 0.180 ms 0.183 ms 0.181 ms 5 [ our Internal Port ] 0.787 ms 0.780 ms 0.787 ms 6 [ our destination IP ] 0.220 ms 0.198 ms 0.183 ms
结论:人为配置错误引起的路由绕路现象,导致 ping 直连接口环路,典型的 Human Error.
排查过程:
首先检测 VRF 路由表,查看相关的路由表项。
#show ip route vrf ixp-fcix [ our /32 ip ] Routing Table: ixp-fcix Routing entry for [ our subnet /24 ] Known via "bgp xxxxx", distance 20, metric 0 (connected), type external Routing Descriptor Blocks: * directly connected, via Null0 Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: none
这个路由其实是有点问题的,没有具体的内部 IGP 协议发过来的地址,但是不会引起环路(以及绕路),暂且放置稍后处理。
再继续检查,发现在接了 FCIX 的机器上 MTR 我们路由器的接口居然出现了环路现象,如图所示:
会不会是 Forwarding-table 出问题了?于是检查 Cisco CEF Table:
第二个有点问题,但是不会引起环路。
那既然路由表与转发表都没致命错误,那为什么会环路呢?
根据我的经验,既然转发表和路由表都没有问题,而错误依旧存在的话,那么就要考虑比转发表更底层的策略了。而比转发表更底层的有一个叫策略表的东西。策略表直接把策略写入底层,不经过路由表以及转发表判断,就算双表无对应条目也可以转发。话不多说,立即检查:
#show ip policy Interface Route map Te1/1/0 [ our routing policy ]
果不其然,FCIX 的接口被人误配置了客户的路由策略,且查看策略具体内容,果然是和 HE 有关联。经过思考确认没问题后,决定删除该策略表。
删除后,测试机器错误变成了 unreachable,而不是先前的绕路/环路了,如下所示:
root@fcix1 ~ # ping [ our destination IP ] PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data. From [ our destination IP ] icmp_seq=1 Destination Host Unreachable From [ our destination IP ] icmp_seq=2 Destination Host Unreachable From [ our destination IP ] icmp_seq=3 Destination Host Unreachable From [ our destination IP ] icmp_seq=6 Destination Host Unreachable ^C --- [ our destination IP ] ping statistics --- 7 packets transmitted, 0 received, +4 errors, 100% packet loss, time 123ms
正如我们在上面所说,我们 VRF 表配置还是有点问题的,没有 IGP协议传过来的路由以及直连路由,所以我们要将路由器接口直连路由信息导入到 VRF 内。
interface TenGigabitEthernet0/2/0.100 encapsulation dot1Q 100 ip vrf select source ip vrf receive ixp-fcix
该命令是专门用于导入直连路由的(就是导入接口信息到 VRF ),将路由导入到 VRF 后,再查看路由表:
#show ip route vrf ixp-fcix [ our destination IP ] Routing Table: ixp-fcix Routing entry for [ our destination IP ]/32 Known via "connected", distance 0, metric 0 (connected) Routing Descriptor Blocks: * directly connected, via TenGigabitEthernet0/2/0.100 Route metric is 0, traffic share count is 1
路由表没问题,再查看 CEF Table:
#show ip cef vrf ixp-fcix [ our destination IP ] [ our destination IP ]/32 receive for TenGigabitEthernet0/2/0.100
都没有问题了,我们再去测试机器测试连通性:
root@fcix-test ~ # mtr -4 [ our destination IP ] --report Start: 2020-03-08T08:55:18-0700 HOST: fcix-test Loss% Snt Last Avg Best Wrst StDev 1.|-- [ our destination IP ] 0.0% 10 0.2 0.2 0.1 0.3 0.1 root@fcix-test ~ # ping -4 [ our destination IP ] -c 3 PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data. 64 bytes from [ our destination IP ]: icmp_seq=1 ttl=255 time=0.124 ms 64 bytes from [ our destination IP ]: icmp_seq=2 ttl=255 time=0.160 ms 64 bytes from [ our destination IP ]: icmp_seq=3 ttl=255 time=0.155 ms --- [ our destination IP ] ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 36ms rtt min/avg/max/mdev = 0.124/0.146/0.160/0.018 ms
已恢复连通性,且路径正常,问题解决。
教训如下:
- 不要在非客户侧业务接口乱配置 PBR
- 不要在路由器上再做路由聚合,路由表导入的时候因为 Prefix-list 匹配原因,只会导入汇聚后的路由( 导入后 NextHop 还是 Null0 ),而详细的 IGP 路由经常被人遗忘,从而导致数据包被丢弃。
有时候 Human Error 检查起来真的是要人命……