[ Experience ] 解决一个因为手误导致路由环路的问题

起始：客户发邮件报障，称经过我们 IXP 接口访问我们内网会绕道 HE 接口，且性能有些许下降，希望我们能够排查。

#Greetings,

#I happened to do a traceroute to [ our destination IP ], and saw some rather odd behavior... Just thought you might want to look into why you're correctly advertising a prefix into FCIX, but your peering router seems to be hairpinning the traffic through Hurricane's port back to your transit port?

kenneth@shell:~$ traceroute [ our destination IP ]
traceroute to [ our destination IP ], 30 hops max, 60 byte packets
 1  [ Client GW ] 0.425 ms  0.455 ms  0.507 ms
 2  [ our IXP IP ]  0.140 ms  0.125 ms  0.151 ms
 3  AS6939.ixp.fcix.net (206.80.238.9)  0.268 ms  0.193 ms  0.193 ms
 4  [ our Transit IP ]  0.180 ms  0.183 ms  0.181 ms
 5  [ our Internal Port ]  0.787 ms  0.780 ms  0.787 ms
 6  [ our destination IP ]  0.220 ms  0.198 ms  0.183 ms

结论：人为配置错误引起的路由绕路现象，导致 ping 直连接口环路，典型的 Human Error.

排查过程：

首先检测 VRF 路由表，查看相关的路由表项。

#show ip route vrf ixp-fcix [ our /32 ip ]

Routing Table: ixp-fcix
Routing entry for [ our subnet /24 ]
  Known via "bgp xxxxx", distance 20, metric 0 (connected), type external
  Routing Descriptor Blocks:
  * directly connected, via Null0
      Route metric is 0, traffic share count is 1
      AS Hops 0
      MPLS label: none

这个路由其实是有点问题的，没有具体的内部 IGP 协议发过来的地址，但是不会引起环路（以及绕路），暂且放置稍后处理。

再继续检查，发现在接了 FCIX 的机器上 MTR 我们路由器的接口居然出现了环路现象，如图所示：

会不会是 Forwarding-table 出问题了？于是检查 Cisco CEF Table:

第二个有点问题，但是不会引起环路。

那既然路由表与转发表都没致命错误，那为什么会环路呢？

根据我的经验，既然转发表和路由表都没有问题，而错误依旧存在的话，那么就要考虑比转发表更底层的策略了。而比转发表更底层的有一个叫策略表的东西。策略表直接把策略写入底层，不经过路由表以及转发表判断，就算双表无对应条目也可以转发。话不多说，立即检查：

#show ip policy
Interface      Route map
Te1/1/0    [ our routing policy ]

果不其然，FCIX 的接口被人误配置了客户的路由策略，且查看策略具体内容，果然是和 HE 有关联。经过思考确认没问题后，决定删除该策略表。

删除后，测试机器错误变成了 unreachable，而不是先前的绕路/环路了，如下所示：

root@fcix1 ~ # ping [ our destination IP ]
PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data.
From [ our destination IP ] icmp_seq=1 Destination Host Unreachable
From [ our destination IP ] icmp_seq=2 Destination Host Unreachable
From [ our destination IP ] icmp_seq=3 Destination Host Unreachable
From [ our destination IP ] icmp_seq=6 Destination Host Unreachable
^C
--- [ our destination IP ] ping statistics ---
7 packets transmitted, 0 received, +4 errors, 100% packet loss, time 123ms

正如我们在上面所说，我们 VRF 表配置还是有点问题的，没有 IGP协议传过来的路由以及直连路由，所以我们要将路由器接口直连路由信息导入到 VRF 内。

interface TenGigabitEthernet0/2/0.100
 encapsulation dot1Q 100
 ip vrf select source
 ip vrf receive ixp-fcix

该命令是专门用于导入直连路由的（就是导入接口信息到 VRF ），将路由导入到 VRF 后，再查看路由表：

#show ip route vrf ixp-fcix [ our destination IP ] 

Routing Table: ixp-fcix
Routing entry for [ our destination IP ]/32
  Known via "connected", distance 0, metric 0 (connected)
  Routing Descriptor Blocks:
  * directly connected, via TenGigabitEthernet0/2/0.100
      Route metric is 0, traffic share count is 1

路由表没问题，再查看 CEF Table:

#show ip cef vrf ixp-fcix [ our destination IP ]  
[ our destination IP ]/32
  receive for TenGigabitEthernet0/2/0.100

都没有问题了，我们再去测试机器测试连通性：

root@fcix-test ~ # mtr -4 [ our destination IP ] --report
Start: 2020-03-08T08:55:18-0700
HOST: fcix-test                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- [ our destination IP ]               0.0%    10    0.2   0.2   0.1   0.3   0.1

root@fcix-test ~ # ping -4 [ our destination IP ] -c 3
PING [ our destination IP ] ([ our destination IP ]) 56(84) bytes of data.
64 bytes from [ our destination IP ]: icmp_seq=1 ttl=255 time=0.124 ms
64 bytes from [ our destination IP ]: icmp_seq=2 ttl=255 time=0.160 ms
64 bytes from [ our destination IP ]: icmp_seq=3 ttl=255 time=0.155 ms

--- [ our destination IP ] ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 36ms
rtt min/avg/max/mdev = 0.124/0.146/0.160/0.018 ms

已恢复连通性，且路径正常，问题解决。

教训如下：

不要在非客户侧业务接口乱配置 PBR
不要在路由器上再做路由聚合，路由表导入的时候因为 Prefix-list 匹配原因，只会导入汇聚后的路由（导入后 NextHop 还是 Null0 ），而详细的 IGP 路由经常被人遗忘，从而导致数据包被丢弃。

有时候 Human Error 检查起来真的是要人命……

发表回复取消回复

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据。

LittleWolf Network Universe

小狼的网络世界

[ Experience ] 解决一个因为手误导致路由环路的问题

发表回复取消回复

[ Experience ] 解决一个因为手误导致路由环路的问题

发表回复 取消回复

发表回复取消回复