This all started when I introduced a new network monitoring tool on to the network, the tool was “cisco prime”, but before I say any more lets be clear the issue here has nothing to do with prime which is a great tool for managing cisco devices. I noticed that when pushing new IOS files and backing up switch configuration that some time they would seem to lose network connectivity. I was able to ping them and ssh to them from my desktop, but they would simple not speak to the prime server.
So lets start with the set up (ip address modified of course)
Prime server – a vmware guest , in vlan 1, ip address 10.10.224.98/21
my desktop – physical machine, in vlan 1 also, ip address 10.10.226.46/21
Switch – management interface in vlan 666, ip address 20.20.255.6/24
interface Vlan666
ip address 20.20.255.6 255.255.255.0
no ip proxy-arp
end
switch#sh ru int vlan 1
Building configuration…
Current configuration : 65 bytes
!
interface Vlan1
no ip address
no ip proxy-arp
shutdown
end
Router – 6506 with a live interface in both vlan 1 and 666 set as DFGW on clients.
So to start with I can ping every thing and every thing can ping every thing else, and on the switch with a show arp I see
switch#sh arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 20.20.255.1 0 4403.a754.8300 ARPA Vlan666
Internet 20.20.255.6 – 7010.5c99.f241 ARPA Vlan666
switch#
So all looking good, I can see the switch IP address and that of the DFGW
switch#ping 10.10.224.98
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.224.98, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/6 ms
But then the ping stops working? and the switch can no longer contact the prime server, however I can still see it from my desktop? Logging back on to the switch I again look at the ARP cache..
Protocol Address Age (min) Hardware Addr Type Interface
Internet 10.10.224.99 2 0050.5683.61ce ARPA Vlan1
Internet 20.20.255.1 0 4403.a754.8300 ARPA Vlan666
Internet 20.20.255.6 – 7010.5c99.f241 ARPA Vlan666
What?? why has the switch learnt a mac address on vlan 1? I have two issues with it doing this, first Prime is still trying to contact it via the DFGW ( I can see this in a packet trace) so the switch is not seeing the packets coming in on vlan 1, second the interface for vlan 1 is shut down so there should be no ARP entries on it! to get it working I clear the ARP cache of this entry and all is good… well for a few days / hours and then it happens again, but the time between issues seems very random. Keeping an eye on it I see it happen with anther monitoring server, and then another? The one thing I notice is that all the times it happens it is always a server on vmware. Physical servers/appliance and desktops never seem to have this issue. This is the first piece of the puzzle, what does Vmware do different to other servers? They Migrate! And when they migrate between the physical hosts the vmware system sends a gratuitous arp on to the network to alert switches what port in the network to now find the server on? And some switches that have “ip arp gleaning” switched on which is the default hear this and place an entry in to there ARP table. Even though the switch had the interface vlan 1 shut down, it still passed vlan1 traffic through the switch at layer 2 and this seemed to be enough that it saw the ARP packet, and added the entry to its ARP table. Then of course the it try’s to use this entry for communication but as the interface is indeed shut it will not work!
A little bit of time with CISCO TAC and the solution was to disable IP arp gleaning on all the access switches, it might be useful for provisioning switches but as I found it can cause issues.
no ip arp gleaning tftp
no ip arp gleaning udp
The fact it was learning on a disabled interface is a bug and something CISCO are looking in to.
However that’s not quite the end of the story, disabling “IP ARP Gleaning” did not work! and this had us scratching out heads for a while, until one of the cisco engineers I was talking to noticed this..
Cisco Bug CSCun38166
“no ip arp gleaning tftp and udp” doesn’t work
CSCun38166
On 2960,when we configure “no ip arp gleaning tftp” and “no ip arp gleaning udp” then do
tftp use command”copy flash:teset tftp:”
It is expected following behavior:1.do not learn a IP address from other segment.
2.add a arp entry corresponding with HSRP virtual mac addressHowever in our case,it turns out like below:
1.generate a arp entry of IP address from other segment
2.add a arp entry corresponding with HSRP PHY mac address.
Conditions:
1.no ip arp gleaning tftp
OR
no ip arp gleaning udp
2.do tftp use following command
copy flash:teset tftp:
And looking at the version code, oh yes I would be running one of the affected versions. 15.0(2)SE6, a quick upgrade to version SE7 and all is good.
I am impressed though, hitting 2 cisco bugs in one issue.
And in the end it was prime that sorted it all out, a single click and it pushed the IOS update and the configuration for “no ip arp gleaning” to all 2960s affected devices, not going to ake this a post for plugging prime, but it does have its good points.