On Saturday 29th Feb 2020 our engineers where alerted by internal monitoring systems of a potential network issue.
Engineers arrived on site to Micron21 Data Center to discover all network paths flapping up/ down for external internet services creating intermittent network connectivity, engineers began reviewing the issue in more detail.
The issue affected our top of rack Dell Force10 switches flooding them with network traffic, our team followed standard troubleshooting steps to identify the root cause and mitigate the problem, in parallel a support ticket was opened with Dell Networking team to initiate a remote session into the switches to begin analysis.
On initial investigation Dell engineers where unable to identify the root cause of the problem.
A third parallel task was initiated contacting Cisco to help with a root cause analysis and troubleshooting.
Cisco engineers looking into issue performed packet captures and where able to identify packets with a TTL (Time To Live) of 1ms (this is a very short interval) for all packets entering the switches, with this information at hand Dell engineers where able to identify that all traffic packets into the switches where routing directly to the switch CPU'S creating the congestion and flooding the switches creating reduced network flow.
The issue was identified as a loop potentially created by a bug, fault or external source, Dell requested for senior engineers to be involved which where another 4 hours away so they could trouble shoot and properly identify the cause, taking this action would have extend the outage window by many hours.
At this stage we where faced with the potential of another minimum 4 hours of down time or to manually start performing an up/ down troubleshooting process on every port across the switch stack to see if we could identify and disable the packet loop creating the outage.
We decided that customer uptime was a priority over waiting for Dell senior engineers to come online and at a minimum another 4 hours before they could engage in troubleshooting, the decision was made to manually up/ down every port on the switches to identify and isolate the cause of the data loop.
After approximately 30 min of manually removing and replacing cables from the switch stack and after a given amount of time the loop was broken and traffic started flowing again.
At this stage we are unable to confirm the root cause of the issue that triggered the loop creating congested traffic flooding all top of rack switch ports.
This is the first time a problem of this type has been seen in our environment over the last 14 years and there are many unanswered questions that need to be explored.
The (vh) team want to thank you for your understanding over this period, if there are any questions please feel free to open a support ticket via our accounts portal https://control.velocityhost.com.au or contact your account manager directly.
Yours Sincerely
Gerardo Altman
Director Of Problem Solving.