Intermittent Network Connectivity
Incident Report for Velocity Host Status Page
Postmortem

The Alert & Response

On Saturday 29th Feb 2020 our engineers where alerted by internal monitoring systems of a potential network issue.

Engineers arrived on site to Micron21 Data Center to discover all network paths flapping up/ down for external internet services creating intermittent network connectivity, engineers began reviewing the issue in more detail.

The issue affected our top of rack Dell Force10 switches flooding them with network traffic, our team followed standard troubleshooting steps to identify the root cause and mitigate the problem, in parallel a support ticket was opened with Dell Networking team to initiate a remote session into the switches to begin analysis.

On initial investigation Dell engineers where unable to identify the root cause of the problem.

A third parallel task was initiated contacting Cisco to help with a root cause analysis and troubleshooting.

Cisco engineers looking into issue performed packet captures and where able to identify packets with a TTL (Time To Live) of 1ms (this is a very short interval) for all packets entering the switches, with this information at hand Dell engineers where able to identify that all traffic packets into the switches where routing directly to the switch CPU'S creating the congestion and flooding the switches creating reduced network flow.

Issue Identification

The issue was identified as a loop potentially created by a bug, fault or external source, Dell requested for senior engineers to be involved which where another 4 hours away so they could trouble shoot and properly identify the cause, taking this action would have extend the outage window by many hours.

At this stage we where faced with the potential of another minimum 4 hours of down time or to manually start performing an up/ down troubleshooting process on every port across the switch stack to see if we could identify and disable the packet loop creating the outage.

Our Course Of Action

We decided that customer uptime was a priority over waiting for Dell senior engineers to come online and at a minimum another 4 hours before they could engage in troubleshooting, the decision was made to manually up/ down every port on the switches to identify and isolate the cause of the data loop.

After approximately 30 min of manually removing and replacing cables from the switch stack and after a given amount of time the loop was broken and traffic started flowing again.

At this stage we are unable to confirm the root cause of the issue that triggered the loop creating congested traffic flooding all top of rack switch ports.

Potential Causes:
  • Bug in the switching structure
  • Fault in the switching structure
  • External source triggering the loop by flooding the Switch CPU’s

This is the first time a problem of this type has been seen in our environment over the last 14 years and there are many unanswered questions that need to be explored.

Next Steps

  • Implement procedures for quickly troubleshooting this type of event if it ever arises in the future.
  • Schedule downtime to upgrade top of rack switched to the latest FOS (Force10 Operating System) - Yet to be confirmed.

Synopses

  • A networking issue was detected
  • Engineers where dispatched to site for investigation
  • Dell and Cisco engineers where engaged to assist in troubleshooting the root cause of the issue and provide additional insight into the problem.
  • Due to prolonged troubleshooting process with Dell and Cisco, senior engineers from Dell not being available for an unreasonable amount of time a management decision was made to manually find and fix the issue.
  • After manual troubleshooting the identified “network” packet loop was broken and internet services restored.
  • An action plan has been formulated for faster troubleshooting of any potential future issues of this nature and potential switch firmware updates - TBA.

The (vh) team want to thank you for your understanding over this period, if there are any questions please feel free to open a support ticket via our accounts portal https://control.velocityhost.com.au or contact your account manager directly.

Yours Sincerely

Gerardo Altman

Director Of Problem Solving.

https://velocityhost.com.au

Posted Mar 05, 2020 - 09:56 AEDT

Resolved
An intermittent network issue was detected affecting external > internal service access to all (vh) Internet facing services at Micron21 DC.

The issue specifically pertained to network traffic over our top of rack switches.

Dell and Cisco engineers had been engaged to assist with troubleshooting the root cause of the issue along with our DC team.

The issue is currently marked as resolved, at this stage we are unclear on the root cause and our team are reviewing the incident in more detail gathering all relevant data so we can report back.

We expect more information to be available by close of business Monday 2nd of March 2020.

Thank you for your patience and understanding.

Kind Regards
(vh) Support Team
Posted Feb 29, 2020 - 06:00 AEDT