Skip to main content Skip to page footer

Our support goes further than many expect!

Created by Danny Sternol | | Blog

At Flying Circus, we often like to emphasise that the key to successful application operation lies in personal and constructive communication between all parties: this is especially true in the event of an error.

The importance of cooperation, communication and commitment instead of apportioning blame in support cases is demonstrated by a recent incident in which a network component of a large fibre optic backbone operator at the Frankfurt site triggered diffuse and stochastic connection errors for one of our customers. But let's start from the beginning.

On Wednesday evening (12.06.2024), we received an initial enquiry in the chat as to whether we would notice any anomalies in the network from our side. A quick check of our monitoring revealed that everything was working as expected. On Thursday morning, we received further reports from the customer and we were able to see within our infrastructure that outgoing packets were being forwarded from our routers to the Internet, but in some cases were not being acknowledged by the remote peer. In our experience, the best option to get to the bottom of such a problem is to quickly communicate with all parties involved.

So we asked for support on the customer side so that we could collate our findings with those of the network technicians there. We also directly involved our infrastructure and data centre service provider in the troubleshooting process and quickly gained a feeling for the error pattern with the comprehensive insights from different sides. Unfortunately, errors in public Internet infrastructure cannot generally be avoided and are difficult to diagnose in this quality. A faulty network component localised at DE-CIX in Frankfurt became increasingly apparent as a possible cause. DE-CIX plays a major role in global Internet traffic and is the largest German Internet exchange node. Internet service providers (ISPs), content delivery networks (CDNs) and other network operators exchange data traffic directly with each other at such nodes, thus forming the public Internet from their individual networks.

While our customer opened a support case with their IP service provider at the same time, our team worked on a workaround to reroute the traffic to our primary data centre in Oberhausen via our Halle site in such a way that the potentially problematic hardware component in Frankfurt could be taken out of the equation. At the same time, we received several reports from informal contacts via DENOG that confirmed the error pattern we had diagnosed. The DENOG (German Network Operators Group) is a community of experts who work in the field of network operation and Internet infrastructure and exchange knowledge and experience with each other.

On Thursday (13 June 2024) at around 3 p.m., shortly before the rollout of our workaround, we received the first reports that the faulty component in Frankfurt appeared to have been replaced and the fault rectified.

Why are we writing about this error? Because it is important for us to communicate that the error-free function and accessibility of the application for end customers is our top priority. Even if an error has not occurred within our sphere of influence, we take the initiative to actively and jointly search for a solution to the problem at hand. Unfortunately, this approach to troubleshooting is not standard in many managed IT services. In such cases, the problem can often only be solved with the co-operation of all parties involved. The joint, professionally focussed exchange between developers, customers, service providers and us is an integral part of our way of working from the very first minute. Even and especially when things don't go as planned!

Back