This is a story about the failure of a large computer system the day before it went live. I am telling this story to give others the opportunity to learn from it. It does not name nor blame people or organizations. In fact, all people involved are highly skilled professionals who learn every day from events like these. My duty is to share.
Response Times
Since the system we talk about is not an end-user system but instead an API to be used to create mash-ups that serve end-users, performance is very important. That is because you have twice the internet latency: first between user and mash-up, next between mash-up and API.
Gut feeling
As we were nearing the end of the project, performance tests became a more regular item on our backlog. Both the overall tests and our component test showed figures within the desired range. We aimed for 80% within 400ms, and 99% within 5s as that were the official targets. Although we were well within range, I felt that the system could do better. The 95% percentile was about 2s, while I felt it could be close to 500ms. It was just a gut feeling based on previous experience with similar systems and I just could not figure out if there was a reason why this time we did not achieve the superb response times we were used to.
Never one problem
But, as said, it was well in range. People even complimented me personally and, most importantly, there were other things on my mind: I couldn’t log in to the VPN anymore. This proved to be a serious problem as all machines making up the system were only reachable over the VPN. This needed fixing first.
The VPN connection wasn’t really stable. Apart from the first few days after it got delivered, it never was. Connecting was often slow, it requested my password twice although I was sure to have typed it in correctly the first time. Also, logging in via SSH was also very slow, and public key authentication worked intermittently. But now it didn’t connect at all. But heck, other people still had access, so let’s use their accounts. After all, they were the ones who needed it most, and they were OK.
After a week I was able to connect again. It was still slow but it worked. So let’s forget this misery and go on. There is a system that must go live!
Almost live
Friday night all machines in the system ran on the latest version and all databases were properly populated. All looked fine for the release next monday. Only the DNS entries had to be switched and the system would be live.
The gates open
Saturday morning I got a text message from the project manager. The system was working, but horribly slow. Requests took seconds, tens of seconds, up to a minute. Consistently. Adrenaline! I ran to my computer, called my colleague and started investigating. Within minutes we knew that something was badly wrong and we called all hands to battle stations. Software developers, system maintainers, the customers support center, about ten people from five organizations on it.
We went deep, deep down into the systems guts and after three hours we found a problem. The internal DNS system was responding slow; too slow. Requests took one second to complete, on average, and this caused the socket library to block while waiting for an answer every time it wanted to communicate with another system. This, it turned out, slowed down almost everything.
The software engineers knew what to do immediately: at those places where the DNS was time-critical, replace all host names with IP addresses. And that we did. While we worked the VPN connection kept failing, but we fixed the problem. It was eight o’clock PM. I took a drink and went to sleep.
Hell breaks loose
Next morning, on a beautiful Sunday, the project manager sent me another text message: the system spet out incomplete data. Again, I launched myself towards my laptop, and gazed at our monitoring system: nine of 13 services were gone. Dead. Red flags all over. And not a single message on my phone. Oh dear, oh dear.
My first reflex was to try to login to one of these systems that hosted one or more of these dead services. But the VPN did not connect. Trying…. connecting…. no luck.
Again I called all hands on deck. But this time, I couldn’t do nothing else. I had to wait for the VPN access to get fixed first. After three hours the problem was found and fixed. The LDAP service failed and hence authentication could not proceed properly. Rebooting the LDAP service brought everything online again.
Saved
Everything? Yes everything. Also the nine dead services were running happily. There was no need for me to login or even connect to the VPN. Even the response times were better. The previously unsatisfying 95% percentile was now down to 350ms. Much more what I expected.
What happened?
LDAP can back a DNS service and so it did here. For a still unknown reason it became slower and slower until it eventually collapsed completely. Since DNS is really a fundamental systems component, its malfunctioning affects almost all other services. VPN went down. Nine of our services went down. The only four remaining were the ones we poked the IP addresses in the day before. Did you wonder why we did not get notifications before the project manager texted me? Because e-mail and messaging services were also down. There is almost nothing that works without DNS.
What do I take away from this story? First of all, there is the obvious single point of failure: the DNS services. While it is not in my capacity to fix that, we did fix the dependency on it. However, this has its limitations. Using IP addresses instead of host names limits load balancing and name-based routing. So in the end, every system needs a reliable DNS. But this is not the most powerful lesson to take away.
What did I learn?
The must fundamental and most potent lesson I have learned, or in fact re-learned is: if you see something strange, investigate! It is clear from this story that there were early warnings that could have prevented disaster. First, the VPN connections were slow and failed often. Second, we felt that the performance could be better. We ignored both signals for considerable time. Both signals clearly hinted at the same problem.
The most powerful lesson
Now this is not a new lesson. If fact, almost everyone who has some experience in systems maintenance already knows this. So what I am actually learning is how difficult it can be to apply what you know to be good. Circumstances, context, people dynamics, pressure and even personal well-being and fitness are strong forces that influence people’s ability to sense the often small signals and act on it. You can’t ignore these forces by stating that people must act ‘professional’. Both sensing and acting are crucial human activities that only happen when the circumstances are right. No matter how obvious and well-known a rule can be; applying it consistently is a whole different ball-game.