" /> Seeing To It: December 2009 Archives

« October 2009 | Main | April 2010 »

December 31, 2009

The Perfect Storm - Enough umbrella to allow sight, but enough wind to counter its benefits...

n.b. - this has been mostly written since the July 4th weekend, but required so much review and editing time that I put it off until I had time to do that - it still rings true, though the causes of the issue were immediately addressed, and have remained so. I did put a couple extra things in there, given that I have time to think. CPT


"We are born wet, naked and hungry. Then things get worse." - Anonymous

There is a concept with which knowledge management practitioners should be familiar - the butterfly effect. Without getting into the discussion of mathematical progressions and biased assumptions, allow me to submit that there is at least a practical lesson to be taken from the concept, because it really does happen, if not in the physical world than certainly in the haze of electrons.

The basic tenet of the butterfly effect is that all things are connected, such that when a butterfly in one part of the world flaps its wings, the aggregation of energy unleashed grows into a storm on the other side of the planet, or something to that effect. Here is a picture, though, of how this progresses, and why it is important to be in a continual process of improvement and evaluation.

n.b. - this has been mostly written since the July 4th weekend, but required so much review and editing time that I put it off until I had time to do that - it still rings true, though the causes of the issue were immediately addressed, and have remained so. I did put a couple extra things in there, given that I have time to think. CPT


"We are born wet, naked and hungry. Then things get worse." - Anonymous

There is a concept with which knowledge management practitioners should be familiar - the butterfly effect. Without getting into the discussion of mathematical progressions and biased assumptions, allow me to submit that there is at least a practical lesson to be taken from the concept, because it really does happen, if not in the physical world than certainly in the haze of electrons.

The basic tenet of the butterfly effect is that all things are connected, such that when a butterfly in one part of the world flaps its wings, the aggregation of energy unleashed grows into a storm on the other side of the planet, or something to that effect. Here is a picture, though, of how this progresses, and why it is important to be in a continual process of improvement and evaluation.

One reason for the gap in blogging of late has been the recovery from what a colleague (and fellow ICS graduate) termed a perfect storm. For those unfamiliar with the book by Sebastian Junger by that title, the story goes that there was a ship called the Andrea Gail (I believe - going from memory here), manned by seasoned sailors, which was caught in a storm so severe that it was better than the men and their ship. If you like happy endings it's not a tale that is pleasing, because the good guys lose in a very permanent way. The concept is that no matter the skill level of the professional, and no matter the quality of the equipment in question, there is a situation which can defeat all, from now on known as the flap.

For us the flap occurred at the worst possible time - the week when so much of the fiscal year-end closing work would be done, mere minutes away from the backup that was set to run every 4 hours, and only 2 days before a holiday weekend. It happened in the worst possible way - a total loss of the virtual machine environment for the entire system. It included anathema for those of us in IT - data loss. It included software issues, at least as far as database maintenance is concerned. All three IT elements were involved - people, hardware and software. If it could affect, it did.

Without going into details, by the wee hours of the following Monday morning, after almost 5 days of frantic work on the back end, everything was back, save for a small and replaceable set of data. While there were many benefits, including further proof of the ability to recover from a disaster scenario - part of the recovery time was because the recovery tapes were only configured to be read back in one at a time, and for over 500GB of data that take a bit of time, but it did come back - and the knowledge that our thinly-provisioned staffing can get the job done with only a major hit to their work/life balance, there were ripples through the environment that took a while to ride out. Submitted for your learning is a list of the big ones:

1) That nagging knowledge that some mysteries are never going to be understood. The root cause for the failure within our virtual environment is not assessable. Any forensic hope we had of figuring out the cause was destroyed in restoring service. It grates that the drive to analyze and understand will forever be without this one thing. This is a ripple because it not only goes against the quest for learning, it also means that I can never say that the root cause has been addressed and rectified. We will devise a cure for the situation, but we will never have the exact cure, because we cannot assess it.

2) The dread on the part of our users that we will not be adequate in our ability to protect them. This is a major concern for me personally, because of a commitment to client satisfaction. An unofficial slogan I use occasionally is that we work the dark side of the clock so that you never have to darken your sunny disposition in the light side, and the safety and security of data is taken more seriously than can be expressed. Because of that trust, the few thousand bits that had to be reproduced were mourned more than the gigabytes that were restored. The users don't know the lengths that are taken to ensure good service, and they shouldn't have to.

3) Mitigating the future, or, fool me once, shame on you, fool me twice, print out my resume. Without a clear knowledge of the causes for the outage, keeping the same thing from happening is difficult at best. It is being addressed, and I defy (with due respect to the electrons, of course) the same situation to leave a mark on anything that we do from that bleary Monday morning hence. However, as good as we are, and as much talent as was here in this hallway (for the latent observer I'm still in the office at this writing doing a software update so that the users won't realize anything was done, other than a couple minutes that will be needed on their machines to roll out the new client updates - perhaps another unofficial slogan could be, "Our hours for your minutes."), we are stuck probably over-engineering to ensure things are good.

4) Producing gun-shy work. Whenever there is a large-scale outage, there is an incredible surge of two forces - those wanting to know what happened and how it happened and will be prevented in the future, and those wanting to know WHO happened and what will happen to them. The combination is not good. Fortunately the latter hasn't really trickled down to be a factor this time, but if you spend enough time in the industry you will certainly find it, and it will cause hesitation and over-caution in the work.


An interesting thing is that without the wind we would have a hard time flying, and that is the real lesson of this butterfly effect. That said, here are some design precepts that should be standard in the thinking of the practitioner, but especially the practitioner coming out of the Center (for the non-ICS readers, the Center for Information and Communication Sciences is a world-class graduate program at Ball State that supplies the rare blend of 'theory-trained/hands-on tested/ready to hit the ground running and add value from the onset' employees to premiere technology and telecommunications companies around the world):

1) There are many fine concepts, like data deduplication, out there in the industry. Know their limitations. Smart deduplication does NOT mean that a single piece of data only ever exists in one place at one time, it means that the semi-permanent home for access to a piece of data is in one place. The difference is that a piece of data can exist in a short-term duplication state to provide failover and gap data protection, or you can lose data when your perfect storm hits. We are mitigating this with a short term duplication of near-line data on a mechanically and logically separated system. In the event of a hardware and/or software failure in anything other than the network, no data will be lost. The cloud? Did Danger lose a bunch of text messages? Do GMail and Hotmail suffer more than rare outages? Does Salesforce.com? Caveat emptor for your mission critical stuff. Will outages always occur? Will the technology improve? How will you know if your data winds up on the good side of those questions?

I'll just say this... if you read it in a glossy ad, or in a newsletter, take that blurb and make it an iron-clad guarantee backed up with stiff financial and other penalties as part of the contract and see if they will still sign it. If not, back away quickly and on your way out the door tell them to fire their marketing people and hire some honest-to-goodness coders to fix the things they've been promising.

2) If you do not have a disaster recovery plan that has been tested AND proven, then you have nothing. We were able to bring back the sum total of our history by having just such a plan. Part of the reason the plan worked is because of the platform it is running on, and how it has been architected. Virtualization on the mainframe was being done decades ago, and when you try to recover or manage it, it shows. I'm not saying that it's not possible to recover a non-mainframe system, but I am saying it is easier to recover one physical box with many logical partitions than it is to recover a hundred of them. The ability to come back from this is one more indication that the fight is a good one against the FUD-mongers who try to move everything under the sun off of the big iron because they don't understand how it works. Ignorance and/or a lack of knowledge are poor justifications for failing to use the right tool for the right job.

3) Wear your raincoat. The best protection against the perfect storm is a complete understanding of your system, the software, and where the development is going with it. It is NOT - repeat NOT - enough to be able to stick in the CD and know what to choose. You need to LEARN that there are communications protocols, system characteristics, operating systems, dll's (IBM invented those, but they are mostly on Windows boxes, though you can create static MFC references to make them more ubiquitous - part of the learning that took place during this last week of Dec. 28...) and other pieces parts that create this system. If this sounds like a plug for a specific Master's program with which I'm familiar - it is. It is also a plug for LEARNING. If you don't appreciate the learning process, then you do not belong in IT. It's not a cruel value judgment, it is a fact. In many jobs, a lack of knowledge just makes you less efficient a worker, but in anything involving IT, a continuing lack of knowledge makes you a liability. Your knowledge depth IS the raincoat you can wear during your perfect storm.

4) Don't be a macho know-it-all. Counterbalancing the need to have a deep knowledge of your system is the need to work well with those who have an expert level of knowledge in theirs, especially when their system and yours interact. They will need to know detailed questions about how your system functions so they can help fix whatever happens, and you will need to know in detail how their system and your form their symbiotic combination of bits. Learn like one possessed, but know it is as Socrates once said, "The only true wisdom is in knowing you know nothing."

The final piece of learning from this is to make sure that your users understand how deeply they are protected, and how fiercely they are worked for. If they trust you, and your efforts, then on the odd occasion that something happen to disrupt the flow of their data, it goes better than if they think you are a nameless hoard of MMPG-playing bandwidth misers. They don't need to hear the deep details, but for them to know that they are using a system which has weathered disaster testing and modifications and other things is a pretty good thing to have known.