Two Cowboys: Keep the System Up!

Keep The Lights On

There are many services we provide in I.T.. Many are hard to define and even harder to price effectively. However, there is one service that is very simple. This service has real measurable outcomes with significant business impacts in the event of a failure. Everybody knows when you’ve failed to meet expected service levels and they can calculate just how much it costs if you do make a mistake.

The service is: “Systems Availability”.

In this blog entry, I spend a brief moment to give some pointers on how to keep your systems available. I will show you how you can achieve ‘Zero Outages” or “No Unplanned Systems Outages” for the benefit of your customers.

The primary premise of failed systems availability is that it is a symptom of several potential causes. You have to manage the causes to eliminate the problem.

Causes of System Outages

Poorly Designed Systems Break: If a systems is poorly crafted by a designer/architect, or if the technology is immature or untested, or if the solution is old and no longer supported by the vendor, or if the knowledge of the system's workings is fast disappearing then you have a problem.

I do not have data yet to give me an overall view of the main contributors to systems instabilities due to these reasons, but my instinct tells me that all of these factors are definitely at play with poor design the main factor in the majority of cases.

Changes to Systems Cause Instability: Experience taught me that if something is working and you fiddle with it, it has a tendency to break. The data supports this as a main contributor to system instability. Therefore, we have to become better at verifying the quality and impact of changes (which pre-supposes proper design and testing, as mentioned above). Secondly, we have to minimize the rate and scale of change once a system is stable and in production.

Non-Maintained Systems Eventually Break: Like any motor vehicle on the road, our systems needs ongoing servicing to continue operating as expected.

If this is neglected they fail their "Warrant of Fitness" and instability and failures are a key symptom. There are many cases where production support teams have virtually no visibility of the day to day maintenance routines required to keep systems operational. If this is improved, then one will have more healthy, more stable and available solutions on hand.

Poor Monitoring: You will get a standard reply from production teams once you start talking about systems availability: "Yes, but our monitoring is inadequate, and therefore the systems fail."

Monitoring tells you when things are going wrong and helps you to be pro-active in rectifying issues. It is not a cause for failed systems. There is a flawed logic in blaming lack of, or poor monitoring for systems instability. I am keeping it on the cause list for one reason only, and that is its value to allow production support teams to avoid an outage pro-actively.

Infrastructure Failure: Most hardware has limited life-span (not so with software), or has inherent manufacturing defects. When a disk goes, or memory fails then you can only rely on the resilience built into the hardware you use to save you from a systems failure. That is why a UNIX, SUN, RAID and TCPIP provide more availability. They are designed to work around hardware, network and system failures.

If you use hardware with these features included (like Air Bags and ABS is for your car), then you have a better chance of avoiding outages all together. If not, then you will have the outage when the hardware or network goes.

How To Avoid Outages

Verify System Quality - Quality Control: In Systems Operations and Support, one has one chance to avoid outages due to bad design, and/or immature or untested solutions. This is when one verifies the solution's quality, integrity and stability during a pre-production/staging release.

The primary objectives of a pre-production/staging release is to verify that it won't compromise the current production state of the solution's stability, and that the release process and procedures are verified and is working as designed. Also, that the new release can be backed-out as required when things do go wrong.

If one materially improves practices to verify quality, and ensure that pre-production/staging environments provide a reasonable state for this verification, then one can avoid a significant number of design induced systems outages.

Contrary to popular belief, a pre-production/staging release is not for the users to verify requirements compliance of the functionality - this is a secondary objective (usually only material in highly integrated environments).

Managed Change - Rate and Scale: If change causes your systems to break, then a smart things would be to avoid it all together. There is material confirmation, and experience will tell any Systems Administrator, that when you change something, the chances of an unplanned systems outage increases significantly.

It is not practical to avoid change of systems all together, because systems evolve around the needs of its users, so one has to go for the second best option.

Firstly, Decrease the rate of change - this means try and delay change as much as possible to weekly, monthly or annual releases. The lower the frequency of change, the lower the risk of unplanned outages.

Secondly, decreases the scale of change. There is a direct correlation between the scale of change and the risk it holds for stability. Therefore, try and make small changes during each release and leave the major modifications for significant releases only. Don’t try to do too much in a release - keep it simple.

There is a balance to be sought between these two factors of rate of change and scale of change, and each solution is unique in its requirement for rate and scale of change.

I have worked in SAP environments where we had releases twice daily with small enhancements without any real issues, and then I've worked in Web solution areas where minor modifications twice a year turns it all to custard.

Your Solution Architect/Designer and Systems Administrators will be best to guide you on the risks for each change and each solution.

Maintain Systems and Monitor Health: The simple solution is to build a list of maintenance tasks for the solution and ensure that it gets done, checked and verified regularly.

Again, each solution has unique requirements, and your vendor, Architect/Designer and Systems Administrators will be able to capture the majority of maintenance tasks required for a solution. If they don't capture the important ones, they will soon learn what they are when they have an outage on hand.

Use monitoring to report on the health of the system to avoid outages. That is all it will provide - pro-active feedback - that will buy you time to solve the problem before it occurs.

Configuration Management: You will be amazed to find how little one knows of ones solutions, how out of date it is, and how hard it is to get to the information.

I've implemented very practical (and archaic) ways to keep the Configuration Management of a solution under control. One key approach is making the Systems Administrator and Solution Architect/Designer responsible for the detail - they need to interface with the Configurations Manager and ensure the Configuration Management Database (CMDB) and Definitive Software Library (DSL) is up to date for their solution. It needs to be updated with the changes of each release, and all maintenance activities is logged and signed-off.

This activity is audited once per month or quarter by a pier, team member and the immediate Operations Manager, and once a year by your I.T. Auditors.

Evolve Systems Towards Improved Availability and Stability: Lastly, most systems are not perfect, and therefore there needs to be a plan for each system that details its evolution towards stability and availability.

These deliverables need to be tabled with investment managers/system owners and prioritized along with all other investments and planned changes.

Improved Service Delivery Through ITIL Deployment
ITIL is a set of processes (and in no way comprehensive or complete) that provides a framework for IS delivery.

There is nothing that proves that adopting ITIL will improve your service delivery, however you will potentially have made progress down a CMM (Capability Maturity) road in the following way.

You will be able to testify that you have repeatable practices.
You will be able to use a standard set of measurements to verify your outcome.
You will be able to manage your activity based on these standard processes and measured outcome.
You will be able to use a structured innovative process to improve your activities and outcomes.

Some will argue, that you can get all of the above right without ITIL.

It is just that ITIL has somehow started to become the rules for the IT Services Game (like there are rules for Rugby, Netball and Cricket), and I definitely support a journey down this path.

However, if one wants to succeed in solving problems like "Systems Outages" through ITIL framework adoption, then one needs to add two missing pieces to the puzzle.

Three Parameters Drive This Outcome

Success depends on getting three key parameters right: Process (ITIL), Job, and Person.

I.T. Service business is ninety five percent reliant on what people do. If one cannot get ITIL into what we do each day (thee People), then there will be no benefit from this approach. It will be a set of processes filed in a cupboard, or it will be like playing rugby with netball rules and cricket with rugby rules. No one knows what is expected from them individually to benefit the collective/team.

To be materially successful in ensuring improved I.T service delivery like no unplanned outages to systems, then we need to get the job of each person defined in the ITIL framework, and be able to cast the right person into the right job.

This way, the person in the role will know what is expected from them and the outcome they are responsible to produce within each ITIL process.

I have seen many ITIL initiatives fail dismally due to the failure to address the Role and the Person issues.

To illustrate the point, I will leave you with one question to contemplate: "Who is the one person/role responsible and accountable for systems availability in your company?"

Let me know your answer. I would gladly like to contribute to your business’ success.

Hendrik van Wyk

Pages

Monday, June 4, 2012

Keep the System Up!

No comments:

Post a Comment