ITIL defines an "Incident" as any unplanned interruption to an IT service or reduction in the quality of an IT service and ITIL defines a "Problem" as the cause of one or more of those incidents. The primary objectives of taking on Problem Management are to prevent problems and resulting incidents from happening, to eliminate recurring incidents and to minimize the impact of incidents that cannot be prevented.
Problem Management is dependent on a mature Incident Management process.Although it is possible to start early with Problem Management, this process is highly integrated with Incident management. So, it is best to implement Problem Management after you have implemented Incident Management. You will require incident data, impact, frequency and incidents trends to help identify relevant and worthwhile Problems to work on eventually.
It is often possible to start with Problem Management activities, without having a formally defined Problem Management process. Instead of getting bogged down with process design, implementing supporting tools and documentation at the start of the project, consider going for quick wins.
Start with actions like:
- Identify the top 5 to 10 incidents
- If needed, provide guidance to incident management/service desk on how to record incidents
- Find some problems and solve them!
A key activity in Problem Management is to look for the root cause of one or more incidents and recommend a permanent fix. Choosing the right people for the job is crucial. Analytical people with the right technology background are best given such roles. This need not be a permanent role. If fact, most organisation do not assign someone to be "THE Problem Manager". Problem Managers are best identified and assigned based on the Problem(s) at hand. Sometimes, a task force could be appointed, instead of a single person. Besides technical skills, the assigned Problem Manager(s) would preferably have problem-solving skills and experience with techniques like Kepnor Tregoe, Pain-Value Analysis and Ishikawa diagrams.
At some stage, the process would need to be designed, documented and formally rollout. Roles and Responsibility for Problem Management needs to be defined and a process owner needs to be assigned for this process.
Reports and metrics have to be defined. Examples include:
- Number of Problems and Known Errors in a period by status, Service or Category.
- Percentage of Problems which have been solved per category and period.
- Average time for finding root cause per category.
- Average resolution time of problems and known errors per category.
- Effort invested in Problems pending resolution and expected effort required for closure per period (as measured by resolution time).
- Total Problem Management effort on a per Service basis vis-à-vis changes in Service availability. This would relate Problem Management effort with estimated downtime avoided due to incident prevention.
- Number of problems that re-occur.
Unlike Incident Management metrics like "percentage solved within target time", Problem Management metrics are typically not included explicitly in SLAs.
Setting up a Known Error Database (KEDB) is another key activity. The KEDB maintains information about problems (i.e., isolation and resolution procedures) and the appropriate workarounds, scripts, references to patches, FAQs and resolutions. The KEDB or knowledge database must allow for flexible retrieval of information, preferably by keyword search.
However, the KEDB may not add much value if the Incident Management process or Service Desk staff is too immature to efficiently use them. A KEDB system would not be really useful if Service Desk or IT staff do not help capture information and use the system to aid in first-line diagnostics. So, setting up a KEDB system in itself is not enough. A knowledge management mindset and culture is needed as well. Incentives and metrics would have to be introduced to motivate the right behaviour in Incident and Problem management staff.
A tool to support the creation and tracking of Problem and Known Error records should be considered. Given the close dependency between the Incident and Problem Management, integration of incident and problem management workflow and data records in the tool is important. Most commercially available tools like BMC's Remedy or HP's Service Manager comes with separately purchasable but integrated modules for Incident Management, Problem Management, Change Management and a Configuration Management Database (CMDB) to store the system management records and also Configuration Item (CI) information.
Like any other ITIL processes, the Problem Management process should then go through the Plan-Do-Check-Act cycles and improved and refined over time.
1 comment:
Nice and very informative post to be considered.
Post a Comment
Do leave your comments on the post.