My manager asked me whether we can do better with our production support. I responded him with one document that I came across. Here we go.
* Define acceptable Service Level Agreements (SLAs).
For example, what is an acceptable response time in the event of a problem? What is an acceptable level of downtime for an unplanned failure? What is an acceptable number of unplanned failures for a particular time frame. SLAs can help to define the contract between our customers and support team. So it becomes easy to gather and track compliance.
* Document the problem, build up knowledge base
What were the symptoms? How did they present themselves? What troubleshooting steps were taken? How was the problem resolved? What was the root cause? Those historical info will help for future support (when same issue comes back).
* Automate Production Monitoring
Automated monitoring is often faster at identifying problem conditions or situations, and can proactively alert the support team to potential issues. (e.g. out of memory, high CPU, downstream service is down)
[this is the area we can improve immediately to improve our level of support, i.e. add more monitor scripts to have better coverage]
* Create a troubleshooting decision tree
This will make it easy for the support staff to quickly identify the appropriate next steps for resolution of the problem.