Why is it important to document the steps you took during the troubleshooting process?

Why is it important to document the steps you took during the troubleshooting process?

When your monitoring systems start sending a deluge of alerts or your servers suddenly stop responding, it’s easy to go into crisis mode. That’s why Step One of this guide to troubleshooting is to remain calm. Let common sense prevail, be sure to maintain your documentation, and get down to the art of troubleshooting your IT systems. Just follow these eight general guidelines to pinpoint the issue and take steps towards remediation.

1) Calm down and start communicating

Don’t panic. Keep breathing. Grab some snacks and settle in. It’s vital to keep your head especially during a serious service outage. As long as you’re radiating calm vibes, now is a great time to start sending out communications regarding the issue — namely, that you are aware of it, and that you intend to follow up with stakeholders, team members, and users/customers at set intervals, like within 30 minutes, or every hour until resolution.

Keep your promises when it comes to follow up notifications. There’s no better way to alienate your users than keeping them in the dark during a service interruption. By offering regular updates, you can also control expectations and get out ahead of conversations like, “You said this would be fixed already,” as you alert customers of shifting timelines and ongoing efforts to remedy the problem.

2) Document and describe the problem in detail

If you don’t already have a troubleshooting documentation process in place, now is the time to start. Use a simple spreadsheet to describe the issue symptoms (what is happening), when it is happening, what components appear to be affected, which users are encountering the problem, and of course the date and time. If it is obvious, go ahead and not why the problem is occurring (that makes for pretty easy troubleshooting). Note any personnel involved in the troubleshooting process as well.

One vital thing to look for is whether any changes were made recently. A software update or component change often leads to problems, and is a relatively easy fix to roll back, assuming you’re using backups. You’re using backups or disaster recovery, aren’t you?

Double check that everything is plugged in and attempt the classic “turn it off and back on” by power cycling/restarting the system. Check your performance monitoring tools to be certain that it isn’t just a heavy system load causing problems. Be sure to be as specific as possible and to investigate the symptoms for yourself. Which bring us to…

3) Replicate the error conditions

While some errors may prove difficult to recreate, the majority are simple to discover (plug in a device and the system doesn’t recognize it, for instance). If you learned of the error from a third party, attempt to do the same steps they did to throw the error.

What devices are involved? Does the error persist with different end clients? Internal network or public internet? Did any significant events happen before/during the problem?

If you are troubleshooting remotely, try to get as specific a description as you can from the person reporting the problem. Walk them through the steps they took to reach the problem again, if you can, or use a remote desktop control app to gain access to their system.

4) Gather supporting information and logs

Get screenshots or copies of any error messages that appear or originally appeared, including any reference codes or links to knowledge base articles that they may include. Event logs from Windows event viewer, VMware logs, or other relevant sources can help you out by providing a time stamp and clues as to what process may have caused the breakdown.

Use any tools you have on hand, even those as simple as ping for network problems. This is the time to leverage performance monitors/system diagnostics, and network/system inventories.

5) Lay out the entire system in a clear manner

Use a whiteboard, talk through the problem and associated components with a coworker, or just chicken scratch on a napkin: however you do it, you need a structured and organized method of looking at the big picture. Sometimes this step is all you need to have a Eureka moment.

6) Start digging

On to the research portion. If the problem hasn’t become apparent yet, you’ll need to turn to the web and potentially engage vendor support to help pinpoint it. Knowledge Bases, web forums, archived help desk tickets, search engine queries, and your fellow engineers can all be of great use in this step.

The supporting information you collected above will come in very useful. If you have specific error messages or system logs, you are likely to find someone else out there who has encountered the same issues.

7) Trial and error

After doing some research you should have a few clues as to where to begin to remedy the problem. Your first attempt at fixing it may or may not be successful, so be sure to create a system backup for any major changes, so you can be ready to roll back. If it is a minor fix, you should still be sure to document exactly what you’re changing.

This will help you systematically narrow down your troubleshooting, avoiding any combinations and previous fixes. Try changing settings, removing or rolling back new software, repairing corrupted system files, defragmenting hard drives, updating system software like drivers or operating systems, or replacing any faulty hardware. Check the DNS and DHCP settings and make sure the firewall or proxies are configured correctly.

8) Talk to a vendor or third party

Hopefully, by now you’ve pinpointed the issue and have figured out a way to resolve it. Still stumped? It’s time to call in the big guns. Open a ticket with the relevant software or hardware vendor, your systems integrator, or your data center provider and see if they can help get things back to normal.

Chances are you’ll eventually run into a unique issue, but the majority of IT problems are relatively common and easy to reproduce. Standard troubleshooting steps like a system reboot, reinstalling/updating drivers, or examining network settings are all great first stops. The most important thing is to keep your head and maintain clear communication with those who are affected by the problem. You’ll be back up and running in no time.

What are the importance steps of the troubleshooting process?

There is a systematic approach with some steps including the identifying of the problem, planning the resolution, its testing, and resolving the problem. Analysis and experience gained should also be considered. Moreover troubleshooting is inextricably linked to inspection process, and sometimes they come together.

What are the steps to troubleshooting a document?

Let's take a more in-depth look at each of these steps to determine what they really mean...
Identify the Problem. ... .
Establish a Theory of Probable Cause. ... .
Test the Theory to Determine the Cause. ... .
Establish a Plan of Action and Implement the Solution. ... .
Verify Full System Functionality and Implement Preventive Measures..