If the production server issue were a villain, it would have been Thanos. Scary and powerful—with goals beyond your understanding. But hey, even that big purple guy had his weaknesses. So, calm down, take a deep breath, and let me tell you how to deal with it.
I have been in the industry for 20+ years. So, believe me, I have faced lots of production server issues throughout my career. And the article you are reading is the outcome of hundreds (if not thousands) of battles.
Let’s Face It, Pals
It is midnight, and you are sleeping peacefully. Suddenly, your phone blows up with notifications. And guess what? That picky client’s server is down. Now, it is just an “all eyes on you” situation. You feel like a politician who wants to explain a terrorist attack on the capital. The whole situation is unpleasant, stressful, and (sometimes) traumatic.
I hate to ask this, but what are you going to do now?
I would forget about every horrible scenario and look for the right way to solve the production server issue. And that is what you should do, too.
Okay, What Do I Mean by the “Right Way?”
I am afraid to say that there is no reliable runbook when it comes to server problems. Every flaw could be unique. Each case needs a separate inspection and analysis. So, the right way to solve a production issue is the one that results in the quickest reliable response—even if it is only a temporary workaround.
Let’s get back to the Thanos example. What is the right way to stop a purple guy who is about to make half of the world’s population disappear? It is by preventing him from snapping. And I don’t care how you do that; cut his fingers? Tape his arm to a tree? Glue up his fingers? Anything. Just do it.
That is the right way to deal with a production server issue as well. I am not encouraging you to forget about your developer-ish morals. What I am pointing out is that the clock is ticking. Your client does not care about all the technical benefits of having a reliable approach over a workaround.
No running server, no money. So, face the music before it is too late.
There are three aspects of finding the best approach when solving a production server issue. (See below).
- The Right Concern
What is the first thing you worry about when you face a downed server? Your reputation? The costs? Any upcoming client complaints? *BZZZT!* All of them are wrong answers.
Your primary concern is (and always should be) the users.
It is not about “you,” “your reputation,” or “your salary.” It is about an end-user who cannot make a purchase, use a service, or solve a problem.
- The Right Goal
A production server issue (aka Thanos!) has goals beyond your understanding. But what is yours? The right plan to face such a villain is to save as much precious stuff as possible. In this case, the bad guy wants to stop your users from accessing your software/website. So, rescuing more consumers sounds like the best objective.
I know; the most heroic target is to defeat the villain in advance. But you are not Spider-Man. So, no one is going to admire you fighting the bad guy(s). All you will get is to have others booing you off the stage. That is because the users do not care about the root cause of a production server issue. What they want is a running application.
Your goal should be to fix things as fast and efficiently as possible.
- The Right Approach
Pour a cup of coffee, close your eyes, think about your happiest memories while coding, and WATCH YOUR EMPIRE FALL! This is not the right approach to fix a production server issue. Act fast—and furious. Play all your cards. Asking for help. Get your team ready. And think of it as the end game. (Pun intended).
The best approach is an all-out strategy. Keep backups handy. Extract the logs immediately and use everything you have to get the server back running.
Production Server Issue This-or-That
As I said, there is no bible to refer to when you face a problem with your server. Of course, you still want to check every piece of information you have (e.g., software runbook). But you should be prepared to enter the land of unknowns.
I cannot give you the roadmap. But here are some hints on how to find the right path.
- 1. Finding the Root Cause vs. Implementing a Workaround
Imagine you spent two days finding the cause of the issue. And now, you are ready to solve it. Congratulations! You have done a great job. But you are fired. That is because you let the server remain down for two days.
If it takes you a while to spot the root-cause, just do NOT do it. Instead, find the most approachable workaround to get things back going.
Professionalism indicates that only the most dependable solutions are acceptable. But hey, Nirvana does not exist. Your desire to dive deep into the flaws is admirable. However, no one is going to pay you for that when there is no running application.
- 2. Keeping the Issue Private vs. Calling for an Emergency Meeting
First of all, you cannot keep a server issue private. Sooner or later, your client will receive tickets or error messages. So, calling for an emergency meeting is a much more professional tactic. Plus, you need to hear others’ analysis—to accelerate the problem-solving procedure.
Honestly, you probably have no idea what happened. However, someone might have useful information. It could be your client who accidentally did something they should not or one of the developers with reliable insights.
Make it public before it ‘gets public.’ That way, you can hang onto your reputation while having more people to lend a hand.
- 3. Relying on One Developer vs. Turning on Avengers Mode
The thing is that sometimes you manage to keep the flaws under control. So, the question is, “should you have one person handle it afterward?”
Think of it this way: Thanos has all six stones. But you found a way to steal them. Now, should you send one superhero (say, Iron-Man) to execute your plan? Or is it best to bring back the avengers together and split the responsibilities?
Of course, having more developers (or superheroes) is a better idea. That way, you have a higher chance of success due to the variety of your team’s skills and capabilities.
Who Is “Fury” When There Is a Production Server Issue?
Everyone is furious when there is a crisis in the software development process. However, who is “Fury?” For those who do not read comics, Fury is the former head of the Avengers. So, again, who is your boss now when there is a fire waiting to be put out?
In my experience, there should always be a case manager when it comes to server issues. It could be your SDLC head or the DevOps team manager. However, you should choose a Fury who knows how not to be furious in times of crisis.
As a DevOps engineer, I believe that Development-and-Operations experts make proper case managers. They know how to benefit from agile relationships to achieve the fastest reliable solution/product.
This is what you need when fixing a production server issue. A case manager who is good with issues knows what the product means to SDLC and cares about the ticking clock.
Key Takeaways You Should Never Take Away
Here is a list of things to have in mind when facing a production server issue.
- Your concern is with the customers/users.
- Your goal is to get things back running again as soon as possible.
- The right approach is the fastest one.
- Workarounds are with your friends during the production server issue repair.
- More help is better. So, make the issue public and let everyone comment on it.
- Don’t expect one developer to be the jack of all trades.
- Have a case manager.
- Postpone the detailed RCA (Root Cause Analysis).
- And finally, do not forget about the documentation.