Archive
System Down: The Anatomy of an “Oopsie!”
You have all heard the horror stories of large organizations, and even governments, that have fallen prey to malicious actors. The massive data breaches, crippled businesses and governments, and the theft of medical records are some examples of these debilitating and embarrassing threats.
Having worked in IT since 1993, I have seen a lot. I have worked with businesses and governments, ranging in size from 3 employees to over 1000 employees. I have worked with businesses that have been hit by viruses that just a nuisance, and I have resolved situations where data got encrypted with a crypto locker variant. All of these situations occurred in businesses that were ill-prepared to deal with an imminent threat. One organization in particular leaps to mind when I think about this situation. The impact for them was especially painful in cost to recover, loss of productivity, and delay of order fulfillment.
For confidentiality, I will not be mentioning the company name or location. All I will say is that they operate in the Pacific Northwest.
This organization was a newer client at the time and had not yet agreed to implement all of the recommendations that I had proposed to them. As most businesses are, they are cost-conscious and had not budgeted for some of the changes. Their servers had been virtualized using Hyper-V on older hardware, so I was supporting one physical server and three virtualized servers.
This episode started when one of their employees disabled the anti-malware software on their computer because they thought it was causing performance issues with their virtual load testing solution. After it was disabled, this person mistyped the name of a well-known web site. That site was able to plant a RAT (Remote Access Trojan) on the computer. One more important detail: This person happened to be a local administrator on every computer in the company. After business hours, a bad actor located in another part of the world accessed this employee’s computer via the RAT. They then proceeded to disable the security solutions on every other computer in the organization. Once they accomplished this, they uploaded a file to every workstation and server in the organization. This file proceeded to encrypt all the data stored on the local drives. It then damaged the operating system in such a way that if the user rebooted to see if the problem went away, the operating system got damaged beyond repair. Since they were able to attack every computer in the organization, every bit of data on all the servers was encrypted.
By now you are probably thinking something like “Yikes! Thank goodness for disaster recovery solutions!”. That is the same thing I thought on my way into resolving this solution. And yes, thank goodness for the backups. The biggest problem we ran into with the restoration of data was performance. Their entire backup solution was cloud-based. Their internet was 50-megabit, so you’re thinking “no problem!”. That’s what I thought too. We’ll circle back to that in a few minutes.
The recovery for this client started immediately. The biggest blessing on this dark day was that I had just started an infrastructure refresh. I had just delivered a new physical server that was destined to by the new Hyper-V host. It was replacing hardware that was almost seven years old. Because I had the basic groundwork laid, I had all the new servers built and fully updated within 5 hours. This is the point where I started running into issues.
Something you may already know, but I’ll say it anyway: Not all cloud-based backup solutions are equal. This client had about 12-terabytes of data backed up to the cloud. Most of it was enormous CAD or other modeling files. As the data stared restoring to the server, we quickly maxed out the 50-megabit connection. I got the go-ahead from the owner to increase the speed to “whatever I thought was appropriate.” I called the ISP and had the bandwidth bumped to 200-megabit in less than 45 minutes. Now the frustration began in earnest. The backup solution that was in place did not list any speed limits on upstream or downstream data. There was a limit somewhere. There had to be with the poor restoration performance. The speed never went above 56-megabit. After testing and verifying the performance of the ISP, I called the backup vendor. When I finally got through 30 minutes later, they informed me that there wasn’t a speed limit, but they had algorithms that distributed the bandwidth so that one customer could not consume the entire connection. They either had a lot of customers, or they had very limited bandwidth. Of course, they would not admit to either, and I was stuck with the miserable performance.
I ended up working with the various department heads to determine which files were critical RIGHT NOW and selectively restored those files first. They then specified a secondary level of important files. Everything left was restored last. The largest downside to this was that restoration was extremely tedious due to complex directory structures.
While the data was restoring, I started rebuilding all the computers in the organization. After the first 24-hours, I had the servers rebuilt, updated and secured, the domain and AD restored, all the workstations rebuilt, and data restoring to the shares.
All told, this project took the better part of 5 days. The majority of that was restoring the data files and fixing glitches with the permissions on shares and files. In total, there were over 90 billable hours spent on this project. The total cost in billable hours worked out to $16,650. All because one person decided to disable their security software. We worked with the client and lowered the bill to just over $11,000. They still complained, but they also realized the value of the work to their business, so eventually paid.
Lessons learned from this experience:
- Verify performance and capabilities for cloud based backup solutions before signing up for them
- Have a local copy of the backup date
- Their backup solution had an unused option to backup to a local NAS
- Don’t just list the security recommendations, but make them a key part of the presentation, repeatedly highlighting the potential issues and driving the security concerns home
- When there is push-back on remediation suggestions, you also need to push back, so the point is made abundantly clear. Be prepared when you go into your meeting with the following information.
- Be able to back up your assertions with actual data and examples
- Include potential disaster remediation times and costs
- Include the hidden costs, such as damage to the business reputation, loss of productivity, and loss of product production
This story could have had a much worse ending than it did. At the time, this was an organization of 12 people, with seven computers and three servers. Imagine the impact on a larger organization that was ill-prepared for such an event. The results could be catastrophic to the business!
As always, I welcome feedback and comments.
Success in the IT Industry: Whatever you do, DON’T PANIC!
I’m going to kick off a small series here about succeeding in the IT industry. These will be topics that I have learned over 20+ years of working as an IT Professional. I will do my best to make sure the topics and content cover consultants, such as myself, as well as those who work for a single entity. So, with that introduction, off we go!
If you have worked in this industry any length of time, I can guarantee you have had at least one person come running up to you sure that their life was about to end due to a lost file, a jammed printer that contains their presentation to the board that’s due in 5 minutes, or their inability to access the internet on their smartphone while in the restroom. In any of those situations, it is pretty easy for us to remain calm, hopefully reassuring that person, and helping them quickly resolve their problem.
But what do you do when it’s your server or server farm that has suddenly dropped off the network denying the CFO access to his data that he needs for a meeting that started 5 minutes ago? How do you react when the worst happens in the systems that you are responsible for and all the upper management staff are standing over your shoulders watching you and demanding an estimate of when the company will be back up and running, all the while reminding you of the expense of having 50+ employees that they have to pay for sitting around and drinking coffee?
Hopefully, your answer doesn’t contain the words “panic”, “freak out”, or “I don’t know”.
If you work in the IT industry as a network or systems administrator, I can personally guarantee you that there will be times that this happens. Technology is not infallible and, in my personal opinion, subscribes to Murphy’s law: “Anything that can go wrong, will go wrong, and at the worst possible time.”
So, how do you prepare for that? Can you prepare for that? How do you deal with the ownership or management staff breathing down your neck?
Rule number one: KEEP CALM!
There is absolutely nothing gained by you panicking. In fact, if you panic, it will increase the panic level of all the others around you. Imagine, if you will, a heard of zebras on the plains of Africa. One of them notices a lion that appears to be stalking the herd. It follows it’s natural instinct to run away from the danger as fast as it can, making noise while doing so. This alerts the rest of the herd to the danger and causes them all to panic. The result is a stampeed and ever increasing panic as they lose sight of where the lion is due to the dust cloud they create in running. Now imagine this same zebra that, instead of panicking, watches the lion. After a few seconds, it sees that the lion is going to lie down in the shade because it is really hot. I sleeping lion is not a great threat, so it goes back to munching the plains grass. The herd doesn’t stampede, and the peace if kept. That doesn’t mean the zebra stops checking on the lion every so often, just to make sure it really is napping.
Same thing applies in IT. You will get people that come running into your office or calling you in a panic. You WILL have servers that go offline for mysterious reasons and cause all sorts of havoc around the office. You might even have equipment that quite literally goes up in smoke. I have been witness to that several times. Since these things are pretty much inevitable in this industry, you need to have a plan to deal with them. And you need to have the proper attitude to handle any situation that comes up. You need to appear to be calm, cool and collected in everything you do.
Some of this response comes from experience. The longer you do something, the more you see the issues, and the better prepared you are to handle the issues as they come up. The hardest part is not dealing with the issues. It’s dealing with the people affected by the issues. When they approach you at a dead run or panicked on the phone, you need to be able to reassure them, let them know you are aware of the problem and that you are working on resolving the problem as quickly as humanly possible. Easy to say, not always easy to do. And to my knowledge, there is no training program that can prepare you for the flood of varied responses you will get from the people in your organization. Some will decide it’s time for a coffee break. Some will call you or come visit you thinking that their presence might in some way help you solve the problem faster. I have even seen people break down in tears over issues they have no control over.
All of these responses can be a major distraction and can cause you to feel more and more stress as you try to resolve the situation. Sometimes it becomes necessary to ask people to leave you alone so you can do your job. This needs to be stated nicely, but firmly. My best example is a CFO/VP at one of my clients. He came from a very large company that had a huge IT staff. He was used to getting status updates and resolution estimates every 10 minutes during an outage or incident. His new company, my client, has two location, about 100 employees overall, and one IT guy…me. With me being the only point of contact for IT issues, needing to give status updates every 10 minutes could be a real problem since it distracts from the task at hand. During one particularly major issue involving Microsoft Exchange, I finally had to sit him down and explain to him that having to stop every 10 minutes, find him, update him on the problem, give him a resolution estimate, and then go back to work on the task was going to easily triple the amount of time (and thus the bill) for getting the issue resolved. Once he finally understood that and realized that I would let people know when there was something to actually report, he backed off on his requirement to update him so frequently on the progress. The net result was that problems got resolved much faster, and if he was really curious, he would come find me, and if I was not looking completely absorbed in the issue at hand, he would ask a simple “how’s it going?” and get a quick reply while I got to keep working the issues. It was a win-win for everyone.
The bottom line is this. When everything around you is is going crazy, and the employees and/or management are all panicked, it is your job to be the calm at the center of the storm. Let it swirl around you, maybe even ruffle your hair a little. But under no circumstances should you visibly panic. It could cause the panic in other people to amplify, and could even cause some people to lose a little faith in you and your abilities. As the person responsible for protecting their network, for protecting their data, and some will even see you as the person protecting their livelihood, you need to be the bastion of calm during a real or perceived crisis.