Resolving Network and Application Problems

IT Tech shocked


Everyday, companies make judgment calls about what is adequate and acceptable performance from their business network and applications and it usually comes down to a dollars and cents discussion. Striving for optimal performance doesn’t seem to make financial sense. But if they took a deeper dive and looked at costs from a variety of perspectives, they might find that they are sacrificing productivity to save a few dollars on tools that troubleshoot these problems and rid them at their root cause. And when taking a closer look over time, those costs add up.


INTRODUCTION

Network and application problems cost companies a tremendous amount of time, money, and if they go unresolved, customer loyalty, brand reputation and/or public perception can suffer. Internally, slow applications and poor service quality can frustrate employees and cause their productivity to suffer, which directly impacts the bottom line. Externally, slow application response, database issues, and service disconnections can cause customer confidence to plummet, leading them to look elsewhere to meet their needs.

In many organizations, sub-optimal performance is often tolerated for weeks, months, or even years before being addressed.

The question is – Why?

Especially when these problems can be resolved.

For one thing, it is difficult to put a hard number on the real cost of these issues. This is in part because both hard and soft costs are involved, which can be difficult to quantify.

For example, lost employee time can be calculated based on the number of hours spent waiting for applications to respond. This is typically acknowledged as a soft cost and can be difficult though not impossible to measure, especially when lost employee time impacts revenue-generating activities or customer interactions. But it is difficult to fully understand the amount of revenue that is lost due to customer abandonment and broken loyalty, especially in businesses involved in eCommerce.

Another reason these problems persist is because IT organizations often lack the visibility needed to identify and resolve them. Analysis and troubleshooting solutions may be viewed as expensive and difficult to justify, when in fact, the cost of tolerating the problems may be far greater. IT departments are forced to resort to blind troubleshooting, attempting to gently upgrade the problem away. When organizations take this approach, they often waste money on upgrades and improvements that don’t address the root cause. So problems persist or reoccur.

In many environments, employees and IT departments grow to accept these issues as normal, or may just conclude that the network is slow. Reduced productivity may become the new baseline, which drives up the overall cost of doing business.

  • The average number of ‘limited outages’ is 11 every two years and the estimated cost of data center downtime across industries is over $5000 per minute. Sources: Ponemon Institute© Research Report: 2013 Report on Data Center Outages and eWeek®article: Unplanned IT Downtime Can Cost $5K per Minute.

LET’S GET REAL

Let’s look at three real examples of companies that tolerated poor application/network performance for several months and, while it didn’t seem like much, the persistent costs they incurred as a result of not having the right tools or visibility escalated.

EXAMPLE 1

A credit card company with several large customer support centers began to experience problems logging into their business-critical applications. Employees would double-click three application icons to start their shift at the beginning of their day. This would bring up the software they needed to begin taking calls from customers. After an upgrade to the client-side operating system, employees found that applications were taking much longer to start. Each application took over two minutes to fire up after clicking its icon on the desktop.

To address this issue, the IT department began combing the network to look for network issues that were involved. Employees were unable to take customer calls for the first ten minutes of their shift each day, due to the delay in starting their applications. To address this, management decided to have them clock-in ten minutes prior to their standard shift, simply to start their applications and be ready for calls by their shift start time. Now, ten minutes may not seem like much time. But consider these numbers. There were over 350 people per shift, with three shifts per day. This adds up to 1050 employees affected. Each person was paid for ten extra minutes a day, which totaled 10,500 minutes. On average, their position paid around $15 per hour, or $2.50 per ten minutes. This meant that each day, the company spent $2,625 paying their employees for the extra time to start their applications.

This problem went on for about six months before a consultant was brought in to find the root cause. After six months, they had paid over $477,000 in extra wages to compensate for the poor performance! That number does not include the lost money on blind troubleshooting and network upgrades that did not resolve the root cause, nor does it include the number of times the employees accidentally mentioned to customers that the network was slow. (Haven’t we all heard that on the phone?)

EXAMPLE 2

A hotel was having connectivity problems to the internet, affecting both employees and guests. Connections were spotty and intermittent, with no real reliability as to who, where, and when the problem would strike. This went on for months.

To troubleshoot, the hotel brought in their IT contractor, who recommended a full wireless assessment ($10,000). Following the assessment, the contractor recommended upgrading the WiFi infrastructure – even though the problem was experienced on the wire as well ($15,000). This did not resolve the issue, so new switching hardware was recommended as an upgrade ($15,000). Problems persisted, so the hotel decided to increase their bandwidth and change internet providers ($500/mo). The problem did not go away. But they were over $40,000 into guessing their way into a resolution.

  • Even the biggest companies with great IT resources are not immune. On August 16, 2013 Google was down for less than 5 minutes. All of its services were unavailable and the volume of global internet traffic plummeted by 40%. Source: Martin MacKay “Downtime Report: Top Ten Outages in 2013."

EXAMPLE 3

Users in a large enterprise were experiencing sudden disconnects of their CRM application. The server people were quick to blame the network, claiming that the network was slow and was dropping connections. On the other hand, the network support people pointed the finger at the server and application teams, saying that the network was fine with no packet loss.

This blame game went on for months, while both sides made several upgrades and changes to their respective environments, none of which resolved the issue. By the time a consultant arrived and helped them to find root cause, they were already tens of thousands of dollars into investing in new networking and server equipment – yet root cause had not been identified.

HOW TO STOP THE INSANITY

These examples show us that misunderstood network and application problems can be very expensive and are often tolerated for far too long. They result in reduced employee productivity and poor customer experience, with wasted dollars (and time!) on IT guesswork troubleshooting. What adds insult to injury is that most of these problems can be quickly resolved with the right visibility.

For example, the problem suffered by the credit card company above had to do with the new operating system using an IPv6 DNS query to resolve the application server names. The DNS server didn’t know what to do with the IPv6 queries, so it just ignored them. Thinking the packets were lost, the clients would resend DNS queries for around two minutes before defaulting back to an IPv4 query, which resolved immediately.

With the right visibility, the problem was found, adjustments were quickly made, and the business started to save the $2,625 per day they were spending in extra wages. However, this was done only after almost half a million dollars were wasted.

For many IT environments, these problems persist due to lack of visibility in two key ways. One, due to many soft variables, the company is not able to get a solid number on the amount of dollars that are lost. Or, they may be completely unaware that slow performance is costing them money.

The second reason is that they lack the visibility necessary to resolve problems quickly, often thinking that tools and visibility solutions are not worth their cost. Or, the solutions that are in place do not provide end-to-end analysis from client to application to cloud, which leaves blind spots where problems linger or hide. When the real cost of these issues can be calculated, IT can justify getting tools in place to prevent further loss, while directing the IT budget skillfully toward meaningful upgrades and changes that will really improve performance.

IT Tech shocked












  • In a study on data center outages, only 21% of respondents said they are confident they know root cause yet most organizations (45–66%) noted that they repair, replace or purchase new IT or infrastructure equipment as a measure to try and fix issues. Ponemon Institute© Research Report: 2013 Report on Data Center Outages

VISIBILITY REVEALED

There are three pillars to network visibility: SNMP, flow metrics, and packet capture. Every IT department should make use of each of these methods to gain full visibility of the environment – both in real time and back-intime. Active tests should also be run against application servers to make sure response time does not suffer and performance stays at a tolerable threshold.

SNMP provides statistics for discovered devices, interfaces, and servers. It polls devices for interface and memory utilization, as well as Ethernet errors and packet drops. These tools are critical in finding utilization spikes and problem points where packets are being dropped.

Flow metrics (NetFlow, j-Flow, s-Flow, IPFIX, etc) provide traffic details for packets traversing an interface. It provides information on who, what, where, and why for a traffic stream, without the need to capture packets. Flow data is critical when tracking down network congestion problems, even those that occurred in the past.

Packet capture gives deepdive detail into application performance and transactions, and is sometimes the only way to get the information necessary to isolate a problem. The difficulty with packet capture is that it is so detailed that traces can be hard to read, especially in data center environments. Or, engineers mistakenly rely on freeware packet capture engines on commercial laptops – which are unable to capture at linerate, causing wasted time during analysis. Full 24/7 analysis at the packet level is a complex hurdle for most IT environments since there is so much data to collect. Even where there is a packet expert in-house, the magnitude of the collected data can make problem isolation like looking for a needle in a haystack.

(NEAR) PERFECTION IS POSSIBLE

You can stop the insanity. Think about the value of all your resources and your investments – not just in infrastructure but in human capital as well. You can choose optimal performance for your employees and your customers without suffering a zero-sum (or negative sum) return. Fluke Networks offers solutions that are intuitive, easy to use, and flexible with continuous process improvement built in (so you don’t have to).

Consider tools like the OptiView XG Network Analysis Tablet and Visual TruView for unified network and application performance management – that provide your organization and network engineers with the ability to visualize network and application problems as they arise. With end-to-end coverage of the network, from WiFi in the access layer to high-performance links in the data center and everything in-between, network engineers have the right solution to resolve issues before they begin to impact the bottom line.

These tools work together to ensure that there are no blind spots on the network or in the application, providing response time metrics both real time and back in time. Guided workflows turn anyone into an efficient troubleshooter, enabling all areas of responsibility within IT to come together and resolve costly performance problems.

You don’t have to tolerate the tremendous costs of network and application performance problems for another minute. It’s time to stop the insanity. See how OptiView XG and Visual TruView can help bring an end to that expensive word – slow.

  • "Insanity: doing the same thing over and over again and expecting different results." Author Rita Mae Brown in her book Sudden Death on Pg. 68 from 1983