In part 2, I talked about how having an application that has multiple tiers is another reason why small and medium businesses need application testing. In you haven’t read it, go here.

In this post, I talk about reason #3.

When I was brought on to help resolve this application issue and started asking my questions, I found out that the issues the users were having occurred intermittently. Once I hear “intermittent”, I know I will need to do some continuous captures.

Capturing data continuously just means that my packet capture tool will run until a user reports a problem. And when an issue is reported, I download the packet trace and start looking at it for clues.

When user-reported application problems occur at any time and cannot be triggered, you must have the appropriate tool to capture the data. I’ve used Wireshark and other tools, like Riverbed’s SteelCentral Transaction Analyzer. You also need the expertise to filter the packet appropriately so you can capture enough data to store the packets when the issue happens and retain those packets long enough for them to be downloaded for analysis.

I typically capture all packets on the client side and the first 100 packets on the server side. The 100 packets is enough to see the protocol headers that would give me information about the data being sent and received.

We Start Capturing Continuously

Once the continuous captures were running on all the tiers I had capture agents installed, users started reporting the Internet Explorer errors they were getting while using the application.

One came…then another…and another.

I looked at all of this data, and after a while, a pattern started to develop. There was an increase in retransmissions and packet loss whenever the errors occurred.

In part 1, I talked about how the application was developed for the LAN and therefore was pretty chatty. So it was no surprise when I noticed the application opening many TCP connections. But this occurred at the same time the errors did and was done in order to display relevant information to the user in various panels in the browser.

For any given panel-specific request, the application would open up a separate connection, and the errors would occur when it could not do this.

In my analysis of one particular set of packets, I noticed the browser trying to open up a connection on port 443 (the HTTPS port). I saw three SYN packets being sent without any replies to the client. After the failed third attempt, when TCP stops trying, the browser displays the “Page Cannot Be Displayed” error. Since the application is composed of numerous panels, the error is only displayed in the panel where it failed.

browser_packet

But is the request from the browser’s send port being dropped somewhere? There were clearly other communications during this time that packets were not dropped. All of the panels didn’t receive the error. Only that one panel did.

Remember all those tiers and locations I set up my agents? These packets were captured in the data center, after going across the WAN provider’s network.

Other data I captured at the user locations  showed some packet loss between the user and the data center.

I was also able to get devices statistics from the WAN provider, which showed that there was up to a 1% packet loss of the customer’s data when these errors were being encountered.

Losses and Retransmissions

So I’ve got packet loss between the users and the data center and packet retransmissions between the data center and the SaaS provider’s network.

Up to this point, there is nothing showing why these losses and retransmissions are occurring. However, it’s clear that they’re affecting the application. And they are getting triggered sporadically.

The intermittent nature of these issues is yet another reason why small and medium businesses need to test their applications. Things happen on networks or to applications that cannot always be easily reproduced. With a test or a process for testing applications in place, you can more quickly resolve such issues, rather than losing time trying to recreate the problem.

It’s clear now that this issue could be anywhere. As is the case in many applications I’ve had to troubleshoot, there is rarely one single culprit. In part 4, I’ll talk about the primary reason why small and medium businesses need to test their applications.