Archives for category: Wireshark

This is still ongoing.  Lots of Perfstats sent to NetApp for analysis.  One item was found on them that is puzzling.  We are using the NetApp processing of Snap processing.  There is Snap Vault and Snap Mirror.  Mirror creates images on the local filer so recover can take place back to a point in time.  As changes happen, the old blocks are kept in this Mirror block list.  But only so many ‘Mirrors’ are kept active at one time.  As a new one comes online, an old one goes off.  Those Blocks then go back into the free list.  But that can take a while and was one of the symptoms of our problems.  High CPU time.  They attributed this High CPU to the Mirror process releasing these blocks. 

That would then impact the response time of NFS requests for the Clients.  Bummer.  We were seeing response times go from 1-3 milliseconds up to 100 milliseconds.  The one application that was doing the most work would suffer the most of course.

We found another set of files that were being referenced.  There were a set of files with .so that were also being referenced a lot.  Not read, but just referenced.  Once again back to the application team to see if they could be placed somewhere else. 

During this time we were waiting on the Networking team to enable another device to capture the traffic.  Our other approaches  were traces on the application server and switch mirror traces to a windows system.  But to trace on the Filer at a steady rate of 1 Gb is tough as the size grows fast.  We have GigaMon as a monitor switch fabric and have a 10 Gb port connected to a GigaStore device to record. (Network Instruments vendor)  Now we don’t have dropped packets messing up the trace and have 48 TB of space to store the results.  Now we can reach back and look at some history of what was happening when the problem was happening.   I’ll also mention that Riverbed also has some great products to address this as well. 

We saw that one of the high traffic sinks was the Snap Vault process.  During this time frame the application was also seeing high response time.  We did not have that knowledge before.  One more thing to talk with NetApp about.

For those interested this is the Wireshark – tshark.exe to read the trace and create a text file with certain fields.  I then used Excel to summarize that.

“C:\Program Files\Wireshark\tshark.exe” ^
-o nfs.default_fhandle_type:ONTAP_V3 ^
-T fields -E separator=, -e frame.number -e rpc.auth.uid -e rpc.time -e nfs.procedure_v3 -e nfs.fh.hash -e nfs.name ^
-e nfs.fh.mount.fileid -e nfs.fh.fileid -e nfs.fh.export.fileid -e nfs.count3 ^
-R “(tcp.dstport == 2049) && (rpc)” ^
-r file-name-of-trace-file

Run this and > output.txt   to a file that you then bring into excel.  The -E separator tells tshark to use ‘,’ for a separator for the output fields.  This makes putting into columns an easy process for excel.

Lots of items in here this week – a lot of digest. 

Our problem is still being worked on.  This is taking way to long to come to a conclusion.

 


We have a mystery question as to what is all this activity to the NFS Filer from a given server?  It has been blamed for some slow downs noticed by some applications.

The challenge has been to find what is causing this activity since this is a general purpose work horse server that is used by a number of different applications.  It seems not all of them are playing nice in the same sandbox.

Some people would want to use this challenge as a reason for each application area to have their server and even their own NetApp Filer system.  That way they can not be impacted by anyone else.  But that is not how it is done when you want to manage your data center as an Enterprise and have some scaling and flexible buying power.  When resources are cut up into small parts and each area has their own, you are challenged to get the utilization very high.  You end up with a lot of waste all in the name of Fear, Uncertainty and Doubt (F.U.D.) of the risk of sharing. 

Many times we have found that when we have found the Impacting workload, it is the very user that is crying the most.  They are doing it to themselves and costing the Enterprise to spend a lot more money instead of fixing the problems.

We used some network captures to assist us with finding out what was going on between the server and the Filer.  (Wireshark was used but that is the topic of another blog). This showed a lot of activity with the function GetAttribute.  This is testing a file to see if it has changed since the last time it was used.  This turned out to be for files used by an Adobe PDF Creation process.  Fonts and Templates were being checked and causing a lot of overhead.  They do not change very often (like a couple of times per year) once they are established.  Templates might change with a new version or release of something but then would settle down again.  We had some SAN space we could define for this system but it would not be Shared Storage like the Filer is.  So once placed there, we would need it duplicated for the Test / Stage and Production versions of the Server.  But we only needed less than 80 GB of space for this.  The system already had HBAs so we could add some SAN storage and reduce the amount of activity taking place over the network.  This ended up reducing the GetAttribute workload by 10x.  

Step one provided some relief to the Filer CPU activity.  But the CPU busy was still very high – in the high 80s to low 90s. Plus we saw some pauses taking place.

I fired up some simple Netstat -p tcp and nfsstat -c and had that output going to a set of files along with a date/time stamp.  I put this into a 10 second interval and let it pump data to files with the name of the $(hostname) and $(date ‘+%y%m%d-%H’) so that I could keep the number of files small, but also make it self naming in case I needed to do this on other systems at the same time. 

Enter one of my fave programs for data analysis.  Perl.  I use this to assist with analysis and transforming output into input for Excel. Excel is my other fave program for data analysis.   Each of these commands (netstat and nfsstat)  create a very verbose output that is hard to bring in to see if there really is a problem or not.  By using Perl to clean that up, I can then use Excel to create Deltas between this interval and the prior interval to see if there is a pattern or not.

We began to see that there were periods of time of very good service and file response time and then some short intervals of very poor service.  Something was indeed happening that was not normal.  Now we had the proof that we needed to be able to talk with NetApp.  They had been running their PerfStat process but the process was not showing anything wrong which is partly caused by the length of time they had for their interval.  Using the 10 second interval output from these two unix commands provided the information that we needed to show them that something was not proper. 

This is still ongoing so I don’t have a fix to the problem.  But we do have better information now that it is a real problem.  At least for a few of the 10 second intervals.  But not for all of them which is why this has been so hard to find.

By the way, the Network trace of the NFS traffic showed us the UID of the user making the request and also gave us the File inode that we could then use to look up on the Filer to find out the real file name.  That gave us clues as to what was going on with the PDF Generation process for instance. 

Till next time.