This is still ongoing.  Lots of Perfstats sent to NetApp for analysis.  One item was found on them that is puzzling.  We are using the NetApp processing of Snap processing.  There is Snap Vault and Snap Mirror.  Mirror creates images on the local filer so recover can take place back to a point in time.  As changes happen, the old blocks are kept in this Mirror block list.  But only so many ‘Mirrors’ are kept active at one time.  As a new one comes online, an old one goes off.  Those Blocks then go back into the free list.  But that can take a while and was one of the symptoms of our problems.  High CPU time.  They attributed this High CPU to the Mirror process releasing these blocks. 

That would then impact the response time of NFS requests for the Clients.  Bummer.  We were seeing response times go from 1-3 milliseconds up to 100 milliseconds.  The one application that was doing the most work would suffer the most of course.

We found another set of files that were being referenced.  There were a set of files with .so that were also being referenced a lot.  Not read, but just referenced.  Once again back to the application team to see if they could be placed somewhere else. 

During this time we were waiting on the Networking team to enable another device to capture the traffic.  Our other approaches  were traces on the application server and switch mirror traces to a windows system.  But to trace on the Filer at a steady rate of 1 Gb is tough as the size grows fast.  We have GigaMon as a monitor switch fabric and have a 10 Gb port connected to a GigaStore device to record. (Network Instruments vendor)  Now we don’t have dropped packets messing up the trace and have 48 TB of space to store the results.  Now we can reach back and look at some history of what was happening when the problem was happening.   I’ll also mention that Riverbed also has some great products to address this as well. 

We saw that one of the high traffic sinks was the Snap Vault process.  During this time frame the application was also seeing high response time.  We did not have that knowledge before.  One more thing to talk with NetApp about.

For those interested this is the Wireshark – tshark.exe to read the trace and create a text file with certain fields.  I then used Excel to summarize that.

“C:\Program Files\Wireshark\tshark.exe” ^
-o nfs.default_fhandle_type:ONTAP_V3 ^
-T fields -E separator=, -e frame.number -e rpc.auth.uid -e rpc.time -e nfs.procedure_v3 -e nfs.fh.hash -e nfs.name ^
-e nfs.fh.mount.fileid -e nfs.fh.fileid -e nfs.fh.export.fileid -e nfs.count3 ^
-R “(tcp.dstport == 2049) && (rpc)” ^
-r file-name-of-trace-file

Run this and > output.txt   to a file that you then bring into excel.  The -E separator tells tshark to use ‘,’ for a separator for the output fields.  This makes putting into columns an easy process for excel.

Lots of items in here this week – a lot of digest. 

Our problem is still being worked on.  This is taking way to long to come to a conclusion.

 

Advertisements