For our Linux systems we were able to add a very simple process to cron to touch a file every 5 minutes.

This is a NFS mounted location and we pointed our Monitoring Tools to look for the oldest file in that directory.  All that is in that directory is this ‘heart beat’ file.  Once it is older than 5 minutes  – we turn that into yellow.  Once it is over 10 minutes, then it turns red.  We know then that either that system is having issues, or it is at least no longer connected to the NFS system.

When this was requested, I wondered what the big deal was with creating this process.  When there are hundreds of systems, this turns into the early warning signal that things are going bump in the night.

If there is just a system or two in this state, then no big deal.  Check if they are being touched and might be a reason as to why they are no longer updating.  But when there are lots of systems in this status, then you know there is a global problem.  That requires a different approach.

Very simple to put into place.  Very simple to use.  Lots of value add.

This is different than things like Ping a list of Systems to see if they are still.  This is using those systems to do the work.  Not often enough to cause performances because of running it, but often enough to catch issues.

Henry

Advertisements

What do you do when you find yourself working for a company that would rather oursource all performance analysis?   HELP.

Those who know me know that is not a good mix.  When Application Owners say that there must not be a problem because there are not tickets against us regardless of what the Dashboards say, then you know there is a problem brewing.

EPIC is an application that is used in the Healthcare industry and the modules that we had in place added the EMR (Electronic Medical Records) for the Primary Care Drs.  This not for the Patients in the hospitals.  That is the  next set of modules they are getting.  That is for the Enterprise version.  When the response time is over 5 seconds the workflow is labled as Yellow. When the response time is over 10 seconds the workflow is labled Red. Yet the Help screens mention that these values are 3 and 6 seconds.  The actual values are 5 and 10 based on the actual response time that is seen.  Plus this is all over VDI / Citrix systems but the actual screens have a lot of Tab keys and Drop Down choices.  My thought is that more interactions are taking place between the Citrix systems and the thin client than are being recorded.  ie slow response time moving around the screen is not reported within the workflow response time measurements and thus are hidden from view.

This all reminded me of the Sub-Second discussions back from the TSO days of system development.  Recall that if the response time starts to creep up above 1 second then the user no longer feels the system is a tool but is a hindrance to their workflow.

When you look at the Exceptions (Yellow or Red) and the location where these are happening (Office Locations) then the story starts to come out that there must be problems in certain areas of the usage of the application.  But they are being told, you have no choice, you must use this system.  So why complain?  There is no value in complaining.  They just need to get the job done.

It so happens that my primary Dr. is also in this system.  I just happened to be seeing him about some other challenges. He mentioned that he is now seeing 2/3 fewer people because of the slowness of the system and now he must do all data entry while the patient is in the room.  I was able to see first hand the little impacts that were happening.

When I brought this up to the project manager I found out that they were understaffed for training and thus the training hours were cut back for teaching the Drs. how to use the system.  Plus they did not know how to pull up the Exception reports.  Of course, my eyes had gone to that the first time I had a chance to play with the monitoring system.

The EPIC experts did some trouble shooting and found that the Disk IO appeared to be challenged.  Of course the EPIC system is built on Cache’ and that has a habit of doing a flush write every 80 seconds.  This flush is done without regard to the IO damage that is done.  Suddenly a high IO write queue is built that of course interferes with the read requests.  The write have to finish up in  20-30 seconds and normally finish in 1-2 seconds.  So why not spread them out over the 20 seconds?  That is not an option.  Why not spread it over 5 seconds even?  Nope – not an option.  Instead spread the data over more disks in the array so that the workload is spread.  This EPIC and Cache’ is not even a hardware vendor.

It has been interesting journey to say the least.

We have this DB on IBM’s SVC and also on a DS4300 back-end.  We do not have any performance reporting on the SVC nor the DS4300.  I started to dig into the SVC and found that while we do have some performance metrics, they are at 5 minute intervals.  A lot hides in 5 minute averages.  Even so, this showed that other systems were causing issues with the EPIC sub-system.  This is all shared DASD.  Surprise, another system could cause impact!

Another target rich set of systems.


At last part of the mystery is finally coming together.  At least for another set of NetApp devices.  This is another mystery we have had in the same NetApp family of devices.  These are devoted to Windows Workstations.  

We have two different classes of devices in one filer.  One is Fibre Channel and the other is SATA. One spins at 15KRPM and the other at 7.5KRPM. We were told there was no way that they could cause trouble to each other.  WRONG.  They share the Cache.  So if the slower writing devices are being driven hard with writes, then the Cache process becomes a bottleneck.

There is part of cache that is called Non  Volatile Storage (NVS).  This covers the case where you need to have a version of the storage before it is actually written to disk. For a period of time you have a version of the write in Cache and also in NVS.  That way if you would lose power to the device, it is supposed to stay in NVS until it can be destaged to the disk.  At least that is the story as to why it exists. 

When a write happens there are two versions then in the cache.  One in Cache and one in NVS. The Write is signaled that  it is finished and the application can continue on with the work it is doing.  But if the NVS becomes full, then the write is delayed from that signal until the data is physically written to the disk. 

Do you see the problem now?  In our case the slower disk is being used by a function that is taking backups of Workstations. (DLO product). Think massive write operations.  For some reason we have crossed the line where the amount of activity has grown such that we now have delays in all of our writes. 

Normally the write response time is in the 1 millisecond value or less.  But when this problem starts to happen, we see write response time climb to 50 ms and then to 100-200 ms.  That is when the workstation then becomes unresponsive to the user and they start calling in that they are dead in the water.  These are VMware Workstations.  In some cases these are also VMware Servers as well.  So not only are the users impacted, but all of their services they are working with are also impacted. Thus even the non-VMware Workstation users are feeling the pain.

The VMware workstations are using NFS for their access and the DLO Backup process is using CIFS.  We have the ability to cut off all CIFS traffic.  When we do that, the NFS response time instantly goes back to 1 millisecond.  Problem solved as far as they are concerned.

I had no idea that we could use NFS for our VMWare systems but it was a great discovery on my part.  We could ‘fix’ the problem at least for those users. 

We have another set of NetApp systems on order, but they are not here as of yet.  Plus we have other internal delays before we can cut over the DLO work to the other systems.  We may have to limp along with risk to lost files if we can not limit the damage by this DLO workload.

It seems that as the number of users in this DLO sub-system has grown we have finally reached the critical mass that has let us see this problem for what it really is.  A design choice we made to keep costs down.  But there was not a plan already in place to cut this over once the size had reached a critical mass. 

The management screens of the NetApp device did show that the Disk Utilization for the DLO system was in the 80% Device Busy.   That should have been a signal that there were issues.  But we had been told there was complete isolation between the two systems and slowness in one area would not impact the other area.   WRONG.

Until next time.


This is still ongoing.  Lots of Perfstats sent to NetApp for analysis.  One item was found on them that is puzzling.  We are using the NetApp processing of Snap processing.  There is Snap Vault and Snap Mirror.  Mirror creates images on the local filer so recover can take place back to a point in time.  As changes happen, the old blocks are kept in this Mirror block list.  But only so many ‘Mirrors’ are kept active at one time.  As a new one comes online, an old one goes off.  Those Blocks then go back into the free list.  But that can take a while and was one of the symptoms of our problems.  High CPU time.  They attributed this High CPU to the Mirror process releasing these blocks. 

That would then impact the response time of NFS requests for the Clients.  Bummer.  We were seeing response times go from 1-3 milliseconds up to 100 milliseconds.  The one application that was doing the most work would suffer the most of course.

We found another set of files that were being referenced.  There were a set of files with .so that were also being referenced a lot.  Not read, but just referenced.  Once again back to the application team to see if they could be placed somewhere else. 

During this time we were waiting on the Networking team to enable another device to capture the traffic.  Our other approaches  were traces on the application server and switch mirror traces to a windows system.  But to trace on the Filer at a steady rate of 1 Gb is tough as the size grows fast.  We have GigaMon as a monitor switch fabric and have a 10 Gb port connected to a GigaStore device to record. (Network Instruments vendor)  Now we don’t have dropped packets messing up the trace and have 48 TB of space to store the results.  Now we can reach back and look at some history of what was happening when the problem was happening.   I’ll also mention that Riverbed also has some great products to address this as well. 

We saw that one of the high traffic sinks was the Snap Vault process.  During this time frame the application was also seeing high response time.  We did not have that knowledge before.  One more thing to talk with NetApp about.

For those interested this is the Wireshark – tshark.exe to read the trace and create a text file with certain fields.  I then used Excel to summarize that.

“C:\Program Files\Wireshark\tshark.exe” ^
-o nfs.default_fhandle_type:ONTAP_V3 ^
-T fields -E separator=, -e frame.number -e rpc.auth.uid -e rpc.time -e nfs.procedure_v3 -e nfs.fh.hash -e nfs.name ^
-e nfs.fh.mount.fileid -e nfs.fh.fileid -e nfs.fh.export.fileid -e nfs.count3 ^
-R “(tcp.dstport == 2049) && (rpc)” ^
-r file-name-of-trace-file

Run this and > output.txt   to a file that you then bring into excel.  The -E separator tells tshark to use ‘,’ for a separator for the output fields.  This makes putting into columns an easy process for excel.

Lots of items in here this week – a lot of digest. 

Our problem is still being worked on.  This is taking way to long to come to a conclusion.

 


We have a mystery question as to what is all this activity to the NFS Filer from a given server?  It has been blamed for some slow downs noticed by some applications.

The challenge has been to find what is causing this activity since this is a general purpose work horse server that is used by a number of different applications.  It seems not all of them are playing nice in the same sandbox.

Some people would want to use this challenge as a reason for each application area to have their server and even their own NetApp Filer system.  That way they can not be impacted by anyone else.  But that is not how it is done when you want to manage your data center as an Enterprise and have some scaling and flexible buying power.  When resources are cut up into small parts and each area has their own, you are challenged to get the utilization very high.  You end up with a lot of waste all in the name of Fear, Uncertainty and Doubt (F.U.D.) of the risk of sharing. 

Many times we have found that when we have found the Impacting workload, it is the very user that is crying the most.  They are doing it to themselves and costing the Enterprise to spend a lot more money instead of fixing the problems.

We used some network captures to assist us with finding out what was going on between the server and the Filer.  (Wireshark was used but that is the topic of another blog). This showed a lot of activity with the function GetAttribute.  This is testing a file to see if it has changed since the last time it was used.  This turned out to be for files used by an Adobe PDF Creation process.  Fonts and Templates were being checked and causing a lot of overhead.  They do not change very often (like a couple of times per year) once they are established.  Templates might change with a new version or release of something but then would settle down again.  We had some SAN space we could define for this system but it would not be Shared Storage like the Filer is.  So once placed there, we would need it duplicated for the Test / Stage and Production versions of the Server.  But we only needed less than 80 GB of space for this.  The system already had HBAs so we could add some SAN storage and reduce the amount of activity taking place over the network.  This ended up reducing the GetAttribute workload by 10x.  

Step one provided some relief to the Filer CPU activity.  But the CPU busy was still very high – in the high 80s to low 90s. Plus we saw some pauses taking place.

I fired up some simple Netstat -p tcp and nfsstat -c and had that output going to a set of files along with a date/time stamp.  I put this into a 10 second interval and let it pump data to files with the name of the $(hostname) and $(date ‘+%y%m%d-%H’) so that I could keep the number of files small, but also make it self naming in case I needed to do this on other systems at the same time. 

Enter one of my fave programs for data analysis.  Perl.  I use this to assist with analysis and transforming output into input for Excel. Excel is my other fave program for data analysis.   Each of these commands (netstat and nfsstat)  create a very verbose output that is hard to bring in to see if there really is a problem or not.  By using Perl to clean that up, I can then use Excel to create Deltas between this interval and the prior interval to see if there is a pattern or not.

We began to see that there were periods of time of very good service and file response time and then some short intervals of very poor service.  Something was indeed happening that was not normal.  Now we had the proof that we needed to be able to talk with NetApp.  They had been running their PerfStat process but the process was not showing anything wrong which is partly caused by the length of time they had for their interval.  Using the 10 second interval output from these two unix commands provided the information that we needed to show them that something was not proper. 

This is still ongoing so I don’t have a fix to the problem.  But we do have better information now that it is a real problem.  At least for a few of the 10 second intervals.  But not for all of them which is why this has been so hard to find.

By the way, the Network trace of the NFS traffic showed us the UID of the user making the request and also gave us the File inode that we could then use to look up on the Filer to find out the real file name.  That gave us clues as to what was going on with the PDF Generation process for instance. 

Till next time.


Do people still use J2SE platforms? Of course. Does batch processing still happen? Does Legacy processing still happen even when people only want to do the new things. You betca.

We are running the HP HotSpot JVM 1.6. Cool in that with the HPJmeter (Free from HP) we can see inside the JVM and they did not have to do anything other than use Java 1.6. HPJmeter will see all the JVMs running Java 1.6 on that system. Using x-windows and the hpjmeter java app and a java connector that the install of HPJmeter starts up as root. That enables dynamic connections to the JVM to do the simple stuff.

This is not Wily or Dynatrace or other special products, but it is simple insight into what is happening. What are the threads that are started, what their stack is like, Heap sizes, (all of them). Lock contention is also seen.

The application team sent through some 5,000 MQ messages and we watched how it handled them. This is a very old applicaiton so they don’t want to spend much money on fixing it up. They set the MQ threads and DB threads along time ago. They follow the old standard, if it is not broken so that someone notices, don’t tell me about it. Or don’t expect me to fix it now. At one time they must have increased the number of MQ handles without knowing that the inside DB activity also needed to be adjusted. The increased MQ handles made the MQ Queue notices go away, so end of story. Now it uses more memory and waits inside the JVM where as before it waited outside the JVM.

If we need to come back to this module again, we will need to find a reason to also upgrade the INI fields to change a few other things. Until then – we are happy that at least it is not 100% in error recovery like it was the other week when it caused all the CPUs to be pegged at 100% and then caused major problems for everyone else.

Till next time.


I have found that sometimes people want to seperate Capacity questions and Performance Questions. Yet they are tied together. The worse part is that poor Performance may not always be solved by just buying more Capacity.

This Blog will explore some issues related to the surprises that have happened when that has been ignored or forgotten.

Applications can either go wide or go high.  Wide is scaling to other servers or multiple JVMs.  High is when you can add more Threads to the same JVM.

Some companies can chose to use custom Web Applcaiton servers but most of the industry is working with Application Servers like Websphere and standard Databases like UDB, Sybase or Oracle.  These are the ‘standard’ set of workhorses that companies have used for years. 

We will explore some challenges and present how you too can have insight into what is going on within your applications without needing to spend $$$ to get that insight. 

Stay tuned as we walk down this garden path together. 

I work with Performance issues and Capacity Planning concerns. I used the play on words – Performance ate my Capacity. I work mainly in the Enterprise sized companies that have to scale their workloads to 1,000 active users or more. It is amazing the challenge that some designers have when you have to consider scaling to high workloads. What will work for a few concurrent users does not always work for 1,000 concurrent users. Or at least no one would pay for the resources to make it work.