At last part of the mystery is finally coming together.  At least for another set of NetApp devices.  This is another mystery we have had in the same NetApp family of devices.  These are devoted to Windows Workstations.  

We have two different classes of devices in one filer.  One is Fibre Channel and the other is SATA. One spins at 15KRPM and the other at 7.5KRPM. We were told there was no way that they could cause trouble to each other.  WRONG.  They share the Cache.  So if the slower writing devices are being driven hard with writes, then the Cache process becomes a bottleneck.

There is part of cache that is called Non  Volatile Storage (NVS).  This covers the case where you need to have a version of the storage before it is actually written to disk. For a period of time you have a version of the write in Cache and also in NVS.  That way if you would lose power to the device, it is supposed to stay in NVS until it can be destaged to the disk.  At least that is the story as to why it exists. 

When a write happens there are two versions then in the cache.  One in Cache and one in NVS. The Write is signaled that  it is finished and the application can continue on with the work it is doing.  But if the NVS becomes full, then the write is delayed from that signal until the data is physically written to the disk. 

Do you see the problem now?  In our case the slower disk is being used by a function that is taking backups of Workstations. (DLO product). Think massive write operations.  For some reason we have crossed the line where the amount of activity has grown such that we now have delays in all of our writes. 

Normally the write response time is in the 1 millisecond value or less.  But when this problem starts to happen, we see write response time climb to 50 ms and then to 100-200 ms.  That is when the workstation then becomes unresponsive to the user and they start calling in that they are dead in the water.  These are VMware Workstations.  In some cases these are also VMware Servers as well.  So not only are the users impacted, but all of their services they are working with are also impacted. Thus even the non-VMware Workstation users are feeling the pain.

The VMware workstations are using NFS for their access and the DLO Backup process is using CIFS.  We have the ability to cut off all CIFS traffic.  When we do that, the NFS response time instantly goes back to 1 millisecond.  Problem solved as far as they are concerned.

I had no idea that we could use NFS for our VMWare systems but it was a great discovery on my part.  We could ‘fix’ the problem at least for those users. 

We have another set of NetApp systems on order, but they are not here as of yet.  Plus we have other internal delays before we can cut over the DLO work to the other systems.  We may have to limp along with risk to lost files if we can not limit the damage by this DLO workload.

It seems that as the number of users in this DLO sub-system has grown we have finally reached the critical mass that has let us see this problem for what it really is.  A design choice we made to keep costs down.  But there was not a plan already in place to cut this over once the size had reached a critical mass. 

The management screens of the NetApp device did show that the Disk Utilization for the DLO system was in the 80% Device Busy.   That should have been a signal that there were issues.  But we had been told there was complete isolation between the two systems and slowness in one area would not impact the other area.   WRONG.

Until next time.

Advertisements