What do you do when you find yourself working for a company that would rather oursource all performance analysis?   HELP.

Those who know me know that is not a good mix.  When Application Owners say that there must not be a problem because there are not tickets against us regardless of what the Dashboards say, then you know there is a problem brewing.

EPIC is an application that is used in the Healthcare industry and the modules that we had in place added the EMR (Electronic Medical Records) for the Primary Care Drs.  This not for the Patients in the hospitals.  That is the  next set of modules they are getting.  That is for the Enterprise version.  When the response time is over 5 seconds the workflow is labled as Yellow. When the response time is over 10 seconds the workflow is labled Red. Yet the Help screens mention that these values are 3 and 6 seconds.  The actual values are 5 and 10 based on the actual response time that is seen.  Plus this is all over VDI / Citrix systems but the actual screens have a lot of Tab keys and Drop Down choices.  My thought is that more interactions are taking place between the Citrix systems and the thin client than are being recorded.  ie slow response time moving around the screen is not reported within the workflow response time measurements and thus are hidden from view.

This all reminded me of the Sub-Second discussions back from the TSO days of system development.  Recall that if the response time starts to creep up above 1 second then the user no longer feels the system is a tool but is a hindrance to their workflow.

When you look at the Exceptions (Yellow or Red) and the location where these are happening (Office Locations) then the story starts to come out that there must be problems in certain areas of the usage of the application.  But they are being told, you have no choice, you must use this system.  So why complain?  There is no value in complaining.  They just need to get the job done.

It so happens that my primary Dr. is also in this system.  I just happened to be seeing him about some other challenges. He mentioned that he is now seeing 2/3 fewer people because of the slowness of the system and now he must do all data entry while the patient is in the room.  I was able to see first hand the little impacts that were happening.

When I brought this up to the project manager I found out that they were understaffed for training and thus the training hours were cut back for teaching the Drs. how to use the system.  Plus they did not know how to pull up the Exception reports.  Of course, my eyes had gone to that the first time I had a chance to play with the monitoring system.

The EPIC experts did some trouble shooting and found that the Disk IO appeared to be challenged.  Of course the EPIC system is built on Cache’ and that has a habit of doing a flush write every 80 seconds.  This flush is done without regard to the IO damage that is done.  Suddenly a high IO write queue is built that of course interferes with the read requests.  The write have to finish up in  20-30 seconds and normally finish in 1-2 seconds.  So why not spread them out over the 20 seconds?  That is not an option.  Why not spread it over 5 seconds even?  Nope – not an option.  Instead spread the data over more disks in the array so that the workload is spread.  This EPIC and Cache’ is not even a hardware vendor.

It has been interesting journey to say the least.

We have this DB on IBM’s SVC and also on a DS4300 back-end.  We do not have any performance reporting on the SVC nor the DS4300.  I started to dig into the SVC and found that while we do have some performance metrics, they are at 5 minute intervals.  A lot hides in 5 minute averages.  Even so, this showed that other systems were causing issues with the EPIC sub-system.  This is all shared DASD.  Surprise, another system could cause impact!

Another target rich set of systems.