Archives for category: Uncategorized

For our Linux systems we were able to add a very simple process to cron to touch a file every 5 minutes.

This is a NFS mounted location and we pointed our Monitoring Tools to look for the oldest file in that directory.  All that is in that directory is this ‘heart beat’ file.  Once it is older than 5 minutes  – we turn that into yellow.  Once it is over 10 minutes, then it turns red.  We know then that either that system is having issues, or it is at least no longer connected to the NFS system.

When this was requested, I wondered what the big deal was with creating this process.  When there are hundreds of systems, this turns into the early warning signal that things are going bump in the night.

If there is just a system or two in this state, then no big deal.  Check if they are being touched and might be a reason as to why they are no longer updating.  But when there are lots of systems in this status, then you know there is a global problem.  That requires a different approach.

Very simple to put into place.  Very simple to use.  Lots of value add.

This is different than things like Ping a list of Systems to see if they are still.  This is using those systems to do the work.  Not often enough to cause performances because of running it, but often enough to catch issues.



What do you do when you find yourself working for a company that would rather oursource all performance analysis?   HELP.

Those who know me know that is not a good mix.  When Application Owners say that there must not be a problem because there are not tickets against us regardless of what the Dashboards say, then you know there is a problem brewing.

EPIC is an application that is used in the Healthcare industry and the modules that we had in place added the EMR (Electronic Medical Records) for the Primary Care Drs.  This not for the Patients in the hospitals.  That is the  next set of modules they are getting.  That is for the Enterprise version.  When the response time is over 5 seconds the workflow is labled as Yellow. When the response time is over 10 seconds the workflow is labled Red. Yet the Help screens mention that these values are 3 and 6 seconds.  The actual values are 5 and 10 based on the actual response time that is seen.  Plus this is all over VDI / Citrix systems but the actual screens have a lot of Tab keys and Drop Down choices.  My thought is that more interactions are taking place between the Citrix systems and the thin client than are being recorded.  ie slow response time moving around the screen is not reported within the workflow response time measurements and thus are hidden from view.

This all reminded me of the Sub-Second discussions back from the TSO days of system development.  Recall that if the response time starts to creep up above 1 second then the user no longer feels the system is a tool but is a hindrance to their workflow.

When you look at the Exceptions (Yellow or Red) and the location where these are happening (Office Locations) then the story starts to come out that there must be problems in certain areas of the usage of the application.  But they are being told, you have no choice, you must use this system.  So why complain?  There is no value in complaining.  They just need to get the job done.

It so happens that my primary Dr. is also in this system.  I just happened to be seeing him about some other challenges. He mentioned that he is now seeing 2/3 fewer people because of the slowness of the system and now he must do all data entry while the patient is in the room.  I was able to see first hand the little impacts that were happening.

When I brought this up to the project manager I found out that they were understaffed for training and thus the training hours were cut back for teaching the Drs. how to use the system.  Plus they did not know how to pull up the Exception reports.  Of course, my eyes had gone to that the first time I had a chance to play with the monitoring system.

The EPIC experts did some trouble shooting and found that the Disk IO appeared to be challenged.  Of course the EPIC system is built on Cache’ and that has a habit of doing a flush write every 80 seconds.  This flush is done without regard to the IO damage that is done.  Suddenly a high IO write queue is built that of course interferes with the read requests.  The write have to finish up in  20-30 seconds and normally finish in 1-2 seconds.  So why not spread them out over the 20 seconds?  That is not an option.  Why not spread it over 5 seconds even?  Nope – not an option.  Instead spread the data over more disks in the array so that the workload is spread.  This EPIC and Cache’ is not even a hardware vendor.

It has been interesting journey to say the least.

We have this DB on IBM’s SVC and also on a DS4300 back-end.  We do not have any performance reporting on the SVC nor the DS4300.  I started to dig into the SVC and found that while we do have some performance metrics, they are at 5 minute intervals.  A lot hides in 5 minute averages.  Even so, this showed that other systems were causing issues with the EPIC sub-system.  This is all shared DASD.  Surprise, another system could cause impact!

Another target rich set of systems.

I have found that sometimes people want to seperate Capacity questions and Performance Questions. Yet they are tied together. The worse part is that poor Performance may not always be solved by just buying more Capacity.

This Blog will explore some issues related to the surprises that have happened when that has been ignored or forgotten.

Applications can either go wide or go high.  Wide is scaling to other servers or multiple JVMs.  High is when you can add more Threads to the same JVM.

Some companies can chose to use custom Web Applcaiton servers but most of the industry is working with Application Servers like Websphere and standard Databases like UDB, Sybase or Oracle.  These are the ‘standard’ set of workhorses that companies have used for years. 

We will explore some challenges and present how you too can have insight into what is going on within your applications without needing to spend $$$ to get that insight. 

Stay tuned as we walk down this garden path together. 

I work with Performance issues and Capacity Planning concerns. I used the play on words – Performance ate my Capacity. I work mainly in the Enterprise sized companies that have to scale their workloads to 1,000 active users or more. It is amazing the challenge that some designers have when you have to consider scaling to high workloads. What will work for a few concurrent users does not always work for 1,000 concurrent users. Or at least no one would pay for the resources to make it work.