Service Disruption - 22nd July 15:40

15:40 - Easycallnow had an issue for a few minutes. The service is coming back and agents should be able to log in now.

15:55 - Unfortunately the problem has just reoccured. Engineers are investigating.

16:10 - The problem has been identified and we are taking steps to stabilise the service 

16:12 - The cause of the problem has been identified and the service has now been restored. We do not expect any further disruption. We apologies for the problem. 

16:30 - Unfortunately, service has again been interrupted. We are looking at the problem now.

16:40 - Service restored.

17:22 - Very sorry, the service was interrupted again briefly 

 


Reason for Outage – No or Slow responses to API on Silo1

Date of event: 2015-07-22

Event Summary:

On Wednesday 2015-07-22 from 15:35pm:

Customers on Silo1 complained they were unable to execute API queries resulting in frozen or slow screens. This was associated with an alarm for number of sessions in progress on a load balancer which was significantly higher than normal as shown in the 5 minute average sample graphic below.

Various actions were taken to increase resources in order to recover services. Since these actions were taken, no further service issues have been experienced. The graphic below shows the previous 2 weeks and subsequent session counts.

Impact Analysis: 

Repair Action:

Initial actions taken:

  1. Primary and backup internal and external load balancers were patched and rebooted
  2. Load balancer session limits were increased from 2,000 to 12,000
  3. Report server resources were shared into the API server pool.

These actions served to provide sufficient resources so that customers were no longer affected by what was initially suspected to be a Denial Of Service attack.

Subsequent investigation from log files captured at the time found that one particular query was running randomly slowly causing a queue backup of queries, which ultimately caused the service issue when the session limits were reached.

 

Root Cause and Preventive Measures:

Root cause is that the database servers were randomly selecting incorrect indexes for a particular query and causing a session to last more than one second. As this query (token authentication) is called many 000’s of times per second, this very quickly caused all available sessions to be used up.

The initial fix to increase session limits and database resources was a valid fix. In addition, token authentication software has also been changed to force index selection to ensure that this query cannot run slowly again.

Have more questions? Submit a request

Comments

Powered by Zendesk