Portal Errors Saturday 8th June 2019 9:35:00 am


Our engineers have been working on a backend database issue this morning preventing some access to the portal.

We will post more information soon as to the cause.

On review of the logs the cause of the database failure has been identified. As part of the database setup, we run four clustered servers using MariaDB Galera, this works very well giving us active-active replication between each node (with a load balancer on top). On one of the nodes we also run a backup task every ten minutes which takes a copy of the data and retains that outside of the cluster. It is this backup task which has been identified to have caused the server process to fail.

The process failed in a way that it was still running and able to continue with normal connections but new connections would timeout. The load balancer did not see this as it was already connected and still sent connections to the node with the issue.

Another factor which further compounded this problem was a failure on the monitoring software used (PRTG). We run this in a two cluster setup which monitors the same services and alerts to a problem when both locations see a service monitored as down. At 3:27am, one of these locations allocated over 400GB of cached webpage data (used for page monitoring), which gradually filled the storage at 6:10am. This shuts down the monitoring server at that location as it cannot write any more monitoring data. A local alert about storage would come from the node in question but this was shutdown due to the space used. The remaining monitoring location will then use only its check data to decide if a service is down and alert accordingly.

For reasons yet to be determined, the remaining location did not detect the other as down and saved the "up" result so when the database server stopped accepting connections and portal pages generated errors, it would see these as "down" but because there was another result saying the opposite (up, from the cached result), it would not alert.

Both the increase in webpage data and the stuck cluster status are related to PRTG and a severe case has been raised with them to determine why this happened. We have also raised the backup agent issue with R1Soft, the vendor for that software for any known issues which could cause this.

We will also be looking in to ways to mitigate a similar occurrence in the future.

This issue only related to portal services and there was no downtime on any hosting or server products.

We are making this operational although under investigation whilst we review the logs which caused the portal unavailability and we are aiming to post some initial details to our findings in due course.