RunCloud Main Server 503 Incident Report on 19 July 2017 (GMT +8)

RunCloud Main Server 503 Incident Report on 19 July 2017 (GMT +8)

What happened?

On 11:00 A.M 19 July 2017, our main server (Panel) was down for a few moments (up to 40 minutes to be exact). The panel shows 503 HTTP status which means the resources is not available. I (the sysadmin) was getting a shower at that time and have no idea about it, while the CEO keep calling me to look at what just happened to the system.

Checking server resources

Quickly without even wearing my cloth, logged in into the server to check for the php-fpm process. And yes, my assumption was correct, the php-fpm was down. I’ve started the php-fpm process and continue to watch the php-fpm log file. Suddenly it got killed again with SIGKILL (kill -9) after a few moments. My first guess is that we are getting DDOS, but looking at server load, it is just 0.1. Then I began checking the server memory, we have no space left for swap and only 74MB of memory left. But looking at the buffer/cache, there are about 2GB, which the server should free the buffer if the system needs it. But I didn’t realize that no space left for swap means it is a trouble for the system.

The infrastructure

I will be transparent as possible for our tech. We only have two servers to run RunCloud. The first one is the Panel and the second one is for the database. Both are 4GB of RAM. It is a little server, but if it doesn’t have any problem, why vertical or horizontal scaling it right? And we are happy with it because our small server has served the biggest server on DigitalOcean and Linode (our customer’s server). Since MariaDB is stored on another server, the problem is not coming from the database (which usually the main problem).

The panel

RunCloud Panel uses Redis to store cache and beanstalkd to store queue jobs. Since our queue job is running about 50 background process at the same time and the queue job is currently empty (due to crashing in the php-fpm), most likely the problem is coming from Redis. I began checking the Redis info.

127.0.0.1:6379> INFO

And yes, we have nearly 2GB of stored cache inside our memory which is not cleared because it still hasn’t reached its end of Time To Live (TTL). For most systems, 2GB of stored cache is nothing, but for us which still having a small server, it is a big deal for us. Since we are using Redis just to store cache, there is no problem to clear the cache. After clearing it, firing the

free -m

command makes me happy. We gained back the memory that we lose and luckily I know it is coming from Redis.

The Solution

The problem we got from Redis is because we left Redis unconfigured. It will make use all of the system’s memory if it left unconfigured (which we don’t think will use that much memory). We love Redis so much and because of that, the small server won’t have a problem to serve our customers. So I began editing the Redis config (/etc/redis/redis.conf) and add this two values.

maxmemory 500mb
maxmemory-policy volatile-ttl

What it does is to limit memory usage only to 500Mb. And if it is nearly 500Mb, it will automatically remove stored data which near to end of TTL.

For our Clients

Since we are having this problem and we left our client’s Redis server to unconfigured as well, we will be updating the Redis configuration inside our client’s servers. We will set Redis to use maximum 1/8 of our customer’s current server’s memory. This process will be automatic and you don’t have to do anything about it.

2 responses to “RunCloud Main Server 503 Incident Report on 19 July 2017 (GMT +8)”

  1. Makis says:

    Nice writeup 🙂

  2. Funny, but I found no such configuration on my runcloud managed server. I guess is not yet implemented?

Leave a Reply

Your email address will not be published. Required fields are marked *