HTTP Error 503. Service Unavailable: A Case in Hosting Support

The work in hosting support is basically the same, most of the requests from clients are resolved according to a well-developed scheme, but sometimes you still have to face non-trivial problems. Then the main task of the engineer is to find the one - the only correct path that will lead to its solution. In this article I want to talk about how we encountered the floating error "HTTP Error 503. Service Unavailable" on our shared hosting, how we tried to catch it, diagnosed it and got an unexpected ending.



Start



The hosting provides users with a typical Linux + Apache + Mysql + PHP stack and a management wrapper. In our case, this is ISP Manager 5 business based on Centos 7 with conversion to CloudLinux. From the administrative side, CloudLinux provides tools for managing limits, as well as a PHP selector with various modes of operation (CGI, FastCGI, LSAPI).



This time a client contacted us with the following problem. His site on the Wordpress engine periodically began to give 503 errors, which he informed us about.



Response codes starting with 50x refer to server side issues. These can be problems of both the site itself and the web server that serves them.



Typical situations in which we receive the following errors:



  • 500 Internal Server Error - quite often it is associated either with syntax errors in the site code, or with missing libraries / unsupported PHP version. There may also be problems with connecting to the site database or incorrect permissions on files / directories
  • 502 Bad Gateway - for example, if Nginx refers to the wrong Apache webserver port, or Apache process stops working for some reason
  • 504 Gateway Timeout - response from Apache was not received within the time specified in the web server configuration
  • 508 Resource limit is reached - the limit of resources allocated to the user has been exceeded


This list contains only some of the most common cases. It is also worth noting that when the limits are exceeded, the user can receive both 500 and 503 errors.



When diagnosing these errors, the first step is to check the web server logs. This is usually enough to identify the culprit and fix the problem.



Regarding the 503 error in our case, we saw an entry in the logs:

[lsapi: error] [pid 49817] [client xxxx: 6801] [host XXX.XX] Error on sending request (GET /index.php HTTP / 1.0); uri (/index.php) content-length (0): ReceiveAckHdr: nothing to read from backend (LVE ID 8514), check docs.cloudlinux.com/mod_lsapi_troubleshooting.html
Based only on this log, it was not possible to determine what the problem might be.



Primary diagnosis



Initially, we checked the statistics of user exceeding the limits. Minor excesses were recorded in the previous days, but errors in the logs were fresh, moreover, they appeared in the log at intervals from one to several minutes.



We also studied the CloudLinux recommendations using the link provided in the error logs.

Changing any parameters did not bring any results.



The site used a database on a Mysql 5.7 server that runs on the same server in a Docker container. The container logs contained messages:



[Note] Aborted connection 555 to db: 'dbname' user: 'username' host: 'x.x.x.x' (Got an error reading communication packets)


Just among these messages were messages about the interrupted connection of the site under investigation. This gave the assumption that the connection to the DBMS is not being performed correctly. To check it, we deployed a copy of the site on a test domain, converted the site database to the native Centos 7 version of the 5.5.65-MariaDB DBMS. On the test site, several hundred requests were executed using the curl utility. The error could not be reproduced. But this result was preliminary and after the conversion of the database on the production site, the problem remained.



Thus, the problem of incorrect connection to the DBMS was eliminated.



The next suggestion was to check if there were any problems with the site itself. To do this, we set up a separate virtual server, on it we raised the most similar environment. The only significant difference is the lack of CloudLinux. The problem could not be reproduced on the test server. So, we have determined that everything is in order in the site code. However, we tried to disable Wordpress plugins in the same way, but the problem persisted.



As a result, we came to the conclusion that the problem is on our hosting.



After analyzing the logs of other sites, it was found that the problem is observed on many of them. About 100 pcs. at the time of verification:



/var/www/httpd-logs# grep -Rl "ReceiveAckHdr: nothing to read from backend" ./ | wc -l
99


During testing, we found that the freshly installed clean CMS Wordpress also periodically gives error 503.



Approximately 2 months before that, we carried out work on modernizing the server, in particular, we changed the Apache mode of operation from Worker to Prefork, in order to be able to use PHP in LSAPI instead of slow CGI. There was an assumption that this could affect, or some additional Apache settings are required, but we could not return the Worker mode back. During the change of the Apache operating mode, all site configs are changed, the process is not fast and not everything could go smoothly.



Correction of Apache settings also did not give the desired result.



Along the way, we looked for similar problems in search engines. On one of the forums, participants argued that the hoster has a problem and needs to be changed if the problem is not solved. It doesn't sound very optimistic when you are on the other side, but you can understand the client. Why does he need non-working hosting.



At this stage, we have collected the available information and the results of the work performed. They were contacted to support CloudLinux.



Detailed diagnostics



For several days, the CloudLinux support staff delved into the issue. Basically, the recommendations were regarding the established user limits. We also checked this question. With the limits disabled (CageFS option for the user) and with the limits enabled in PHP mode as an Apache module, the problem was not observed. Based on this, it has been suggested that CloudLinux is influencing in some way. As a result, by the end of the week the request was escalated to the 3rd level of support, but there was no solution yet.



Along the way, we studied the Apache documentation on the CGI and LSAPI modes, set up a second Apache instance on the hosting server on a different port with a test site, eliminated the influence of Nginx by sending requests directly to Apache and receiving the same error codes.



The LSAPI documentation helped to get off the ground, just on the diagnosis of 503 errors:

www.litespeedtech.com/support/wiki/doku.php/litespeed_wiki : php: 503-errors

In the Advanced Troubleshooting section, it is proposed to trace the processes found in the system:



while true; do if mypid=`ps aux | grep $USERNAME | grep lsphp | grep $SCRIPTNAME | grep -v grep | awk '{print $2; }' | tail -1`; then strace -tt -T -f -p $mypid; fi ; done


The command has been refined to record all processes in files with their identifiers.



When looking at the trace files, we see some of the same lines:



cat trace.* | tail
...
47307 21:33:04.137893 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=42053, si_uid=0} ---
47307 21:33:04.140728 +++ killed by SIGHUP +++
...


If we look at the description of the structure of signals sent by processes, we will see that



pid_t    si_pid;       /* Sending process ID */


Indicates the identifier of the process that sent the signal.



At the time of studying the traces, the process with PID 42053 is no longer in the system, therefore, in the process of capturing traces, we decided to monitor the processes that sent the SIGHUP signal as well.

Under the spoiler, actions are described that made it possible to determine what kind of process it is, as well as get its trace and additional information about which processes it sends the SIGHUP signal to.



Trace technique
Console 1.



tail -f /var/www/httpd-logs/sitename.error.log


2.



while true; do if mypid=`ps aux | grep $USERNAME | grep lsphp | grep "sitename" | grep -v grep | awk '{print $2; }' | tail -1`; then strace -tt -T -f -p $mypid -o /tmp/strace/trace.$mypid; fi ; done


3.



while true; do if mypid=`cat /tmp/strace/trace.* | grep si_pid | cut -d '{' -f 2 | cut -d'=' -f 4 | cut -d',' -f 1`; then ps -aux | grep $mypid; fi; done;


4.



seq 1 10000 | xargs -i sh -c "curl -I http://sitename/"


1 , 4 503, 4.



As a result, we got the name of the process. /opt/alt/python37/bin/python3.7 -sbb /usr/sbin/cagefsctl --rebuild-alt-php-ini



This process was carried out in the system once a minute.



We trace several cagefsctl processes to trace at least one from start to finish:



for i in `seq 1 100`; do strace -p $(ps ax | grep cagefsctl | grep rebuild-alt-php-ini | grep -v grep | awk '{print $1}') -o /tmp/strace/cagefsctl.trace.$(date +%s); done;


Next, we study what he did, for example:



cat /tmp/strace/cagefsctl.trace.1593197892 | grep SIGHUP


Process IDs were also obtained that were terminated with a SIGHUP signal. The terminated processes were the PHP processes currently running.



The received data was transferred to CloudLinux support in order to clarify the legitimacy of this process and whether it should work with such frequency.



Later, we received an answer that the team's work /usr/sbin/cagefsctl --rebuild-alt-php-iniis being performed correctly, the only caveat is that the team is being executed too often. Usually called when a system update or PHP settings change.



The only clue left in this case is to check who the parent of the cagefsctl process is.



The result was not long in coming, and what was our surprise - the parent process for cagefsctl was the ispmgrnode process. It was a little strange, because the logging level for the ISP Manager was set to the maximum and the cagefsctl call was not seen in ispmgr.log.



Now there was enough data to contact the ISP System support as well.



Summary



The issue was triggered after performing an ISP Manager update. In general, updating the ISP Manager is a normal situation, but it led to the start of the synchronization process, which ended with an error and was restarted every minute. The synchronization process invoked the cagefsctl process, which in turn terminated the PHP processes.



The reason for the hangup of the synchronization process was the work carried out on the hosting to modernize the equipment. A few months before the problem occurred, a PCI-e NVMe drive was installed in the server, an XFS partition was created and mounted in the / var directory. Users' files were also transferred to it, but disk quotas were not updated. The mount options were not enough, it was also required to change the file system type in the ISP Manager parameters, since it invokes commands to update disk quotas. For Ext4 and XFS, these commands are different.



Thus, the problem made itself felt several months after the work.



conclusions



We ourselves created the problem, but it was not clear until the last moment. For the future, we will try to take into account as many nuances as possible. With the help of more trained colleagues from CloudLinux and ISP System support, the problem was resolved. Now our hosting is stable. And we have gained experience that will be useful to us in future work.



PS: I hope you were interested in reading the article, and it will help someone to quickly solve a similar problem.



All Articles