TransWikia.com

PHP/FPM/Nginx Issue: Runs fine for a hours, then maxes FPM children in a matter of seconds, resulting in 502 errors

Server Fault Asked by Slickrick12 on November 4, 2021

We recently migrated a long running (without issue) Ubuntu 16.04 / nginx (1.10.3) / PHP 7.0 stack over to an Ubuntu 18.04 / nginx (1.14.0) / PHP 7.2 stack. The result has caused instability at the PHP-FPM level which never existed before, and hours of investigation have proved no results as to the root cause of the issue.

As an aside, both deployments are managed through AWS, and in both situations, the PHP app being run uses identical code and connects to the same MySQL RDS endpoint.

Issue

Post upgrade, the system runs as per usual for several hours (sometimes over 24 hours) without issue. Suddenly, in a matter of a minute, FPM starts logging warnings that execution of the script has timed out on various children like so:

[15-Jul-2020 19:50:22] WARNING: [pool www] child 29649, script '/path/to/app.php' (request: "POST /app.php") execution timed out (140.371526 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25847, script '/path/to/app.php' (request: "GET /app.php") execution timed out (136.178758 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (135.822576 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25401, script '/path/to/app.php' (request: "POST /app.php") execution timed out (135.927461 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25229, script '/path/to/app.php' (request: "GET /app.php") execution timed out (143.179479 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25110, script '/path/to/app.php' (request: "POST /app.php") execution timed out (141.080509 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24904, script '/path/to/app.php' (request: "POST /app.php") execution timed out (142.441526 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24679, script '/path/to/app.php' (request: "GET /app.php") execution timed out (138.165572 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24652, script '/path/to/app.php' (request: "POST /app.php") execution timed out (142.414685 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 15982, script '/path/to/app.php' (request: "GET /app.php") execution timed out (125.469362 sec), terminating

Shortly thereafter, FPM starts spawning new children to try and handle the load from nginx.

[15-Jul-2020 19:50:37] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 46 total children
[15-Jul-2020 19:50:38] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 1 idle, and 51 total children
[15-Jul-2020 19:50:39] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 55 total children
[15-Jul-2020 19:50:40] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 60 total children
[15-Jul-2020 19:50:41] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 65 total children

Then, once the limit is reached, the server reaches max capacity at 70 children, and FPM stops accepting requests from nginx ([error] 1072#1072: *234300 connect() to unix:/run/php/php7.2-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream), resulting in 502 errors.

[15-Jul-2020 19:50:42] WARNING: [pool www] server reached pm.max_children setting (70), consider raising it  

At this point, the app is effectively down until the FPM service is restarted. When FPM is restarted, the hanging requests/processes are immediately terminated and the issue is instantly resolved for the time being.

Note: It’s important to note that the app can normally run with <= 10 workers, even in peak times, so the fact that it scales instantly due to long running processes is certainly an abnormality.

Repeating logs for the same process ID

From the surface, the logs look random, but if you filter them by child process ID, it’s clear that each child is trying to terminate a single request over and over again. What’s more, each log entry for a given process ID is ~40 seconds apart, evidenced by comparing the timestamp at the beginning of each log with the corresponding script time at the end of each log. Here I’ve filtered the log by a specific process ID for an example:

[15-Jul-2020 19:40:39] NOTICE: [pool www] child 25617 started
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (135.847853 sec), terminating
[15-Jul-2020 19:51:03] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (176.182362 sec), terminating
[15-Jul-2020 19:51:43] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (216.516705 sec), terminating
[15-Jul-2020 19:52:23] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (256.849969 sec), terminating
[15-Jul-2020 19:53:04] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (297.183892 sec), terminating

Occurs when traffic is increasing or at peak

Based on the times of day the issue crops up, it is fairly obvious that the issue predictably occurs when traffic is increasing. So for example, first thing in the morning or at the peak of the day, when traffic is high, the issue occurs. We have not seen this issue occur when traffic is lower, such as in the middle of the night.

New Relic is Installed

A new relic daemon is installed on both deployments via their apt repository, as per the recommended installation method outlined on their website.

   CGroup: /system.slice/php7.2-fpm.service
           ├─ 6554 php-fpm: master process (/etc/php/7.2/fpm/php-fpm.conf)
           ├─ 6583 /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /var/log/newrelic/newrelic-daemon.log --port @newrelic --wait-for-port 0s --define utilization.detect_aws=true --define utilization.de
           ├─ 6603 /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /var/log/newrelic/newrelic-daemon.log --port @newrelic --wait-for-port 0s --define utilization.detect_aws=true --define utilization.de
           ├─ 7228 php-fpm: pool www
           ├─ 7520 php-fpm: pool www
           ├─ 7548 php-fpm: pool www
           ├─ 7569 php-fpm: pool www
           ├─11835 php-fpm: pool www
           ├─11944 php-fpm: pool www
           ├─11997 php-fpm: pool www
           ├─12106 php-fpm: pool www
           ├─12934 php-fpm: pool www
           └─13049 php-fpm: pool www

Note: We have not tried disabling the new relic agent as of yet, and this doesn’t appear on the surface to be a related cause.

FPM Pool Config

Here’s pool config, and is the same one used on both stacks:

pm = dynamic
pm.max_children = 70
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 1000

Nginx Config

The same nginx configuration is used on both deployments.

user www-data;
worker_processes auto;
pid /run/nginx.pid;

events {
  worker_connections 2048;
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;

  client_max_body_size 24m;
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 120;
  limit_req_status 429;
  limit_req_zone $http_x_forwarded_for zone=web:16m rate=10r/s;
  types_hash_max_size 2048;
  server_tokens off;
  ssl_protocols TLSv1.2;
  ssl_prefer_server_ciphers on;
  gzip on;
  gzip_disable "msie6";
  gzip_vary on;
  gzip_proxied any;
  gzip_comp_level 6;
  gzip_buffers 16 8k;
  gzip_http_version 1.1;
  gzip_min_length 256;
  gzip_types text/plain text/css application/json application/javascript text/xml application/xml text/javascript image/x-icon;
  log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" - $request_time X-Forwarded-For=$http_x_forwarded_for Host=$host';

  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}

PHP Config

Same config is used on both deployments, all other settings are left as default.

max_execution_time=120
max_input_vars=5000
memory_limit=256M
post_max_size=24M

Stuff we’ve tried and/or looked at

  • We tried adjusting the FPM pool configuration to include request_terminate_timeout = 121 (just beyond the max_execution_time set in php), in the hopes this would terminate these stuck processes, but it did not work. The processes continue to be terminating every 40 seconds, sometimes for over 1000+ seconds.

  • The master FPM config is left as default with no alterations, the default php-fpm.conf config file is exactly the same between both versions of FPM. We examined the emergency_restart_threshold, emergency_restart_interval and process_control_timeout configs, but they don’t seem to be a factor in the issue we’re experiencing.

  • Reverting the app back to Ubuntu 16.04 (PHP 7.0.33) magically solves the problem, but obviously not what we’re trying to accomplish by upgrading.

  • Restarting FPM when it seems busy (via the logs) solves the problem, but again, not a solution.

  • Examined frontend logging for the potential of coincidental DOS attacks, but traffic is not unusual.

  • Examined nginx/php error logs for indications that scripts are not properly shutting down (i.e. no response). Nothing found.

Conclusion

Something in php, nginx, or fpm is configured in a way that ultimately ends up in some sort of race condition whereby the entire FPM service is consumed by sleeping/stuck/stale processes that use all available resources until it is eventually out of capacity and rejects all traffic indefinitely until the FPM process is manually restarted.

We are not ruling out that the issue could also be an issue within the app itself, but the fact that it works without issue on a previous version of OS/PHP/FPM would indicate an incompatibility at the PHP level and thus far we haven’t seen any logging to support this.

My goal here is to figure out a.) why this is happening, and b.) what we can do to mitigate or work around the issue.

Thanks for your eyes!

One Answer

An identical code on different servers can very easily run completely differently, all things considered.

A simple example is any code that does curl_exec calls against a third-party (remote) web server that requires HTTP / IP address authorization.

In the latter case, if one server is authorized, but the other isn't, you will have a code that hangs waiting for an indefinite timeout of libcurl, where page processing will be stopped only by max_execution_time configuration.

You really want to check PHP-FPM slow log as suggested by @Michael Hampton. You can see the detailed write about on how to do this here.

There is no mention of which caching mechanisms you use. To greatly improve how much traffic your website can handle, use full page caching mechanism like Varnish Cache.

Also don't forget about PHP's OPcache, real path cache, etc.

I would also look into MySQL slow query log and see if you have missing indices.

The problem with a missing MySQL index in database schema becomes more obvious as you scale the size of your data (as in, simply over time as your website data grows bigger). It will get worse and worse until you look into offending queries and fix up the database schema to make those few bad queries use keys instead of scanning millions of rows on every page request.

Finally, look into tools like NGINX Amplify - this kind of monitoring where you can see graphs for CPU load / NGINX connections / PHP-FPM metrics can help you to see how the load correlates to connections. Post-upgrade - you might have simply received more traffic than usual. Investigate logs via goaccess, see what are the most frequent IP addresses, check their user-agents, whois them for the subject of potential blocking, etc.

Answered by Danila Vershinin on November 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP