Here’s the latest on the performance problems I’ve been tracking. It turns out that it doesn’t appear to be directly related to the load of the status display. The following is the technical details of what’s going on: I’d very much welcome any Apache or PHP gurus’ input on how I might proceed to debug the problem.
The basic issue is that a few times a day, an Apache httpd child processes suddenly explodes in memory size and consumes all available memory. The behavior is sudden, not gradual: within a few seconds or a minute the process swells to several orders of magnitude larger than its usual size.
Listed below is a ps aux; the first httpd is a ‘normal’ one, the second is the problem child:
nobody 23166 0.0 0.1 17572 2148 ? S 16:06 0:00 /usr/local/apache/bin/httpd
nobody 23167 1.4 63.7 2245624 1315584 ? D 16:06 1:00 /usr/local/apache/bin/httpd
Generally only a small number of child processes display this behavior: often just one at a time.
The system setup is Linux with Apache 1.3.33, PHP 4.3.10 and MySQL 4.0.22-standard. I am not running mod_perl.
The vast majority of traffic on the site is PHP scripts, some of which access the MySQL backend. I strongly suspect a problem in one of the scripts, but have not been able to identify specifically which one. I added the ‘%P’ variable to my Apache log file and am now able to identify the specific requests that the Apache child process which explodes was handling prior to the error, but thus far no pattern has emerged (different scripts appear last each time). I suspect that the last entry I see in the log is the last successful request, not the one that causes the problem. (I am aware of log_forensic, which I learned would provide log output of a request *before* processing, but my skills are not sufficient to make me feel comfortable rebuilding my Apache server to include it at this time).
I am working around the issue with a cron script that checks if a child process has exploded and then restarts Apache if needed; this helps to mask the issue but is obviously not a fix. I have considered setting max_requests_per_child to a non-zero value but based on my understanding I doubt it would help (given that this is not a gradual leak but a case where the process goes wild in the middle of a request).
I recognize that it is unlikely that this problem can be diagnosed-at-a-distance, but would welcome suggestions on debugging tools and techniques which might help me narrow down the problem area. In particular: other than log_forensic, is there a way to truly see what that child process was doing when it went rogue?
Any and all suggestions are appreciated…
Update 1/16: Thanks again to all who have suggested additional debugging techniques. I tried a few of them this morning and have gathered additional data. None of it has provided an “aha!” moment to me as yet, but I will post it here so more skilled gurus than I can examine.
Output of lsof for rogue httpd process pid 10990
Output of ‘strace -q -f’ for rogue httpd process pid 10990
Output of ‘cat /proc/PID/maps’ for rogue httpd process pid 10990