Analysis of the problem of the Tomcat process unexpected exit

Before the festival, the test environment feedback from a certain department will exit unexpectedly. After checking the actual environment, we found that it was not jvm crash. There is a record of process destruction in the log. The whole process from pause to destory:

pause
Pausing ProtocolHandler
stopInternal
Stopping service Catalina
stop
Stopping ProtocolHandler
destroy
Destroying ProtocolHandler

From the above logs, we can judge:

1) Tomcat is not closed normally through scripts (viaport: that is, sending shutdown commands through port 8005)

Because if you close the vaiport normally, there will be a warning log before pause:

await
A valid shutdown command was received via the shutdown port. Stopping the Server instance.

Then pause -> stop -> destroy

2) Tomcat's shutdownhook is triggered, and the destruction logic is executed

There are two situations in this. One is that there is a place in the application code to exit jvm, and the other is the signal sent by the system (except kill -9, the SIGKILL signal JVM will not have the opportunity to execute shutdownhook)

First, by troubleshooting the code, the application party and the middleware team have troubleshooted the possibility of using it in this application. Then there is only Signal's situation; after some investigation, it was found that the time when tomcat unexpectedly exited each time coincided with the time when the ssh session ended.

After having this clue, Gintoki immediately looked at the script of the other party's testing environment, and simplified it as follows:

$ cat 
#!/bin/bash
cd /data/server/tomcat/bin/
./ start
tail -f /data/server/tomcat/logs/

After tomcat is started, the current shell process does not exit, but is hung in the tail process and outputs log content to the terminal. In this case, if the user directly closes the window of the ssh terminal (using a mouse or shortcut key), the java process will also exit. If ctrl-c first terminates the process and then closes the ssh terminal, the java process will not exit.

This is an interesting phenomenon. Tomcat started in the start method will hang the java process under the parent process with init (process id is 1). It has been separated from the parent-child relationship with the current process and has nothing to do with the ssh process. Why does closing the ssh terminal window cause the java process to exit?

Our speculation is that when the ssh window is closed, it sends an exit Signal to the current interactive shell and the running subprocess, and finds a machine equipped with systemtap to verify. The stap script used is copied from Jianquan classmate:

function time_str: string () {
return ctime(gettimeofday_s() + 8 * 60 * 60);
}
probe begin {
printdln(" ", time_str(), "BEGIN");
}
probe end {
printdln(" ", time_str(), "END");
}
probe  {
if (sig_name == "SIGHUP" || sig_name == "SIGQUIT" || 
sig_name=="SIGINT" || sig_name=="SIGKILL" || sig_name=="SIGABRT") {
printd(" ", time_str(), sig_name, "[", uid(), pid(), cmdline_str(), 
"] -> [", task_uid(task), sig_pid, pid_name, "], ");
task = pid2task(pid());
while (task_pid(task) > 0) {
printd(" ", "[", task_uid(task), task_pid(task), task_execname(task), "]");
task = task_parent(task);
}
println("");
}
}

The process level (pstree) during simulation is roughly as follows. After tomcat is started, the java process has detached and is hung under init:

|-sshd(1622)-+-sshd(11681)---sshd(11699)---bash(11700)---(13285)---tail(13299)

With the assistance of the kernel group Bo Yu, we found that

a) When ctrl-c is used to terminate the current process, the system events process sends a SIGINT signal to both java and tail processes

SIGINT [ 0 11 ] -> [ 0 20629 tail ] 
SIGINT [ 0 11 ] -> [ 0 20628 java ] 
SIGINT [ 0 11 ] -> [ 0 20615  ]

Note: pid 11 is the events process

b) When closing the ssh terminal window, sshd sends SIGHUP to the downstream process. Why does the java process also receive it?

SIGHUP [ 0 11681 sshd:  [priv] ] -> [ 57316 11700 bash ] 
SIGHUP [ 57316 11700 -bash ] -> [ 57316 11700 bash ]
SIGHUP [ 57316 11700 ] -> [ 0 13299 tail ] 
SIGHUP [ 57316 11700 ] -> [ 0 13298 java ] 
SIGHUP [ 57316 11700 ] -> [ 0 13285  ]

However, Bo Yu was very busy and did not continue to assist in analyzing the problem (he gave some speculations, but it turned out that it was not the case).

After confirming that it was caused by signal, my doubts became:

1) Why does SIGINT (kill -2) not let the tomcat process exit?

2) Why does SIGHUP (kill -1) make the tomcat process exit?

My first reaction may be that JVM will have different signal processing for Os under certain parameters (or because of some JNI). I looked at the jvm parameters of the application and did not see the problem, and also ruled out that Tomcat uses apr/tcnative.

Let's take a look at how the jvm process deals with SIGINT and SIGHUP by default, and simulate it with scala's repl:

scala> ().addShutdownHook(
new Thread() { override def run() { println("ok") } })

Use kill -2 and kill -1 for this java process to find that it will cause the jvm process to exit and also trigger shutdownhook. This also conforms to Oracle's instructions on hotspot virtual machine processing Signal. Refer to here, SIGTERM, SIGINT, and SIGHUP will trigger shutdownhook

It seems that it is not a matter of jvm. Continue to guess whether it is related to the status of the process? The script does not use start-stop-daemon to start the java process. The execution method of the start parameter is simplified and the script is equivalent to:

eval '"/pathofjdk/bin/java"' 'params'  start '&'

It is simply putting Java in the background to execute. When the process exits, the ppid of the java process becomes 1

I spent a lot of time guessing that it might be the OS level reason, but later I found out that it didn't matter. After the Spring Festival, I asked Shaoming and Jianquan to analyze this problem together, because they have a background of C and know more about the underlying system. After spending most of the day, I kept guessing and verifying it, and finally confirmed that it was the reason for Shell.

SIGINT (kill -2) Reasons why the background java process will not be exited

For simplicity, we use sleep to simulate the process when we are in interactive mode:

$ sleep 1000 & 
$ ps -opid,pgid,ppid,stat,cmd -C sleep
PID PGID PPID STAT CMD
9897 9897 9813 S sleep 1000

Note that the pid of the process sleep 1000 is the same as the pgid (process group). At this time, we can use kill -2 to kill the sleep 1000 process.

Now we put the sleep process into a script and execute it in the background:

$ cat 
#!/bin/sh
sleep 4400 &
echo "shell exit"

After running the script, the pid of the sleep 4400 process is different from pgid. pgid is the id of its parent process, that is, the process that has been exited.

$ ps -opid,pgid,ppid,comm -p 63376
PID PGID PPID COMM
63376 63375 1 sleep

At this time, we cannot kill the sleep 4400 process with kill -2.

At this point, it is very close to the reason. It must be that the shell has done something to the background process signal_handler. Shaoming implemented a custom handler command to see if it is valid for kill -2:

#include <>
#include <>
#include <>
void my_handler(int sig) {
printf("handler aaa\n");
exit(0);
}
int main() {
signal(SIGINT, my_handler);
for(;;) { }
return 0;
}

We run the compiled commands in the script in the background:

$ cat 
#!/bin/sh
/tmp/ &

This time, it is OK to try to use kill -2 to kill the process. This shows that the shell tampers with signal_handler before executing user logic, that is, the script is set when fork the child process. Following this clue, we learned after Google: the shell sets IGNORE when processing SIGINT signals to background processes in non-interactive mode.

The default mode of job control is different from the non-interactive mode.

Why does the shell not set the SIGINT signal for background processes to ignore in interactive mode, and will set it to ignore in non-interactive mode? It is still easy to understand. For example, we can first have a certain foreground process running for too long. We can abort ctrl-z, and then put this process into the background through bg %n. We can also put a background process started by cmd & method back to the foreground through fg %n, and then stop it in ctrl-c. Of course, we cannot ignore SIGINT.

Why do background processes in interactive mode set their own process group ID? Because by default, if the parent process's process group ID is used, the parent process will propagate the received keyboard events such as ctrl-c to each member of the process group. Assuming that the background process is also a member of the parent process group, SIGINT cannot be ignored because of job control needs. If you ctrl-c at will in the terminal, it may cause all background processes to exit. Obviously, this is unreasonable; so in order to avoid this interference, the background process is set to its own pgid.

In non-interactive mode, job control is usually not required, so job control is also turned off by default in non-interactive mode (of course, you can also open the job control option in the script with the option set -m). If job control is not enabled, the background process in the script can avoid the propagation of the parent process to members in the group by setting the SIGINT signal, because this signal is meaningless to it.

Going back to the tomcat example, when the script is started with the start parameter, it is started in a non-interactive manner in the background. The java process is also set by the shell to ignore the SIGINT signal. Therefore, when ctrl-c ends the process, the SIGINT sent by the system has no effect on java.

SIGHUP (kill -1) Reason for causing the tomcat process to exit
In non-interactive mode, the shell sets SIGINT for the java process, and the SIGQUIT signal is set to ignore, but the SIGHUP signal is not set to ignore. Let’s take a look at the process level at that time:

|-sshd(1622)-+-sshd(11681)---sshd(11699)---bash(11700)---(13285)---tail(13299)

After sshd passes SIGHUP to the bash process, bash will pass SIGHUP to its child process, and for its child process, bash will also propagate SIGHUP to its members of the process group. Because the java background process inherits pgid from the parent process (and from its parent process), the java process still belongs to a member of the process group and exits after receiving SIGHUP.

If we set the job control to enable it, the java process will not be exited

#!/bin/bash
set -m 
cd /home/admin/tt/tomcat/bin/
./ start
tail -f /home/admin/tt/tomcat/logs/

At this time, the java background process inherits the pgid of the parent process and no longer uses the process group, but its own pid as the pgid. After the process has been executed and exited, the java process is hung under init, and java and the process are completely out of touch, and bash will no longer send signals to it.

The above is an analysis of the problem that the editor introduced to you about the unexpected exit of the Tomcat process. I hope it will be helpful to everyone. If you have any questions, please leave me a message. The editor will reply to everyone in time!