SoFunction
Updated on 2025-03-08

Troubleshooting and solving problems with failed startup of Tomcat

Preface

Recently, after an application updated the code, some machines failed to release. Tomcat has not been successfully launched on the machine that failed to release. The log is stuck in the Deploying web application. The same situation is still the same after restarting several times. Therefore, check the problem. The following records all the investigation process. Friends who need it can refer to it.

Troubleshooting process

1. Tomcat startup thread stuck

Hereinafter, Tomcat start thread refers to the thread namelocalhost-startStop-$idthread.

Use jstack to print out Tomcat's thread stack:

jstack `jps |grep Bootstrap |awk '{print $1}'` > 

From it, you can see that the thread localhost-startStop-1 is in the WAITING state, and the stack is as follows:

"localhost-startStop-1" #26 daemon prio=5 os_prio=0 tid=0x00007fe6c8002000 nid=0x3dc1 waiting on condition [0x00007fe719c1e000]
 : WAITING (parking)
 at (Native Method)
 - parking to wait for <0x00000007147be150> (a )
 at (:175)
 at (:429)
 at (:191)
 at (:183)
 at (:130)

The corresponding code is as follows:

final ResponseFuture<XxxMessage<Result>> future = (request);
(request);
XxxMessage<Result> response = ();

The thread is stuck()No return. This step is waiting for the return of the registration request sent by the client to Xxx-Server.

2. Xxx registration request did not return

Use tcpdump to grab the lower package (the service port of Xxx-Server is yyy):

tcpdump -X -s0 -i bond0 port yyy

I found that there are only packets that build connections, but no packets with length != 0:

IP app-ip.56599 > : Flags [S], seq 3536490816, win 14600, options [mss 1460,sackOK,TS val 3049061547 ecr 0], length 0
IP  > app-ip.56599: Flags [S.], seq 2500877640, ack 3536490817, win 14480, options [mss 1460,sackOK,TS val 1580197458 ecr 3049061547], length 0
IP app-ip.56599 > : Flags [.], ack 1, win 14600, options [nop,nop,TS val 3049061548 ecr 1580197458], length 0

Therefore, the reason why the registration request was not returned is that the request was not sent at all.

3. The Xxx registration request was not sent out

The Xxx code was called, but the data was not sent out. The more friendly approach to this code should be to register a Listener for the returned ChannelFuture after writeAndFlush, and judge the status in the callback after the write operation is completed.

Under the guidance of Netty's master - @yh, I followed Netty's code with BTrace.

Add parameters to the Tomcat startup logic-related script bin/ to enable the Btrace agent and Tomcat to start together:

JAVA_OPTS="$JAVA_OPTS -javaagent:${BTRACE_HOME}/build/=script=${BTRACE_HOME}/scripts/,stdout=true,debug=true,noServer=true"

It contains some methods that need to be viewed. Here are the steps to troubleshoot the reason why the request was not sent:

  • First, I found that there was no interface calledThe write method verifies the inference that the request was not sent out;
  • Then I found that the interface is calledThe write method reported an error;
  • Finally, locate the calling classThe write method throws an exception and the exception stack is:
: : (I)I
 (:125)
 ...
Caused by: : 
 (I)I
 $MemoryRegionCache.<init>(:372)
 ...

At this time, the cause of the problem is clearer:
This method was not found.

The BTrace Method that found the problem is as follows:

@OnMethod(
 clazz = "+",
 method = "write",
 location = @Location(value = )
)
public static void errorChannelOutboundHandlerWrite(@ProbeClassName String className, Throwable cause) {
 println("error , real class: " + className);
 (cause);
 println("=====================");
}

Here is a question: Why is this exception log not printed?

This question can be fromThe answer is found in the code:

private void invokeWrite(Object msg, ChannelPromise promise) {
 try {
 ((ChannelOutboundHandler)()).write(this, msg, promise);
 } catch (Throwable var4) {
 notifyOutboundHandlerException(var4, promise);
 }
}

notifyOutboundHandlerException will notify the corresponding Listener. This old code of Xxx does not register Listener, so this exception is not printed.

4. NoSuchMethodError Reason

I checked the Netty version under $WEBAPP-DIR/WEB-INF/lib again:

netty-3.10.
netty-all-4.1.
netty-buffer-4.1.
netty-codec-4.1.
netty-codec-http-4.1.
netty-common-4.1.
netty-handler-4.1.
netty-resolver-4.1.
netty-transport-4.1.
transport-netty3-client-5.0.
transport-netty4-client-5.0.

What is more eye-catching is that the version of netty-all-4.1. is not very consistent with other jar package versions. Need further confirmation,$MemoryRegionCacheandThese two classes are loaded from which jar package.

Add startup parameters to the Tomcat startup logic-related script bin/ to print the log loaded by Class:

JAVA_OPTS="$JAVA_OPTS -verbose:class"

You can see:

...
[Loaded $MemoryRegionCache from file:$WEBAPP-DIR/WEB-INF/lib/WEB-INF/lib/netty-buffer-4.1.]
...
[Loaded  from file:$WEBAPP-DIR/WEB-INF/lib/netty-all-4.1.]
...

Loaded from netty-all-4.1., there is no safeFindNextPositivePowerOfTwo method (Under normal circumstances, this class should be loaded from netty-common-4.1.).

So far, I figured out the reason for the startup being stuck:

Netty package loading problem => Xxx calls to send registration request exception => No return package,()Keep stuck => Tomcat startup thread stuck

There is another puzzling phenomenon: Why do some machines start normally, while others start abnormally?

5. Different machines perform differently

Let’s look back at the order of starting Netty-related jar packages on the problematic machine. Here we use the ls -f command (only pay attention to the jar packages related to the problem):

$ ls -f |grep netty
netty-buffer-4.1.
netty-all-4.1.
...
netty-common-4.1.
...

The meaning of the ls plus -f parameter can be seen through the man manual:

-f do not sort, enable -aU, disable -ls --color

It means to use the return of the system call getdents directly and no longer sorting. As you can see from the man manual, the default sorting method of ls is Sort entries alphabetically if none.

The reason for NoSuchMethodError is: it was loaded from netty-buffer-4.1.$MemoryRegionCache, this class will callThis method is loaded from netty-all-4.1.There is no such method.

Compare the order of starting Netty-related jar packages on the correct machine:

$ ls -f |grep netty
...
netty-all-4.1.
...
netty-common-4.1.
netty-buffer-4.1.
...

From this we can see that all Netty-related Classes are loaded from netty-all-4.1., and there will be no incompatibility problems.

Or the question is: Why in ext4, the order of ls -f is different when ls -f is found in ext4?

I can't answer this question for the time being, at least I haven't gotten a code-level explanation that convinces me.

Well, an explanation without code is not an explanation, a task without deadline is not a task, a source code reading without flowchart or shared is not a source code reading, and a performance test without reports is not a performance test.

Here is an explanation based on phenomenon, which I think is quite reliable:

On modern filesystems where directory data structures are based on a search tree or hash table, the order is practically unpredictable.

What we can do

Afterwards :) Just kidding, you can only make fewer mistakes when you have more reviews when you encounter problems.

  • Basic components: consider the failure situation more often, and do not swallow exceptions; consider the timeout time for possible blocking operations (self-encouragement)
  • Publishing system: It is possible to add some rules, which packages cannot coexist, such as netty-all and netty-common in the above question
  • Container isolation: Isolate the three-party packages used in middleware and the three-party packages used in business

Summarize

The above is the entire content of this article. I hope the content of this article will be of some help to your study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.